On-Device ML: When It Matters and How to Build with It
We created the NPU Advisor Series to share how experienced operators think about the next decade of technology, especially where AI meets the physical world.
We created the NPU Advisor Series to share how experienced operators think about the next decade of technology, especially where AI meets the physical world.
For this edition, we are featuring Luke Boyer, a software engineer at Google. He is a core developer of LiteRT, the industry leading framework for on-device inference. Luke works at the intersection of machine learning, resource-constrained compute, and over-the-air software distribution.
If you would like to connect with Luke or continue the conversation on on-device ML, you can reach him on LinkedIn.
What is on-device ML?
When most people think about AI, they picture a request going to a big model in a data center and a response coming back over the network.
On-device ML flips that picture, wherein inference is executed directly on the edge, as close as possible to end-user devices or physical sensors.
Luke describes this as the foundation of personalized, “real-life AI”, and a future of systems capable of adapting to their users and interfacing directly with the physical world. Think “a phone without a screen”, what Siri wants to be, or the first steps toward a real-world Jarvis.
Why does it need to be on-device?
Three forces keep pushing intelligence onto devices instead of leaving it in the cloud:
Latency: Real-time use cases cannot tolerate unpredictable network lag. Braking a car, tracking fast motion, or translating live speech does not work if you are waiting on a round-trip to the server.
Privacy and personalization: The most powerful agents see calendars, documents, video feeds, and sensor data. In many consumer, industrial, and military settings, streaming all of that to a remote server is a non-starter.
Cost and reliability: Server compute is expensive at scale. Running optimized models on-device shifts cost into hardware you pay for once and keeps systems working when the network is slow, jammed, or offline.
How NPU thinks about ODML
At NPU, we care about on-device ML because it sits right at the intersection of our thesis: AI that actually interacts with the physical world, and national security or critical infrastructure, where latency, reliability, and autonomy matter most.
We expect a meaningful share of the most important AI companies in the next decade to be built from the ground up around ODML, proliferating user devices, sensors, vehicles, and any systems at the edge.
The rest of this piece is a Q&A with Luke on how founders should approach ODML, what is working in practice, and where the biggest opportunities and traps are.
When you talk about “on-device ML”, what exactly do you mean, and what is the biggest misconception you run into?
On-device ML (ODML) to me refers to the systems that execute and leverage the ability to do inference directly on an edge device. This includes not just the ML, but all the surrounding technologies leveraged to put an experience in the users hand.
The biggest misconception I run into is edge inference being just an alternative technical approach to ML. Edge inference can unlock fresh and novel user experiences that wouldn’t be possible with cloud only inference.
What are some real-world examples of ODML already running in production that most people would be surprised by?
Take your phone out and zoom in as much as possible to something far away. What you’re seeing is not what the camera is capturing, it’s a model predicting what the image might be at a higher resolution. Similar things occur around the phone’s audio inputs, like background noise mitigation in phone calls.
What changed in the last few years that made ODML feel inevitable rather than hypothetical?
There have been progressive advancements in the computing capability of mobile and edge based devices. The CPUs, GPUs and ML specialized NPUs (neural processors) shipped in smartphones today are able to tackle surprisingly sophisticated workloads efficiently. Supporting technology allows most app devs to simply drop ML into their product, simultaneously, creative devs are learning how to get the most of the computing devices for advanced use cases.
This combined with an explosion of interest in AI-powered use cases (especially generative and agentic ones) leads me to believe the market is ready for a swath of new ODML based products.
For a founder who has only worked with cloud models, how should they think differently once they move intelligence onto devices?
Today, ODML can be a more difficult technical problem than the cloud approach, in part due to the high fragmentation of the mobile market, or the model architecture specialization required for optimized use cases.
Where ODML shows its value is in the new product experiences it can create. Creative and astute founders will understand how to turn edge inference into killer features for users that cloud inference will not be able to support.
Where do you expect ODML to create the most value over the next few years?
I think there are three main categories we will see ODML accruing value in the next 5 or so years.
The first is integrated applications for smartphones or wearables. Particularly interesting are applications in a specific vertical, like personal fitness.
The second is industrial applications focused on a single input media. I’m looking at things like computer vision for monitoring or QA control within agricultural or factory production contexts.
The third category relates to AI powered tooling and environments for creators and artists, a re-envisioning of things like photoshop or animation software. The winners will be employing ODML to offload inference cost to user machines and protect proprietary information.
For industrial, defense, and field systems that operate in bandwidth-constrained or adversarial environments, what makes ODML such a good fit? Any concrete examples you'd like to point to?
ODML enables real-time analysis and reactiveness on physical signals in the field. Autonomy and reliability of end points is critical here, especially within an adversarial context. The most robust field systems will be blending ODML on end points with centralized compute for high level monitoring, controlling and sophisticated decision making.
Anduril’s LatticeAI is the canonical example of this approach in practice. Their stack can deploy smaller ML models (like image recognition) directly on the end devices which alert and inform a centralized monitoring service.
What do strong ODML founders look like? How are they different from teams building typical cloud or SaaS products?
On the engineering side, strong ODML founders will be more comfortable with lower level, platform specific considerations that affect how their stack performs in the hands of the user. They will have a holistic view of development, thinking deeply on efficiency and deployment.
On the product side, strong ODML founders will have an intimate understanding of their users’ needs and experience, and how the underlying technology can benefit them. They also will have a connection or interest in a specific domain or vertical where ODML can be applied naturally, like health or agriculture.
What are the most common technical traps you see early ODML teams fall into when they try to go from demo to deployed system?
The most common technical trap I see, at least in the mobile space, is underestimating the fragmentation of devices and systems and the complexity of scaling across it. It is easy to build a demo considering only a single operating system and device, later to realize the approach might not scale to as much of the market as anticipated.
How should teams think about the tradeoff between model quality and real-world constraints like latency, power, and memory on their target devices?
Always think product-first, and holistically, model quality being just one piece of the puzzle. Depending on the experience being tailored, a sacrifice in quality can create an overall better product, if it leads to snappier response times, or even the oft-ignored app download/install time.
If you had to give founders a simple rule of thumb, how should they decide whether a use case really needs to be on-device versus just living in the cloud?
The answer just comes down to the user experience. Running inference on the edge can unlock a suite of experiences regarding personalization, real-time interaction and privacy. If these features give their product a fundamental competitive advantage, they should go all in.



