0.75% Battery for 25 Answers: AI That Runs Offline on Your Phone

On a Pixel, twenty-five answers cost 0.75% of the battery and nothing leaves the device. Gemma, Llama, Qwen: AI now fits inside the phone, offline. What you gain, what you give up.

A Pixel 9 Pro rests on an airplane tray table, in airplane mode. Twenty-five questions follow one after another: a summary of a note, a translation, the rewording of an email. The assistant answers every one of them, without a single bar of signal. The battery gauge, meanwhile, has dropped by just 0.75%. And not one of those twenty-five requests left for a server: they were handled by a handful of megabytes tucked inside the phone.

The scene is no prototype. The model is called Gemma 3 270M, Google cut it to fit in the memory of an ordinary device, and it marks a quiet shift. For three years, talking to an AI meant opening a pipe to a distant data center. Now a growing share of those exchanges never leaves the hand holding the screen. The question is what you gain, and what you give up, by bringing the intelligence home.

The model shrank until it fit in a pocket

Gemma 3 270M wears its size in its name: 270 million parameters, where living-room models line up hundreds of billions. Compressed to INT4, it makes do with 125 megabytes of memory, about the weight of a few photos. It runs on a phone, on a Raspberry Pi board, and holds nearly 33,000 words of context in mind. On Google's own Pixel 9 Pro, twenty-five conversations cost it less than 1% of the battery.

It is not alone. Meta ships Llama 3.2 in 1 and 3 billion parameter versions, Alibaba pushes its Qwen family from 0.5 to 3 billion, and Google has a variant called Gemma 3n, designed for mobile from the start. The common thread: these models are small enough to run where you are, with no detour through the cloud.

What made this possible is not only a matter of size. Inference engines have grown more efficient, llama.cpp gained kernels tuned for the ARM chips in smartphones, and Apple turned local execution into a building block of its system. At bottom, AI is following the path of the digital photo or the GPS: first a remote service, then a function the device handles on its own.

What you gain by cutting the cord

The first benefit is autonomy in the literal sense. An on-device model works on a plane, in a tunnel, in a village with no coverage, in a basement. It asks for no account, no monthly subscription, no connection: it is simply there, available, like a calculator. For anyone who has watched an online assistant refuse to answer for lack of signal, the difference is tangible.

Then comes privacy, and it carries weight. When the processing stays on the device, the dictated text, the medical note, the draft letter pass through no server, feed no log, train no model. At a time when the European AI Act and a growing wariness have made data residency a front-line concern, keeping your words at home stops being a detail for the technically minded.

Apple has made that argument the spine of its offering. At its developer conference on 8 June 2026, the company unveiled its in-house models, AFM Core and an advanced version built on a sparse architecture, able to run on the device without letting anything out. The gain shows up in time too: no network round trip, the answer lands at once, where yesterday the smallest tunnel was enough to make it wait.

The price of going local

The trade-off is real, and it would be dishonest to hide it. A 270-million-parameter model does not reason like a cloud behemoth. Its knowledge is narrow, its memory for facts full of holes, and it invents all the more boldly when asked outside its domain. To summarize a note or sort messages, it shines; on a question of law or medicine, it is wrong without warning.

The hardware sets its terms. Running a model, even a light one, heats the chip, eats into the RAM and, on a phone a few years old, crawls. The flattering battery numbers assume a recent chip and a tiny model; scale up for sharper answers, and the energy bill climbs with it.

Above all, local rarely stands alone. The framework Apple revealed lets an app switch, in a single line of code, from the on-device model to a cloud-hosted Gemini model when the task outruns its strength. The design is clever, but it tells the truth of the moment: the small model handles the ordinary, the big one stays far off for the hard part. The autonomy on offer is therefore a floor, not a ceiling, and dependence has not vanished, it has moved to the cases that matter most.

The frontier runs back through the device

What is at stake is not a victory of the phone over the cloud, but a new division. Over the past decade, computation drifted away from us, into server sheds no one ever sees. Now it returns, in part, to lodge in the object we keep in a pocket. The reader does not have to pick a side; he gains a base of capability that is truly his, above which the cloud steps in only if he decides so.

The nuance is worth holding. An AI that answers with no signal, no account and no trace left behind is not more powerful than the one on the big servers, it is more your own. In a world where every service demands a connection and keeps a copy, having an assistant content with the memory of a phone is a rare kind of peace.

The real question is no longer whether the model fits in a pocket, that much is settled. It is how far it can stay useful without calling outside. The day the small local model covers most of our ordinary requests, AI will have stopped being a service we connect to and become a tool we own.