11 min read

Building with LLMs on mobile

In 2023 we launched the first foundational model in a mobile operating system. This marked a big shift from traditional ML running on-device. Historically models were domain-specific and highly optimised to perform tasks well while consuming minimal system resources. With the arrival of foundational models1 we saw an opportunity to replace many single-purpose models with one larger model, benefiting from cross-learning to result in an overall improvement against individual domains.

Getting to this point required overcoming several major hurdles including:

  1. Models were huge. Typically two orders of magnitude larger than what we typically deployed.
  2. Execution was slow. Responses took tens of seconds. Unclear if this could be useful for feature teams.
  3. Hallucination was prevalent. Output was basic, prone to making things up. A far cry from the GPT3.5 moment that had ignited imagination in the developer community.

Creating a multi-disciplinary team

In order to determine viability we quickly assembled expertise from teams across Google to build out prototypes and identify bottlenecks. In the spirit of rapid iteration and enabling cross-team development we jettisoned established development practices, focusing solely on the simplest way to prove out ideas. This may seem obvious to smaller teams but is atypical in large organisations where development practices have been highly optimised against longer term objectives. As an example, the toolchains/languages/processes are very different for the Android team developing an operating system, the core Google ML runtime development teams and the Google Deepmind research teams.

Once we had established that these models could work at all and identified the major bottlenecks we then set about leveraging the specialist skills of the individual teams to work in parallel to develop solutions before final re-integration.

A key lesson from this time is that given the current immaturity of AI/ML, bringing multi-disciplinary teams together with a shared focus is critical to make progress. No individual team has sufficient influence over development, or expertise into the necessary components, in order to be able to iterate in isolation. Pushing the frontier requires a diverse range of skills.

Finding a footprint

Android is a heterogeneous ecosystem of devices, flagship phones may have upwards of 12GB of RAM while there are capable devices being shipped that run with just 4GB. In order to plot a path to launch we needed to find the intersection between model capabilities (and size) and phone resources.

The size of the model impacts many different areas:

  • Flash storage usage at idle, RAM usage during execution
  • Data downloaded during model updates
  • Execution time

We pursued two independent paths. Firstly working closely with the research and feature product teams to identify the sweet spot for quality, looking at drop-off curves across a range of use cases for given parameter counts. Secondly, with the research and runtime teams to identify ways to reduce the size within a given parameter count.

Through this work we were able to quickly establish the viability of moving to 4-Bit weights at minimal impact to quality and set a target of 2B parameters for the model target.

Optimising execution

There aren’t that many core elements to modern transformer architectures and so it has been possible to build small self-contained runtimes like GGML that are great for rapid prototyping of new models. Being able to iterate quickly on the CPU was critical in proving out ideas. However it was clear that CPU execution was not going to be sufficient for a production launch due to three main challenges.

  1. Power usage
  2. Raw compute
  3. Contention

Power usage

The dense auto-regressive models being used are typically memory bandwidth limited due to the sequential nature of decoding and need to process the full model weights on each step. This meant that CPU decode speeds were fairly competitive with alternative options. However utilising the full capacity of the CPU (all cores, max clock) requires a lot of power and generates a lot of heat that needs to be dissipated. We wouldn’t be able to use the CPU at full capacity for long.

Raw compute

While decoding is typically limited by memory bandwidth, inputting context to the model (prefill) is able to be parallelised and benefits significantly from more compute. More specialised hardware like GPUs and NPUs are able to perform far more matrix multiplications per second than more general CPU architectures. Where a large amount of content needs to be ingested by the model before giving a response, this will happen much faster on these other architectures. We typically measure this as TTFT to indicate how long it is before we start to get a result.

This asymmetry in performance is acceptable because once we start receiving a response we can stream it to the user as it arrives so that they can start reading. For many usecases, so long as we are streaming faster than the user can read it is acceptable. However the initial latency before that response starts to appear is very noticeable and harder to mitigate through UX. We knew that summarisation of large content would be an important first usecase so fast prefill was important.

Contention

The CPU is used by many different elements of the system and as such in order for the device to remain responsive it must not be fully saturated by a single task. Limiting the usage of the CPU to an appropriate level has a significant impact on the performance of the model further compounding the challenges listed above.

Deciding on the NPU

It was clear that we needed to move to more specialist hardware and the choice was between the GPU and NPU. The majority of ML compute today on workstations and in datacenters is running on modern GPUs. NPUs are a newer and even more specialist approach to gain efficiency and performance at some cost to developer complexity. On mobile devices everything is fundamentally limited by power, the amount you can draw from the battery and the ability to dissipate the heat generated through waste energy. Unlike in the datacenter we cannot scale horizontally by stacking more phones together and we are limited in vertical scaling due to heat dissipation (noone wants a fan on their phone).

The efficiency needs and the compute available made the NPU compelling while the target of optimising against a single model mitigated some of the challenges with developing for such specialist hardware.

Matching features to capabilities

There was a lot of excitement about these models amongst feature teams though expectations were often grounded in an initial demo that had been prototyped on a far more powerful datacenter-hosted model.

The reality, of course, is that a 2-billion parameter model running on a phone, while a technical marvel, has different strengths and weaknesses to a 100-billion+ parameter model running on a fleet of servers. Our model was fast and private, but it wasn’t going to ingest or generate huge quantities of text. We had to bridge the gap between the future potential of the technology and the practical needs of a product today.

We quickly started bucketing features into different sets of requirements around input and output size, user flow (e.g. interactive vs background processes) and early accuracy benchmarks. From this we identified a core set of features that while a stretch from our starting point we felt we had a good understanding of what was required to meet the bar for launch and this was feasible given device schedules.

The impact of user experience

Performance was critical and now we had a clear set of initial features the teams started work on more focused optimisations against these specific targets. As well as E2E2 processing time we also had a specific focus on TTFT3. For some usecases it is important that the complete response is ready before it can be presented to the user while for others a streaming response where the UI is constantly updated with additional results is acceptable. In general LLM output is mostly bandwidth limited but where streaming is appropriate so long as the output is faster than the user can read (approximately 4-7 tokens per second) the total response time is less important than the time waiting for the first output token.

Pixel Recorder team wanted to generate on-device summaries of user’s voice notes. Transcriptions could be large and so being able to quickly fill this into the context of the model was important. It was a clear case where streaming produced an acceptable experience for decoding, but with the large input size, the initial wait for the first token could be a poor experience. The choice of the NPU really paid off at this point as while output is typically bandwidth limited, a lot of the prefill computations are able to be parallellised and so we were able to leverage the large amounts of compute to achieve input rates several orders of magnitude higher than output.

UI example of Pixel Recorder

Boosting accuracy with LoRA

Our base model was a capable language learner. Given input and a request it could do a good job of manipulating it to provide a reasonable natural language output. This was an impressive feat, however the bar for a high quality product experience was high. It was clear from early tests that at this point in time the base model alone wasn’t outputting sufficiently high quality to meet our expectations from simple prompts.

It was typical for feature teams building on the server to finetune a model specifically against their usecases. Early tests had demonstrated that full finetuning of the model could reach the accuracy needed for launch so we knew it was possible at this size, but full finetuning would mean shipping multiple gigabytes for every new feature. This clearly wasn’t viable. We needed something where we could share common capabilities across features while enabling each team to bring additional knowledge and performance relevant for their usecase. Enter PEFT.4

Specifically we adopted LoRA5 as a way to create small, efficient “adapter” modules that could be layered on top of the main model. These adapters would train the model to become an expert in a very specific task, like summarising text or drafting a reply to a message, without altering the large base model itself. These adapters were tens of megabytes rather than gigabytes which enabled a path to scaling to many features all using the same base model.

Finally a device release

Building features that ship with a mobile phone is different than the typical release process of an app developer. There are hard deadlines due to shipping atoms not bits. Even before the final date for a build to go to devices in the factory there is a long period of stabilisation to ensure that everything is working together as expected and important system health metrics are stable.

Thanks to the incredible work of many teams we were able to bring everything together to ship first on the Pixel 8 Pro. This marked a big shift proving that it was possible to have one single model that could deliver best-in-class quality for a range of very different features. The era of foundational models on mobile devices had arrived.

Footnotes

  1. Large single models, trained on a diverse range of data

  2. End To End. The total time from the initial input request until the final response.

  3. Time To First Token. How long it takes before the initial output response starts to be generated.

  4. Parameter Efficient Fine-Tuning. A method for tuning only parts of a model in order to achieve improved performance.

  5. Low-Rank Adaptation. A mechanism for adjusting partial weights for a model.