ODML – Running LLMs on Android

In Android, Artificial Intelligence, google IO, LLMs, Machine Learning, Mobile Apps, Performance, TensorFlow by Prabhu Missier

Large Language models are built by training them on a very large corpus of data. They are typically based in deep-learning neural networks such as the Transformer architecture invented by Google. This enables them to find application in Natural Language Processing not limited to text generation.
However the sheer volume of data to be ingested makes these models consume huge amounts of processing power while training and even while running inferences these models require huge amounts of processing power.

LLMs like Google’s LaMDA and PaLM are computationally expensive to run with generating text taking several minutes to run on GPUs. This makes it extremely challenging when we attempt to run LLMs on mobile devices where computing resources are limited. However smaller LLMs like GPT-2 can be run on Android devices with a few tweaks and modifications and still give impressive results.

Using KerasNLP
KerasNLP is a toolkit which provides pre-trained LLM models and this comes in handy because we don’t have to train a model from scratch and as I’ve mentioned above this is a computationally expensive task.
KerasNLP provides several pre-trained models like Google’s BERT and GPT2 which can then be fine-tuned to match our requirements. Say for example you can finetune GPT-2 to provide responses styled along the lines of a news website.

One way to access these LLMs is to deploy them on the cloud and then call out to their endpoints from a mobile device. However this introduces a latency which may introduces uncertainties and unacceptable delays.
An alternative and more efficient way forward would be to use On-Device Machine Learning(ODML).

On-Device Machine Learning
ODML can be achieved with TensorFlow Lite. A simple workflow would be to take our fine-tuned GPT-2 model and convert it into a TensorFlow Lite model using the TensorFlow Lite converter.
You can then run inferences on the model using the TensorFlow interpreter which is highly optimized for mobile devices.
The model can be optimized further using techniques like quantization which shrink model size considerably.