On-Device Machine Learning or ODML refers to running machine learning models using just the computing resources of the device in question.
MediaPipe is a low code/no code environment which can be used to run inferences on mobile devices. It is an abstraction layer built on top of TensorFlow Lite which takes the complexity out of orchestrating the various models which may be needed to accomplish your task on a mobile device.
The key advantage is that MediaPipe enables running inferences using only the device’s resources. For example the Magic Eraser feature on Google’s Pixel phone makes use of ODML.
MediaPipe focuses on a specific set of common ML use cases and provide easy to use on-device solutions for them. 14 solutions are available as announced in Google IO 2023 and Hand gesture recognition is one of them. Each solution contains one or more TensorFlow Lite models and MediaPipe chains these models together. All that is needed is to feed an input image and get the prediction result as output.
Setting up MediaPipe pipeline
The first step is to set up a Gesture Recognition object and specify the model bundle you may want to use. This model bundle contains the TensorFlow Lite models that have to be run in order to obtain inferences for your use case. So in the case of Hand Gesture recognition 4 TFLite models have to be loaded and orchestrated and all this is done under the covers by MediaPipe. You then specify the input image and then call the recognize() function on the input image.
MediaPipe can be used with Kotlin if you are doing something on the Android platform or with JavaScript if you are integrating a solution into your web application.
MediaPipe Studio is where you can try out all your on-device ML solutions.
Supported platforms
MediaPipe supports Android, JavaScript and Python with iOS support expected in the near future.
Solutions supported
Single hand gestures are supported today with 2 hand gesture support being added shortly.
MediaPipe also supports Image segmentation of general objects where you can specify an image and a point of interest as input and you get a segmentation mask as output which distinguishes the point of interest from the rest of the image.
Face Landmark detection is yet another solution and MediaPipe can detect facial expressions such as an open mouth or a raised eyebrow. This is in addition to the 400 landmarks that are already detected by the model. Virtual avatars mimicking human expressions and detecting poses and superimposing them onto avatars should be easy to create now.
Text and Audio classification are the other solutions which are supported.
Customizing solutions
MediaPipe Model Maker can be used to customize existing solutions. For instance the in-built Hand Gesture recognizer can recognize 7 hand gestures. You can now add support for additional gestures by using the MediaPipe API to load the GestureRecognizer model and training it with the dataset of images you want recognized. Once training of the custom model has been completed you can evaluate it on images which were not provided during training and finally deploy it on-device using MediaPipe tasks.
All that is needed is to replace the default model bundle with the customized model bundle that was just created.
Model Distillation
Large models used for Generative AI cannot run on-device so Google has invented Model distillation to distill these larger models into smaller ones and can be specialized for a certain set of tasks.
Face stylization which allows you to stylize images and probably make a cartoon of your photograph is one such solution. This solution is customizable so you can use Model Maker to transfer images to various styles.
Text to Image generation is another area that is under development.
References
https://mediapipe-studio.webapps.google.com/home
https://developers.google.com/mediapipe