Apple engineers have developed an AI system that resolves complex references to screen entities and user conversations. The lightweight model may very well be a super solution for on-device virtual assistants.
People are good at breaking down references in conversations with each other. When we use terms like “the underside” or “he,” we understand what the person is referring to based on the context of the conversation and the things we will see.
This is way more difficult for an AI model. Multimodal LLMs like GPT-4 are good for answering questions on images, but are expensive to coach and require lots of computational effort to process each query about a picture.
Apple engineers took a unique approach with their system called ReALM (Reference Resolution As Language Modeling). The paper It's value a read to learn more concerning the development and testing process.
ReALM uses an LLM to process conversational, screen and background elements (alarms, background music) that make up a user's interactions with an AI virtual agent.
Here is an example of the kind of interaction a user may need with an AI agent.
The agent must understand conversational units equivalent to the undeniable fact that when the user says “that one,” they’re referring to the pharmacy’s phone number.
It also needs to grasp the visual context when the user says “the underside,” and that is where ReALM’s approach differs from models like GPT-4.
ReALM relies on upstream encoders to first analyze the weather on the screen and their positions. ReALM then reconstructs the screen in a purely textual representation from left to right and from top to bottom.
In easy terms, it uses natural language to summarize the user's screen.
Now when a user asks a matter about something on the screen, the language model processes the textual description of the screen as a substitute of getting to make use of a vision model to process the screen image.
The researchers created synthetic datasets of conversation, screen and background entities and tested ReALM and other models to check their effectiveness in resolving references in conversational systems.
The smaller version of ReALM (80 million parameters) performed comparable to GPT-4 and its larger version (3B parameters) significantly outperformed GPT-4.
ReALM is a tiny model in comparison with GPT-4. Its superior reference resolution makes it a super alternative for a virtual assistant that may live on-device without sacrificing performance.
ReALM doesn't perform as well with more complex images or nuanced user needs, but could work well as an in-car or on-device virtual assistant. Imagine if Siri could “see” your iPhone screen and reply to references to elements on the screen.
Apple has been a bit slow to get out of the starting blocks, but recent developments just like the MM1 model and ReALM show that loads is occurring behind closed doors.