HomeArtificial IntelligenceApple goals to know user intent on the device with UI-JEPA models

Apple goals to know user intent on the device with UI-JEPA models

Understanding user intent based on user interface (UI) interactions is a critical challenge in constructing intuitive and useful AI applications.

In a latest paperResearchers from Apple Introducing UI-JEPA, an architecture that significantly reduces the computational overhead of UI understanding while maintaining high performance. UI-JEPA goals to enable lightweight on-device UI understanding, paving the way in which for more responsive and privacy-preserving AI assistant applications. This could fit into Apple's broader technique to improve its on-device AI.

The challenges of UI understanding

To understand user intent from UI interactions, cross-modal features, including images and natural language, have to be processed to capture the temporal relationships in UI sequences.

“While advances in multimodal large language models (MLLMs) equivalent to Anthropic Claude 3.5 Sonnet and OpenAI GPT-4 Turbo offer opportunities for personalized planning by adding personal contexts as a part of the prompt to enhance attunement with users, these models require extensive compute resources, huge model sizes, and lead to high latencies,” co-authors Yicheng Fu, machine learning researcher and intern at Apple, and Raviteja Anantha, principal ML scientist at Apple, told VentureBeat. “This makes them impractical for scenarios where lightweight, on-device solutions with low latency and enhanced privacy are required.”

On the opposite hand, current lightweight models that may analyze user intent are still too computationally intensive to run efficiently on user devices.

The JEPA architecture

UI-JEPA draws inspiration from the Joint Embedding Predictive Architecture (JEPA), a self-supervised learning approach introduced by Meta AI Chief Scientist Yann LeCun in 2022. JEPA goals to learn semantic representations by predicting masked regions in images or videos. Instead of attempting to recreate every detail of the input data, JEPA focuses on learning high-level features that capture an important parts of a scene.

JEPA significantly reduces the dimensionality of the issue, allowing smaller models to learn wealthy representations. In addition, it’s a self-supervised learning algorithm that could be trained on large amounts of unlabeled data, eliminating the necessity for costly manual annotations. Meta has already released I-JEPA and V-JEPA, two implementations of the algorithm designed for images and videos.

“Unlike generative approaches that attempt to fill in every missing detail, JEPA can discard unpredictable information,” said Fu and Anantha. “This results in improved training and sampling efficiency by an element of 1.5 to six, as observed in V-JEPA, which is critical given the limited availability of high-quality and labeled UI videos.”

UI-JEPA

UI-JEPA builds on the strengths of JEPA and adapts it to UI understanding. The framework consists of two major components: a video transformer encoder and a pure decoder language model.

The Video Transformer Encoder is a JEPA-based model that processes videos of UI interactions into abstract feature representations. The LM takes the video embeddings and generates a textual description of the user intent. The researchers used Microsoft Phi-3, a light-weight LM with roughly 3 billion parameters suitable for experimentation and on-device deployment.

This combination of a JEPA-based encoder and a light-weight LM enables UI-JEPA to attain high performance with significantly fewer parameters and computational resources in comparison with state-of-the-art MLLMs.

To further advance research on UI understanding, the researchers introduced two latest multimodal datasets and benchmarks: “Intent within the Wild” (IIW) and “Intent within the Tame” (IIT).

IIT and IIW datasets for UI-JEPA

IIW captures open-ended sequences of UI actions with unclear user intent, equivalent to booking a vacation rental. The dataset includes splits with few and no trials to judge the models' ability to generalize to unknown tasks. IIT focuses on more general tasks with clearer intent, equivalent to setting a reminder or calling a contact.

“We imagine that these datasets will contribute to the event of more powerful and light-weight MLLMs, in addition to training paradigms with improved generalization capabilities,” the researchers write.

UI-JEPA in motion

The researchers evaluated the performance of UI-JEPA using the brand new benchmarks and compared it with other video encoders and personal MLLMs equivalent to GPT-4 Turbo and Claude 3.5 Sonnet.

In each IIT and IIW, UI-JEPA outperformed other video encoder models in few-shot settings. It also achieved comparable performance to the much larger closed-loop models. However, with 4.4 billion parameters, it’s orders of magnitude lighter than the cloud-based models. The researchers found that incorporating text extracted from the user interface using optical character recognition (OCR) further improved UI-JEPA's performance. In zero-shot settings, UI-JEPA lagged behind the highest models.

UI-JEPA compared to other encoders

“This suggests that while UI-JEPA excels at tasks involving familiar applications, it faces challenges when coping with unfamiliar applications,” the researchers write.

The researchers see several potential uses for UI-JEPA models. One vital application is creating automated feedback loops for AI agents, allowing them to repeatedly learn from interactions without human intervention. This approach can significantly reduce annotation costs and ensure user privacy.

“As these agents collect more data via UI-JEPA, their responses change into increasingly precise and effective,” the authors told VentureBeat. “In addition, UI-JEPA's ability to process a continuous stream of screen context can greatly enrich the prompts for LLM-based planners. This improved context helps produce more informed and complex plans, especially when processing complex or implicit queries based on previous multimodal interactions (e.g., eye-tracking for voice interaction).”

Another promising application is the combination of UI-JEPA into agent-based frameworks designed to trace user intent across different applications and modalities. UI-JEPA could act as a perception agent, capturing and storing user intent at different cut-off dates. When a user interacts with a digital assistant, the system can retrieve probably the most relevant intent and generate the corresponding API call to meet the user request.

“UI-JEPA can enhance any AI agent framework by leveraging screen activity data to raised adapt to user preferences and predict user actions,” said Fu and Anantha. “Combined with temporal (e.g., time of day, day of the week) and geographic (e.g., within the office, at home) information, it may well infer user intent and enable a wide selection of direct applications.”
UI-JEPA appears to be a very good fit for Apple Intelligence, a collection of lightweight generative AI tools aimed toward making Apple devices smarter and more productive. Given Apple's concentrate on privacy, the low price and added efficiency of UI-JEPA models may give its AI assistants an edge over others that depend on cloud-based models.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read