HomeArtificial IntelligenceByteDance's UI-TARS can take over your computer and outperforms GPT-4o and Claude

ByteDance's UI-TARS can take over your computer and outperforms GPT-4o and Claude

A brand new AI agent has emerged from TikTok's parent company to take control of your computer and perform complex workflows.

Similar to Anthropic's Computer Use, ByteDance's recent UI-TARS understands graphical user interfaces (GUIs), applies reasoning, and takes autonomous, step-by-step actions.

The PC/MacOS agents are trained on around 50 B tokens and are offered in parameter versions 7 B and 72 B. They achieve state-of-the-art (SOTA) performance in greater than 10 GUI benchmarks across performance, perception, grounding, and overall agent capabilities, consistently outperforming OpenAI's GPT-4o, Claude, and Google's Gemini.

“Through iterative training and reflective optimization, UI-TARS continually learns from its mistakes and adapts to unexpected situations with minimal human intervention,” researchers from ByteDance and Tsinghua University wrote in a single recent research work.

How UI-TARS explains its pondering

UI-TARS works across desktop, mobile and web applications and uses multimodal inputs (text, images, interactions) to know visual environments.

The interface has two tabs – one on the left shows step-by-step “pondering” and a bigger one on the fitting that accesses files, web sites and apps and mechanically takes motion.

For example, in a demo video released today, the model asks: “Find round-trip flights from SEA to NYC on the fifth and return flights on the tenth of next month, filtering by price in ascending order.”

In response, UI-TARS navigates to the Delta Airlines website, fills within the “from” and “to” fields, clicks on the relevant data, sorts and filters by price, and explains each step in its pondering box before taking motion.

Another scenario instructs to put in an autoDocstring extension in VS Code. Here is the step-by-step guide to finish this task:

  • First it says: “To proceed with installing the autoDocstring extension in VS Code, I first must open the VS Code application.”
  • The model acknowledges that the VS Code window didn’t fully load and notes that “it’s crucial to attend for the applying to finish initialization before proceeding to put in the autoDocstring extension.” This ensures that every one required functionality is out there for the following steps.”
  • It then explains that the Extensions view must be accessed in VS Code, which may be done by clicking a tab within the left sidebar.
  • When it encounters a small error, it reasons: “The previous try and click the Extensions tab may not have been precise enough, so I'll try clicking it again to be sure the Extensions view opens successfully.” .”
  • In the Extensions view, it enters “autoDocstring” and concludes: “Since the installation process has already been initiated, no further motion is required presently. I’ll wait for the installation to finish and ensure success once the method is complete.”

Outperform his competitors

Across various benchmarks, researchers report that UI-TARS consistently outperformed OpenAI's GPT-4o; Anthropic's Claude 3.5 Sonnet; Gemini-1.5-Pro ​​and Gemini-2.0; 4 Qwen models; and diverse academic models.

For example, UI-TARS 72B scored 82.8% on VisualWebBench – which measures a model's ability to anchor web elements, including web page quality assurance and optical character recognition – outperforming GPT-4o (78.5%) and Claude 3.5 ( 78.2%).

It also performed significantly higher within the WebSRC benchmarks (understanding of semantic content and layout in web contexts) and ScreenQA-short (understanding of complex mobile screen layouts and web structure). UI-TARS-7B achieved peak scores of 93.6% on WebSRC, while UI-TARS-72B achieved 88.6% on ScreenQA-short, outperforming Qwen, Gemini, Claude 3.5, and GPT-4o.

“These results reveal the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers write. “Such perceptual capabilities lay the inspiration for agentic tasks where an accurate understanding of the environment is critical to task execution and decision making.”

UI-TARS also showed impressive ends in ScreenSpot Pro and ScreenSpot v2, which evaluate a model's ability to know and localize elements in GUIs. In addition, the researchers tested its capabilities in planning multi-step actions and low-level tasks in mobile environments and compared it to OSWorld (which evaluates open computing tasks) and AndroidWorld (which evaluates autonomous agents on 116 programmatic tasks in 20 mobile apps). ).

Under the hood

To perform step-by-step actions and recognize what it sees, UI-TARS was trained on a big dataset of screenshots analyzing metadata, including element description and sort, visual description, bounding box (position information), and element function and text from various web sites, applications and operating systems. This allows the model to supply a comprehensive, detailed description of a screenshot, capturing not only elements but additionally spatial relationships and the general layout.

The model also uses state transition labeling to discover and describe the differences between two consecutive screenshots and determine whether an motion, reminiscent of a mouse click or keyboard entry, has occurred. Using the Set-of-Mark (SoM) prompt, different marks (letters, numbers) may be overlaid on specific areas of a picture.

The model is supplied with each short-term and long-term memory to handle tasks at hand while retaining historical interactions to enhance subsequent decision making. The researchers trained the model to reason in each System 1 (fast, automatic and intuitive) and System 2 (slow and deliberate). This enables multi-level decision making, “reflective pondering,” milestone detection, and error correction.

The researchers emphasized that it’s critical that the model have the opportunity to keep up consistent goals and use trial and error to hypothesize, test, and evaluate potential actions before completing a task. To support this, they introduced two forms of data: error correction and post-reflection data. To correct errors, they identified errors and labeled corrective actions; For post-reflection, they simulated recovery steps.

“This strategy ensures that the agent not only learns to avoid errors but additionally dynamically adapts once they occur,” the researchers write.

UI-TARS clearly has impressive capabilities, and it can be interesting to look at its evolving use cases within the increasingly competitive AI agent space. As the researchers note: “Looking forward, while native agents represent a major advance, the long run lies in the combination of lively and lifelong learning, where agents autonomously advance their very own learning through continuous interactions in the actual world.”

Researchers note that Claude Computer Use “performs well in web-based tasks but has significant problems in mobile scenarios, suggesting that Claude's GUI operability has not translated well to the mobile domain.”

In contrast, “UI-TARS shows excellent performance on each website and mobile.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read