A brand new framework from researchers The University of Hong Kong (HKU) and joint institutions offer an open source basis for the creation of strong AI agents who can operate computers. The frame called Opencuacomprises the tools, data and recipes for scaling the event of computer use agents (CUAS).
Models which were trained with this framework are strong at CUA benchmarks, exceed existing open source models and compete closely with closed agents from leading AI laboratories comparable to Openaai and Anthropic.
The challenge of constructing computer use agents
Computer use agents are designed in such a way that they autonomously do tasks on a pc, from navigation web sites to the operation of complex software. You can even help automate workflows in the corporate. However, essentially the most capable CUA systems are proprietary, with critical details about their training data, architecture and development processes being kept private.
“Since the dearth of transparency limits technical progress and causes security concerns, the research community needs really open CUA framework conditions to look at its skills, restrictions and risks” Your newspaper.
At the identical time, there are open source efforts with their very own hurdles. There was no scalable infrastructure to gather the various large -scale data required for the training of those agents. Existing open source data records for graphical user interfaces (GUIs) have limited data, and plenty of research projects provide insufficient details about their methods, which makes it difficult for others to duplicate their work.
According to the paper, these restrictions “generally hinder progress on the whole and restrict sensible research into their scalability, generalizability and potential learning approaches.”
Introduction of Opencua
OpenCUA is an Open source framework that has met these challenges by scaling each data acquisition and the models themselves. In essence, the Agentnet tool is situated for the recording of human demonstrations of computer tasks on various operating systems.
The tool optimizes the information acquisition by running on the personnel computer of an annotator, capturing screen videos, mouse and keyboard entrances, and the underlying barrier frisiko tree, which provides structured information on elements on the screen. These raw data are then processed in “State Action Trajekories”, whereby a screenshot of the pc (the status) is coupled with the corresponding motion of the user (a click, button, etc.). Annotators can then check, edit and submit these demonstrations.

With this tool, the researchers have collected the Agentnet data record, which comprises over 22,600 task demonstrations via Windows, MacOS and Ubuntu and includes greater than 200 applications and web sites. “This data record authentically records the complexity of human behavior and the environmental dynamics from pc environments of the users,” says the paper.
The researchers recognized that the screen consumption tools for corporations are taking significant data protection concerns and developed the Agentnet tool, considering security. Xinyuan Wang, co-author of the paper and doctoral student on the HKU, explained that they implemented a multi-layer-private protection framework. “Firstly, annotators can completely observe the information they generate … before they determine whether or not they submit them,” he told Venturebeat. The data is then subjected to a manual review for data protection problems and automatic scanning by a big model for recognizing a remaining sensitive content before release. “This layered process ensures the robustness of the corporate quality for environments that process sensitive customer or financial data,” added Wang.
In order to hurry up the assessment, the team also curated Agentnetbench, an offline benchmark that gives several correct actions for every step and offers a more efficient method to measure the performance of an agent.
A brand new recipe for training agents
The Opencua framework introduces a brand new pipeline for the processing of information and training computer users. The first step converts the raw human demonstrations into clean state pairs which can be suitable for training vision models (VLMS). However, the researchers found that the only training of models for these couples also achieved limited increases in performance with large amounts of information.

The essential vision was to extend these trajectories with the thoughts (cot) (cot). This process generates an in depth “inner monologue” for each motion that features planning, memory and reflection. This structured pondering is organized in three levels: a high -ranking commentary of the screen, reflective thoughts that analyze the situation and plan the following steps, and eventually the concise, executable motion. This approach helps the agent to develop a deeper understanding of the tasks.
“We find natural language terrain of crucial importance for generalizable computer-use models and help Cuas to internalize cognitive skills,” the researchers write.
This data synthesis pipeline is a general framework that may be adapted by corporations to coach agents for their very own unique internal tools. According to Wang, an organization can record demonstrations of its proprietary workflows and use the identical reflector and generator pipeline to create the essential training data. “This lets you start a robust agent that’s tailored to your internal tools without manually manually having the traces of reasoning,” he explained.
Put the Opencua on the test
The researchers applied the OpenCua framework to coach plenty of open source VLMs, including variants of QWen and Kimi-VL, with the parameter sizes of three to 32 billion 32 billion. The models were rated in plenty of online and offline benchmarks that test their ability to perform tasks and understand GUI.
The 32-billion parameter model OPENCUA-32B has set a brand new state-of-the-art success rate under open source models on the Osworld-verified benchmark. It also exceeded the CUA on GPT-4O from Openaai and significantly exceeded the performance gap with the leading proprietary models from Anthropic.

For corporate developers and product leaders, research offers several necessary findings. The OpenCUA method is mostly applicable and improves performance in models with different architectures (each densely and tight and experts) and sizes. The trained agents also show a robust generalization and achieve well in quite a lot of tasks and operating systems.
According to Wang, the framework is especially suitable for automation of repetitive, labor-intensive corporate workflows. “In the Agentnet data record, for instance, we record some demonstrations of the beginning of EC2 instances on Amazon AWS and the configuration of annotation parameters on Mturk,” he told Venturebeat. “These tasks include many sequential steps, but follow repeatable patterns.”
However, Wang stated that bridging the gap into the live provision requires necessary challenges when it comes to security and reliability. “The biggest challenge in real provision is security and reliability: the agent must avoid mistakes that by chance change the system settings or trigger harmful unwanted effects beyond the intended task,” he said.
The researchers published them codePresent Data recordAnd Weights on your models.
If open source agents based on frameworks comparable to Opencua are capable of further develop the connection between knowledge staff and their computers. Wang presents a future by which the power of complex software becomes less necessary than the power to obviously formulate the goals of an AI agent.
He described two essential work modes: “Offline automation, by which the agent uses its broader software knowledge to pursue a task of end-to-end” and “online cooperation, by which the agent reacts in real time and works next to humans, much like a colleague”. Basically, people will deliver strategic “what”, while increasingly sophisticated AI agents treat the operational “how”.

