HomeArtificial IntelligenceGoogle's AI can now surf the online for you, click buttons, and...

Google's AI can now surf the online for you, click buttons, and use forms with Gemini 2.5 computers

Some of the biggest providers of enormous language models (LLMs) have tried to maneuver beyond multimodal chatbots – expanding their models into “agents” that may actually take more actions on web sites on behalf of the user. Recall OpenAI's Chatgpt agent (formerly often called “Operator”) and Anthropic's computing usage, each released within the last two years.

Now Google can also be stepping into the identical game. Today the search giant of the search giant Subsidiary DeepMind AI Labor has unveiled a brand new, fine-tuned and custom version of its powerful Gemini 2.5 Pro LLM often called “Use Gemini 2.5 Pro Computer“what can Use a virtual browser to surf the online in your behalf, retrieve information, fill out forms, and even take actions on web sites – Everything from a user's single text prompt.

“These are early days, however the model's ability to interact with the online – like scrolling, filling out forms + navigating dropdowns – is a Important next step in constructing general agents,” said Google CEO Sundar Pichai, as a part of a longer statement concerning the social network, X.

However, the model just isn’t available to consumers directly on Google.

Instead, Google was a partner with one other company, Browserbasefounded by Former Twilio engineer Paul Klein in early 2024which provides a virtual “headless” web browser specifically to be used by AI agents and applications. (A “headless” browser is one that doesn’t require a graphical user interface or GUI to navigate the online, although on this case Browserbase displays a graphical representation for the user).

Users can reveal the brand new Gemini 2.5 computer model directly on Browserbase Here And even compare it side-by-side with the older, competing offerings from OpenAI and Anthropic in a brand new “Browser Arena“launched by the startup (although just one additional model may be chosen at a time besides Gemini).

For AI builders and operators, it’s produced as a crude, albeit propardary, LLM through the Gemini API to review in Google for Rapid prototypingand Google Clouds Vertex ai Model selection and applications for platform creation.

The latest offering builds on the capabilities of Gemini 2.5 Proreleased in March 2025, but has been significantly updated several times since then, with a give attention to allowing AI agents to have direct interactions with user interfaces, including browsers and mobile applications.

Overall it appears The use of Gemini 2.5 The use of computers allows developers to create agents that may complete interface-driven tasks autonomously – e.g. B. Clicking, typing, scrolling, filling out forms, and navigating past login screens.

Rather than relying solely on APIs or structured inputs, this model allows AI systems to visually and functionally interact with software, very similar to a human.

Short practical user tests

In my transient, unscientific first hands-on test on the Browserbase website, the Gemini 2.5 computer, as instructed, successfully navigated to Taylor Swift's official website and provided me with a summary of the proportion sold or advertised on the highest of a special edition of her latest album, “The Life of a Showgirl.”

In one other test, I asked Gemini 2.5 computers to look Amazon for highly rated and well-reviewed solar lights that I could stick in my backyard, and I used to be pleased to successfully observe a Google search captcha to expel non-human users (“Select all of the boxes with a bike.”).

However, once it got through there, it stalled and was unable to finish the duty, despite providing a “task competing” message.

I must also note here that while OpenAI and Anthropics Claude's Chatgpt agent can create and edit local files – akin to PowerPoint presentations, spreadsheets, or text documents – Gemini 2.5 doesn’t currently offer direct file access or the native file creation capabilities for computer use on behalf of the user.

Instead, it’s designed to manage and navigate through actions akin to clicking, tapping, and scrolling. Output is restricted to suggested UI actions or chatbot-style text responses. Each structured output akin to a document or file should be handled individually by the developer, often via custom code or third-party integrations.

Performance benchmarks

According to Google, the usage of Gemini 2.5 has shown leading leads to several interface control benchmarks, particularly compared to other major AI systems, including Claude-Sonnet and OpenAI's agent-based models.

Reviews were conducted via Browserbase and Google's own testing.

Some highlights include:

  • Online Mind2Web (browser base): 65.7% for Gemini 2.5 versus 61.0% (Claude-Sonnet 4) and 44.3% (OpenAI-Agent)

  • WebVoyager (browser base): 79.9% for Gemini 2.5 versus 69.4% (Claude-Sonnet 4) and 61.0% (OpenAI-Agent)

  • Androidworld (DeepMind): 69.7% for Gemini 2.5 versus 62.1% (Claude-Sonnet 4); OpenAI's model couldn’t be measured attributable to lack of access

  • Osworld: Currently not supported by Gemini 2.5; The top competitor result was 61.4%

In addition to strong accuracy, Google reports that the model operates with lower latency than other browser control solutions – a key think about production use cases akin to UI automation and testing.

How it really works

Agents utilized by the pc work in an interaction loop. You receive:

  • A user task prompt

  • A screenshot of the interface

  • A history of past actions

The model parses this input and produces a really helpful UI motion, akin to: B. clicking a button or typing in a field.

If mandatory, this could request confirmation from the top user for riskier tasks, akin to: B. a purchase order.

Once the motion is performed, the interface status is updated and a brand new screenshot is distributed back to the model. The loop continues until the duty completes or stops attributable to an error or security decision.

The model uses a special tool called computer_useand it may well be integrated into custom environments using tools akin to Tools playwright or concerning the Browserbase Demo Sandbox.

Use cases and adoption

According to Google, teams internally and externally have already began using the model across multiple domains:

  • Google Payments Platform Team reports that Gemini 2.5 uses computers to successfully revert over 60% of failed test executions, reducing a key source of engineering efficiencies.

  • CaresA 3rd-party AI agent platform said the model outperformed others on complex data parsing tasks, increasing performance by as much as 18% in its hardest assessments.

  • Poke.coma proactive AI assistant provider, found that the Gemini model often works 50% faster as competing solutions during interface interactions.

The model can also be utilized in Google's own product development efforts, including in Project SailorsThe Firebase Testing AgentAnd AI mode in search.

Security measures

Because this model directly controls software interfaces, Google emphasizes a layered approach to security:

  • A Security service per step Validates each suggested motion before execution.

  • Developers can define System level instructions Block or confirm certain actions.

  • The model includes built-in protections to forestall actions that compromise security or violate Google's prohibited application policies.

For example, when the model encounters a captcha, an motion is generated to click the checkbox but mark it as user contravention, ensuring that the system doesn’t proceed without human supervision.

Technical skills

The model supports a wide selection of built-in UI actions akin to:

  • click_atPresent type_text_atPresent scroll_documentPresent drag_and_dropand more

  • Custom features may be added to increase reach to mobile or custom environments

  • The screen coordinates are normalized (0-1000 scale) and reset in pixel dimensions during execution

It accepts Image and text Entrance and exits Text answers or Function calls Perform tasks. The really helpful screen resolution for optimal results is 1440×900even though it may go with other sizes.

API pricing stays almost equivalent to Gemini 2.5 Pro

The pricing for Gemini 2.5 computer use Matches closely to the usual Gemini 2.5 Pro model. Both follow the identical billing structure per loan: input tokens are valued $1.25 per million tokens for requests under 200,000 tokens and $2.50 per million tokens for prompts longer than that.

Output tokens follow an analogous split, valued at a price $10.00 per million for smaller answers and $15.00 for larger ones.

Where the models differ is in availability and extra features.

Gemini 2.5 Pro features a free tier This allows developers to make use of the model at no cost, with no explicit token cap published depending on the platform (e.g. Google AI Studio).

This free access includes each input and output tokens. Once developers exceed their allotted quota or upgrade to the paid tier, standard per-deal pricing applies.

In contrast, The use of Gemini 2.5 Computer is simply available through the paid tier. There are No free access This model is currently offered and all usage incurs token-based fees from the beginning.

In terms of features, Gemini 2.5 Pro supports optional features akin to context caching (starting at $0.31 per million tokens) and the muse of Google Search (free for as much as 1,500 requests per day, then $35 per 1,000 additional requests). These usually are not available for computer use right now.

Another distinction is in data manipulation: the output of the pc usage model just isn’t used to enhance Google products within the paid tier, while the free tier of Gemini 2.5 Pro contributes to model improvement unless explicitly opted in.

Overall, developers can expect similar token-based costs in each models, but should use tier access, features and data policies to make a decision which model meets their needs.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read