Google’s Gemini 2.5 Computer Use model enables AI agents to interact with browser user interfaces through 13 supported UI actions including clicking, typing, scrolling, cursor hovering, etc • DigiBanker

Google LLC has just announced a new version of its Gemini large language model that can navigate the web through a browser and interact with various websites, meaning it can perform tasks such as searching for information or buying things without human supervision. The model, Gemini 2.5 Computer Use, uses a combination of visual understanding and reasoning to analyze user’s requests and carry out tasks in the browser. It will complete all of the actions required to fulfill that task, such as clicking, typing, scrolling, manipulating dropdown menus and filling out and submitting forms, just as a human can do. Google’s DeepMind research outfit said Gemini 2.5 Computer Use is based on the Gemini 2.5 Pro LLM. It explained that earlier versions of the model have been used to power earlier agentic features it has launched in tools such as AI Mode and Project Mariner. But this is the first time the complete model has been made available. The company explained that each request kicks off a “loop” that involves the model go through various steps until it’s considered complete. First, the user sends a request to the model, which can also include screenshots of the website in question and a history of recent actions. Then, Gemini 2.5 Computer Use will analyze those inputs and generate a response, which will typically be a “function call representing one of the UI actions such as clicking or typing.” Client-side code will then execute the required action, and after this is done, a new screenshot of the graphical user interface and the current website will be sent back to the model as a function response.

Read Article