Salesforce’s new CoAct-1 agents gives computer-use agents the ability to execute code while navigating GUIs, that is, writing scripts while also moving a cursor and/or clicking buttons on an application • DigiBanker

Researchers at Salesforce and the University of Southern California have developed a new technique that gives computer-use agents the ability to execute code while navigating graphical user interfaces (GUIs), that is, writing scripts while also moving a cursor and/or clicking buttons on an application, combining the best of both approaches to speed up workflows and reduce errors. This hybrid approach allows an agent to bypass brittle and inefficient mouse clicks for tasks that can be better accomplished through coding. The system, called CoAct-1, sets a new state-of-the-art on key agent benchmarks, outperforming other methods while requiring significantly fewer steps to accomplish complex tasks on a computer. The system is structured as a team of three specialized agents that work together: an Orchestrator, a Programmer, and a GUI Operator. The Orchestrator acts as the central planner or project manager. It analyzes the user’s overall goal, breaks it down into subtasks, and assigns each subtask to the best agent for the job. It can delegate backend operations like file management or data processing to the Programmer, which writes and executes Python or Bash scripts. For frontend tasks that require clicking buttons or navigating visual interfaces, it turns to the GUI Operator, a VLM-based agent. After the Programmer or GUI Operator completes a subtask, it sends a summary and a screenshot of the current system state back to the Orchestrator, which then decides the next step or concludes the task. The Programmer agent uses an LLM to generate its code and sends commands to a code interpreter to test and refine its code over multiple rounds. Similarly, the GUI Operator uses an action interpreter that executes its commands (e.g., mouse clicks, typing) and returns the resulting screenshot, allowing it to see the outcome of its actions. The Orchestrator makes the final decision on whether the task should continue or stop. The researchers tested CoAct-1 on OSWorld, a comprehensive benchmark that includes 369 real-world tasks across browsers, IDEs, and office applications. The results show CoAct-1 establishes a new state-of-the-art, achieving a success rate of 60.76%. The performance gains were most significant in categories where programmatic control offers a clear advantage, such as OS-level tasks and multi-application workflows. Beyond just a higher success rate, the system is dramatically more efficient. CoAct-1 solves tasks in an average of just 10.15 steps, a stark contrast to the 15.22 steps required by leading GUI-only agents like GTA-1. The potential for this technology goes beyond general productivity. For enterprise leaders, the key lies in automating complex, multi-tool processes where full API access is a luxury, not a guarantee. Ran Xu, a co-author of the paper and Director of Applied AI Research at Salesforce, points to customer support as a prime example. Xu also sees high-value applications in sales, such as prospecting at scale and automating bookkeeping, and in marketing for tasks like customer segmentation and campaign asset generation.

Read Article