AI research institute Ai2, the Allen Institute for AI, released MolmoAct 7B, a breakthrough open embodied AI model that brings intelligence to robotics by allowing them to “think” through actions before performing. Ai2 said MolmoAct is the first in a new category of AI models the company is calling an action reasoning model, or ARM, that interprets high-level natural language and then reasons through a plan of physical actions to carry them out in the real world. Unlike current robotics models on the market that operate as vision language action foundation models, ARMs break down instructions into a series of waypoints and actions that take into account what the model can see. “As soon as it sees the world, it lifts the entire world into 3D and then it draws a trajectory to define how its arms are going to move in that space,” Ranjay Krishna, the computer vision team lead at Ai2. “So, it plans for the future. And after it’s done planning, only then does it start taking actions and moving its joints.” Unlike many current models on the market, MolmoAct 7B was trained on a curated open dataset of around 12,000 “robot episodes” from real-world environments, such as kitchens and bedrooms. These demonstrations were used to map goal-oriented actions — such as arranging pillows and putting away laundry. Krishna explained that MolmoAct overcomes this industry transparency challenge by being fully open, providing its code, weights and evaluations, thus resolving the “black box problem.” It is both trained on open data and its inner workings are transparent and openly available. To add even more control, users can preview the model’s planned movements before execution, with its intended motion trajectories overlaid on camera images. These plans can be modified using natural language or by sketching corrections on a touchscreen. This provides a fine-grained method for developers or robotics technicians to control robots in different settings such as homes, hospitals and warehouses. In the SimPLER benchmark, the model achieved state-of-the-art task success rates of 72.1%, beating models from Physical Intelligence, Google LLC, Microsoft Corp. and Nvidia.