To tackle “jagged intelligence” one of AI’s most persistent challenges for business applications: the gap between an AI system’s raw intelligence and its ability to consistently perform in unpredictable enterprise environments —Salesforce revealed several new benchmarks, models, and frameworks designed to make future AI agents more intelligent, trusted, and versatile for enterprise use. The SIMPLE dataset, a public benchmark featuring 225 straightforward reasoning questions designed to measure how jagged an AI system’s capabilities really are. Perhaps the most significant innovation is CRMArena, a novel benchmarking framework designed to simulate realistic customer relationship management scenarios. It enables comprehensive testing of AI agents in professional contexts, addressing the gap between academic benchmarks and real-world business requirements. The framework evaluates agent performance across three key personas: service agents, analysts, and managers. Early testing revealed that even with guided prompting, leading agents succeed less than 65% of the time at function-calling for these personas’ use cases. Among the technical innovations announced, Salesforce highlighted SFR-Embedding, a new model for deeper contextual understanding that leads the Massive Text Embedding Benchmark (MTEB) across 56 datasets. A specialized version, SFR-Embedding-Code, was also introduced for developers, enabling high-quality code search and streamlining development. Salesforce also announced xLAM V2 (Large Action Model), a family of models specifically designed to predict actions rather than just generate text. These models start at just 1 billion parameters—a fraction of the size of many leading language models. To address enterprise concerns about AI safety and reliability, Salesforce introduced SFR-Guard, a family of models trained on both publicly available data and CRM-specialized internal data. These models strengthen the company’s Trust Layer, which provides guardrails for AI agent behavior. The company also launched ContextualJudgeBench, a novel benchmark for evaluating LLM-based judge models in context—testing over 2,000 challenging response pairs for accuracy, conciseness, faithfulness, and appropriate refusal to answer. Salesforce unveiled TACO, a multimodal action model family designed to tackle complex, multi-step problems through chains of thought-and-action (CoTA). This approach enables AI to interpret and respond to intricate queries involving multiple media types, with Salesforce claiming up to 20% improvement on the challenging MMVet benchmark.