Reports emerged over the last few days on X from AI influencers, including OpenAI’s own researcher “Roon (@tszzl on X)” (speculated to be technical team member Tarun Gogineni) — of a new “router” function that will automatically select the best OpenAI model to respond to the user’s input on the fly, depending on the specific input’s content. Similarly, Yuchen Jin, Co-founder & CTO of AI inference cloud provider Hyperbolic Labs, wrote in an X post, “Heard GPT-5 is imminent, from a little bird. It’s not one model, but multiple models. It has a router that switches between reasoning, non-reasoning, and tool-using models. That’s why Sam said they’d “fix model naming”: prompts will just auto-route to the right model. GPT-6 is in training.” While a presumably far more advanced GPT-5 model would (and will) be huge news if and when released, the router may make life much easier and more intelligent for the average ChatGPT subscriber. It would also follow on the heels of other third-party products such as the web-based Token Monster chatbot, which automatically select and combine responses from multiple third-party LLMs to respond to user queries. Hopefully any hypothetical OpenAI router seamlessly helps direct them to the right model product for their needs, when they need it.
Sovos’s AI platform delivers automation through every stage of compliance for e-invoicing, taxation and regulatory reporting letting users navigate complexity through natural language, visual interfaces, self-service analytics and biometric security
Sovos, announced the launch of Sovi™ AI, a first-of-its-kind suite of embedded AI and machine learning capabilities purpose-built for tax compliance. Sovi symbolizes smart power in action, a perfect reflection of Sovos’ embedded AI engine that drives a whole panorama of intelligent automation across the Sovos Tax Compliance Cloud platform. Sovi delivers unprecedented insight, automation, and reliability throughout every stage of compliance for e-invoicing, taxation and regulatory reporting. Sovi AI will integrate across analytics, automation, and regulatory workflows, enabling technical and non-technical teams to navigate complexity through natural language, visual interfaces, and intuitive guidance. Sovi AI capabilities are already operational across Sovos solutions, including advanced biometrics for face and liveness detection, image recognition, and secure authentication built into Sovos Trust solutions. The roadmap includes ambitious expansions such as AI compliance checks, Ask Sovi for embedded assistants, automated mapping tools for goods and services classification, and intelligent document agents for AP process automation. Sovi AI enables organizations to achieve: Enhanced Efficiency: Self-service analytics eliminate IT dependencies for finance and tax teams; Improved Accuracy: Biometric security and AI validations reduce errors, fraud, and compliance mismatches; Greater Clarity: Conversational AI and insightful dashboards uncover hidden issues and opportunities; Unlimited Scalability: Future-proof compliance capabilities regardless of country, volume, or complexity.
Agent2.AI’s AI orchestration platform can understand user intent, break down the request into smaller, manageable steps, delegate each task to focused atomic agents and deliver real, usable outputs such as reports, spreadsheets, and presentations
Agent2.AI announced the upcoming launch of Super Agent, a breakthrough AI orchestration platform designed to coordinate intelligent work across multiple agents, APIs, and even real human collaborators. Unlike traditional AI tools that focus on generating content or answering questions, Super Agent acts as an orchestration layer — a system that understands user intent, delegates work to the right components, and delivers real, usable outputs such as reports, spreadsheets, and presentations. “We’re not building just another AI agent,” said Chuci Qin, CEO of Agent2.AI. Users can prompt Super Agent with requests and system will automatically break each request into smaller, manageable steps. Each task is broken down and handled by focused atomic agents. Each agent is built to do one specific job, such as finding information, organizing research, or creating slides. These atomic agents form a growing ecosystem inside Agent2.AI, each focused, reliable, and composable. Super Agent can also call on external tools and agents through standard protocols such as MCP or A2A, allowing the system to dynamically connect with open-source frameworks, third-party APIs, or no-code automations as needed. In some cases, tasks may require not just software, but real-world execution, such as placing an order, contacting a vendor, or managing a physical deliverable. When that’s the case, Super Agent can seamlessly coordinate with vetted freelancers or agency partners. These human contributors are not fallback options, but core participants in a flexible, multi-agent system.
A new open-source method utilizes the MCP architecture to evaluate agent performance through a variety of available LLMs by gathering real-time information on how agents interact with tools, generating synthetic data and creating a database to benchmark them
Researchers from Salesforce discovered another way to utilize MCP technology, this time to aid in evaluating AI agents themselves. The researchers unveiled MCPEval, a new method and open-source toolkit built on the architecture of the MCP system that tests agent performance when using tools. They noted current evaluation methods for agents are limited in that these “often relied on static, pre-defined tasks, thus failing to capture the interactive real-world agentic workflows.” MCPEval differentiates itself by being a fully automated process, which the researchers claimed allows for rapid evaluation of new MCP tools and servers. It both gathers information on how agents interact with tools within an MCP server, generates synthetic data and creates a database to benchmark agents. Users can choose which MCP servers and tools within those servers to test the agent’s performance on. MCPEval’s framework takes on a task generation, verification and model evaluation design. Leveraging multiple large language models (LLMs) so users can choose to work with models they are more familiar with, agents can be evaluated through a variety of available LLMs in the market. Enterprises can access MCPEval through an open-source toolkit released by Salesforce. Through a dashboard, users configure the server by selecting a model, which then automatically generates tasks for the agent to follow within the chosen MCP server. Once the user verifies the tasks, MCPEval then takes the tasks and determines the tool calls needed as ground truth. These tasks will be used as the basis for the test. Users choose which model they prefer to run the evaluation. MCPEval can generate a report on how well the agent and the test model functioned in accessing and using these tools. What makes MCPEval stand out from other agent evaluators is that it brings the testing to the same environment in which the agent will be working. Agents are evaluated on how well they access tools within the MCP server to which they will likely be deployed.
Alibaba’s Qwen3-Coder launches and it ‘might be the best coding model yet’- designed to handle complex, multi-step coding workflows and can create full-fledged, functional applications in seconds or minutes
Chinese e-commerce giant Alibaba’s “Qwen Team” has come out with Qwen3-Coder-480B-A35B-Instruct, a new open-source LLM focused on assisting with software development. It is designed to handle complex, multi-step coding workflows and can create full-fledged, functional applications in seconds or minutes. Qwen3-Coder, is available now under an open source Apache 2.0 license, meaning it’s free for any enterprise to take without charge, download, modify, deploy and use in their commercial applications for employees or end customers. It’s also so highly performant on third-party benchmarks and anecdotal usage among AI power users for “vibe coding.” Qwen3-Coder is a Mixture-of-Experts (MoE) model with 480 billion total parameters, 35 billion active per query, and 8 active experts out of 160. It supports 256K token context lengths natively, with extrapolation up to 1 million tokens using YaRN (Yet another RoPE extrapolatioN — a technique used to extend a language model’s context length beyond its original training limit by modifying the Rotary Positional Embeddings (RoPE) used during attention computation. This capacity enables the model to understand and manipulate entire repositories or lengthy documents in a single pass. Designed as a causal language model, it features 62 layers, 96 attention heads for queries, and 8 for key-value pairs. It is optimized for token-efficient, instruction-following tasks and omits support for <think> blocks by default, streamlining its outputs. Qwen3-Coder has achieved leading performance among open models on several agentic evaluation suites: SWE-bench Verified: 67.0% (standard), 69.6% (500-turn); GPT-4.1: 54.6%; Gemini 2.5 Pro Preview: 49.0%; Claude Sonnet-4: 70.4%. The model also scores competitively across tasks such as agentic browser use, multi-language programming, and tool use. For enterprises, Qwen3-Coder offers an open, highly capable alternative to closed-source proprietary models. With strong results in coding execution and long-context reasoning, it is especially relevant for: Codebase-level understanding: Ideal for AI systems that must comprehend large repositories, technical documentation, or architectural patterns Automated pull request workflows: Its ability to plan and adapt across turns makes it suitable for auto-generating or reviewing pull requests Tool integration and orchestration: Through its native tool-calling APIs and function interface, the model can be embedded in internal tooling and CI/CD systems. This makes it especially viable for agentic workflows and products, i.e., those where the user triggers one or multiple tasks that it wants the AI model to go off and do autonomously, on its own, checking in only when finished or when questions arise. Data residency and cost control: As an open model, enterprises can deploy Qwen3-Coder on their own infrastructure—whether cloud-native or on-prem—avoiding vendor lock-in and managing compute usage more directly.
Hailo Technologies AI accelerator device runs hybrid AI pipelines that blend LLMs, vision-language models (VLMs), and other multi-modal AI with traditional convolutional neural networks (CNNs) directly on-device, eliminating the need for cloud-based inference
Israeli chipmaker Hailo Technologies has released the Hailo-10H, the first discrete AI accelerator designed for generative AI workloads at the edge. The device runs large language models (LLMs), vision-language models (VLMs), and other multi-modal AI directly on-device, eliminating the need for cloud-based inference. The Hailo-10H offers unmatched power efficiency and low latency, achieving first-token generation in under one second and maintaining 10 tokens per second on 2-billion parameter LLMs. It can also generate images with Stable Diffusion 2.1 in under five seconds, demonstrating a significant leap forward for offline generative workloads. The chip is designed around Hailo’s second-generation neural core architecture, providing 40 tera-operations per second (TOPS) of INT4 performance and 20 TOPS of INT8 at a typical power draw of 2.5 W. It is fully compatible with TensorFlow, PyTorch, ONNX, and Keras, and is supported by Hailo’s mature software stack. The device is designed to work in hybrid AI pipelines that blend LLMs or VLMs with traditional convolutional neural networks (CNNs), conserving power and ensuring real-time responsiveness for mission-critical applications like video analytics.
Anthropic unveils ‘auditing agents’ to test for AI misalignment finding prompts that elicit “concerning” behaviors
Anthropic researchers developed auditing agents that achieved “impressive performance at auditing tasks, while also shedding light on their limitations.” The researchers stated that these agents, created during the pre-deployment testing of Claude Opus 4, enhanced alignment validation tests and enabled researchers to conduct multiple parallel audits at scale. The three agents they explored were: Tool-using investigator agent for open-ended investigation of models using chat, data analysis and interpretability tools; Evaluation agent that builds behavioral evaluations and can discriminate between models that have implanted behaviors and those that do not; Breadth-first red-teaming agent, which was developed specifically for the Claude 4 alignment assessment, so that it can discover implanted test behaviors. According to the researchers, the investigator agent successfully found the root cause of the issues 10-13% of the time. Still, this performance improved to 42% “when using a super-agent approach that aggregates findings across multiple investigations.” The next step in most alignment audits is typically a behavioral evaluation, which determines whether the model consistently misbehaves in various settings. To test its evaluation agent, the researchers designed 10 models “given system prompt instructions to exhibit a specific behavior, such as excessive deference to the user.” They ran the agent five times per model and saw that the agent correctly finds and flags at least one quirk of the model. However, the agent sometimes failed to identify specific quirks consistently. It had trouble evaluating subtle quirks, such as self-promotion and research-sandbagging, as well as quirks that are difficult to elicit, like the Hardcode Test Cases quirk. The last test and agent concern behavioral red-teaming to find the prompts that elicit “concerning” behaviors. The breadth-first red-teaming agent converses with the target model (in Anthropic’s case, it was with Claude Opus 4), and this chat is then rated for alignment-relevant properties. The agent identified seven of the ten system quirks, but it also struggled with the same issues as the evaluator agent.
Fintech Pipe’s AI agent reviews flagged credit applications using live revenue and transaction data to distinguish genuine risk from input errors, allowing up to 90% of applicants to receive decisions in minutes
Pipe, a FinTech company specializing in embedded capital products for small businesses, has launched four new AI agents to automate key operational workflows and accelerate its global expansion. The AI-native approach eliminates traditional credit score assessments and personal guarantees, leveraging live revenue and transaction data to optimize capital offers. The agents streamline tasks across fraud detection, compliance, customer engagement, payments, and treasury operations, enhancing partner and customer experiences and accelerating platform growth without increasing headcount. The four agents include: Fraud and Compliance Agent, which reviews flagged applications using business data to distinguish genuine risk from input errors, allowing up to 90% of applicants to receive decisions in minutes. Recovery Agent, which analyses business operations and payment statuses to guide the most effective strategy for resuming payments. Sales Agent, which supports applicants around the clock and re-engages businesses that abandoned applications. Treasury Agent, which provides real-time liquidity insights by monitoring global cash positions and macroeconomic indicators to guide investment and capital deployment decisions.
Leena AI’s voice-enabled AI colleagues can speak out loud and listen using conversational language and natural voice communication in the workplace, acting as a go-between for situations where text might be cumbersome or difficult
Leena AI, the developer of an employee-facing agentic AI assistant, launched what it’s calling AI colleagues, which can speak out loud and listen using natural, conversational language in the workplace. Its AI agents can work and interact just like human employees can and provide support across numerous work fields, including information technology, human resources, finance, marketing, sales and procurement. By using natural voice communication, these agents allow workers to get work done faster than before. AI acts as a go-between for situations where text might be cumbersome or difficult, becoming more of a collaborator than a widget trapped within a screen. Already 35 % of all interactions with Leena are on voice and the average time of session is seven and a half minutes. The AI agent could use Salesforce integration to access and complete the necessary forms. Next, the agent could scour meeting notes and look up information about the “deep tech information” requested, prepare an email and show it to the user before sending it to the tech team. The AI colleagues are powered by agentic AI, meaning that they can complete tasks with little or no human input once they’re set to a goal. Although they always seek employee approval before taking action, they ensure a user-friendly experience by keeping a human in the loop before taking any critical action. They also have 24/7 availability, all year round.
Google’s vibe-coding tool lets users create mini web apps using text prompts or by remixing existing apps available in a gallery and see the visual workflow of input, output, and generation steps
Google is testing a vibe-coding tool called Opal, available to users in the U.S. through Google Labs, which the company uses as a base to experiment with new tech. Opal lets you create mini web apps using text prompts, or you can remix existing apps available in a gallery. All users have to do is enter a description of the app they want to make, and the tool will then use different Google models to do so. Once the app is ready, you can navigate into an editor panel to see the visual workflow of input, output, and generation steps. You can click on each workflow step to look at the prompt that dictates the process, and edit it if you need to. You also can manually add steps from Opal’s toolbar. Opal also lets users publish their new app on the web and share the link with others to test out using their own Google accounts. Google’s AI studio already lets developers build apps using prompts, but Opal’s visual workflow indicates the company likely wants to target a wider audience.
