Amazon Bedrock has announced the general availability of its Intelligent Prompt Routing, a serverless endpoint that efficiently routes requests between different foundation models within the same model family. The system dynamically predicts the response quality of each model for a request and routes the request to the model it determines is most appropriate based on cost and response quality. The system incorporates state-of-the-art methods for training routers for different sets of models, tasks, and prompts. Users can use the default prompt routers provided by Amazon Bedrock or configure their own prompt routers to adjust for performance linearly between the performance of two candidate LLMs. The system has reduced the overhead of added components by over 20% to approximately 85 ms (P90), resulting in an overall latency and cost benefit compared to always hitting the larger/more expensive model. Amazon Bedrock has conducted internal tests with proprietary and public data to evaluate the system’s performance metrics.
Codacy’s solution integrates directly with AI coding assistants to enforce coding standards using MCP server, flagging or fixing issues in real-time
Codacy, provider of automated code quality and security solutions, launched Codacy Guardrails, a groundbreaking new product designed to bring real-time security, compliance, and quality enforcement to AI-generated code. Guardrails is the first technology to make AI-generated code trustworthy and compliant by checking it before it ever reaches the developer. Codacy Guardrails is the first solution of its kind that integrates directly with AI coding assistants to enforce coding standards and prevent non-compliant code from being generated in the first place. Built on Codacy’s SOC2-compliant platform, Codacy Guardrails empowers teams to define their own secure development policies and apply them across every AI-generated prompt. With Codacy Guardrails, AI-assisted tools gain full access to the security and quality context of a team’s codebase. At the core of the product is the Codacy MCP server, which connects development environments to the organization’s code standards. This gives LLMs the ability to reason about policies, flag or fix issues in real time, and deliver code that’s compliant by default. Guardrails integrates with popular IDEs like Cursor AI and Windsurf as well as VSCode and IntelliJ through Codacy’s plugin, allowing developers to apply guardrails directly within their existing workflows.
Docker to simplify AI software delivery by containerizing MCP servers along with offering an enterprise-ready toolkit and a centralized platform to discover and manage them from a catalog of 100+ servers
Software containerization company Docker is launching the Docker MCP Catalog and Docker MCP Toolkit, which bring more of the AI workflow into the existing Docker developer experience and simplify AI software delivery. The new offerings are based on the emerging Model Context Protocol standard created by its partner Anthropic PBC. Docker argues that the simplest way to use Anthropic’s MCP to improve LLMs is to containerize it. To do that, it offers tools such as Docker Desktop for building, testing and running MCP servers, as well as Docker Hub to distribute their container images, and Docker Scout to ensure they’re secure. By packaging MCP servers as containers, developers can eliminate the hassles of installing dependencies and configuring their runtime environments. The Docker MCP Catalog, integrated within Docker Hub, is a centralized way for developers to discover, run and manage MCP servers, while the Docker MCP Toolkit offers “enterprise-ready tooling” for putting AI applications to work. At launch, there are more than 100 MCP servers available within Docker MCP Catalog. President and Chief Operating Officer Mark Cavage explained that “The Docker MCP Catalog brings that all together in one place, a trusted, developer-friendly experience within Docker Hub, where tools are verified, secure, and easy to run.”
Amazon’s new benchmark to evaluate AI coding agents’ ability to navigate and understand complex codebases and GitHub issues
Amazon has introduced SWE-PolyBench, the first industry benchmark to evaluate AI coding agents’ ability to navigate and understand complex codebases. The benchmark, which measures system performance in GitHub issues, has spurred the development of capable coding agents and has become the de-facto standard for coding agent benchmarking. SWE-PolyBench contains over 2,000 curated issues in four languages and a stratified subset of 500 issues for rapid experimentation. The benchmark aims to advance AI performance in real-world scenarios. Key features of SWE-PolyBench at a glance: Multi-Language Support: Java (165 tasks), JavaScript (1017 tasks), TypeScript (729 tasks), and Python (199 tasks). Extensive Dataset: 2110 instances from 21 repositories ranging from web frameworks to code editors and ML tools, on the same scale as SWE-Bench full with more repository. Task Variety: Includes bug fixes, feature requests, and code refactoring. Faster Experimentation: SWE-PolyBench500 is a stratified subset for efficient experimentation. Leaderboard: A leaderboard with a rich set of metrics for transparent benchmarking.
New system for training AI agents focuses on multi-turn, interactive settings where agents must adapt, remember, and reason in the face of uncertainty instead of static tasks like math solving or code generation
A collaborative team from Northwestern University, Microsoft, Stanford, and the University of Washington — including a former DeepSeek researcher named Zihan Wang, currently completing a computer science PhD at Northwestern — has introduced RAGEN, a new system for training and evaluating AI agents that they hope makes them more reliable and less brittle for real-world, enterprise-grade usage. Unlike static tasks like math solving or code generation, RAGEN focuses on multi-turn, interactive settings where agents must adapt, remember, and reason in the face of uncertainty. Built on a custom RL framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), the system explores how LLMs can learn through experience rather than memorization. StarPO-S incorporates three key interventions: Uncertainty-based rollout filtering; KL penalty removal; and Asymmetric PPO clipping. StarPO operates in two interleaved phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This structure supports a more stable and interpretable learning loop compared to standard policy optimization approaches. The team identified three dimensions that significantly impact training: Task diversity, Interaction granularity, and Rollout freshness. Together, these factors make the training process more stable and effective.
OpenAI is planning a truly ‘open reasoning’ AI system with a ‘handoff’ feature that would enable it to make calls to the OpenAI API to access other, larger models for a substantial computational lift
OpenAI is gearing up to release an AI system that’s truly “open,” meaning it’ll be available for download at no cost and not gated behind an API. Beyond its benchmark performance, OpenAI may have a key feature up its sleeve — one that could make its open “reasoning” model highly competitive. Company leaders have been discussing plans to enable the open model to connect to OpenAI’s cloud-hosted models to better answer complex queries. OpenAI CEO Sam Altman described the capability as a “handoff.” If the feature — as sources describe it — makes it into the open model, it will be able to make calls to the OpenAI API to access the company’s other, larger models for a substantial computational lift. It’s unclear if the open model will have the ability to access some of the many tools OpenAI’s models can use, like web search and image generation. The idea for the handoff feature was suggested by a developer during one of OpenAI’s recent developer forums, according to a source. The suggestion appears to have gained traction within the company. OpenAI has been hosting a series of community feedback events with developers to help shape its upcoming open model release. A local model that can tap into more powerful cloud systems brings to mind Apple Intelligence, Apple’s suite of AI capabilities that uses a combination of on-device models and models running in “private” data centers. OpenAI stands to benefit in obvious ways. Beyond generating incremental revenue, a handoff could rope more members of the open source community into the company’s premium ecosystem.
UiPath’s agentic AI platform to utilize Redis semantic routing tech that would enable AI agents to leverage the best LLM or LLM provider depending on the context, intent, and use-case which the customer is trying to solve
Data platform Redis and UiPath expanded their collaboration toward furthering agentic automation solutions for customers. By extending their partnership, Redis and UiPath will explore ways to leverage the Redis vector database, Semantic Caching, and Semantic Routing to support UiPath Agent Builder, a secure, simple way to build, test, and launch agents and the agentic automations they are executing. With Redis powering these solutions, UiPath agents will understand the meaning behind user queries, making data access faster and system responses smarter, which delivers greater speed and cost efficiency to enterprise developers looking to take advantage of automation. Additionally, via the utilization of semantic routing, UiPath agents will be able to leverage the best LLM or LLM provider depending on the context, intent, and use-case which the customer is trying to solve. UiPath Agent Builder builds on the RPA capabilities and orchestration of UiPath Automation Suite and Orchestrator to deliver unmatched agentic capabilities. Agent Builder will utilize a sophisticated memory architecture that enables agents to retrieve relevant information only from permissioned, governed knowledgebases and maintain context across planning and execution. This architecture will enable developers to create, customize, evaluate, and deploy specialized enterprise agents that can understand context, make decisions, and execute complex processes while maintaining enterprise-grade security and governance.
Microsoft releases taxonomy of failure modes- security and safety- inherent to agentic architecture- novel modes unique to agentic systems (e.g. agent compromise) and modes representing amplification of existing GenAI risks (e.g. bias amplification)
Microsoft’s AI Red Team has published a detailed taxonomy addressing the failure modes inherent to agentic architectures. Agentic AI systems are autonomous entities that observe and act upon their environment to achieve predefined objectives. These systems integrate capabilities such as autonomy, environment observation, interaction, memory, and collaboration. However, these features introduce a broader attack surface and new safety concerns. The report distinguishes between novel failure modes unique to agentic systems and amplification of risks already observed in generative AI contexts. Microsoft categorizes failure modes across security and safety dimensions. Novel Security Failures: Including agent compromise, agent injection, agent impersonation, agent flow manipulation, and multi-agent jailbreaks. Novel Safety Failures: Covering issues such as intra-agent Responsible AI (RAI) concerns, biases in resource allocation among multiple users, organizational knowledge degradation, and prioritization risks impacting user safety. Existing Security Failures: Encompassing memory poisoning, cross-domain prompt injection (XPIA), human-in-the-loop bypass vulnerabilities, incorrect permissions management, and insufficient isolation. Existing Safety Failures: Highlighting risks like bias amplification, hallucinations, misinterpretation of instructions, and a lack of sufficient transparency for meaningful user consent.
NeuroBlade’s Analytics Accelerator is a purpose-built hardware designed to handle modern database workloads delivering 4x faster performance than leading vectorized CPU implementations
As Elad Sity, CEO and cofounder of NeuroBlade, noted, “while the industry has long relied on CPUs for data preparation, they’ve become a bottleneck — consuming well over 30 percent of the AI pipeline.” NeuroBlade, the Israeli semiconductor startup Sity cofounded, believes the answer lies in a new category of hardware specifically designed to accelerate data analytics. Their Analytics Accelerator isn’t just a faster CPU — it’s fundamentally different architecture purpose-built to handle modern database workloads. NeuroBlade’s Accelerator unlocks the full potential of data analytics platforms by dramatically boosting performance and reducing query times. By offloading operations from the CPU to purpose-built hardware — a process known as pushdown—it increases the compute power of each server, enabling faster processing of large datasets with smaller clusters compared to CPU-only deployments. Purpose-built hardware that boosts each server’s compute power for analytics reduces the need for massive clusters and helps avoid bottlenecks like network overhead, power constraints, and operation complexity. In TPC-H benchmarks — a standard for evaluating decision support systems — Sity noted that the NeuroBlade Accelerator delivers about 4x faster performance than leading vectorized CPU implementations such as Presto-Velox. NeuroBlade’s pitch is that by offloading analytics from CPUs and handing them to dedicated silicon, enterprises can achieve better performance with a fraction of the infrastructure — lowering costs, energy draw and complexity in one move.
Bloomberg’s research reveals Retrieval-Augmented Generation (RAG) can produce unsafe responses; future designs must integrate safety systems that specifically anticipate how retrieved content might interact with model safeguards
According to surprising new research published by Bloomberg, RAG can potentially make large language models (LLMs) unsafe. Bloomberg’s paper, ‘RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models,’ evaluated 11 popular LLMs including Claude-3.5-Sonnet, Llama-3-8B and GPT-4o. The findings contradict conventional wisdom that RAG inherently makes AI systems safer. The Bloomberg research team discovered that when using RAG, models that typically refuse harmful queries in standard settings often produce unsafe responses. For example, Llama-3-8B’s unsafe responses jumped from 0.3% to 9.2% when RAG was implemented. Alongside the RAG research, Bloomberg released a second paper, ‘Understanding and Mitigating Risks of Generative AI in Financial Services,’ that introduces a specialized AI content risk taxonomy for financial services that addresses domain-specific concerns not covered by general-purpose safety approaches. The research challenges widespread assumptions that retrieval-augmented generation (RAG) enhances AI safety, while demonstrating how existing guardrail systems fail to address domain-specific risks in financial services applications. For enterprises looking to lead the way in AI, Bloomberg’s research mean that RAG implementations require a fundamental rethinking of safety architecture. Leaders must move beyond viewing guardrails and RAG as separate components and instead design integrated safety systems that specifically anticipate how retrieved content might interact with model safeguards. Industry-leading organizations will need to develop domain-specific risk taxonomies tailored to their regulatory environments, shifting from generic AI safety frameworks to those that address specific business concerns.