Amazon has introduced SWE-PolyBench, the first industry benchmark to evaluate AI coding agents’ ability to navigate and understand complex codebases. The benchmark, which measures system performance in GitHub issues, has spurred the development of capable coding agents and has become the de-facto standard for coding agent benchmarking. SWE-PolyBench contains over 2,000 curated issues in four languages and a stratified subset of 500 issues for rapid experimentation. The benchmark aims to advance AI performance in real-world scenarios. Key features of SWE-PolyBench at a glance: Multi-Language Support: Java (165 tasks), JavaScript (1017 tasks), TypeScript (729 tasks), and Python (199 tasks). Extensive Dataset: 2110 instances from 21 repositories ranging from web frameworks to code editors and ML tools, on the same scale as SWE-Bench full with more repository. Task Variety: Includes bug fixes, feature requests, and code refactoring. Faster Experimentation: SWE-PolyBench500 is a stratified subset for efficient experimentation. Leaderboard: A leaderboard with a rich set of metrics for transparent benchmarking.
New system for training AI agents focuses on multi-turn, interactive settings where agents must adapt, remember, and reason in the face of uncertainty instead of static tasks like math solving or code generation
A collaborative team from Northwestern University, Microsoft, Stanford, and the University of Washington — including a former DeepSeek researcher named Zihan Wang, currently completing a computer science PhD at Northwestern — has introduced RAGEN, a new system for training and evaluating AI agents that they hope makes them more reliable and less brittle for real-world, enterprise-grade usage. Unlike static tasks like math solving or code generation, RAGEN focuses on multi-turn, interactive settings where agents must adapt, remember, and reason in the face of uncertainty. Built on a custom RL framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), the system explores how LLMs can learn through experience rather than memorization. StarPO-S incorporates three key interventions: Uncertainty-based rollout filtering; KL penalty removal; and Asymmetric PPO clipping. StarPO operates in two interleaved phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This structure supports a more stable and interpretable learning loop compared to standard policy optimization approaches. The team identified three dimensions that significantly impact training: Task diversity, Interaction granularity, and Rollout freshness. Together, these factors make the training process more stable and effective.
OpenAI is planning a truly ‘open reasoning’ AI system with a ‘handoff’ feature that would enable it to make calls to the OpenAI API to access other, larger models for a substantial computational lift
OpenAI is gearing up to release an AI system that’s truly “open,” meaning it’ll be available for download at no cost and not gated behind an API. Beyond its benchmark performance, OpenAI may have a key feature up its sleeve — one that could make its open “reasoning” model highly competitive. Company leaders have been discussing plans to enable the open model to connect to OpenAI’s cloud-hosted models to better answer complex queries. OpenAI CEO Sam Altman described the capability as a “handoff.” If the feature — as sources describe it — makes it into the open model, it will be able to make calls to the OpenAI API to access the company’s other, larger models for a substantial computational lift. It’s unclear if the open model will have the ability to access some of the many tools OpenAI’s models can use, like web search and image generation. The idea for the handoff feature was suggested by a developer during one of OpenAI’s recent developer forums, according to a source. The suggestion appears to have gained traction within the company. OpenAI has been hosting a series of community feedback events with developers to help shape its upcoming open model release. A local model that can tap into more powerful cloud systems brings to mind Apple Intelligence, Apple’s suite of AI capabilities that uses a combination of on-device models and models running in “private” data centers. OpenAI stands to benefit in obvious ways. Beyond generating incremental revenue, a handoff could rope more members of the open source community into the company’s premium ecosystem.
UiPath’s agentic AI platform to utilize Redis semantic routing tech that would enable AI agents to leverage the best LLM or LLM provider depending on the context, intent, and use-case which the customer is trying to solve
Data platform Redis and UiPath expanded their collaboration toward furthering agentic automation solutions for customers. By extending their partnership, Redis and UiPath will explore ways to leverage the Redis vector database, Semantic Caching, and Semantic Routing to support UiPath Agent Builder, a secure, simple way to build, test, and launch agents and the agentic automations they are executing. With Redis powering these solutions, UiPath agents will understand the meaning behind user queries, making data access faster and system responses smarter, which delivers greater speed and cost efficiency to enterprise developers looking to take advantage of automation. Additionally, via the utilization of semantic routing, UiPath agents will be able to leverage the best LLM or LLM provider depending on the context, intent, and use-case which the customer is trying to solve. UiPath Agent Builder builds on the RPA capabilities and orchestration of UiPath Automation Suite and Orchestrator to deliver unmatched agentic capabilities. Agent Builder will utilize a sophisticated memory architecture that enables agents to retrieve relevant information only from permissioned, governed knowledgebases and maintain context across planning and execution. This architecture will enable developers to create, customize, evaluate, and deploy specialized enterprise agents that can understand context, make decisions, and execute complex processes while maintaining enterprise-grade security and governance.
Microsoft releases taxonomy of failure modes- security and safety- inherent to agentic architecture- novel modes unique to agentic systems (e.g. agent compromise) and modes representing amplification of existing GenAI risks (e.g. bias amplification)
Microsoft’s AI Red Team has published a detailed taxonomy addressing the failure modes inherent to agentic architectures. Agentic AI systems are autonomous entities that observe and act upon their environment to achieve predefined objectives. These systems integrate capabilities such as autonomy, environment observation, interaction, memory, and collaboration. However, these features introduce a broader attack surface and new safety concerns. The report distinguishes between novel failure modes unique to agentic systems and amplification of risks already observed in generative AI contexts. Microsoft categorizes failure modes across security and safety dimensions. Novel Security Failures: Including agent compromise, agent injection, agent impersonation, agent flow manipulation, and multi-agent jailbreaks. Novel Safety Failures: Covering issues such as intra-agent Responsible AI (RAI) concerns, biases in resource allocation among multiple users, organizational knowledge degradation, and prioritization risks impacting user safety. Existing Security Failures: Encompassing memory poisoning, cross-domain prompt injection (XPIA), human-in-the-loop bypass vulnerabilities, incorrect permissions management, and insufficient isolation. Existing Safety Failures: Highlighting risks like bias amplification, hallucinations, misinterpretation of instructions, and a lack of sufficient transparency for meaningful user consent.
NeuroBlade’s Analytics Accelerator is a purpose-built hardware designed to handle modern database workloads delivering 4x faster performance than leading vectorized CPU implementations
As Elad Sity, CEO and cofounder of NeuroBlade, noted, “while the industry has long relied on CPUs for data preparation, they’ve become a bottleneck — consuming well over 30 percent of the AI pipeline.” NeuroBlade, the Israeli semiconductor startup Sity cofounded, believes the answer lies in a new category of hardware specifically designed to accelerate data analytics. Their Analytics Accelerator isn’t just a faster CPU — it’s fundamentally different architecture purpose-built to handle modern database workloads. NeuroBlade’s Accelerator unlocks the full potential of data analytics platforms by dramatically boosting performance and reducing query times. By offloading operations from the CPU to purpose-built hardware — a process known as pushdown—it increases the compute power of each server, enabling faster processing of large datasets with smaller clusters compared to CPU-only deployments. Purpose-built hardware that boosts each server’s compute power for analytics reduces the need for massive clusters and helps avoid bottlenecks like network overhead, power constraints, and operation complexity. In TPC-H benchmarks — a standard for evaluating decision support systems — Sity noted that the NeuroBlade Accelerator delivers about 4x faster performance than leading vectorized CPU implementations such as Presto-Velox. NeuroBlade’s pitch is that by offloading analytics from CPUs and handing them to dedicated silicon, enterprises can achieve better performance with a fraction of the infrastructure — lowering costs, energy draw and complexity in one move.
Bloomberg’s research reveals Retrieval-Augmented Generation (RAG) can produce unsafe responses; future designs must integrate safety systems that specifically anticipate how retrieved content might interact with model safeguards
According to surprising new research published by Bloomberg, RAG can potentially make large language models (LLMs) unsafe. Bloomberg’s paper, ‘RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models,’ evaluated 11 popular LLMs including Claude-3.5-Sonnet, Llama-3-8B and GPT-4o. The findings contradict conventional wisdom that RAG inherently makes AI systems safer. The Bloomberg research team discovered that when using RAG, models that typically refuse harmful queries in standard settings often produce unsafe responses. For example, Llama-3-8B’s unsafe responses jumped from 0.3% to 9.2% when RAG was implemented. Alongside the RAG research, Bloomberg released a second paper, ‘Understanding and Mitigating Risks of Generative AI in Financial Services,’ that introduces a specialized AI content risk taxonomy for financial services that addresses domain-specific concerns not covered by general-purpose safety approaches. The research challenges widespread assumptions that retrieval-augmented generation (RAG) enhances AI safety, while demonstrating how existing guardrail systems fail to address domain-specific risks in financial services applications. For enterprises looking to lead the way in AI, Bloomberg’s research mean that RAG implementations require a fundamental rethinking of safety architecture. Leaders must move beyond viewing guardrails and RAG as separate components and instead design integrated safety systems that specifically anticipate how retrieved content might interact with model safeguards. Industry-leading organizations will need to develop domain-specific risk taxonomies tailored to their regulatory environments, shifting from generic AI safety frameworks to those that address specific business concerns.
Lightrun’s observability platform can monitor code just as it is in the IDE and then automatically make adjustments to it as it moves into production through AI-based simulations
Startup Lightrun has built an observability platform to identify and debug (remediate) code. “Code is becoming cheap but bugs are expensive,” Ilan Peleg, CEO said. That problem, meanwhile, has reached “an inflection point. Developers now can ship more code than ever before,” due to all the automation that is being used, thanks to AI. “But it’s still a very manual process to fix it when things go wrong.” Lightrun’s breakthrough has been to build an observability toolset that can monitor code just as it is in the IDE and understand how it will behave alongside code that is actively in production. Lightrun is then able to automatically make adjustments to the code as it moves into production to continue operating without interruption and crashes. It does this by way of being able to create AI-based simulations to understand that behaviour, and then to fix the code before issues arise. “This is the part where we are unique,” Peleg said. There are a lot of options for how Lightrun might develop, given how close observability sits to other activities in organizations. One of those is building tools more specifically for cybersecurity teams, given the obvious security implications that arise out of bugs. Another is potentially building some of its tooling even closer to the point of code creation, to make finding and fixing possible bugs even more efficient.
OpenAI is rolling out shopping features such as improved product results, visual product details, pricing and reviews, and direct links to “find, compare and buy products” in ChatGPT
OpenAI said that it began rolling out features that make it easier and faster to “find, compare and buy products” in its ChatGPT chatbot. These features include improved product results; visual product details, pricing and reviews; and direct links to buy, according to a post on X. They will be available to Plus, Pro, Free and logged-out users. The rollout of this shopping experience began Monday and will take a few days to complete. “Product results are chosen independently and are not ads,” the post said. The new improvements outlined in other posts on X include the ability to send a WhatsApp message to ChatGPT to get up-to-date answers and live sports scores; the delivery of multiple citations with each response so that users can learn more or verify information; and the use of trending searches and autocomplete suggestions to make search faster.
Mastercard’s Agentic Payments Program applies tokenization to integrate trusted, seamless payments experiences into the tailored recommendations and insights already provided on conversational AI platforms
Mastercard announced the launch of its Agentic Payments Program, Mastercard Agent Pay. The groundbreaking solution integrates with agentic AI to revolutionize commerce. Mastercard Agent Pay will deliver smarter, more secure, and more personal payments experiences to consumers, merchants, and issuers. The program introduces Mastercard Agentic Tokens, which build upon proven tokenization capabilities that today power global commerce solutions like mobile contactless payments, secure card-on-file, and Mastercard Payment Passkeys, as well as programmable payments like recurring expenses and subscriptions. This helps unlock an agentic commerce future where consumers and businesses can transact with trust, security, and control. Mastercard will collaborate with Microsoft on new use cases to scale agentic commerce, with other leading AI platforms to follow. Mastercard will also partner with technology enablers like IBM, with its watsonx Orchestrate product, to accelerate B2B use cases. In addition, Mastercard will work with acquirers and checkout players like Braintree and Checkout.com to enhance the tokenization capabilities they are already using today with merchants to deliver safe, transparent agentic payments. For banks, tokenized payment credentials will be seamlessly integrated across agentic commerce platforms, keeping card issuers at the forefront of this rapidly evolving technology with enhanced visibility, security, and control. Mastercard Agent Pay will enhance generative AI conversations for people and businesses alike by integrating trusted, seamless payments experiences into the tailored recommendations and insights already provided on conversational platforms. By identifying and validating a customer using Mastercard’s tokenization technology, a retailer will be able to offer a meaningful and consistent shopping experience, layering on relevant and personalized benefits, such as recommended products, free delivery, rewards, and discounts. Mastercard will work with Microsoft to integrate Microsoft’s leading AI technologies, including Microsoft Azure OpenAI Service and Microsoft Copilot Studio, with Mastercard’s trusted payment solutions to develop and scale agentic commerce, addressing the evolving needs of the entire commerce value chain.
