Anthropic’s open sourced AI safety tool Petri can automate and continuously audit GenAI models across 111 risky tasks; Claude Sonnet 4.5 emerged as the top-performing model • DigiBanker

Anthropic PBC is doubling down on AI safety with the release of Parallel Exploration Tool for Risky Interactions, or Petri, a new open-source tool that uses AI agents to audit the behavior of large language models. It’s designed to identify numerous problematic tendencies of models, such as deceiving users, whistleblowing, cooperation with human misuse and facilitating terrorism. Anthropic said that agentic tools like Petri can be useful because the complexity and variety of LLM behaviors exceeds the ability of researchers to test them for every kind of worrying scenario manually. As such, Petri represents a shift in the business of AI safety testing from static benchmarks to automated, ongoing audits that are designed to catch risky behavior, not only before models are released, but also once they’re out in the wild. Anthropic said that Claude Sonnet 4.5 emerged as the top-performing model across a range of “risky tasks” in its evaluations. Petri combines its testing agents with a judge model that ranks each LLM across various dimensions, including honesty and refusal. It will then flag any transcripts of conversations that resulted in risky outputs, so humans can review them. The tool is therefore suitable for developers wanting to conduct exploratory testing of new AI models, so they can improve their overall safety before they’re released to the public, Anthropic said. It significantly reduces the amount of manual effort required to evaluate models for safety, and by making it open-source, Anthropic says it hopes to make this kind of alignment research standard for all developers.

Read Article