Anthropic unveils ‘auditing agents’ to test for AI misalignment finding prompts that elicit “concerning” behaviors • DigiBanker

Anthropic researchers developed auditing agents that achieved “impressive performance at auditing tasks, while also shedding light on their limitations.” The researchers stated that these agents, created during the pre-deployment testing of Claude Opus 4, enhanced alignment validation tests and enabled researchers to conduct multiple parallel audits at scale. The three agents they explored were: Tool-using investigator agent for open-ended investigation of models using chat, data analysis and interpretability tools; Evaluation agent that builds behavioral evaluations and can discriminate between models that have implanted behaviors and those that do not; Breadth-first red-teaming agent, which was developed specifically for the Claude 4 alignment assessment, so that it can discover implanted test behaviors. According to the researchers, the investigator agent successfully found the root cause of the issues 10-13% of the time. Still, this performance improved to 42% “when using a super-agent approach that aggregates findings across multiple investigations.” The next step in most alignment audits is typically a behavioral evaluation, which determines whether the model consistently misbehaves in various settings. To test its evaluation agent, the researchers designed 10 models “given system prompt instructions to exhibit a specific behavior, such as excessive deference to the user.” They ran the agent five times per model and saw that the agent correctly finds and flags at least one quirk of the model. However, the agent sometimes failed to identify specific quirks consistently. It had trouble evaluating subtle quirks, such as self-promotion and research-sandbagging, as well as quirks that are difficult to elicit, like the Hardcode Test Cases quirk. The last test and agent concern behavioral red-teaming to find the prompts that elicit “concerning” behaviors. The breadth-first red-teaming agent converses with the target model (in Anthropic’s case, it was with Claude Opus 4), and this chat is then rated for alignment-relevant properties. The agent identified seven of the ten system quirks, but it also struggled with the same issues as the evaluator agent.

Read Article