Anthropic and OpenAI run first cross‑lab safety tests: o3 and o4‑mini align strongly, GPT‑4o/4.1 show misuse concerns, and all models exhibit varying sycophancy under stress • DigiBanker

AI startups Anthropic and OpenAI said that they evaluated each other’s public models, using their own safety and misalignment tests. Sharing this news and the results in separate blog posts, the companies said they looked for problems like sycophancy, whistleblowing, self-preservation, supporting human misuse and capabilities that could undermine AI safety evaluations and oversight. OpenAI wrote in its post that this collaboration was a “first-of-its-kind joint evaluation” and that it demonstrates how labs can work together on issues like these. Anthropic wrote in its post that the joint evaluation exercise was meant to help mature the field of alignment evaluations and “establish production-ready best practices.” Reporting the findings of its evaluations, Anthropic said OpenAI’s o3 and o4-mini reasoning models were aligned as well or better than its own models overall, the GPT-4o and GPT-4.1 general-purpose models showed some examples of “concerning behavior,” especially around misuse, and both companies’ models struggled to some degree with sycophancy. OpenAI wrote in its post that it found that Anthropic’s Claude 4 models generally performed well on evaluations stress-testing their ability to respect the instruction hierarchy, performed less well on jailbreaking evaluations that focused on trained-in safeguards, generally proved to be aware of their uncertainty and avoided making statements that were inaccurate, and performed especially well or especially poorly on scheming evaluation, depending on the subset of testing. Both companies said in their posts that for the purpose of testing, they relaxed some model-external safeguards that otherwise would be in operation but would interfere with the tests. They each said that their latest models, OpenAI’s GPT-5 and Anthropic’s Opus 4.1, which were released after the evaluations, have shown improvements over the earlier models.

Read Article