Anthropic just analyzed 700,000 Claude conversations — and found its AI has a moral code of its own- largely upholding the company’s “helpful, honest, harmless” framework while adapting its values to different contexts • DigiBanker

Anthropic research reveals both reassuring alignment with the company’s goals and concerning edge cases that could help identify vulnerabilities in AI safety measures. The study found that Claude largely upholds the company’s “helpful, honest, harmless” framework while adapting its values to different contexts — from relationship advice to historical analysis. This represents one of the most ambitious attempts to empirically evaluate whether an AI system’s behavior in the wild matches its intended design. “Our hope is that this research encourages other AI labs to conduct similar research into their models’ values,” said Saffron Huang, a member of Anthropic’s Societal Impacts team who worked on the study. The research team developed a novel evaluation method to systematically categorize values expressed in actual Claude conversations. After filtering for subjective content, they analyzed over 308,000 interactions, creating what they describe as “the first large-scale empirical taxonomy of AI values.” The taxonomy organized values into five major categories: Practical, Epistemic, Social, Protective, and Personal. At the most granular level, the system identified 3,307 unique values — from everyday virtues like professionalism to complex ethical concepts like moral pluralism. “I was surprised at just what a huge and diverse range of values we ended up with, more than 3,000, from ‘self-reliance’ to ‘strategic thinking’ to ‘filial piety,’” Huang told. The study found that Claude generally adheres to Anthropic’s prosocial aspirations, emphasizing values like “user enablement,” “epistemic humility,” and “patient wellbeing” across diverse interactions.

Read Article