Artificial intelligence models that spend more time “thinking” through problems don’t always perform better — and in some cases, they get significantly worse, according to new research from Anthropic that challenges a core assumption driving the AI industry’s latest scaling efforts. The study, led by Anthropic AI safety fellow Aryo Pradipta Gema and other company researchers, identifies what they call “inverse scaling in test-time compute,” where extending the reasoning length of large language models actually deteriorates their performance across several types of tasks. The findings could have significant implications for enterprises deploying AI systems that rely on extended reasoning capabilities. The study reveals distinct failure patterns across major AI systems. Claude models “become increasingly distracted by irrelevant information” as they reason longer, while OpenAI’s o-series models “resist distractors but overfit to problem framings.” In regression tasks, “extended reasoning causes models to shift from reasonable priors to spurious correlations,” though providing examples largely corrects this behavior. Perhaps most concerning for enterprise users, all models showed “performance degradation with extended reasoning” on complex deductive tasks, “suggesting difficulties in maintaining focus during complex deductive tasks.” Major AI companies have invested heavily in “test-time compute” — allowing models more processing time to work through complex problems — as a key strategy for enhancing capabilities. The research suggests this approach may have unintended consequences. “While test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns,” the authors conclude. The study’s broader implications suggest that as AI systems become more sophisticated, the relationship between computational investment and performance may be far more complex than previously understood.