The latest large reasoning models (LRMs) experience “complete accuracy collapse” when faced with highly complex tasks, according to a new paper co-authored by researchers from Apple. Researchers used controllable puzzles like the Tower of Hanoi, Checkers Jumping, River Crossing and Blocks World, allowing them precise control over the difficulty of the puzzles by adding more disks, checkers, people or blocks, while keeping the basic rules the same. This allowed them to see exactly when and how the AI’s reasoning broke down as problems got harder. As puzzle complexity increased, the performance of these frontier LRMs didn’t just get a little worse; it suffered a “complete accuracy collapse,” often dropping to zero successful solutions beyond a certain point.The researchers found that as the problems approached the point where the AI started failing, the LRMs began to reduce their reasoning effort, using fewer “thinking” steps or tokens, pointing to a fundamental limit in how they handle increasing difficulty. On simple problems, the LRMs sometimes found the correct answer early but kept exploring wrong solutions — a form of “overthinking” that wastes effort. On harder problems, correct solutions appeared later, if at all. Beyond the collapse point, no correct solutions were found in the thinking process. The study concluded that these findings point to fundamental limitations in how current LRMs tackle problems. While the “thinking” process helps delay failure, it doesn’t overcome these core barriers. The research raises questions about whether simply adding more “thinking” steps is enough to achieve truly general AI that can handle highly complex, novel problems.