OpenAI says GPT-5 stacks up to humans in a wide range of jobs based on GDPval benchmark; was ranked as better than or on par with industry experts 40.6% of the time • DigiBanker

OpenAI released a new benchmark that tests how its AI models perform compared to human professionals across a wide range of industries and jobs. The test, GDPval, is an early attempt at understanding how close OpenAI’s systems are to outperforming humans at economically valuable work — a key part of the company’s founding mission to develop artificial general intelligence, or AGI. GDPval is based on nine industries that contribute the most to America’s gross domestic product, including domains such as healthcare, finance, manufacturing, and government. The benchmark tests an AI model’s performance in 44 occupations among those industries, ranging from software engineers to nurses to journalists. For OpenAI’s first version of the test, GDPval-v0, OpenAI asked experienced professionals to compare AI-generated reports with those produced by other professionals, and then choose the best one. For example, one prompt asked investment bankers to create a competitor landscape for the last-mile delivery industry and compare them to AI-generated reports. OpenAI then averages an AI model’s “win rate” against the human reports across all 44 occupations. For GPT-5-high, a souped-up version of GPT-5 with extra computational power, the company says the AI model was ranked as better than or on par with industry experts 40.6% of the time. It’s worth noting that most working professionals do a lot more than submit research reports to their boss, which is all that GDPval-v0 tests for. OpenAI acknowledges this and says it plans to create more robust tests in the future that can account for more industries and interactive workflows. OpenAI also tested Anthropic’s Claude Opus 4.1 model, which was ranked as better than or on par with industry experts in 49% of tasks. OpenAI says that it believes Claude scored so high because of its tendency to make pleasing graphics, rather than sheer performance.

Read Article