OpenAI launched a new family of models called GPT-4.1. Yes, “4.1” — as if the company’s nomenclature wasn’t confusing enough already. There’s GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, all of which OpenAI says “excel” at coding and instruction following. Available through OpenAI’s API but not ChatGPT, the multimodal models have a 1-million-token context window, meaning they can take in roughly 750,000 words in one go. OpenAI’s grand ambition is to create an “agentic software engineer,” as CFO Sarah Friar put it. The company asserts its future models will be able to program entire apps end-to-end, handling aspects such as quality assurance, bug testing, and documentation writing. GPT-4.1 is a step in this direction. “We’ve optimized GPT-4.1 for real-world use based on direct feedback to improve in areas that developers care most about: frontend coding, making fewer extraneous edits, following formats reliably, adhering to response structure and ordering, consistent tool usage, and more,” says OpenAI. “These improvements enable developers to build agents that are considerably better at real-world software engineering tasks.” OpenAI claims the full GPT-4.1 model outperforms its GPT-4o and GPT-4o mini models on coding benchmarks, including SWE-bench. GPT-4.1 mini and nano are said to be more efficient and faster at the cost of some accuracy, with OpenAI saying GPT-4.1 nano is its speediest — and cheapest — model ever. According to OpenAI’s internal testing, GPT-4.1, which can generate more tokens at once than GPT-4o (32,768 versus 16,384), scored between 52% and 54.6% on SWE-bench Verified, a human-validated subset of SWE-bench. In a separate evaluation, OpenAI probed GPT-4.1 using Video-MME, which is designed to measure the ability of a model to “understand” content in videos. GPT-4.1 reached a chart-topping 72% accuracy on the “long, no subtitles” video category, claims OpenAI.