OpenAI on Monday launched a brand new household of fashions referred to as GPT-4.1. Sure, β4.1β β as if the corporateβs nomenclature wasnβt complicated sufficient already.
Thereβs GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, all of which OpenAI says βexcelβ at coding and instruction following. Accessible by OpenAIβs API however not ChatGPT, the multimodal fashions have a 1-million-token context window, which means they’ll absorb roughly 750,000 phrases in a single go (longer than βConflict and Peaceβ).
GPT-4.1 arrives as OpenAI rivals like Google and Anthropic ratchet up efforts to construct refined programming fashions. Googleβs not too long ago launched Gemini 2.5 Professional, which additionally has a 1-million-token context window, ranks extremely on in style coding benchmarks. So do Anthropicβs Claude 3.7 Sonnet and Chinese language AI startup DeepSeekβs upgraded V3.
Itβs the objective of many tech giants, together with OpenAI, to coach AI coding fashions able to performing complicated software program engineering duties. OpenAIβs grand ambition is to create an βagentic software program engineer,β as CFO Sarah Friar put it throughout a tech summit in London final month. The corporate asserts its future fashions will be capable to program complete apps end-to-end, dealing with points comparable to high quality assurance, bug testing, and documentation writing.
GPT-4.1 is a step on this route.
βWeβve optimized GPT-4.1 for real-world use based mostly on direct suggestions to enhance in areas that builders care most about: frontend coding, making fewer extraneous edits, following codecs reliably, adhering to response construction and ordering, constant instrument utilization, and extra,β an OpenAI spokesperson informed Trendster by way of e-mail. βThese enhancements allow builders to construct brokers which might be significantly higher at real-world software program engineering duties.β
OpenAI claims the complete GPT-4.1 mannequin outperforms its GPT-4o and GPT-4o miniΒ fashions on coding benchmarks, together with SWE-bench. GPT-4.1 mini and nano are mentioned to be extra environment friendly and quicker at the price of some accuracy, with OpenAI saying GPT-4.1 nano is its speediest β and most cost-effective β mannequin ever.
GPT-4.1 prices $2 per million enter tokens and $8 per million output tokens. GPT-4.1 mini is $0.40/million enter tokens and $1.60/million output tokens, and GPT-4.1 nano is $0.10/million enter tokens and $0.40/million output tokens.
In line with OpenAIβs inside testing, GPT-4.1, which may generate extra tokens directly than GPT-4o (32,768 versus 16,384), scored between 52% and 54.6% on SWE-bench Verified, a human-validated subset of SWE-bench. (OpenAI famous in a weblog submit that some options to SWE-bench Verified issues couldnβt run on its infrastructure, therefore the vary of scores.) These figures are just below the scores reported by Google and Anthropic for Gemini 2.5 Professional (63.8%) and Claude 3.7 Sonnet (62.3%), respectively, on the identical benchmark.
In a separate analysis, OpenAI probed GPT-4.1 utilizing Video-MME, which is designed to measure the power of a mannequin to βperceiveβ content material in movies. GPT-4.1 reached a chart-topping 72% accuracy on the βlengthy, no subtitlesβ video class, claims OpenAI.
Whereas GPT-4.1 scores moderately effectively on benchmarks and has a more moderen βdata cutoff,β giving it a greater body of reference for present occasions (as much as June 2024), itβs essential to take into account that even among the greatest fashions right this moment wrestle with duties that wouldnβt journey up specialists. For instance, many research haveΒ provenΒ that code-generating fashions typically fail to repair, and even introduce, safety vulnerabilities and bugs.
OpenAI acknowledges, too, that GPT-4.1 turns into much less dependable (i.e., likelier to make errors) the extra enter tokens it has to take care of. On one of many firmβs personal checks, OpenAI-MRCR, the mannequinβs accuracy decreased from round 84% with 8,000 tokens to 50% with 1 million tokens. GPT-4.1 additionally tended to be extra βliteralβ than GPT-4o, says the corporate, typically necessitating extra particular, specific prompts.