A new AI coding challenge just published its first results – and they aren’t pretty

Must Read
bicycledays
bicycledayshttp://trendster.net
Please note: Most, if not all, of the articles published at this website were completed by Chat GPT (chat.openai.com) and/or copied and possibly remixed from other websites or Feedzy or WPeMatico or RSS Aggregrator or WP RSS Aggregrator. No copyright infringement is intended. If there are any copyright issues, please contact: bicycledays@yahoo.com.

A brand new AI coding problem has revealed its first winner β€” and set a brand new bar for AI-powered software program engineers.Β 

On Wednesday at 5pm PST, the nonprofit Laude Institute introduced the primary winner of the Ok Prize, a multi-round AI coding problem launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian immediate engineer named Eduardo Rocha de Andrade, who will obtain $50,000 for the prize. However extra shocking than the win was his ultimate rating: he received with right solutions to simply 7.5% of the questions on the take a look at.

β€œWe’re glad we constructed a benchmark that’s truly laborious,” stated Konwinski. β€œBenchmarks must be laborious in the event that they’re going to matter,” he continued, including: β€œScores could be totally different if the large labs had entered with their greatest fashions. However that’s type of the purpose. Ok Prize runs offline with restricted compute, so it favors smaller and open fashions. I really like that. It ranges the taking part in area.”

Konwinski has pledged $1 million to the primary open-source mannequin that may rating larger than 90% on the take a look at.

Just like the well-known SWE-Bench system, the Ok Prize checks fashions towards flagged points from GitHub as a take a look at of how effectively fashions can take care of real-world programming issues. However whereas SWE-Bench relies on a hard and fast set of issues that fashions can prepare towards, the Ok Prize is designed as a β€œcontamination-free model of SWE-Bench,” utilizing a timed entry system to protect towards any benchmark-specific coaching. For spherical one, fashions had been due by March twelfth. The Ok Prize organizers then constructed the take a look at utilizing solely GitHub points flagged after that date.

The 7.5% prime rating stands in marked distinction to SWE-Bench itself, which at the moment exhibits a 75% prime rating on its simpler β€˜Verified’ take a look at and 34% on its more durable β€˜Full’ take a look at. Konwinski nonetheless isn’t certain whether or not the disparity is because of contamination on SWE-Bench or simply the problem of gathering new points from GitHub, however he expects the Ok Prize mission to reply the query quickly.

β€œAs we get extra runs of the factor, we’ll have a greater sense,” he informed Trendster, β€œas a result of we count on folks to adapt to the dynamics of competing on this each few months.”

Techcrunch occasion

San Francisco
|
October 27-29, 2025

It would seem to be an odd place to fall brief, given the wide selection of AI coding instruments already publicly accessible – however with benchmarks changing into too straightforward, many critics see tasks just like the Ok Prize as a essential step towards fixing AI’s rising analysis downside.

β€œI’m fairly bullish about constructing new checks for current benchmarks,” says Princeton researcher Sayash Kapoor, who put ahead an identical concept in a latest paper. β€œWith out such experiments, we are able to’t truly inform if the difficulty is contamination, and even simply focusing on the SWE-Bench leaderboard with a human within the loop.”

For Konwinski, it’s not only a higher benchmark, however an open problem to the remainder of the business. β€œIf you happen to take heed to the hype, it’s like we must be seeing AI medical doctors and AI attorneys and AI software program engineers, and that’s simply not true,” he says. β€œIf we are able to’t even get greater than 10% on a contamination free SWE-Bench, that’s the truth examine for me.”

Latest Articles

Replit’s Amjad Masad on the Cursor deal, fighting Apple, and why...

Amjad Masad has been constructing Replit for a decade, however the final 18 months have been one thing else...

More Articles Like This