This week, Sakana AI, an Nvidia-backed startup that’s raised a whole lot of tens of millions of {dollars} from VC corporations, made a outstanding declare. The corporate mentioned it had created an AI system, the AI CUDA Engineer, that might successfully velocity up the coaching of sure AI fashions by an element of as much as 100x.
The one downside is, the system didn’t work.
Customers on X rapidly found that Sakana’s system really resulted in worse-than-average mannequin coaching efficiency. In accordance with one person, Sakana’s AI resulted in a 3x slowdown — not a speedup.
What went mistaken? A bug within the code, in response to a put up by Lucas Beyer, a member of the technical workers at OpenAI.
“Their orig code is mistaken in [a] refined method,” Beyer wrote on X. “The very fact they run benchmarking TWICE with wildly completely different outcomes ought to make them cease and assume.”
In a postmortem printed Friday, Sakana admitted that the system has discovered a option to “cheat” (as Sakana described it) and blamed the system’s tendency to “reward hack” — i.e. establish flaws to realize excessive metrics with out conducting the specified objective (rushing up mannequin coaching). Comparable phenomena has been noticed in AI that’s educated to play video games of chess.
In accordance with Sakana, the system discovered exploits within the analysis code that the corporate was utilizing that allowed it to bypass validations for accuracy, amongst different checks. Sakana says it has addressed the difficulty, and that it intends to revise its claims in up to date supplies.
“We now have since made the analysis and runtime profiling harness extra sturdy to eradicate a lot of such [sic] loopholes,” the corporate wrote within the X put up. “We’re within the technique of revising our paper, and our outcomes, to mirror and focus on the results […] We deeply apologize for our oversight to our readers. We’ll present a revision of this work quickly, and focus on our learnings.”
Props to Sakana for proudly owning as much as the error. However the episode is an efficient reminder that if a declare sounds too good to be true, particularly in AI, it most likely is.