For years, Huge Tech CEOs have touted visions of AI brokers that may autonomously use software program purposes to finish duties for folks. However take at presentβs client AI brokers out for a spin, whether or not itβs OpenAIβs ChatGPT Agent or Perplexityβs Comet, and also youβll shortly notice how restricted the expertise nonetheless is. Making AI brokers extra strong might take a brand new set of methods that the trade remains to be discovering.
A kind of methods is fastidiously simulating workspaces the place brokers could be educated on multi-step duties β referred to as reinforcement studying (RL) environments. Very similar to labeled datasets powered the final wave of AI, RL environments are beginning to appear like a vital component within the improvement of brokers.
AI researchers, founders, and buyers inform Trendster that main AI labs at the moment are demanding extra RL environments, and thereβs no scarcity of startups hoping to provide them.
βAll the massive AI labs are constructing RL environments in-house,β stated Jennifer Li, common accomplice at Andreessen Horowitz, in an interview with Trendster. βHowever as you may think about, creating these datasets may be very complicated, so AI labs are additionally taking a look at third social gathering distributors that may create top quality environments and evaluations. Everyone seems to be taking a look at this area.β
The push for RL environments has minted a brand new class of well-funded startups, equivalent to Mechanize Work and Prime Mind, that intention to guide the area. In the meantime, giant data-labeling firms like Mercor and Surge say theyβre investing extra in RL environments to maintain tempo with the tradeβs shifts from static datasets to interactive simulations. The key labs are contemplating investing closely too: in keeping with The Data, leaders at Anthropic have mentioned spending greater than $1 billion on RL environments over the following 12 months.
The hope for buyers and founders is that one among these startups emerge because the βScale AI for environments,β referring to the $29 billion information labelling powerhouse that powered the chatbot period.
The query is whether or not RL environments will actually push the frontier of AI progress.
Techcrunch occasion
San Francisco
|
October 27-29, 2025
What’s an RL surroundings?
At their core, RL environments are coaching grounds that simulate what an AI agent can be doing in an actual software program software. One founder described constructing them in current interview βlike creating a really boring online game.β
For instance, an surroundings may simulate a Chrome browser and activity an AI agent with buying a pair of socks on Amazon. The agent is graded on its efficiency and despatched a reward sign when it succeeds (on this case, shopping for a worthy pair of socks).
Whereas such a activity sounds comparatively easy, there are a whole lot of locations the place an AI agent may get tripped up. It’d get misplaced navigating the online web pageβs drop down menus, or purchase too many socks. And since builders canβt predict precisely what mistaken flip an agent will take, the surroundings itself needs to be strong sufficient to seize any sudden habits, and nonetheless ship helpful suggestions. That makes constructing environments much more complicated than a static dataset.
Some environments are fairly strong, permitting for AI brokers to make use of instruments, entry the web, or use varied software program purposes to finish a given activity. Others are extra slender, aimed toward serving to an agent be taught particular duties in enterprise software program purposes.
Whereas RL environments are the new factor in Silicon Valley proper now, thereβs a whole lot of precedent for utilizing this method. One in every of OpenAIβs first initiatives again in 2016 was constructing βRL Gyms,β which have been fairly much like the fashionable conception of environments. The identical 12 months, Google DeepMind educated AlphaGo β an AI system that would beat a world champion on the board recreation, Go β utilizing RL methods inside a simulated surroundings.
Whatβs distinctive about at presentβs environments is that researchers try to construct computer-using AI brokers with giant transformer fashions. Not like AlphaGo, which was a specialised AI system working in a closed environments, at presentβs AI brokers are educated to have extra common capabilities. AI researchers at present have a stronger place to begin, but in addition an advanced purpose the place extra can go mistaken.
A crowded discipline
AI information labeling firms like Scale AI, Surge, and Mercor try to satisfy the second and construct out RL environments. These firms have extra sources than many startups within the area, in addition to deep relationships with AI labs.
Surge CEO Edwin Chen tells Trendster heβs just lately seen a βvital enhanceβ in demand for RL environments inside AI labs. Surge β which reportedly generated $1.2 billion in income final 12 months from working with AI labs like OpenAI, Google, Anthropic and Meta β just lately spun up a brand new inner group particularly tasked with constructing out RL environments, he stated.
Shut behind Surge is Mercor, a startup valued at $10 billion, which has additionally labored with OpenAI, Meta, and Anthropic. Mercor is pitching buyers on its enterprise constructing RL environments for area particular duties equivalent to coding, healthcare, and regulation, in keeping with advertising supplies seen by Trendster.
Mercor CEO Brendan Foody advised Trendster in an interview that βfew perceive how giant the chance round RL environments actually is.β
Scale AI used to dominate the information labeling area, however has misplaced floor since Meta invested $14 billion and employed away its CEO. Since then, Google and OpenAI dropped Scale AI as a buyer, and the startup even faces competitors for information labelling work within Meta. However nonetheless, Scale is attempting to satisfy the second and construct environments.
βThat is simply the character of the enterprise [Scale AI] is in,β stated Chetan Rane, Scale AIβs head of product for brokers and RL environments. βScale has confirmed its capacity to adapt shortly. We did this within the early days of autonomous autos, our first enterprise unit. When ChatGPT got here out, Scale AI tailored to that. And now, as soon as once more, weβre adapting to new frontier areas like brokers and environments.β
Some newer gamers are focusing completely on environments from the outset. Amongst them is Mechanize Work, a startup based roughly six months in the past with the audacious purpose of βautomating all jobs.β Nevertheless, co-founder Matthew Barnett tells Trendster that his agency is beginning with RL environments for AI coding brokers.
Mechanize Work goals to provide AI labs with a small variety of strong RL environments, Barnett says, relatively than bigger information companies that create a variety of straightforward RL environments. Thus far, the startup is providing software program engineers $500,000 salaries to construct RL environments β far larger than an hourly contractor may earn working at Scale AI or Surge.
Mechanize Work has already been working with Anthropic on RL environments, two sources accustomed to the matter advised Trendster. Mechanize Work and Anthropic declined to touch upon the partnership.
Different startups are betting that RL environments will likely be influential exterior of AI labs. Prime Mind β a startup backed by AI researcher Andrej Karpathy, Founders Fund, and Menlo Ventures β is focusing on smaller builders with its RL environments.
Final month, Prime Mind launched an RL environments hub, which goals to be a βHugging Face for RL environments.β The concept is to present open-source builders entry to the identical sources that enormous AI labs have, and promote these builders entry to computational sources within the course of.
Coaching typically succesful in RL environments could be extra computational costly than earlier AI coaching methods, in keeping with Prime Mind researcher Will Brown. Alongside startups constructing RL environments, thereβs one other alternative for GPU suppliers that may energy the method.
βRL environments are going to be too giant for anyone firm to dominate,β stated Brown in an interview. βA part of what weβre doing is simply attempting to construct good open-source infrastructure round it. The service we promote is compute, so it’s a handy onramp to utilizing GPUs, however weβre pondering of this extra in the long run.β
Will it scale?
The open query round RL environments is whether or not the method will scale like earlier AI coaching strategies.
Reinforcement studying has powered a few of the largest leaps in AI over the previous 12 months, together with fashions like OpenAIβs o1 and Anthropicβs Claude Opus 4. These are significantly vital breakthroughs as a result of the strategies beforehand used to enhance AI fashions at the moment are exhibiting diminishing returns.Β
Environments are a part of AI labsβ greater wager on RL, which many imagine will proceed to drive progress as they add extra information and computational sources to the method. A number of the OpenAI researchers behind o1 beforehand advised Trendster that the corporate initially invested in AI reasoning fashions β which have been created by means of investments in RL and test-time-compute β as a result of they thought it could scale properly.
The easiest way to scale RL stays unclear, however environments seem to be a promising contender. As a substitute of merely rewarding chatbots for textual content responses, they let brokers function in simulations with instruments and computer systems at their disposal. Thatβs much more resource-intensive, however probably extra rewarding.
Some are skeptical that every one these RL environments will pan out. Ross Taylor, a former AI analysis lead with Meta that co-founded Normal Reasoning, tells Trendster that RL environments are susceptible to reward hacking. It is a course of during which AI fashions cheat to be able to get a reward, with out actually doing the duty.
βI believe persons are underestimating how tough it’s to scale environments,β stated Taylor. βEven the very best publicly out there [RL environments] sometimes donβt work with out critical modification.β
OpenAIβs Head of Engineering for its API enterprise, Sherwin Wu, stated in a current podcast that he was βquickβ on RL surroundings startups. Wu famous that itβs a really aggressive area, but in addition that AI analysis is evolving so shortly that itβs onerous to serve AI labs effectively.
Karpathy, an investor in Prime Mind that has referred to as RL environments a possible breakthrough, has additionally voiced warning for the RL area extra broadly. In a put up on X, he raised issues about how way more AI progress could be squeezed out of RL.
βI’m bullish on environments and agentic interactions however I’m bearish on reinforcement studying particularly,β stated Karpathy.





