Microsoft built a fake marketplace to test AI agents — they failed in surprising ways

On Wednesday, researchers at Microsoft launched a brand new simulation atmosphere designed to check AI brokers, together with new analysis exhibiting that present agentic fashions could also be weak to manipulation. Carried out in collaboration with Arizona State College, the analysis raises new questions on how effectively AI brokers will carry out when working unsupervised — and the way shortly AI firms could make good on guarantees of an agentic future.

The simulation atmosphere, dubbed the “Magentic Market” by Microsoft, is constructed as an artificial platform for experimenting on AI agent habits. A typical experiment would possibly contain a customer-agent attempting to order dinner in line with a person’s directions, whereas brokers representing varied eating places compete to win the order.

The staff’s preliminary experiments included 100 separate customer-side brokers interacting with 300 business-side brokers. As a result of the supply code for {the marketplace} is open supply, it must be simple for different teams to undertake the code to run new experiments or reproduce findings.

Ece Kamar, CVP and managing director of Microsoft Analysis’s AI Frontiers Lab, says this sort of analysis might be important to understanding the capabilities of AI brokers. “There may be actually a query about how the world goes to vary by having these brokers collaborating and speaking to one another and negotiating,” mentioned Kamar. “We wish to perceive these items deeply.”

The preliminary analysis checked out a mixture of main fashions, together with GPT-4o, GPT-5, and Gemini-2.5-Flash, and located some shocking weaknesses. Specifically, the researchers discovered a number of methods companies may use to govern buyer brokers into shopping for their merchandise. The researchers observed a selected falloff in effectivity as a buyer agent was given extra choices to select from, overwhelming the eye area of the agent.

“We would like these brokers to assist us with processing plenty of choices,” Kamar says. “And we’re seeing that the present fashions are literally getting actually overwhelmed by having too many choices.”

The brokers additionally bumped into bother after they have been requested to collaborate towards a standard aim, apparently uncertain of which agent ought to play what function within the collaboration. Efficiency improved when the fashions got extra express directions on how you can collaborate, however the researchers nonetheless noticed the fashions’ inherent capabilities as in want of enchancment.

Techcrunch occasion

San Francisco
|
October 13-15, 2026

“We are able to instruct the fashions — like we will inform them, step-by-step,” Kamar mentioned. “But when we’re inherently testing their collaboration capabilities, I’d anticipate these fashions to have these capabilities by default.”