Running AI models is turning into a memory game

After we speak about the price of AI infrastructure, the main focus is normally on Nvidia and GPUs — however reminiscence is an more and more essential a part of the image. As hyperscalers put together to construct out billions of {dollars} value of latest information facilities, the value for DRAM chips has jumped roughly 7x within the final yr.

On the identical time, there’s a rising self-discipline in orchestrating all that reminiscence to verify the precise information will get to the precise agent on the proper time. The businesses that grasp it is going to be capable of make the identical queries with fewer tokens, which may be the distinction between folding and staying in enterprise.

Semiconductor analyst Dan O’Laughlin has an fascinating have a look at the significance of reminiscence chips on his Substack, the place he talks with Val Bercovici, chief AI officer at Weka. They’re each semiconductor guys, so the main focus is extra on the chips than the broader structure; the implications for AI software program are fairly important too.

I used to be notably struck by this passage, through which Bercovici seems to be on the rising complexity of Anthropic’s prompt-caching documentation:

The inform is that if we go to Anthropic’s immediate caching pricing web page. It began off as a quite simple web page six or seven months in the past, particularly as Claude Code was launching — simply “use caching, it’s cheaper.” Now it’s an encyclopedia of recommendation on precisely what number of cache writes to pre-buy. You’ve obtained 5-minute tiers, that are quite common throughout the business, or 1-hour tiers — and nothing above. That’s a extremely essential inform. Then after all you’ve obtained all kinds of arbitrage alternatives across the pricing for cache reads primarily based on what number of cache writes you’ve pre-purchased.

The query right here is how lengthy Claude holds your immediate in cached reminiscence: you possibly can pay for a 5-minute window, or pay extra for an hour-long window. It’s less expensive to attract on information that’s nonetheless within the cache, so when you handle it proper, it can save you an terrible lot. There’s a catch although: each new bit of knowledge you add to the question might bump one thing else out of the cache window.

That is advanced stuff, however the upshot is straightforward sufficient: Managing reminiscence in AI fashions goes to be an enormous a part of AI going ahead. Corporations that do it effectively are going to rise to the highest.

And there may be loads of progress to be made on this new discipline. Again in October, I lined a startup known as TensorMesh that was engaged on one layer within the stack often called cache-optimization.

Techcrunch occasion

Boston, MA
|
June 23, 2026

Alternatives exist in different components of the stack. For example, decrease down the stack, there’s the query of how information facilities are utilizing the several types of reminiscence they’ve. (The interview features a good dialogue of when DRAM chips are used as a substitute of HBM, though it’s fairly deep within the {hardware} weeds.) Greater up the stack, finish customers are determining find out how to construction their mannequin swarms to make the most of the shared cache.

As firms get higher at reminiscence orchestration, they’ll use fewer tokens and inference will get cheaper. In the meantime, fashions are getting extra environment friendly at processing every token, pushing the fee down nonetheless additional. As server prices drop, a number of purposes that don’t appear viable now will begin to edge into profitability.