The Middle Layer Has No Moat

March 20, 2026 (3d ago)

There is a pattern in the current wave of AI startups that I think is going to end badly for a lot of investors. The pattern is: a foundation model releases a new capability, and within weeks a startup appears that wraps that capability into a developer-friendly product. The startup raises money, acquires customers, and for a while looks like it's building something real. Then the foundation model releases the next version, the new capability is built in natively, and the startup's product becomes redundant.

This has already happened several times, and it's about to happen a lot more.

The AI stack, as it's settling into shape, has three layers. At the bottom are the foundation models: OpenAI, Anthropic, Google, DeepSeek. These companies train large models on massive compute clusters with proprietary data pipelines. In the middle are what I'll call middleware companies: they take the raw capabilities of foundation models and package them into specific functionalities. RAG orchestration, voice AI pipelines, agent frameworks, MCP tooling. At the top are application companies: they build products that actual humans use. A customer support tool, a legal research product, a creative writing assistant, a live speaking coach.

The thesis of this essay is simple: the middle layer is a terrible place to build a company. It has no durable moat, and the value it captures today will be squeezed to zero by the layers above and below it.

The reason is structural. Middleware companies are squeezed from both directions. From below, foundation models are improving at a pace that routinely swallows middleware functionality whole. From above, application companies choose their infrastructure purely on quality and cost, and will switch the moment a better option appears. The middleware company has no leverage on either side.

Take voice AI as an example. Vapi built a business orchestrating voice interactions on top of existing models: handling turn-taking, latency, interruptions, the messy engineering of real-time conversation. This was genuinely hard to do well, and Vapi did it well. Then in January 2026, Nvidia released PersonaPlex, a 7-billion-parameter model that replaces the entire traditional voice AI pipeline (speech recognition, language model, text-to-speech) with a single unified model that listens and speaks simultaneously. Turn-taking, interruptions, backchannels, natural pauses: all handled natively inside the model at 240-millisecond latency. PersonaPlex is not yet polished enough to replace Vapi in production today. It has context window limitations, stability issues at longer conversations, and lower audio fidelity. But the direction is unmistakable. The model is doing in one pass what Vapi's orchestration layer does by stitching together three separate systems. Each new version will close the remaining gaps. This is not a question of if, but when. And "when" in AI tends to mean months, not years.

This is not a one-time event. It is the recurring dynamic of the middle layer. Consider RAG. A year ago, building a retrieval-augmented generation pipeline required real engineering: chunking strategies, embedding models, vector databases, reranking, prompt construction. Companies sprang up to package this. But context windows keep getting longer, models keep getting better at using raw context, and the foundation model providers are building retrieval directly into their APIs. The standalone RAG company is solving a problem that is actively disappearing.

Or consider MCP, the protocol for connecting AI models to external tools. Building an MCP server or tooling around it is useful today. But MCP is a protocol, not a product. Protocols are adopted, not purchased. Once MCP becomes standard, the companies that merely package it have nothing proprietary left to sell. They're building a business on top of something designed, by definition, to be open and interchangeable.

The pattern is always the same: the middleware company identifies a real gap between what the model can do and what the application needs. It fills that gap. Then the model improves and the gap closes. The middleware company has to find a new gap, but the new gap is smaller, and it closes faster. This is a treadmill, not a business.

I should be precise about what kind of middleware I mean, because the distinction matters more than the label.

The first type fills capability gaps: the model can't do X yet, so the middleware does X on top of the model. RAG, voice orchestration, most agent frameworks. These are the ones that get swallowed, because their entire reason for existing disappears when the model learns to do X natively. If your company exists because the model can't yet do something, you are building on a countdown timer.

The second type manages the complexity that model capabilities create. LLM observability, safety guardrails, access control, audit logging, compliance. These don't fill a gap in what the model can do. They deal with the problems that arise because the model can do so much. A model getting smarter doesn't eliminate the need to monitor it. It makes the need greater.

This second type has a real chance of survival, and it's worth understanding why. As models become more capable and more autonomous, the surface area for things going wrong expands. A model that can browse the web, execute code, and call external APIs needs more guardrails than one that only generates text. An enterprise deploying agents across its organization needs more observability, not less. The value proposition of complexity-management middleware grows with model capability rather than shrinking. It has the same directional tailwind as the foundation models themselves.

But even here, the moat is conditional. If a complexity-management tool's value comes purely from clever engineering (a better dashboard, a smarter alerting system) it remains vulnerable. The durable version needs something outside the domain of pure software: deep integration into enterprise compliance workflows, regulatory expertise, accumulated incident data that improves over time. In other words, the same kinds of institutional and data moats that protect the layers above and below.

Now, someone will point out that middleware companies have existed successfully in other parts of the technology stack. Stripe sits between applications and payment networks. Twilio sits between applications and telecom carriers. These are middleware, and they're enormously valuable. Why should AI middleware be different?

The answer is that Stripe's moat is not its code. Stripe's moat is financial regulation, banking licenses, compliance infrastructure, and relationships with payment networks across dozens of countries. These are barriers that cannot be replicated by a better model or a larger training run. Twilio's moat is carrier relationships and telecom infrastructure. The value in these companies comes from navigating complex, slow-moving, human-institutional systems that exist outside the domain of software entirely.

Foundation model companies have an analogous structure. Their moat is not their architecture, which gets published in papers and replicated within months. Their moat is compute access, proprietary data, and a concentration of research talent that takes years to build. These are hard, physical, institutional barriers.

AI middleware has neither kind of moat. Its entire value proposition is a technical implementation: stitching together model capabilities in a useful way. Technical implementations are exactly what gets eaten when the layers above and below you move fast. You have no regulatory barrier, no physical infrastructure, no institutional relationships, no accumulated data that gets better with use. You have code. And code that merely stitches together existing capabilities, in the age of AI, is the cheapest thing in the world.

What does have a moat is the application layer. And the reason is simple: applications face humans.

When a human being uses your product every day, builds their workflow around it, accumulates data inside it, and trains their team on it, you have switching cost. When your product has a brand that users trust, you have pricing power. When your interface is optimized for how humans see and decide, as I've argued in GUIs Are Not Going Anywhere, you have a GUI moat that no API call can replicate.

This doesn't mean every application that faces humans automatically has a moat. An AI writing assistant or a chatbot wrapper has near-zero switching cost, because the user has nothing accumulated inside it. The application-layer moat comes specifically from data accumulation, workflow integration, and habitual use. The question is not "does a human use it?" but "would a human feel pain if they had to switch?"

The practical implication for startups is clear. If you're building in AI, build at the application layer. Build something a human being uses, cares about, and would miss if it disappeared. Don't build plumbing between the model and the app. That plumbing is going to be absorbed by the model on one side and the app on the other, and you'll be left with nothing.

I know middleware feels safe. In every gold rush, the saying goes, sell picks and shovels. But think about what maps to what. The picks and shovels in the AI gold rush are the model capabilities themselves: GPT, Claude, Gemini. These are the tools every builder needs. The middleware company isn't selling the shovels. It's repackaging them with a nicer handle. And when the manufacturer starts shipping better handles by default, the repackager has nothing left. The application company, by contrast, is the gold shop in the city: it takes raw material and turns it into something a human actually wants to own. People don't love their jeweler because of the mine the gold came from. They love the jeweler because of what it became in their hands.

The companies that will endure in the AI era are the ones closest to humans on one end and closest to the physical constraints of computation on the other. The middle, where the only barrier is clever engineering, is exactly where you don't want to be.