Building on Jagged Intelligence

The debate over whether language models can genuinely reason has been burning for a while now. Gary Marcus put it this way:
"LLM 'reasoning' is so cooked they turned my name into a verb."
Karpathy has a word for what Marcus is reacting to. He calls it jagged intelligence:
"The word I came up with to describe the (strange, unintuitive) fact that state-of-the-art LLMs can both perform extremely impressive tasks (e.g., solve complex math problems) while simultaneously struggling with some very dumb problems."
I follow this debate closely. But I experience it differently than the commentators. For me, jagged intelligence isn't an intellectual curiosity. It's a daily engineering problem.
The Carpet
The image I carry is of a carpet factory, but not one that works the way you'd expect.
Instead of the carpet being woven row by row, left to right, many different sections are being constructed simultaneously. The result is riddled with glaring holes, like Swiss cheese gradually closing its gaps over time. A bizarre way to make a carpet.
This is how we're scaling intelligence. Not layer by layer, not capability by capability, but everything at once. Unevenly, unpredictably. One month the carpet gains a brilliant patch of PhD-level mathematical reasoning. The next month you discover it still can't count the r's in "strawberry."
The debate splits predictably ā optimists reach for "reasoning" and "agents," pessimists point at the holes, a third camp worries about what happens when the carpet is finished. Each new model release restarts the same cycle of hype, doom or dismiss.
Building on a Shifting Surface
But here's what none of them fully grapple with: what it means to build on top of this carpet while it's still being woven chaotically.
If you're building products on top of these models, jagged intelligence isn't a philosophical debate. It's a practical problem. Every architectural decision is a bet on whether a hole is permanent or about to close.
The clearest example is RAG. Two years ago, context windows were tiny. 4k, 8k tokens. You couldn't fit a meaningful codebase or a moderately complex document into a single prompt. So an entire ecosystem of retrieval-augmented generation emerged. Vector databases, chunking strategies, embedding pipelines. Billions in venture capital poured into engineering around a hole in the carpet.
Then the models shipped 128K context. Then a million tokens. Now ten million. The hole closed. Not fully. Retrieval still has its uses. But the architectural bet that "context windows will always be small" got steamrolled. Teams that built their entire product identity around RAG found the ground shifting beneath them.
Domain fine-tuning tells the same story. Companies spent months curating domain data ā radiology reports, German legal contracts ā managing training runs, wrestling with catastrophic forgetting. Then the next foundation model dropped and its zero-shot performance matched or beat the fine-tune. Months of work compressed into a capability that shipped for free in someone else's next release. Fine-tuning hasn't disappeared, but the window where it's necessary keeps shrinking. The bet isn't "should I fine-tune?" anymore. It's "will this fine-tune still matter in six months?"
Where This Hits
I build AI for public health in India ā systems for frontline health workers, disease screening, agricultural advisories. Every few months, I face the same question. We built custom ASR pipelines for Hindi dialects that no off-the-shelf system could handle. Then Gemini improved. Sarvam dropped cost-effective options. We engineered elaborate hallucination guardrails for health advisories. The base models rarely hallucinate now. Each time, the same calculus: how much of this infrastructure will be redundant in twelve months?
Build too much scaffolding around a gap, and the model fills it in. Your work is wasted. Build too little, and your product fails today while you wait for a future that may not arrive on schedule. There's no safe answer. You're walking on the carpet while it's being woven beneath you.
Where I Stand
The carpet is growing. Unevenly, unreliably, undeniably. Building on it is fine. Everyone does. But the trap is subtler than it looks. The holes in the carpet don't just look like problems. They look like opportunities. A gap feels like a market. You pour engineering into it, raise money around it, build a team around it. Then the gap closes and you realize you were investing in someone else's next release.
This warps the oldest question in engineering: what's worth building, and what's worth buying? When the "buy" side improves dramatically every few months, something worth a team's effort today might become a default API capability tomorrow. The work doesn't just depreciate. It gets commoditized from underneath you.
Every architectural decision I make now, I run through one question. Not "will this hold?" Not "can I build this cheaper than I can buy it?" Those are the wrong questions when the ground shifts every few months. The question is simpler, and harder to answer honestly:
If this hole closes tomorrow, does what I've built get better, or does it disappear?