The Missing Level — Casulo AI Labs

Nate just dropped a video called "The Gap Between Dark Factories and Everyone Else" and if you work in software, you should watch it. He uses Dan Shapiro's 5 Levels of Vibe Coding as a diagnostic framework — and he nails the diagnosis: most teams are stuck between Level 2 and Level 3, cargo-culting AI features without fundamentally rethinking how they build.

He's right. The gap is real. The question is: what should you be aiming for?

My answer: not Level 5.

At least not yet. Not for most of what you're building. And not in the way most people picture it.

The Framework

For those who haven't seen the video, here's a quick orientation:

Level 0 — Spicy Autocomplete. AI suggests the next line. You accept or reject. You're still writing every line.
Level 1 — Coding Intern. You hand the AI a discrete task: write this function, refactor this module. You review everything.
Level 2 — Junior Developer. AI handles multi-file changes, navigates dependencies, builds features spanning modules. You're still reading all the code. Shapiro estimates 90% of "AI-native" developers operate here.
Level 3 — Developer as Manager. You direct the AI and review at the feature level. The AI submits PRs. You approve or reject. Almost everybody tops out here.
Level 4 — Developer as Product Manager. You write a spec, walk away, come back and check whether the tests pass. You're evaluating outcomes, not reading code.
Level 5 — The Dark Factory. Spec goes in, working software comes out. No human writes code. No human reviews code. The factory runs with the lights off. StrongDM famously operates here with three engineers.

Level 5 is real. It works. StrongDM isn't lying.

But here's what gets glossed over: Level 5 didn't emerge from thin air. It was built — on years of explicit architecture, rigorous spec culture, and the kind of systematic thinking that most teams haven't done yet.

The J-Curve — And What Happens When You Leap Too Far

The J-Curve of AI adoption is well documented — productivity dips before it improves. I've written about the METR data before: experienced developers getting measurably slower with AI tools, while believing they're faster. The pattern is consistent across studies.

What's less discussed is that the depth of the J-Curve is a function of how far you leap.

A team sliding from Level 3 to Level 4 experiences a manageable dip. A team jumping from Level 2 directly to Level 5 doesn't just dip — they crater. Their spec culture can't support full autonomy. The output is wrong, the team loses trust, and they retreat to manual work. The J-Curve becomes a J-Cliff.

We have real-world proof of this now.

Klarna is the cautionary tale. In 2024, they went full autonomous on customer service — replaced 700 human agents with AI, cut 22% of their workforce. Their CEO called it the future. Nine months later, he admitted it led to "lower quality" and launched a recruitment drive to bring humans back. By September 2025, they were reassigning engineers and marketers to customer support to patch the damage. As one analyst put it: "You saved $50,000 on payroll, but lost $5 million in lifetime value."

They're not alone. Gartner found that half of executives who planned AI workforce reductions are now abandoning those plans. Forrester expects 55% of corporations to regret their AI replacement projects. Forbes reported that only 5% of enterprise AI pilots deliver measurable financial impact.

The pattern is consistent: leap to full autonomy without the methodology, and you don't just fail — you fail publicly. The reputational damage compounds the financial cost. The J-Curve becomes unrecoverable.

For most teams, jumping straight to Level 5 is like buying F1 tires for a car that still has drum brakes. The bottleneck isn't the rubber.

The Fatal Flaw in the Dark Factory

Here's the structural problem with Level 5: it assumes the spec can be made perfect and that the outcome will be fully measurable and deterministic.

The Dark Factory is excellent at answering: does this work?

It cannot answer: is this the right thing to build?

Those are not the same question. And in most real-world projects, the second question is harder — and more expensive to get wrong.

Think about what a spec actually encodes. It captures what you know — the requirements you understood, the user behavior you anticipated, the constraints you remembered to write down. What it doesn't encode is:

Domain intuition. The experienced engineer who says "this feels off" before they can articulate why.
UX judgment. The moment someone looks at a working feature and says "yes, but a user will never understand this."
Architectural wisdom. The pattern recognition that flags "we've built something like this before and it became unmaintainable in six months."
Edge case triage. Knowing which edge cases are critical now versus which can wait — a prioritization that depends on experience with real users, not spec completeness.
The unknown unknowns. The monster that emerges mid-development — the integration that doesn't behave as documented, the assumption that was wrong from day one, the requirement nobody thought to write down because everyone "just knew." No scenario suite catches what nobody imagined.

These things cannot be encoded in acceptance criteria. They emerge from humans engaging with work — not just reviewing a boolean pass/fail on a test suite.

Level 5 optimizes ruthlessly for execution. It does exactly what it's told. The risk is that you told it the wrong thing, and you won't know until you've shipped it to users.

The Brownfield Reality

There's another problem that almost never gets mentioned in Level 5 discussions: most software isn't greenfield.

StrongDM built their Dark Factory on a product with known architecture, explicit APIs, and a spec culture refined over years. That's not most people's reality.

Most teams are dealing with brownfield systems. Legacy codebases. Implicit knowledge living in the heads of engineers who've been there for five years. Business logic that exists nowhere in documentation and only shows up when something breaks.

Nate nails this in the video: the billing module with the one edge case for Canadian customers. The engineer who remembers which microservice was carved out of the monolith under duress during the 2021 outage. The environment variable that someone set to a specific value three years ago and nobody remembers why — but if you change it, production breaks. That knowledge lives in people, not in specs.

You cannot dark-factory a legacy system. The spec doesn't exist. Before you can automate execution, someone has to reverse-engineer the implicit knowledge out of the codebase and into explicit form. That's not a Level 5 task. That's deeply human work — archaeology, interviews, judgment calls about what matters.

The teams that need AI transformation most urgently are the ones with the least spec infrastructure. For them, the path isn't toward the Dark Factory. It's toward building the foundation that would make a Dark Factory eventually possible.

The Missing Level: A Spectrum, Not a Step

So where does that leave us?

Between Level 4 and Level 5, there's an operating mode that doesn't have a name. I'd call it The Augmented Workshop.

The key insight: it's not a fixed level. It's a spectrum. Every task in a project can slide between full human judgment (Level 4) and full autonomy (Level 5) based on one question.

The Determinism Test

"Can you write a complete, fixed set of rules to validate the outcome of this task?"

If yes → push toward Level 5. Define the acceptance criteria, write the scenarios, let automation handle it. The output is deterministic. You'll know when it's right.

If no → pull toward Level 4. Human judgment is required somewhere in the loop. Don't pretend the acceptance criteria are sufficient when they're not.

A form validation rule? Push it to 5. Define the rules, automate the checks.

A UI flow for a complex multi-step process with first-time users? Pull it to 4. No specification fully captures whether it feels right. You need a human to experience it.

The Determinism Test isn't a one-time architectural decision. It's a live question asked before every task. And the answer can change as the project matures — what starts at Level 4 (ambiguous, needs judgment) may become Level 5 as the domain crystallizes and the acceptance criteria become complete.

The Sprint Rhythm

The spectrum plays out as: automate the steps, humanize the checkpoints.

Within a sprint, individual tasks with clear, deterministic acceptance criteria can be fully delegated. The AI agent takes the task, executes, and delivers a verifiable output — Level 5 within the sprint.

But at sprint boundaries — integration points, UX reviews, architectural retrospectives — the human is back in the loop. Not as a micromanager. As a strategist engaging at the moments where human judgment is irreplaceable.

Think of it as the Iron Man suit. The suit handles thrust vectors, targeting, impact absorption — all the execution. Tony Stark makes the strategic calls. He decides where to fly and who to fight.

StrongDM removed the pilot. For their product, with their spec maturity, that was the right call.

For most teams, the pilot is what keeps the suit from flying into the wrong building.

What This Looks Like in Practice

I ran this approach recently on a project scoping engagement.

32 tasks broken down from a project specification. AI sub-agents generated the detailed task breakdowns in under 30 minutes: acceptance criteria, dependencies, effort estimates, technical notes. Full execution-ready specs.

Then I reviewed every single one. That took about an hour — adjusting sequences, tightening acceptance criteria, catching assumptions that didn't match how the client actually works. I wasn't starting from scratch. I was building on top of solid AI output, refining it with the domain knowledge and judgment that only comes from decades of shipping software.

The end result was more robust and carried a higher level of confidence than either approach alone could achieve. 90 minutes total — 30 for generation, 60 for human refinement — to produce 32 high-quality, execution-ready task definitions.

Two years ago, that same work would have taken a week. A solo engineer writing specs from scratch, context-switching between architecture decisions, acceptance criteria, dependency mapping, and effort estimation. The AI compressed the mechanical work. The human ensured the output was right, not just done.

That's the Augmented Workshop in action. The AI didn't replace my judgment. It gave me a foundation to apply it faster.

The Comparison

	Dark Factory (Level 5)	Augmented Workshop (Spectrum)
Who reviews code?	No one	Depends on the Determinism Test
Spec requirement	Must be complete and explicit	Iteratively refined with human input
Works on brownfield?	Rarely	Yes — human handles spec archaeology
Handles UX judgment?	No	Yes — human in the loop at review points
Edge case triage	All must be specified upfront	Human prioritizes as project evolves
Unknown unknowns	Fatal — no scenario catches what nobody imagined	Human detects and adapts mid-flight
J-Curve risk	High — leap can be unrecoverable	Lower — gradient approach, shallow dip
Right for greenfield + mature spec culture	✓	✓
Right for legacy systems	✗	✓
Right for most teams today	✗	✓
Speed	Maximum	High
Craft	Dependent entirely on spec quality	Human judgment preserved

Flattening the J-Curve

Klarna cratered because they leaped. The spectrum approach controls the depth of the curve.

When you're working the spectrum — sliding tasks toward Level 5 only when the Determinism Test passes — you're simultaneously shipping and developing the spec quality standards that full autonomy demands. The methodology matures alongside the automation. The dip is shallow enough to survive.

This is the path to Level 5 that actually works: not a leap, but a gradient. You build toward the Dark Factory without betting the project on it. And at every step, the human in the loop catches what the spec missed — because the spec is never perfect.

The Augmented Workshop assumes that. The Dark Factory pretends otherwise.

Where Does This Leave You?

If you're a technical leader reading this, ask yourself:

What's your current level? If you're honest, most of your team is at Level 2-3. The gap isn't just tooling — it's methodology.
Do your acceptance criteria capture success, or just correctness? Those aren't the same thing.
How's your spec culture? Level 5 is only as good as the specifications feeding it. If your specs are mediocre, your Dark Factory will produce mediocre software at high speed.
Do you have brownfield systems that need to evolve? If yes, the implicit knowledge extraction problem is yours to solve before automation helps.

The factory produces volume. The workshop produces craft.

Know which one you need. Build accordingly.

The Missing Level: What Happens Between Level 4 and Level 5