Teaching an AI to Build Software

Five Iterations of Spec-Driven Development with Claude Code and Spec-Kit

I spent a week building the same full-stack application five times with Claude Code and Github Spec-Kit. Not because the first attempt failed — but because I wanted to understand how to get consistently good output from an AI coding agent.

The application is an interview scheduling tool: interviewers set availability, the system pairs them based on roles, candidates pick time slots, and everyone gets notified. It's a real-world CRUD app with enough complexity to expose process failures — 5 user stories, 17 functional requirements, 8 database entities, 10 pages of UI, role-based auth, and transactional booking with double-booking prevention.

Each iteration used the same spec and the same model (Claude Opus 4.6 via Claude Code). What changed was the process I wrapped around it. Here's what I learned.

The Setup

I'm using spec-kit, an open-source spec-driven development tool that can be used with Claude Code. It provides a set of slash commands that enforce a structured workflow: /speckit.spec to write the specification, /speckit.plan to design the architecture, /speckit.tasks to generate an implementation plan, and /speckit.implement to execute it. These commands are backed by templates, rules, and a project constitution that define how the agent should work.

I used spec-kit from the very first iteration. The early runs used the default templates. Starting with iteration 3, I began customizing them — adding enforcement tables, reconciliation checks, and lessons-learned integration. The tool gives you the skeleton; I spent the week figuring out what to put in the bones.

The constitution is a markdown file that establishes non-negotiable principles: infrastructure before features, TDD, API-first design, security-first, and framework-native patterns. Think of it as a style guide for the agent's decision-making process, not just its code output.

Iteration 1: Incremental TDD ($634 API-equivalent)

All five iterations used spec-kit from the start — the spec, plan, and task breakdown were generated through its workflow every time. What changed across iterations wasn't whether I used the tool, but how much I customized its templates and enforcement mechanisms.

The first attempt followed every rule in the book. Write a failing test, implement the minimum to pass it, refactor, repeat. Three sessions over two days: one for spec and planning, two for implementation.

The result was excellent. 155 tests across 30 files — unit, integration, and E2E. Every route worked. Rich seed data. Proper migrations. The code was production-grade.

But the process was brutally expensive. The TDD cycle (write test, run, fail, implement, run, pass) generates many small API round-trips, and each call re-reads the growing conversation history. At 2,880 API calls and 238 million cache-read tokens, this approach burned through capacity fast. The same codebase could have been produced with far fewer round-trips.

Takeaway: Strict TDD with an AI agent multiplies API calls without proportional quality gains. The agent doesn't need the pedagogical discipline of TDD — it doesn't learn by failing. What it needs is a clear spec and test expectations upfront.

Iteration 2: Big Bang ($153 API-equivalent)

The opposite extreme: throw the entire spec at the agent in a single session and let it write everything at once. One session, 524 API calls, a quarter of the cost.

The code looked right. Unit tests passed. Terraform infrastructure and deploy pipelines included — things the incremental approach never got to. On paper, feature parity.

Then I tried to run it.

The schedule routes returned 500 errors. No seed data beyond round config, so every page was empty. No migrations directory — you had to push the schema with drizzle-kit push. No logout endpoint. No ESLint rule enforcing the service layer boundary. Unit tests passed because they mocked the database, so the relational query bugs never surfaced.

The code compiled. The tests passed. The application didn't work.

Takeaway: An AI agent will satisfy the literal requirements you give it. If you don't require runtime validation, you don't get runtime validation. Unit tests with mocked dependencies prove the code is internally consistent, not that it functions.

Iteration 3: Customizing the Templates ($177 API-equivalent)

The first two iterations used spec-kit's default templates as-is. For iteration 3, I started customizing them — refining the plan template's structure, adding more specificity to the task generation, and tightening the constitution's project-specific constraints.

The output was meaningfully better: proper migrations, rich seed data, ESLint service boundary enforcement, and a codebase that worked out of the box. The overhead of the template customization was minimal — $23 more than the raw big-bang approach.

Still no E2E tests, though. The spec said to write them, the constitution mandated them, the CLAUDE.md file listed the command to run them. But no E2E test was generated.

Takeaway: Structured planning works. But rules that exist only in documentation get ignored. The agent doesn't "forget" them — it never reads them at the decision point where they matter.

The Lessons-Learned Breakthrough

After iteration 3, I sat down and documented every failure across all three implementations. Not as retrospective notes — as enforcement mechanisms.

Each lesson followed a template:

  • What was observed: The specific failure
  • What rule existed: Where the requirement was documented
  • Why it wasn't enforced: The structural reason the rule was bypassed
  • What enforcement was added: The mechanical change to prevent recurrence

Five failures produced five enforcement entries:

LL-001: Spec technology silently ignored. shadcn-svelte was listed as an active technology, but all 9 pages were built with raw Tailwind. The spec mentioned it once in the header — never in any task. Fix: Added a Tech Stack Manifest table to the plan template mapping each technology to its setup phase and a Tech Stack Reconciliation table to the tasks template requiring a binary check that every manifest entry has a corresponding setup task.

LL-002 & LL-003: Existing linter rules ignored. ESLint was configured to block direct database imports in routes and to ban inline comments. The agent didn't read eslint.config.js before writing code. The violations were caught at lint time, but the code was already written — wasted effort. No template change needed — the enforcement was already mechanical. The lesson was that the agent should read lint config before writing new code.

LL-004: Infrastructure skipped. The task plan had 76 feature tasks and zero infrastructure tasks, despite the constitution saying infrastructure must come first. The rule was in a separate file from where task generation decisions were made. Fix: Marked the infrastructure phase as BLOCKING in the tasks template with explicit verification checkpoints.

LL-005: E2E tests completely omitted. Both the incremental and big-bang approaches shipped zero E2E tests. The requirement existed in three separate rule files but was never surfaced during planning or task generation. Fix: Added a Critical User Flows table to the plan template and an E2E Test Coverage Reconciliation table to the tasks template — the same pattern as the tech stack manifest.

The pattern across all five failures was identical: rules that exist only as prose get silently dropped. The agent doesn't "refuse" to follow them. It never encounters them at the moment it makes the decision to skip them.

This produced a principle I now consider the most important insight from this entire experiment:

Every rule should have a mechanical enforcement mechanism. If enforcement relies on the agent "remembering" or "knowing," it will eventually fail. Preference order: Tooling > Template tables > Rule files > Documentation.

Iteration 4: Quality Gates ($334 API-equivalent)

Same spec, same model, but now the speckit commands load all lessons-learned files before generating plans and tasks. The enforcement tables are baked into the templates.

Cost nearly doubled from iteration 3. That's the "quality tax" — the work that was silently skipped before (shadcn CLI components, E2E tests, tech stack verification) was now actually being done. The $158 difference between v3 and v4 is the cost of the code actually working.

shadcn components were installed via CLI instead of hand-written. E2E tests existed. The tech stack manifest was reconciled. All routes worked.

But when I reviewed the E2E tests, many were shallow. They navigated to a page, checked that a heading existed, verified some seed data was rendered, and called it done. They tested that the server rendered HTML, not that the application functioned.

Takeaway: Getting the agent to produce E2E tests and getting it to produce good E2E tests are different problems. The reconciliation table ensured tests existed. Nothing ensured they exercised real user interactions.

Iteration 5: Real E2E Tests ($344 API-equivalent)

I asked a simple question: "Do the E2E tests actually test functionality of the requirements?"

They didn't. So I rewrote them. All 26 tests now exercise real user flows: creating entities through forms, editing and verifying persistence, advancing interview rounds, booking interview slots, checking that notifications appear.

This immediately surfaced bugs that no static test would catch:

Svelte 5 checkbox binding bug. Checkboxes using checked={formData.roles.includes(role)} with $derived lost their state during form submission. SvelteKit's client-side form handler re-collected FormData after Svelte's reactivity reset the checkbox state during the click-to-submit event propagation. The fix was bind:group with $state. This was a framework-level interaction bug that only appears when you actually submit a form with checkboxes.

SvelteKit named action conflict. Pages with both a default action and named actions (toggleActive, advanceRound) threw a 500 error in production. SvelteKit validates this at runtime with check_named_default_separate. The fix was renaming default to update.

Both bugs passed unit tests, passed type checking, and passed the shallow E2E tests from v4. They only appeared when a real user interaction — checking a box and submitting a form — was tested end-to-end.

The cost difference between v4 and v5 was $10. Nearly identical cost, significantly higher quality.

Takeaway: The most valuable tests are the ones that replicate what a user actually does. A $10 investment in rewriting superficial tests found bugs that would have shipped to production.

What I Learned

The Five Iterations at a Glance

#ApproachAPI-equiv CostSessionsResult
001Incremental TDD$6343Production-grade but 4x the cost. TDD's round-trip overhead compounds with context re-reading.
002Big Bang$1531Cheapest and fastest. Code compiled, tests passed, app crashed at runtime.
003Customized Templates$1772First run that worked out of the box. Structured planning paid for itself. No E2E tests.
004Quality Gates$3341Lessons-learned enforcement forced real work (shadcn CLI, E2E tests). Quality matched iteration 1.
005Real E2E Tests$3442E2E rewrite found framework-level bugs. $10 more than v4, significantly higher confidence.

The Enforcement Hierarchy

The single most impactful discovery: rules at different positions in the enforcement hierarchy have dramatically different compliance rates.

  1. Tooling (ESLint, type-checker): Near 100% compliance. The code won't pass CI.
  2. Template tables (manifest reconciliation): High compliance. The agent fills in tables and checks boxes.
  3. Rule files (constitution, coding standards): Variable. Consulted if loaded into context.
  4. Documentation (README, comments): Low compliance. Rarely consulted during decisions.

If you care about a rule, move it up the hierarchy. Don't write "always use shadcn components" in a README. Add a Tech Stack Manifest table that forces the agent to list each technology, its setup command, and a verification artifact. Then add a reconciliation table that must show a checkmark for each entry.

Lessons Learned Must Be Active, Not Archival

Writing down what went wrong is useless if the system doesn't read it next time. The speckit commands now scan all specs/*/lessons-learned.md files before generating plans, tasks, or implementations. Each entry includes the enforcement mechanism and verification step. The agent must confirm compliance in its output.

This is the difference between a retrospective and a guardrail. A retrospective says "we should do X next time." A guardrail makes X a required input to the next run.

The Cost of Quality Is Predictable

Iteration TransitionCost DeltaWhat It Bought
002 -> 003+$23Structured planning, seed data, migrations
003 -> 004+$158Real component library, E2E tests, tech reconciliation
004 -> 005+$10Meaningful E2E tests, framework bug discovery

The cheap approach ($153) produced code that crashed. The expensive approach ($634) produced great code but burned 4x the capacity for diminishing returns. The sweet spot (~$340) produced working, tested, well-structured code with proper component libraries and E2E coverage.

What the Agent Does Well vs. What It Needs Help With

Does well autonomously:

  • Writing CRUD services, database schemas, migrations
  • Implementing business logic from clear specs
  • Setting up CI pipelines, Docker configs, project scaffolding
  • Writing unit tests (when told to)
  • Following established patterns consistently once shown one example

Needs structural enforcement:

  • Using CLI tools instead of hand-writing library code
  • Writing E2E tests (and making them test real interactions)
  • Reading existing config (lint, tsconfig) before writing new code
  • Prioritizing infrastructure over features
  • Maintaining consistency across many files (tech stack drift)

The failure mode is never "the agent can't do it." It's "the agent takes the path of least resistance." Hand-writing a component is easier than running a CLI. Mocking the database is easier than testing with real data. Writing <div class="rounded border p-2"> is easier than importing a shadcn component.

Your job as the human in the loop isn't to write code. It's to build the system of constraints that makes the path of least resistance also the path of highest quality.

Sub-Agents and Skills: Making the Process Self-Aware

One of the more interesting layers I built was a set of Claude Code sub-agents and skills that automate the meta-work — analyzing projects, initializing templates, enforcing rules, and guarding against tech-stack bias.

Project Analyzer Agent

The first problem I hit when trying to make spec-kit work across different projects was that the templates contained generic placeholders like lint && type-check && test for CI pipeline commands. Every project uses different commands — pnpm lint vs cargo clippy vs ruff check. Filling these in manually defeated the purpose.

So I built a project-analyzer sub-agent. It's a Claude Code agent (running on Sonnet for cost efficiency) that scans the project root, detects the tech stack from config files (package.json, Cargo.toml, go.mod, pyproject.toml), identifies the package manager from lockfiles, extracts all available commands from scripts/targets, and maps them to canonical categories: lint, type_check, test_unit, test_e2e, build, db_migrate, etc.

It outputs a structured JSON manifest that gets saved to .specify/memory/commands.json. The /speckit-init skill then uses this manifest to customize spec-kit's templates for the specific project — replacing generic "run lint" with the actual pnpm lint command, pre-filling the language and framework in the plan template, and populating the CLAUDE.md commands table.

The result: when you run /speckit-init on a SvelteKit project, the templates know about pnpm check, pnpm test:e2e, and Vitest. Run it on a Rust project and they know about cargo clippy, cargo test, and cargo fmt --check. The agent figures this out by reading the project's actual config — no manual setup needed.

Rule Enforcer Agent

After documenting eight lessons learned about rules being ignored, I realized some of those rules could be mechanically enforced by tools that already existed in the project — if someone configured them correctly.

The rule-enforcer agent reads all the development rules (constitution, coding standards, design principles), classifies each one as automatable, partially automatable, or judgment-only, and maps the automatable ones to specific linter rules, pre-commit hooks, or config checks. It verifies that the proposed rules actually exist (no hallucinated ESLint rule names), detects conflicts with existing config, and counts how many current violations each rule would surface.

For example, it reads "No any — use unknown + type guards" from the coding standards and maps it to @typescript-eslint/no-explicit-any: error, confirms the plugin is installed, and reports 0 existing violations. It reads "No hardcoded secrets" and proposes a gitleaks pre-commit hook. It reads "SRP: A module changes for one reason only" and correctly classifies it as judgment-only — no tool can enforce that.

The output is a manifest of enforceable vs. judgment-only rules. The enforceable ones get baked into linter config and pre-commit hooks. The judgment-only ones stay as context for the AI agent. This directly implements the enforcement hierarchy: move rules up from "documentation the agent might read" to "tooling that blocks the commit."

Template Guardian Agent

Lesson LL-007 taught me that project-specific tech references leak into global templates. When I added anti-approximation rules to prevent the agent from hand-writing shadcn components, the examples I used referenced shadcn, bits-ui, pnpm, and Drizzle — all from this specific project. Those templates are shared across every project, regardless of language.

The template-guardian agent audits global templates for tech-stack specificity. It has a list of banned terms (shadcn, bits-ui, drizzle, prisma, zustand, nextjs, sveltekit) that must have zero matches unless they're part of a multi-ecosystem example list. Every example in a template must include references from at least two different ecosystems. It runs after any template edit and flags violations before they propagate.

This is the kind of problem that only surfaces when you use the same tooling across multiple projects. A template that says "run npx shadcn init" works fine for a SvelteKit project but actively misleads the agent when it's generating a Rust plan.

How They Fit Together

These agents form a pipeline:

  1. /speckit-init runs the project analyzer, produces a command manifest, and customizes templates for the project's tech stack
  2. /enforce-rules runs the rule enforcer, maps development rules to linter config and pre-commit hooks
  3. Template guardian runs after any template edit to prevent tech-stack bias from leaking into shared templates
  4. The customized templates and enforced rules feed into the spec-kit workflow (/speckit.spec -> /speckit.plan -> /speckit.tasks -> /speckit.implement)

Each agent runs on Sonnet with restricted tool access — the project analyzer and rule enforcer are read-only with bash access for command verification. The template guardian can write files since its job is to fix templates. They're defined as markdown files in the Claude Code profiles directory, which means they persist across sessions and projects.

The System Today

After five iterations, here's what the process looks like. Spec-kit provides the workflow framework (spec, plan, tasks, implement commands). The customizations I built on top are what make it enforce quality:

  1. Hardened tooling (highest enforcement): ESLint rules, type checking, pre-commit hooks, CI pipelines. The /enforce-rules skill reads every rule from the constitution and coding standards, identifies which ones can be mechanically enforced, and writes the actual linter config and git hooks. A rule like "no direct database imports in routes" doesn't stay as prose — it becomes an ESLint no-restricted-imports entry that fails the build. This is the strongest layer because the code literally won't merge if it violates these rules.

  2. Sub-agents (automated analysis): The project analyzer, rule enforcer, and template guardian run as Sonnet-powered sub-agents that handle meta-work — detecting the tech stack, mapping rules to tooling, and preventing template drift. They make the system self-bootstrapping: point it at a new project and it figures out the commands, framework, and enforceable rules automatically.

  3. Template tables (my customization on spec-kit): Tech Stack Manifests, E2E Coverage Reconciliation tables, Critical User Flows tables. These are the mechanical enforcement layer I added to spec-kit's default templates — binary checks that force the agent to account for each requirement.

  4. Lessons learned (my customization on spec-kit): Not retrospectives. Active enforcement entries with specific mechanisms and verification steps. I modified spec-kit's plan, tasks, and implement commands to scan these files and apply them as constraints during every run.

  5. Constitution and rules (spec-kit feature): Project-level principles that narrow global coding standards to the specific domain and tech stack. These feed into both the rule enforcer (which hardens them into tooling) and the spec-kit workflow (which loads them as context).

  6. Spec-kit commands (open source): The workflow backbone — spec, plan, tasks, implement. I customized them to load lessons-learned files, validate against templates, and produce structured artifacts with compliance checks.

The total investment to build this system was about a week. Spec-kit gave me a solid foundation; the iteration-by-iteration customization — especially the sub-agents and the enforcement pipeline that converts prose rules into linter configs — is what turned it into a reliable quality system. The payoff is that I can now hand the agent a new feature spec and be reasonably confident that the output will work, be tested, use the correct technology, and follow the project's conventions — because compliance is structural, not aspirational.

A Note on Costs

Every cost figure in this post is an API-equivalent estimate, not an actual bill. All five iterations ran on a Claude Max subscription — a flat monthly fee with usage limits. I didn't pay per-token for any of this.

So why quote dollar amounts at all? Because they're useful for comparing relative efficiency between approaches. The question isn't "how much did this cost?" but "how much compute did each approach consume?" API pricing gives a consistent unit of measurement.

How the Numbers Were Calculated

Claude Code tracks token usage per session in its transcript files. Each session records four token categories per API call:

  • Input tokens: The actual new content sent (user messages, tool results)
  • Cache write tokens: First time the model sees context in a conversation — written to cache at 1.25x input price
  • Cache read tokens: Subsequent reads of already-cached conversation history — charged at 0.1x input price
  • Output tokens: The model's response (code, explanations, tool calls)

I pulled these numbers from the session transcripts and applied Anthropic's published Opus 4.6 API rates:

CategoryRate per million tokens
Input$15.00
Cache write$18.75
Cache read$1.50
Output$75.00

For each session, the formula is:

1cost = (input × $15 + cache_write × $18.75 + cache_read × $1.50 + output × $75) / 1,000,000

The dominant cost driver across all approaches is cache reads. Every API call re-reads the growing conversation history. A 2,000-call session doesn't just generate 2,000 responses — it re-reads the entire conversation 2,000 times. This is why the incremental TDD approach ($634) cost 4x the big-bang approach ($153): not because it produced more code, but because 2,880 small round-trips each re-read an ever-growing context window.

What This Means for Max Subscribers

If you're on Claude Max, these dollar figures don't map to your bill — you're paying a flat rate. But they do map to your usage limits. Higher token consumption burns through your conversation allowance faster. The relative efficiency still matters: a $340-equivalent session uses roughly 2.2x the capacity of a $153-equivalent session, regardless of how you're billed.

Should You Do This?

If you're using AI coding agents for one-off scripts or small features, probably not. The overhead of a constitution, templates, and enforcement tables isn't worth it for a 50-line function.

If you're building real applications with an AI agent as a primary contributor — especially if you're running the same kind of feature implementation repeatedly — this process pays for itself quickly. The first iteration costs time to set up. Every subsequent iteration benefits from the accumulated guardrails.

The key insight isn't about any specific tool or template. It's this: treat AI coding agents like junior developers who are brilliant but easily distracted. They'll write great code if you set up the right structure. They'll cut corners if you rely on them to remember rules from a document they read 10,000 tokens ago.

Build the guardrails. Make them mechanical. And when something goes wrong, don't just fix it — add enforcement so it can't happen again.