Tried GitHub's spec-kit with Claude Code for 2 months — notes on what works and what doesn't

Posted by jokiruiz@reddit | LocalLLaMA | View on Reddit | 2 comments

Been experimenting with Spec-Driven Development for a couple of months now, specifically GitHub's spec-kit toolkit with Claude Code as the agent. Wanted to share notes because I think this sub will have strong opinions on it, and frankly I'm still figuring parts of it out.

Quick definition for anyone who hasn't seen spec-kit: it's GitHub's official toolkit for what they call Spec-Driven Development. The philosophy is that the spec, not the prompt, becomes the source of truth. You write a versioned, reviewable spec; the agent generates code from it; any substantial change goes back to the spec first. Five phases: Constitution, Specify, Plan, Tasks, Implement. Repo: github.com/github/spec-kit

What's actually good:

- Agent-agnostic. Same spec works with Claude Code, Cursor, Codex, Gemini CLI, Copilot. I've literally generated initial code with Claude Code, then handed the spec to Cursor for test refactoring, and it picked up cleanly. The spec is the portable asset.

- Hard checkpoints between phases. You see the full proposed architecture (Plan phase) before a single line of code gets written. Catches bad arch decisions when they cost 5 minutes to fix instead of 5 hours.

- The Constitution file as quality gate. You define inviolable principles up front (test coverage minimums, dependency allowlists, perf budgets, typing strictness). Agent fails its own validation if it tries to violate them.

- Determinism improves a lot vs. raw prompting. The agent isn't filling in 30 implicit decisions on its own — they're in the spec. Re-running the implement phase produces much more consistent output across runs.

What annoys me:

- Drift is real. If you tweak code manually without updating the spec, things desync fast. spec-kit has some tooling for this but it's young.

- Heavy overhead for small changes. Bug fixes <50 LOC or trivial features make the 5-phase flow feel ceremonial. My current rule: only do full SDD for new modules or features touching 200+ LOC. Below that, just do it manually.

- Legacy migration is painful. Retrofitting SDD onto an existing 30k-LOC codebase without prior specs is months of work, not days. Haven't found a clean approach yet.

- Quality depends heavily on the agent. Claude Code (Sonnet/Opus 4.6+) handles it well. Smaller models struggle with the Plan phase — they generate plans that compile but don't reflect good architectural reasoning.

Practical setup I'm using now:

- spec-kit installed via: uv tool install --from git+https://github.com/github/spec-kit.git specify-cli (PSA: PyPI has typosquatters with similar names. Only the github/spec-kit repo is official.)

- Claude Code as primary agent. Have also tested with Cursor and Gemini CLI for cross-validation.

- SQLite for any local persistence needs in the project. Easy to spec, easy to validate, no cloud dependency to mock.

- A reusable constitution template I've extracted: strict typing, pytest coverage >80%, explicit dependency allowlist, no cloud services unless requirement explicitly demands it.

Two questions for the sub:

Has anyone gotten local models (Qwen, DeepSeek-Coder, GLM, Llama) to handle the Plan and Implement phases competently? My local-only experiments have been mixed — small models follow the format but architectural reasoning falls apart. Curious if anyone's found specific local models or prompt engineering tricks that fit spec-kit's phase structure.
Anyone running SDD multi-agent (one model writes spec, another implements, a third audits)? Theoretically should improve quality through specialization but I haven't gotten it to be measurably better than single-agent in practice.

Curious if anyone has a setup that actually works.

[-]

JudgePhobos@reddit

Hey! We're actually experimenting with spec-driven development at our company too, but I'm honestly still not 100% sold on it.

My main concern is how it handles hotfixes. The approach almost assumes that developers don't have the expertise to manually patch the code when things break. If every minor bug requires tweaking the spec and doing a full code regeneration — instead of just letting an expert fix the code directly — it drastically slows down response times.

One tip that has actually worked for us: combine it with strict TDD. Having a solid test suite makes validating that regenerated code so much easier.

colin_colout@reddit

Have you tried openspec? It feels less opinionated about spec kit in the order of work. It's pretty accepting of applying changes and backporting to the spec after changes outside the workflow