Maintaining Quality
LLMs are not good at building and maintaining structure. For any non-toy project, you need bookends: architecture on one side, tests on the other, and visibility in between.
The Core Problem
AI removes the natural friction that traditionally slowed structural decay. Code appears instantly, complexity accrues faster than understanding, and the codebase can degrade rapidly without active intervention.
Every line Claude writes is a line you must maintain. Fast generation can outpace your understanding. If you can't explain the code, slow down.
The Bookends Framework
AI-generated code needs constraints on both ends. Without these bookends, you're relying on the model to maintain coherence across an entire codebase, something it fundamentally cannot do.
The Architecture Bookend
Before generationClear blueprints, design docs, and constraints that guide what Claude produces. Without architectural guidance, Claude solves each problem in isolation, creating inconsistent patterns and structural debt.
- Architecture decision records (ADRs)
- Data flow diagrams
- Module boundary definitions
- Pattern libraries
- CLAUDE.md constraints
The Visibility Layer
OngoingYou can't manage what you can't see. Dashboards and instrumentation show system health, letting you detect structural degradation before it becomes a crisis.
- Test coverage tracking
- Lint error rates
- File size monitoring
- PR size distribution
- Complexity metrics
The Testing Bookend
After generationTest suites with meaningful coverage as your safety net. Tests catch when Claude's changes break existing behavior, even when the code looks correct. They define what "correct" means.
- Unit tests for public interfaces
- Integration tests for boundaries
- End-to-end tests for flows
- Visual regression tests
- Performance benchmarks
The Gradient of Trust
Not all code needs the same scrutiny. Chad Fowler articulated this insight: there's a gradient from code you trust immediately to code you never quite trust, no matter who wrote it.
Design systems where the AI's work is constrained to fill in blanks that are hard to fill incorrectly. When code is cheap to generate, the quality of individual implementations matters less than whether the system is shaped so that cheap code is good enough.
High Trust: Low Scrutiny
Some code you can accept without deep review:
Small, no side effects, clear inputs/outputs. If types align, behavior is probably correct.
Statically typed inputs and outputs. The type system makes many bugs impossible.
Well-understood operations with no hidden state, no I/O, no ambiguity.
Code that follows templates or generates from schemas. Structure does the work.
Low Trust: High Scrutiny
Some code demands careful review regardless of who wrote it:
Authentication, authorization, encryption. Mistakes here have severe consequences.
Schema changes, data transformations. Difficult to reverse and easy to corrupt data.
Financial calculations, payment processing. Errors have real costs.
Rules that encode unclear invariants, partial documentation, or "everyone knows how this works."
Design your system so the most critical code is also the most verifiable. Quarantine complexity to the edges. Surround messy parts with monitoring. When they fail, the blast radius is limited.
Code Trust vs. Architectural Trust
There's an important distinction here worth naming:
Code Trust
Is this specific implementation correct? Does this function do what it claims?
Architectural Trust
Is the system shaped so that correctness is easy and failure is survivable?
You can have high code trust in a bad architecture where every function is perfect, but the interactions are a nightmare. You can have high architectural trust with mediocre code where individual functions might have bugs, but types prevent certain errors, tests catch others, and monitoring detects what slips through.
When code is cheap to generate, what matters is whether the system is shaped so that cheap code is good enough. Systems where most code needs careful review become expensive. Systems where most code is trustworthy by construction become cheap.
Creating Architectural Blueprints
Document key patterns in a way Claude can reference. Architecture maps don't need to be perfect. They need to establish boundaries and patterns.
What to Document
How data moves through the system. Entry points, transformations, storage.
API β Service β Repository β Database
Never access database directly from API handlers.
What belongs where. Public vs internal APIs. Allowed dependencies.
src/core/ β shared utilities only
No business logic in core modules.
How similar problems should be solved. Consistent approaches to common tasks.
Error handling via ErrorService
Never catch and ignore. Always surface.
What's forbidden. Past mistakes to avoid. Deprecated patterns.
Never import from /internal
Use the public API in /lib instead.
Create an architecture guide:
Look at this codebase and document the key architectural patterns: data flow, module boundaries, error handling approach, and state management. Focus on 'what should always be true' not 'what is'.
Planning Before Execution
For any change beyond a quick fix, planning mode is essential. Enter with /plan or Shift+Tab twice. Claude proposes its approach before writing code, and you approve or refine the plan.
Why Planning Matters for Quality
When Claude explains its intended approach, you can spot patterns that violate your architecture, unnecessary complexity, or simpler alternatives. Fixing a plan is cheaper than fixing code.
Planning mode is especially important for:
- Multi-file changes: See all the pieces before any are created
- Refactoring: Understand the migration path before moving code
- New features: Validate the approach fits your architecture
- Unfamiliar territory: When you or Claude might not know the best path
The Plan-Commit Loop
For larger changes, combine planning with git checkpoints:
- Plan: Use planning mode to establish the approach
- Implement one step: Execute just the first piece of the plan
- Commit: Save working state before continuing
- Repeat: Move to the next step with a safe rollback point
This rhythm prevents the "tangled mess" failure mode where multiple half-finished changes become impossible to untangle.
For tasks spanning multiple sessions, have Claude write the plan to a file (plan.md). Plans in files persist across context resets. Start new sessions by reading the plan file.
Test Coverage Strategies
Tests define what "correct" means before implementation starts. They give Claude a target to iterate against.
What to Test
- Public interfaces (every exported function)
- Error paths (how does it fail?)
- Edge cases (empty, large, invalid inputs)
- Integration points (where modules connect)
- Business rules (the logic that matters most)
- Pure presentation code (styling, layout)
- Third-party library wrappers
- Simple getters/setters
- Boilerplate/generated code
- One-off scripts
How Much Coverage
Target meaningful coverage, not percentage points. 80% coverage with thoughtful tests beats 100% coverage with shallow ones.
Security, payments, core business logic
Most application code, features
Internal tools, admin interfaces
TDD with Claude:
Write tests for the feature I'm describing. Cover the happy path, error cases, and edge cases. Don't implement yet. Let's get the tests right first, then implement to make them pass.
The TDD Loop
When success criteria can be expressed as tests, this workflow produces dramatically better results:
Write Tests First
Define what correct behavior looks like before any implementation.
"Write tests covering the happy path and these edge cases: [list]. Don't implement yet."
Confirm They Fail
Run the tests to verify they fail. This confirms they're testing something real.
"Run the tests, confirm they fail for the right reasons."
Implement to Pass
Now Claude has a clear target. Each red-to-green cycle provides feedback.
"Implement code to make the tests pass. Keep iterating until green."
Commit Separately
Tests and implementation as separate commits aids review and debugging.
"Commit tests and implementation separately."
Detecting Degrading Structure
Without active maintenance, AI-assisted codebases accumulate invisible technical debt. Watch for these warning signs.
Code Smells
Agents accrete code without refactoring. Files grow to thousands of lines.
Similar code appearing in multiple places. Abstractions not recognized.
Same problem solved different ways. Each new task creates new conventions.
Files importing from everywhere. Module boundaries eroding.
Process Smells
Simple changes requiring edits across the codebase.
Frequently finding issues in recently generated code.
Difficulty articulating how parts of the system interact.
Tasks taking longer, more confusion, more failed attempts.
The 40% Rule
Steve Yegge recommends spending 30-40% of your time on code health when using AI for coding. Without this investment, technical debt accumulates rapidly and slows everything down.
Code Health Activities
Have Claude find code smells: large files, low coverage, duplication, dead code, debug cruft.
Files over 500 lines need splitting. Agents reason better over smaller, focused modules.
Debug artifacts, ancient plans, old docs, unused code. Keep the codebase lean.
Find and merge duplicate systems: logging, configs, utilities that grew separately.
Firm up foundations before major new capabilities. Complexity compounds.
Code health review:
Review this codebase for code smells: files over 500 lines, areas with low test coverage, duplicated logic, redundant systems, dead code, debug cruft. File issues for anything that needs followup.
If you're finding serious problems in every review, you need more reviews. As long as reviews surface issues, you're not doing enough maintenance.
The Rule of Five
Jeffrey Emanuel discovered that the best designs, plans, and implementations come from forcing agents to review their work 4-5 times. It typically takes this many iterations before the output converges.
How It Works
Initial implementation or design. Broad strokes, quick solution.
Review for bugs, edge cases, obvious problems. Often finds things missed in pass one.
Deeper review: security, performance, maintainability. Looking in-the-small.
Existential questions: is this the right approach? Are we solving the right problem? Looking in-the-large.
Final polish. Agent declares "this is as good as I can make it." Now you can moderately trust the output.
Use 2-3 passes on small tasks, 4-5 passes on big tasks. If you're unfamiliar with the language, stack, or domain, err toward more reviews.
Extended Thinking for Better Reviews
In Claude Code, including "think" in your prompt triggers extended reasoning with more compute budget. Different phrases allocate different levels:
- "think": Basic extended thinking
- "think hard" / "think deeply": More reasoning budget
- "ultrathink": Maximum reasoning budget
This is a Claude Code specific feature. Use higher levels for architecture reviews and complex analysis.
For deeper review:
Ultrathink about this implementation. What potential bugs, security issues, or edge cases might I have missed? Is this the right approach for our use case?
Verification Loops
Boris Cherny calls this "probably the most important thing": give Claude a way to verify its work.
Automated Tests
Tests provide immediate feedback. Claude can iterate until green. Each cycle gives Claude feedback it can act on.
npm test -- --related
Visual Feedback
For UI work, screenshots let Claude compare current state to target. Claude's visual accuracy is uneven: sometimes perfect on the first try, sometimes surprisingly off. After 2-3 iterations with feedback, results converge.
Separate Reviewer
A fresh Claude instance reviewing code catches issues the original might defend. Separate context matters. No investment in choices made.
This feedback loop will 2-3x the quality of results. Without verification, you're stuck with whatever Claude produces first. With it, Claude can iterate toward correct.
Summary: Building Trustworthy Systems
The Key Insight
The developers who thrive with AI won't be the ones who write the best prompts. They'll be the ones who design systems where prompts don't need to be perfect, because the system's structure does most of the work, and the AI is just filling in blanks that are hard to fill incorrectly.
Remember:
- Better shapes beat better prompts: Structure systems so more work is simple and constrained
- Bookends bracket AI's work: Architecture before, tests after, visibility throughout
- Gradient of trust: Match scrutiny to risk; quarantine complexity
- 40% on code health: Or you'll spend 60% on debugging
- Verify, don't trust: Feedback loops 2-3x quality