Maintaining Quality

LLMs are not good at building and maintaining structure. For any non-toy project, you need bookends: architecture on one side, tests on the other, and visibility in between.

The Core Problem

AI removes the natural friction that traditionally slowed structural decay. Code appears instantly, complexity accrues faster than understanding, and the codebase can degrade rapidly without active intervention.

⚑
Ease of generation doesn't mean ease of maintenance.

Every line Claude writes is a line you must maintain. Fast generation can outpace your understanding. If you can't explain the code, slow down.


The Bookends Framework

AI-generated code needs constraints on both ends. Without these bookends, you're relying on the model to maintain coherence across an entire codebase, something it fundamentally cannot do.

πŸ“

The Architecture Bookend

Before generation

Clear blueprints, design docs, and constraints that guide what Claude produces. Without architectural guidance, Claude solves each problem in isolation, creating inconsistent patterns and structural debt.

Includes:
  • Architecture decision records (ADRs)
  • Data flow diagrams
  • Module boundary definitions
  • Pattern libraries
  • CLAUDE.md constraints
πŸ‘οΈ

The Visibility Layer

Ongoing

You can't manage what you can't see. Dashboards and instrumentation show system health, letting you detect structural degradation before it becomes a crisis.

Includes:
  • Test coverage tracking
  • Lint error rates
  • File size monitoring
  • PR size distribution
  • Complexity metrics
πŸ§ͺ

The Testing Bookend

After generation

Test suites with meaningful coverage as your safety net. Tests catch when Claude's changes break existing behavior, even when the code looks correct. They define what "correct" means.

Includes:
  • Unit tests for public interfaces
  • Integration tests for boundaries
  • End-to-end tests for flows
  • Visual regression tests
  • Performance benchmarks

The Gradient of Trust

Not all code needs the same scrutiny. Chad Fowler articulated this insight: there's a gradient from code you trust immediately to code you never quite trust, no matter who wrote it.

πŸ’‘
The real leverage isn't better prompts. It's better shapes.

Design systems where the AI's work is constrained to fill in blanks that are hard to fill incorrectly. When code is cheap to generate, the quality of individual implementations matters less than whether the system is shaped so that cheap code is good enough.

High Trust: Low Scrutiny

Some code you can accept without deep review:

βœ“
Pure functions

Small, no side effects, clear inputs/outputs. If types align, behavior is probably correct.

βœ“
Strongly typed code

Statically typed inputs and outputs. The type system makes many bugs impossible.

βœ“
Simple transformations

Well-understood operations with no hidden state, no I/O, no ambiguity.

βœ“
Constrained patterns

Code that follows templates or generates from schemas. Structure does the work.

Low Trust: High Scrutiny

Some code demands careful review regardless of who wrote it:

⚠️
Security-critical code

Authentication, authorization, encryption. Mistakes here have severe consequences.

⚠️
Database migrations

Schema changes, data transformations. Difficult to reverse and easy to corrupt data.

⚠️
Code handling money

Financial calculations, payment processing. Errors have real costs.

⚠️
Complex business logic

Rules that encode unclear invariants, partial documentation, or "everyone knows how this works."

πŸ’‘ Design Principle

Design your system so the most critical code is also the most verifiable. Quarantine complexity to the edges. Surround messy parts with monitoring. When they fail, the blast radius is limited.


Code Trust vs. Architectural Trust

There's an important distinction here worth naming:

Code Trust

Is this specific implementation correct? Does this function do what it claims?

Focus: Individual pieces

Architectural Trust

Is the system shaped so that correctness is easy and failure is survivable?

Focus: System design

You can have high code trust in a bad architecture where every function is perfect, but the interactions are a nightmare. You can have high architectural trust with mediocre code where individual functions might have bugs, but types prevent certain errors, tests catch others, and monitoring detects what slips through.

🎯
AI shifts the emphasis from code trust to architectural trust.

When code is cheap to generate, what matters is whether the system is shaped so that cheap code is good enough. Systems where most code needs careful review become expensive. Systems where most code is trustworthy by construction become cheap.


Creating Architectural Blueprints

Document key patterns in a way Claude can reference. Architecture maps don't need to be perfect. They need to establish boundaries and patterns.

What to Document

πŸ”€ Data Flow

How data moves through the system. Entry points, transformations, storage.

API β†’ Service β†’ Repository β†’ Database Never access database directly from API handlers.
πŸ“¦ Module Boundaries

What belongs where. Public vs internal APIs. Allowed dependencies.

src/core/ β†’ shared utilities only No business logic in core modules.
🎨 Pattern Libraries

How similar problems should be solved. Consistent approaches to common tasks.

Error handling via ErrorService Never catch and ignore. Always surface.
🚫 Constraints

What's forbidden. Past mistakes to avoid. Deprecated patterns.

Never import from /internal Use the public API in /lib instead.
Try This

Create an architecture guide:

❯ Look at this codebase and document the key architectural patterns: data flow, module boundaries, error handling approach, and state management. Focus on 'what should always be true' not 'what is'.

Planning Before Execution

For any change beyond a quick fix, planning mode is essential. Enter with /plan or Shift+Tab twice. Claude proposes its approach before writing code, and you approve or refine the plan.

Why Planning Matters for Quality

πŸ“‹
Planning catches architectural violations before they happen.

When Claude explains its intended approach, you can spot patterns that violate your architecture, unnecessary complexity, or simpler alternatives. Fixing a plan is cheaper than fixing code.

Planning mode is especially important for:

The Plan-Commit Loop

For larger changes, combine planning with git checkpoints:

  1. Plan: Use planning mode to establish the approach
  2. Implement one step: Execute just the first piece of the plan
  3. Commit: Save working state before continuing
  4. Repeat: Move to the next step with a safe rollback point

This rhythm prevents the "tangled mess" failure mode where multiple half-finished changes become impossible to untangle.

πŸ’‘ Multi-Session Plans

For tasks spanning multiple sessions, have Claude write the plan to a file (plan.md). Plans in files persist across context resets. Start new sessions by reading the plan file.


Test Coverage Strategies

Tests define what "correct" means before implementation starts. They give Claude a target to iterate against.

What to Test

βœ“ High Priority
  • Public interfaces (every exported function)
  • Error paths (how does it fail?)
  • Edge cases (empty, large, invalid inputs)
  • Integration points (where modules connect)
  • Business rules (the logic that matters most)
⚠️ Lower Priority
  • Pure presentation code (styling, layout)
  • Third-party library wrappers
  • Simple getters/setters
  • Boilerplate/generated code
  • One-off scripts

How Much Coverage

Target meaningful coverage, not percentage points. 80% coverage with thoughtful tests beats 100% coverage with shallow ones.

90%+ Critical paths

Security, payments, core business logic

70-80% Regular code

Most application code, features

50-70% Utilities

Internal tools, admin interfaces

Try This

TDD with Claude:

❯ Write tests for the feature I'm describing. Cover the happy path, error cases, and edge cases. Don't implement yet. Let's get the tests right first, then implement to make them pass.

The TDD Loop

When success criteria can be expressed as tests, this workflow produces dramatically better results:

1

Write Tests First

Define what correct behavior looks like before any implementation.

"Write tests covering the happy path and these edge cases: [list]. Don't implement yet."
2

Confirm They Fail

Run the tests to verify they fail. This confirms they're testing something real.

"Run the tests, confirm they fail for the right reasons."
3

Implement to Pass

Now Claude has a clear target. Each red-to-green cycle provides feedback.

"Implement code to make the tests pass. Keep iterating until green."
4

Commit Separately

Tests and implementation as separate commits aids review and debugging.

"Commit tests and implementation separately."

Detecting Degrading Structure

Without active maintenance, AI-assisted codebases accumulate invisible technical debt. Watch for these warning signs.

Code Smells

πŸ“ˆ Files getting larger

Agents accrete code without refactoring. Files grow to thousands of lines.

πŸ” Increasing duplication

Similar code appearing in multiple places. Abstractions not recognized.

🎭 Inconsistent patterns

Same problem solved different ways. Each new task creates new conventions.

πŸ•ΈοΈ Growing imports

Files importing from everywhere. Module boundaries eroding.

Process Smells

🎯 Changes touching many files

Simple changes requiring edits across the codebase.

πŸ› Bugs in recent AI code

Frequently finding issues in recently generated code.

❓ Hard to explain connections

Difficulty articulating how parts of the system interact.

🐒 Agents working slowly

Tasks taking longer, more confusion, more failed attempts.


The 40% Rule

Steve Yegge recommends spending 30-40% of your time on code health when using AI for coding. Without this investment, technical debt accumulates rapidly and slows everything down.

Code Health Activities

πŸ”
Regular code reviews

Have Claude find code smells: large files, low coverage, duplication, dead code, debug cruft.

βœ‚οΈ
Break up large files

Files over 500 lines need splitting. Agents reason better over smaller, focused modules.

🧹
Clean up cruft

Debug artifacts, ancient plans, old docs, unused code. Keep the codebase lean.

πŸ”—
Consolidate redundancy

Find and merge duplicate systems: logging, configs, utilities that grew separately.

πŸ—οΈ
Refactor before expanding

Firm up foundations before major new capabilities. Complexity compounds.

Try This

Code health review:

❯ Review this codebase for code smells: files over 500 lines, areas with low test coverage, duplicated logic, redundant systems, dead code, debug cruft. File issues for anything that needs followup.
⚠️ Debt Accumulates Fast

If you're finding serious problems in every review, you need more reviews. As long as reviews surface issues, you're not doing enough maintenance.


The Rule of Five

Jeffrey Emanuel discovered that the best designs, plans, and implementations come from forcing agents to review their work 4-5 times. It typically takes this many iterations before the output converges.

How It Works

1
First Pass: Generate

Initial implementation or design. Broad strokes, quick solution.

2
Second Pass: Obvious Issues

Review for bugs, edge cases, obvious problems. Often finds things missed in pass one.

3
Third Pass: Code Review

Deeper review: security, performance, maintainability. Looking in-the-small.

4
Fourth Pass: Architecture

Existential questions: is this the right approach? Are we solving the right problem? Looking in-the-large.

5
Fifth Pass: Convergence

Final polish. Agent declares "this is as good as I can make it." Now you can moderately trust the output.

πŸ’‘ When to Apply

Use 2-3 passes on small tasks, 4-5 passes on big tasks. If you're unfamiliar with the language, stack, or domain, err toward more reviews.

Extended Thinking for Better Reviews

In Claude Code, including "think" in your prompt triggers extended reasoning with more compute budget. Different phrases allocate different levels:

This is a Claude Code specific feature. Use higher levels for architecture reviews and complex analysis.

Try This

For deeper review:

❯ Ultrathink about this implementation. What potential bugs, security issues, or edge cases might I have missed? Is this the right approach for our use case?

Verification Loops

Boris Cherny calls this "probably the most important thing": give Claude a way to verify its work.

πŸ§ͺ

Automated Tests

Tests provide immediate feedback. Claude can iterate until green. Each cycle gives Claude feedback it can act on.

Setup: Configure Claude to run tests after changes. npm test -- --related
πŸ‘οΈ

Visual Feedback

For UI work, screenshots let Claude compare current state to target. Claude's visual accuracy is uneven: sometimes perfect on the first try, sometimes surprisingly off. After 2-3 iterations with feedback, results converge.

Setup: Playwright MCP for screenshots, or paste images into chat.
πŸ”

Separate Reviewer

A fresh Claude instance reviewing code catches issues the original might defend. Separate context matters. No investment in choices made.

Setup: /clear before review, or use a second Claude instance.
🎯
You don't trust; you instrument.

This feedback loop will 2-3x the quality of results. Without verification, you're stuck with whatever Claude produces first. With it, Claude can iterate toward correct.


Summary: Building Trustworthy Systems

The Key Insight

The developers who thrive with AI won't be the ones who write the best prompts. They'll be the ones who design systems where prompts don't need to be perfect, because the system's structure does most of the work, and the AI is just filling in blanks that are hard to fill incorrectly.

Remember:

  1. Better shapes beat better prompts: Structure systems so more work is simple and constrained
  2. Bookends bracket AI's work: Architecture before, tests after, visibility throughout
  3. Gradient of trust: Match scrutiny to risk; quarantine complexity
  4. 40% on code health: Or you'll spend 60% on debugging
  5. Verify, don't trust: Feedback loops 2-3x quality

Where to Go Next