⚙ Case Study · Dogfooding

How Ratchet Improved Itself:
74 → 98

We ran Ratchet on Ratchet's own source code. Here's the real trajectory — including bugs we found in the scanner itself, the false positives it was generating, and the architect-level cleanup that got us to 98.

Starting score

→

Final score (v1.1.0)

12 days, Mar 13–25 2026

5 rollbacks auto-caught

1,725 tests passing

scanner bugs found in ourselves

Score Journey

$ the full timeline

Six inflection points over 12 days. The dip at Day 5–9 is the interesting one.

Baseline
Mar 13

start

Pino + rate
limiters · Mar 16

Auth DRY +
errors · Mar 17

Scanner fixed
Mar 22–23

Webhook + file
classifier · Mar 24

Architect cleanup
Mar 25

< 75 — Needs work

75–89 — Good

90+ — Strong

Current

+24

Points gained (74→98)

1,725

Tests passing (89 files)

Rollbacks auto-caught

567

Duplicated lines eliminated

Day 1 · Mar 13

The Baseline: 74/100

First commit. We ran ratchet scan . on Ratchet's own source directory. Score: 74 out of 100.

The scan flagged what you'd expect from a fast-moving early codebase: overly broad rate limiters that treated all endpoints identically, unstructured console.* calls throughout the server, and a pattern-matching approach to security scanning that used regex without AST confirmation.

The score was accurate. That was the point — the tool wasn't going to flatter itself.

ratchet scan . — Mar 13

Ratchet Code Quality Scan ========================== Scanning ./ (ratchet source) ✓ Parsed TypeScript files ✓ Type checked (tsc --noEmit) ✓ Running detectors... Score Breakdown --------------- 🔒 Security 11/15 📝 TypeSafety 12/15 ⚠️ ErrorHandling 14/20 ⚡ Performance 9/10 📖 CodeQuality 11/15 🧪 Testing 17/25 Quality Score: 74 / 100 Top issues: [rate-limiter] overly broad — all routes same limit [logging] unstructured console.* calls [security] unconfirmed regex patterns

Days 2–3 · Mar 16 · +9 pts

Structured Logging + Rate Limiters: 74 → 83

Two passes drove the first big jump. First: migrating all console.* calls to Pino — structured, leveled, machine-readable. A central logger.ts module, consistent log levels across the server.

Second: the rate limiter was applying a single broad limit to every route. Authentication endpoints, scan endpoints, and webhook endpoints all behave differently under load. We split the limiters by domain — stricter on auth, more permissive on read-heavy scan endpoints.

Both were real problems the tool correctly identified in itself. Neither fix was glamorous. Both moved the needle.

ratchet improve — Mar 16

Improvement: Structured logging ------------------------------- - Migrate console.* → pino logger - Create src/logger.ts (shared instance) - Update log calls across server files ✓ Applied · Tests passing Score: 74 → 79 (+5) Improvement: Route-aware rate limiting -------------------------------------- - Split global limiter → per-domain limits - auth: strict (20 req/min) - scan: moderate (60 req/min) - webhooks: burst-tolerant ✓ Applied · Tests passing Score: 79 → 83 (+4)

Day 4 · Mar 17 · +2 pts

Auth Utils DRY + Error Handling: 83 → 85

Two targeted improvements. Auth utility functions had grown duplicated across the codebase — token validation logic repeated in multiple handlers rather than centralized. We extracted a shared auth utils module.

The mutation error handler was catching errors but re-throwing them with the original stack lost. Routes that modified state weren't producing useful error context on failure. A structured handleMutationError() helper unified the pattern.

Small increments. The kind that compound. The tool is good at finding them — and at applying them without touching unrelated code.

ratchet improve — Mar 17

Improvement: Auth utils DRY --------------------------- - Extract shared src/utils/auth.ts - Deduplicate token validation logic - 3 files updated, 1 new module ✓ Applied · TypeScript clean Score: 83 → 84 (+1) Improvement: Mutation error handler ----------------------------------- - Add handleMutationError() utility - Preserve stack context on re-throw - Structured error shape for mutations ✓ Applied · Tests passing Score: 84 → 85 (+1)

Days 5–9 · Mar 22–23 · The Interesting Part

We Found Bugs in the Scanner: 85 → 86

The net change looks small (+1). What happened underneath was not.

The scanner had false positives. Example code in the repository — fake API keys used in documentation and test fixtures — was triggering the security detector. The regex-based patterns couldn't distinguish a literal example string from a real leaked secret.

More critically: the file classifier wasn't excluding documentation directories and test fixtures from production code analysis. Test files were being scored as production coverage, inflating the apparent test/source ratio. When we fixed it, some scores recalibrated downward before the real improvements took hold.

We replaced naive regex matching with AST confirmation: patterns now require a valid AST node context before firing. The file classifier gained production exclusion rules. Both changes made the tool more honest — and its scores more trustworthy.

scanner accuracy overhaul — Mar 22

# False positive found in security scanner ⚠ Secret detector firing on example code: docs/examples/config.ts:12 API_KEY = "sk-example-not-real-1234" Pattern matched but not in production path. # Fix 1: AST confirmation - Require node context before flagging - regex match alone → insufficient - Must confirm: non-test, non-doc scope # Fix 2: File classifier - Exclude: docs/**, fixtures/**, examples/** - Production code only for scoring - Test ratio recalculated accurately ✓ False positives eliminated ✓ Score recalibrated (honest baseline) Score: 85 → 86 (net +1, accuracy: significantly improved)

Day 10 · Mar 24 · +7 pts

Webhook Verification + Security Push: 86 → 93

With the scanner now accurate, the remaining security points became visible. The webhook handler was accepting payloads without verifying the HMAC signature — a real security gap, not a false positive.

Adding verifyWebhookSignature() brought the security category from partial to near-complete. The file classifier also picked up additional production exclusion rules, further tightening the accuracy of the production code surface area.

This jump (+7) was the payoff from having fixed the scanner first. A less accurate scanner wouldn't have shown the real security gap — it would have been hidden behind noise.

ratchet scan . — Mar 24

Score Breakdown --------------- 🔒 Security 14/15 ← webhook sig added 📝 TypeSafety 15/15 maxed ⚠️ ErrorHandling 20/20 maxed ⚡ Performance 10/10 maxed 📖 CodeQuality 12/15 ← duplication remains 🧪 Testing 22/25 Quality Score: 93 / 100 Remaining gaps: Security -1 (minor: 1 input validation gap) Quality -3 (567 duplicated lines across helpers) Testing -3 (assertion density in 4 test files)

Day 12 · Mar 25 · +5 pts

Architect Mode Finds What Clicks Miss: 93 → 98

The 567 duplicated lines were spread across shared engine helpers — similar patterns repeated across multiple files that individual click-by-click improvements had worked around but never eliminated. Each click improved something. None of them could see the full pattern.

Architect mode operates differently: it analyses the entire codebase graph first, identifies cross-file duplication, then generates a coordinated refactor. One pass. It extracted the shared helpers, updated all references, and removed the duplication cleanly.

That was the last 5 points. Combined with the security and testing work already done, the final score settled at 98/100 — perfect in 4 of 6 categories. This was the v1.1.0 release commit.

ratchet architect — Mar 25

Architect Analysis ------------------ Scanning cross-file patterns... Duplication cluster found: src/engine/scanner.ts (lines 44–91) src/engine/improve.ts (lines 12–58) src/engine/architect.ts (lines 77–124) Pattern: shared helper logic, 567 lines total Proposed: extract src/engine/helpers.ts - 3 files updated - 1 new shared module - Zero behavior change Approve? [y/N] y ✓ Extracted helpers.ts ✓ TypeScript compiles cleanly ✓ Tests passing (1725/1725) Score: 93 → 98/100 +5

Final state · v1.1.0

$ score breakdown: 98/100

Perfect in 4 of 6 categories. The remaining 2 points are in Testing (assertion density).

Production Readiness Score

98/100

🔒 Security15/15 ✓

📝 TypeSafety15/15 ✓

⚠️ ErrorHandling20/20 ✓

⚡ Performance10/10 ✓

📖 CodeQuality15/15 ✓

🧪 Testing23/25

Remaining 2 points: assertion density in 4 test files (threshold: 2.0 assertions/test average).

What We Learned

$ 6 lessons from running it on ourselves

These became product improvements. Each one was a real finding, not a hypothetical.

🔬

Fix your scanner before trusting your score

We had false positives: fake secrets in example code triggering the security detector, test fixtures inflating coverage ratios. The 85→86 step was mostly accuracy work. A score from an inaccurate scanner is worse than no score.

🌳

AST confirmation beats regex alone

Pattern matching without AST context generates noise. A string that looks like a secret in a documentation example is not a secret. Requiring a valid AST node context before flagging eliminated false positives without missing real issues.

📁

Score production code, not test fixtures

The file classifier had to learn what "production code" means in this codebase: not docs/**, not fixtures/**, not examples/**. Without that distinction, coverage ratios are meaningless. Getting it right took iteration.

🏗️

Architect mode sees what clicks can't

567 duplicated lines spread across three engine files were invisible to individual improvements — each click improved something adjacent. Architect mode analyzed the full graph, found the pattern, and eliminated it in one coordinated refactor.

🛡️

The guard system earned its keep

5 rollbacks over 12 days. Each one was a real problem caught before it reached main: import ordering issues, a removed null check depended on downstream, partial applies from concurrent edits. The guard is not overhead — it's the whole point.

📈

98 is achievable in 12 days

The score went from 74 to 98 on a real production codebase with 1,725 passing tests, zero broken builds, and every change reviewable in git. The last two points (Testing: 23/25) are assertion density — a known, bounded problem.

By the numbers

$ the complete run log

ratchet · Full dogfood summary

Run Summary — ratchet self-improvement (v1.1.0) ================================================ Metric Value ------ ----- Duration 12 days (Mar 13–25, 2026) Starting score 74 / 100 Final score 98 / 100 Net gain +24 pts Tests passing (final) 1725 / 1725 (89 files) Rollbacks auto-caught 5 Duplicated lines eliminated 567 False positives fixed yes (AST confirmation + file classifier) Build broken at any point never Release tag v1.1.0 Final category scores --------------------- 🔒 Security 15/15 perfect 📝 TypeSafety 15/15 perfect ⚠️ ErrorHandling 20/20 perfect ⚡ Performance 10/10 perfect 📖 CodeQuality 15/15 perfect 🧪 Testing 23/25 2 pts remaining (assertion density) Verdict: 74→98 is real. Every commit is in git. Every rollback was automatic. Build never broke.

Free scan · No credit card

Try Ratchet on your codebase

Get your score in under 60 seconds. See exactly what's holding you back — before you commit to anything.

npm install -g ratchet-run && ratchet scan ⎘

Start improving → Try the sandbox

Builder $19/mo · Pro $49/mo · BYOK (bring your own API key)

How Ratchet Improved Itself: 74 → 98