How It Works Commands Scoring Vision Pricing Sandbox Docs
Try Sandbox Get Started
⚙ Case Study · Dogfooding

How Ratchet Improved Itself:
72 86

We ran Ratchet on Ratchet's own source code. Here's every commit, every rollback, and everything we learned — including the bugs it found in itself.

72
Starting score
86
Final score
9 improvements applied
5 rollbacks caught by guard
891/891 tests passing throughout
40 minutes of compute

The Full Timeline

Every inflection point — including the rollbacks that made the result trustworthy.

72
72
Baseline scan
start
83
83
Route split
DeuceDiary
+11
85.5
85.5
Logging + error handling
+2.5
98
98
Score spike
→ reverted
+12.5
80
80
Post-rollback
−18
84
84
Targeted testing
+4
86
86
Stable final score
+2
< 75 — Needs work
75–85 — Good
> 85 — Strong
Current
+14
Points gained (72→86)
891
Tests passing throughout
5
Rollbacks auto-caught
44
console.* calls migrated

The Baseline: 72/100

We ran ratchet scan . on Ratchet's own source directory. Score: 72 out of 100. Not embarrassing — but not what you'd want to show on a landing page.

The scan was honest about what it found: 44 console.* calls scattered across the codebase, a monolithic routes file approaching 2,000 lines, and shallow test coverage that was technically passing but not catching edge cases.

The thing about starting at 72: it felt low. It was low. But it was also accurate — which is exactly what a scorer is supposed to be.

ratchet scan .
Ratchet Code Quality Scan ========================== Scanning ./ (ratchet-oss) Parsed 312 TypeScript files Type checked (tsc --noEmit) Lint checked (eslint) RESULTS ------- Quality Score: 72 / 100 Issues found: 594 Critical: 8 Warnings: 143 Top categories: [console] 44 calls across 9 files [long-fn] 23 functions over 50 lines [coverage] test/source ratio 23% [duplication] 412 duplicated lines

The Route Split: 72 → 83

The biggest single improvement came from splitting a ~2,000-line routes file into 13 domain-specific modules: admin, auth, scan, improve, and more.

Ratchet analyzed the handler groupings, extracted each domain into its own module, created a barrel export in index.ts, and wired the router. It renamed the pattern from DeuceDiary (a game app we'd been testing on) and applied the same decomposition to our own codebase.

891 tests. All green. One diff that was large but reviewable. That was the moment the tool felt like it worked.

"The code it wrote matched the existing style, respected naming conventions, and didn't try to 'improve' things that were fine."

ratchet improve · Click #2
Proposed improvement #2: Route decomposition ------------------------------------------- Changes: - Split routes.ts (1,983 lines) into 13 modules - admin.ts, auth.ts, scan.ts, improve.ts, torque.ts, report.ts, config.ts, diff.ts, badge.ts, webhook.ts, sandbox.ts, docs.ts, index.ts (barrel) Estimated score change: +11 points Risk: Medium Approve? [y/N] y Applying... 13 new files created, 1 removed TypeScript compiles cleanly Tests passing (891/891) Committed as 8c1e9f4 Score: 7283/100 +11

Logging + Error Handling: 83 → 85.5

Two passes drove the next increment. First: migrating all 44 console.* calls to Pino — structured, leveled, machine-readable. Each call became logger.info(), logger.error(), or logger.warn(). A central logger.ts module kept it DRY.

Second: seven route handlers were missing try/catch blocks — letting exceptions surface as opaque 500s. Ratchet wrapped them, added structured error responses, and extracted duplicated error patterns into a shared helper.

Small increments. But the kind that compound. By this point, the scanner was honest about where the ceiling was.

ratchet improve · Clicks #3–5
Proposed improvement #3: Structured logging ------------------------------------------ - Replace 44 console.* calls → pino - Create src/logger.ts with pino instance - Update imports in 9 files Score: 8384.5 (+1.5) Proposed improvement #4: Error handling --------------------------------------- - Add try/catch to 7 route handlers - Extract shared handleError() utility - Add structured error responses Score: 84.585.5 (+1.0) Tests passing (891/891)

The Score Spike: 85.5 → 98 → 80

Day 6 was the most instructive. Running in parallel worker mode — three workers simultaneously — produced a score of 98/100. For about four minutes, we thought we'd cracked it.

We hadn't. Two workers had modified the same auth middleware file. One removed an import the other depended on. A third introduced an infinite recursion bug in a recursive utility function. The guard system flagged all three before they reached main.

Rollback. Score landed at 80 — below where we were before, because some of the parallel changes had been partially applied.

This was the moment the conservative, one-at-a-time design philosophy stopped being a limitation and started being a feature. The tool caught its own mistakes. No broken build reached production.

ratchet torque --workers 3 · Day 6
# Parallel run — 3 workers Worker 1: auth middleware refactor... Worker 2: auth middleware refactor... Conflict: both workers modified src/middleware/auth.ts Worker 3: recursive util improvement... Guard caught: infinite recursion src/utils/traverse.ts:47 Rolling back 3 changes... Reverted worker-1 changes Reverted worker-2 changes Reverted worker-3 changes Score: 80/100 (partial-apply penalty) Recommendation: Use sequential mode. Parallel runs risk cross-file conflicts.

Back on Track: 80 → 84 → 86

After the rollback, we diagnosed the gap. The scoring guide showed exactly where points were hiding: test quality was at 6/8 (assertions per test averaging 1.4, below the 2.0 threshold) and structured logging subcategory still had headroom at 3/7 because a few log calls had slipped through the Pino migration.

Two focused sequential clicks: add assertion density to 12 test files, then sweep the remaining console calls. Score moved from 80 to 84 to 86. All 891 tests still green.

The plateau was honest too — at 86, the remaining points require architectural changes, not mechanical fixes. Ratchet is good at the mechanical work.

ratchet scan · Subcategory breakdown
Score breakdown — 80/100 🧪 Testing 18/25 ← headroom: +7 ↳ Coverage ratio 8/8 ↳ Edge case depth 9/9 ↳ Test quality 1/8 ← assertions 1.4x 🔒 Security 13/15 📝 Type Safety 15/15 maxed ⚠️ Error Handling 15/20 ← headroom: +5 ⚡ Performance 10/10 maxed 📖 Code Quality 9/15 ← headroom: +6 ↳ Structured log 3/7 ← 4pts available Targeting: Testing/quality + logging

Score Breakdown: 86/100

Where every point comes from — and where the next ones are hiding.

Production Readiness Score
86/100
🧪 Testing 23/25
🔒 Security 13/15
📝 Type Safety 15/15
⚠️ Error Handling 15/20
⚡ Performance 10/10
📖 Code Quality 10/15

Remaining 14 points: assertion density (test quality 6→8), structured logging (3→7), duplication cleanup.

6 Lessons from Eating Our Own Dog Food

These became product improvements. All of them are in the backlog with tickets.

🎯
Score impact beats issue count
The agent spent 11 consecutive clicks fixing a maxed-out Code Quality category (15/15). It picks highest issue count, not highest score headroom. We're adding priority = (maxScore − current) × fixProbability.
🔄
Sequential beats parallel
Three parallel workers produced a 98 — then a rollback to 80. Workers modified the same file independently. Sequential mode, one click at a time, is slower but safe. This is now the default and the docs are clear about why.
🛡️
The guard system earned its keep
5 rollbacks. 5 times the guard caught something before it reached main. One was infinite recursion. One removed a null check that a downstream type assertion depended on. Without the guard, both would have shipped.
📊
Read the subcategory breakdown
The total score hides where the points are. We found 4 available points in structured logging (3/7) that were invisible at the top level. ratchet scan output is your best friend — the subcategory table matters.
🔁
Stall detection needs a category switch
After 2 consecutive ±0 clicks, the agent escalated to "cross-file sweep mode" — which just did the same thing with more files. Fix: 2 stalls → force-switch to highest-headroom category. 3 stalls → stop, report ceiling reached.
📈
85–86 is the realistic plateau
Above 86, the remaining points require architectural decisions: assertion density targets, data layer refactors, consistent error taxonomy. These need a human. Ratchet handles the mechanical work. Know the ceiling before you start.

The Complete Run Log

ratchet-oss · Full run summary
Run Summary — ratchet-oss dogfood ================================ Metric Value ------ ----- Starting score 72 / 100 Final score 86 / 100 Net gain +14 pts Total improvements attempted 14 Improvements that landed 9 Rollbacks 5 Tests passing (throughout) 891 / 891 console.* calls migrated 44 (9 files) Route modules created 13 Parallel run used once (don't) Score plateau reached 85–86 Total compute time ~40 minutes Estimated API cost ~$3.20 Verdict: 72→86 is real. Every commit is verifiable. Every rollback was automatic. Build never broke.
Free scan · No credit card

Try Ratchet on your codebase

Get your score in under 60 seconds. See exactly what's holding you back — before you commit to anything.

npm install -g ratchet-run && ratchet scan
Start improving → Try the sandbox

Builder $19/mo · Pro $49/mo · BYOK (bring your own API key)