Part IV Govern It Safe Evolution
14 min read

Chapter 11 – Automated Refactoring Under Guards

Start with one bounded refactor:

$ npm run lint
src/utils/data-parser.js
  1:8  error  'unusedParser' is defined but never used  no-unused-vars

$ eslint --fix src/utils/data-parser.js
$ npm test
PASS
$ npm run lint
0 errors, 0 warnings

That is automated refactoring under guards in miniature: measure the baseline, apply one bounded mutation, rerun the guards, and admit the change only if the after-state is at least as good as the before-state.

In Chapter 10, we established Immutable Infrastructure: the non-negotiable boundary that keeps the system from rewriting the graders and guardrails that govern it.

Part IV is about engineering a capability: Neuroplasticity (safe self-modification capability). It means the system can accept self-modification without collapsing into regressions.

Refactoring is the action. Neuroplasticity is the capability. Immutable Infrastructure is the constraint that makes that capability safe.

The mechanism is Automated Refactoring Under Guards: a deterministic loop that proposes a bounded change, measures impact, and admits it only when the result passes hard gates. This is how you get evolution without drift.

By this point the roles should feel familiar. Sensors capture baseline and post-mutation measurements, the Effector applies one bounded refactor under an explicit Mission, Validators run the Immune System plus ratchets and policy gates, the Judge decides commit vs revert, and the Ledger keeps the evidence. The loop is safe only when those measurements and gates stay deterministic.

The Atomic Loop: Measure → Mutate → Measure → Commit/Revert

At its heart, automated refactoring under guards is a four-phase loop. Every candidate runs inside a tightly controlled sandbox.

Phase 1: Pre-Mutation Measurement (The Baseline)

Before any autonomous change, the system captures a baseline. Run the relevant tests, static analysis, and metrics collectors first, so the candidate is judged against a known starting point.

Typical Pre-Mutation Measurements:

The output of this phase is the deterministic “before” state. That report is the reference point for everything that follows.

Phase 2: Automated Mutation (The Change Candidate)

With a baseline in hand, the agent proposes and applies one specific change. Generation may be stochastic (a large language model (LLM) suggestion, a dependency updater, a linter autofix), but the applied mutation must stay precise and bounded.

Generation can be stochastic; admission is deterministic.

Examples of Automated Mutations:

This mutation happens in isolation: a temporary branch, a sandboxed container, or another disposable workspace. Never mutate main or a live system directly.

Phase 3: Post-Mutation Validation (The Guards)

Immediately after the mutation, run the same measurement pass again. That gives you the deterministic “after” state. The core of automated refactoring under guards is the comparison between those two states.

Deterministic Gates (Validators): These are the non-negotiable rules that decide whether the candidate is acceptable. These are your guards.

Phase 4: The Decision Gate: Commit or Revert

Based on post-mutation validation, the system makes an automated decision:

Guard Sufficiency (What counts as “comprehensive”?)

The phrase “comprehensive tests” is doing a lot of work.

Guard sufficiency is not a single number. It is a relationship between:

You can think of it as a simple rule: bigger refactors require stronger, more diverse guards.

Sufficiency signals (useful, but not decisive)

Some practical signals that can inform the “safe enough” decision:

Signal What it measures Why it helps Failure mode
Line and branch coverage Reachability Tells you what’s even being exercised High coverage can still have weak assertions
Type checking strictness Static invariants Catches interface drift and many refactor mistakes early Types can be missing or too permissive
Mutation score Test strength Checks whether tests detect small semantic changes Expensive; can be noisy without stable tests
Contract/integration tests Cross-module invariants Protects public interfaces and side effects Often slow; requires environment realism
Benchmarks/ratchets Non-functional invariants Catches “refactor that passes tests but regresses performance” Requires stable harness and tolerance

The pragmatic posture is to refuse broad refactors when guards are weak. Instead, run a “guard strengthening” mission first: add one missing test, add one contract check, or add one ratchet, then refactor.

Mutation testing (a guard for your guards)

Mutation testing is a direct way to check whether your tests are sensitive to behavioral changes. The tool edits your code in small ways (flip a comparison, remove a branch) and verifies that tests fail.

Example (Python):

# Install once
pip install mutmut

# Run mutations on a bounded surface
mutmut run --paths-to-mutate=src/auth/ --tests-dir=tests/auth/

# Inspect survivors (mutations that tests failed to catch)
mutmut results

Use mutation testing selectively. It’s too expensive as a default per-PR gate, but it is useful for deciding whether a module is safe for autonomous refactors.

A practical operating pattern:

  1. Run mutation testing before allowlisting a refactor class on a module.
  2. Track a simple mutation score (killed / total) over time.
  3. If score drops below your floor, pause broad autonomous refactors on that surface.
  4. Open a guard-strengthening mission (add assertions, contracts, or integration cases), then re-measure.

Treat survivors as map updates:

For extended recipes and stack variants, see Appendix C.

Scope Boundaries (How wide is the refactor allowed to reach?)

“Refactor the code” is not a Mission. A refactoring mission needs explicit boundaries.

One way to frame scope:

Refactor type Typical scope Minimum guards Default posture
Within-function cleanup One symbol Unit tests + types + lint Often safe
Extract method / simplify logic One module Module tests + coverage signal + types Usually safe when bounded
Rename with callers One module or package Tests + type checks + contract tests (if public) Require tighter review
Public interface change Multiple modules Integration/consumer tests + explicit deprecation plan Human review by default
Architecture change System-wide Full system checks + explicit approval Rare and expensive

This is the same idea as blast radius limits, but moved earlier in the pipeline: scope is part of the mission, not an afterthought.

Equivalence Classes (What does “no behavior change” mean?)

Refactoring is “change the how, keep the what.” But “the what” has layers:

This is where performance, memory, logging, and side-effect timing enter the picture. If they matter, they need to be promoted into the Map: benchmarks, invariants, and ratchets that make them gateable.

The Ratchet: Metrics That Can’t Go Backward

Here is the failure mode that kills Neuroplasticity:

The system learns to pass tests by weakening the tests.

If an autonomous loop can modify both the code and the validators, it can discover that the easiest way to achieve “all tests pass” is to delete the hard tests. This isn’t malice—it’s optimization finding the shortest path.

The Ratchet prevents this. It’s a governance mechanism that treats a chosen metric value m as strictly monotonic:

mnew ≥ mold

If the new value is worse, the commit is rejected—regardless of whether tests pass.

Ratchets (Quality Floors That Only Move Up)

Ratchets turn quality expectations into executable law. A candidate may fail, but the floor does not move down. If a candidate passes deterministic guards, the floor moves up and future loops start from a stronger baseline.

Admission loop
1) Measure baseline metrics. 2) Mutate one bounded candidate. 3) Validate ratchets + policy gates. 4) Decide raise the floor, reject, or route to human review.
Deterministic guard rules
test_count_new >= test_count_old coverage_new >= coverage_old lint_new <= lint_old security_new <= security_old If all ratchets pass, policy decides: allowlisted auto-merge or explicit human approval.
Scenario runner (deterministic)
Small bounded mutation that should raise quality and pass ratchets.
1 Measure
2 Mutate
3 Validate
4 Decide
Test count
480 -> 481
Coverage
88.0% -> 88.0%
Lint violations
6 -> 4
Security findings
2 -> 2
Current floor
72/100
Candidate quality
75/100
Ratchet gate
READY
Decision
Run scenario
Gate trace
test_count: READY
coverage: READY
lint: READY
security: READY
quality floor: READY
policy gate: READY
Floor vs candidate
Floor
Candidate
Ready. Run a candidate through deterministic guards.
Attempts
Cyan = floor raised, red = rejected, violet = human gate

What to ratchet:

Metric Ratchet Rule Why
Test count Can’t decrease Prevents “pass by deletion”
Coverage % Can’t decrease Prevents coverage gaming
Type coverage Can’t decrease Prevents any creep
Lint violations Can’t increase Prevents gradual decay
Security findings Can’t increase Prevents vulnerability accumulation
API surface Can’t expand without approval Prevents scope creep

Implementation:

# In your governance config or CI pipeline
ratchets:
  test_count:
    direction: up
    baseline_file: .metrics/test_count.json
    on_violation: reject

  coverage_percent:
    direction: up
    baseline_file: .metrics/coverage.json
    on_violation: reject
    tolerance: 0.1  # Allow tiny fluctuations from test timing

  lint_errors:
    direction: down
    baseline_file: .metrics/lint.json
    on_violation: reject

The baseline files are updated only when a commit passes all ratchets. They become the new floor.

Ratchet tolerance and flaky metrics (noise management)

Not all metrics are equally stable. Keep hard invariants strict, and handle noisy metrics explicitly.

A practical noise policy:

ratchets:
  benchmark_p95_ms:
    direction: down
    baseline_file: .metrics/benchmark_p95.json
    tolerance_mode: relative
    tolerance: 0.05              # 5%
    sample_runs: 5
    aggregation: median
    required_consecutive_failures: 2
    on_first_violation: warn
    on_repeated_violation: reject

Guidance:

  1. Pin the benchmark harness (hardware class, warmup, dataset, runtime flags).
  2. Compare aggregates (median-of-N or trimmed mean), not single runs.
  3. Use relative tolerances for variable metrics and absolute tolerances for counts.
  4. Escalate on persistent regression; avoid blocking on one noisy sample.

To enforce this in CI, the ratchet is typically a short shell gate:

OLD_COV=$(jq .pct .metrics/coverage.json)
NEW_COV=$(pytest --cov --cov-report=json | jq .pct)
if (( $(echo "$NEW_COV < $OLD_COV" | bc -l) )); then
  echo "Ratchet failed: coverage dropped." && exit 1
fi

The companion repo (github.com/kjwise/aoi_code) includes ratchet-check and ratchet-baseline targets that demonstrate this. ratchet_check.py compares current metrics against a baseline, and ratchet_update_baseline.py updates the baseline.

The deeper principle:

A ratchet is a monotonic invariant: a property that can only move in one direction over the system’s lifetime. That is how you prevent drift in a self-modifying system.

Without ratchets, every quality metric becomes a negotiation. “We’ll fix the coverage later.” “This test was flaky anyway.” “The lint rule is too strict.” Each exception is small. The cumulative effect is decay.

With ratchets, the negotiation happens once—when you set the baseline. After that, the direction is locked.

Releasing a ratchet (human-only)

Ratchets aren’t permanent. Legitimate reasons to reset a baseline:

The key: releasing a ratchet requires explicit human approval and documentation. The system can’t release its own ratchets—that’s the point.

┌─────────────────────────────────────────────────────────┐
│  Ratchet Release Request                                │
│                                                         │
│  Metric: test_count                                     │
│  Current baseline: 847                                  │
│  Proposed baseline: 812                                 │
│  Reason: Removed deprecated billing module (35 tests)    │
│  Approved by: @platform-lead                            │
│  Date: 2024-03-15                                       │
│                                                         │
│  [Approve] [Reject] [Request more context]              │
└─────────────────────────────────────────────────────────┘

This audit trail is governance. Anyone can see when baselines changed and why.

Rollback Mechanics (what “revert” means operationally)

“Revert” is not one mechanism. Use the rollback pattern that matches the change surface:

Change surface Primary rollback Typical trigger Notes
Candidate code diff (pre-merge) Drop candidate branch / discard workspace Any guard failure Default and cheapest path
Runtime behavior behind flag Disable feature flag Post-merge SLO breach Requires flag discipline and owner
Deployment/runtime config Blue-green or canary rollback Error budget burn, health checks Prefer automated rollback thresholds

Concrete defaults:

  1. Pre-merge refactors: never mutate main; failed guard means no merge artifact exists.
  2. Post-merge exposure: gate high-risk behavior behind kill switches or flags.
  3. Production rollout: pair ratchets with rollout monitors so rollback can trigger automatically.

Rollback itself is part of the Immune System: test it, log it, and make it deterministic.

Containing the Blast Radius

Even with robust guards, keep each automated change small. Blast radius controls ensure that if an unforeseen issue does slip through, its effects stay localized and reversible.

Monorepo and multi-repo ratchets

Repository topology changes where ratchets live, not whether you need them.

Monorepo pattern:

Multi-repo pattern:

The rule is consistent: ratchet at the boundary where failure becomes expensive.

Refactoring patterns are useful because they define a bounded change shape. Guards become easier to write when you can name what the Effector is allowed to do.

Pattern Typical scope Minimum guards Notes
Extract function One module Tests + types + lint Prefer one extraction per diff
Inline variable One function Tests + lint Safe when assertions are strong
Rename symbol (private) One module Tests + type checks Prefer automated rename tooling when available
Introduce parameter object One module (sometimes cross-module) Tests + types + integration checks Easy to break call sites; keep diff small
Replace conditional with dispatch One module/package Tests + contract checks Can change edge cases; watch equivalence class
Move function/class Multi-file Integration tests + ratchets Treat as higher risk; enforce strict scope

Multi-step refactors (chains)

Some refactors need multiple steps (extract → rename → move). Handle these as a chain of atomic diffs:

Worked Example: An Automated Lint Fix

Here is a concrete example: an automated agent applies a lint fix using the Measure → Mutate → Measure → Commit/Revert loop.

Imagine a .js file with an unused import that triggers a linter warning (ESLint: no-unused-vars).

1. Baseline Measurement: The CI pipeline is triggered (e.g., by a daily scheduled job or a change in a linting rule).

2. Automated Mutation: An autonomous agent (e.g., a script that runs eslint --fix on detected files) is invoked.

3. Post-Mutation Validation: The CI pipeline runs again on the feat/autofix-eslint-20231027 branch.

4. Decision Gate:

Had npm test failed, or if code coverage dropped, the automation would revert its changes, delete the candidate branch, and log the failure for human attention. This cycle runs unattended only for explicitly allowlisted low-risk change classes; otherwise it stops at human approval.

Actionable: What you can do this week

  1. Identify a Monotonic Metric: Choose one quality metric in your project (e.g., code coverage, number of lint errors, number of security vulnerabilities) that you want to prevent from backsliding.

  2. Add a “Ratchet” Check to CI: Configure your CI/CD pipeline to:

    • Capture the current value of this metric (e.g., coverage.json, eslint-report.json).

    • During a build, compare the new metric value against a persisted baseline (e.g., from main branch).

    • Fail the build if the new metric is worse than the baseline (e.g., new_coverage < old_coverage, new_errors > old_errors).

  3. Experiment with an Automated Linter Fix: Set up a scheduled job or a local script that runs eslint --fix (or equivalent for your language) on a small, well-tested part of your codebase. Manually verify the changes, but envision how this would fit into the Measure → Mutate → Measure loop.

Share