Chapter 11 – Automated Refactoring Under Guards
Start with one bounded refactor:
$ npm run lint
src/utils/data-parser.js
1:8 error 'unusedParser' is defined but never used no-unused-vars
$ eslint --fix src/utils/data-parser.js
$ npm test
PASS
$ npm run lint
0 errors, 0 warnings
That is automated refactoring under guards in miniature: measure the baseline, apply one bounded mutation, rerun the guards, and admit the change only if the after-state is at least as good as the before-state.
In Chapter 10, we established Immutable Infrastructure: the non-negotiable boundary that keeps the system from rewriting the graders and guardrails that govern it.
Part IV is about engineering a capability: Neuroplasticity (safe self-modification capability). It means the system can accept self-modification without collapsing into regressions.
Refactoring is the action. Neuroplasticity is the capability. Immutable Infrastructure is the constraint that makes that capability safe.
The mechanism is Automated Refactoring Under Guards: a deterministic loop that proposes a bounded change, measures impact, and admits it only when the result passes hard gates. This is how you get evolution without drift.
By this point the roles should feel familiar. Sensors capture
baseline and post-mutation measurements, the Effector applies one
bounded refactor under an explicit Mission, Validators run the Immune
System plus ratchets and policy gates, the Judge decides
commit vs revert, and the Ledger keeps the
evidence. The loop is safe only when those measurements and gates stay
deterministic.
The Atomic Loop: Measure → Mutate → Measure → Commit/Revert
At its heart, automated refactoring under guards is a four-phase loop. Every candidate runs inside a tightly controlled sandbox.
Phase 1: Pre-Mutation Measurement (The Baseline)
Before any autonomous change, the system captures a baseline. Run the relevant tests, static analysis, and metrics collectors first, so the candidate is judged against a known starting point.
Typical Pre-Mutation Measurements:
Immune System Execution: All unit, integration, and end-to-end cases must pass.
Code Coverage: Current percentage of code covered by tests.
Static Analysis: Linter warnings, security vulnerabilities detected by SAST (Static Application Security Testing) tools, complexity metrics.
Architectural Conformance: Checks against defined architectural rules (e.g., dependency inversion, package layering).
Performance Benchmarks: Baseline metrics for critical paths.
The output of this phase is the deterministic “before” state. That report is the reference point for everything that follows.
Phase 2: Automated Mutation (The Change Candidate)
With a baseline in hand, the agent proposes and applies one specific change. Generation may be stochastic (a large language model (LLM) suggestion, a dependency updater, a linter autofix), but the applied mutation must stay precise and bounded.
Generation can be stochastic; admission is deterministic.
Examples of Automated Mutations:
Dependency Updates: Automatically upgrading a library to a newer patch or minor version.
Code Formatting/Linting: Applying
prettierorblackautofixes.Refactoring Suggestions: Renaming variables, extracting methods, or simplifying expressions based on defined patterns or AI suggestions.
Security Patches: Applying known fixes for vulnerabilities identified in dependencies.
Configuration Updates: Adjusting settings based on environment changes or best practices.
This mutation happens in isolation: a temporary branch, a sandboxed
container, or another disposable workspace. Never mutate
main or a live system directly.
Phase 3: Post-Mutation Validation (The Guards)
Immediately after the mutation, run the same measurement pass again. That gives you the deterministic “after” state. The core of automated refactoring under guards is the comparison between those two states.
Deterministic Gates (Validators): These are the non-negotiable rules that decide whether the candidate is acceptable. These are your guards.
All Tests Must Pass: No regressions in existing functionality. This is the first and most fundamental guard.
The Ratchet Principle: Quality metrics, once established, should only ever improve or remain the same; they must never degrade. In practice: reject if
m_newis worse thanm_old. (Detailed ratchet rules are defined later in this chapter.)Architectural Conformance: The change must not violate any predefined architectural rules.
Blast Radius Limits: The scope of the change (e.g., number of lines changed, number of files affected) might be capped. Changes exceeding a certain threshold could automatically trigger a revert or require human review.
Phase 4: The Decision Gate: Commit or Revert
Based on post-mutation validation, the system makes an automated decision:
If all guards pass: The change is deemed safe and beneficial (or at least non-regressive). The mutation is proposed as a pull request with the before/after evidence attached. Merge can be automatic only for explicitly allowlisted change classes; otherwise this is where human review happens.
If any guard fails: The change is automatically discarded. The temporary branch is deleted, and the system reverts to its pre-mutation state. A detailed report of the failure (what guard was tripped, why) is generated and can be sent to a human for review. This prevents bad changes from ever reaching the codebase.
Guard Sufficiency (What counts as “comprehensive”?)
The phrase “comprehensive tests” is doing a lot of work.
Guard sufficiency is not a single number. It is a relationship between:
- Refactor scope: how wide the mutation is allowed to reach.
- Guard strength: what Validators you run, and what they actually constrain.
- Signal quality: whether your tests contain meaningful assertions, not just line execution.
You can think of it as a simple rule: bigger refactors require stronger, more diverse guards.
Sufficiency signals (useful, but not decisive)
Some practical signals that can inform the “safe enough” decision:
| Signal | What it measures | Why it helps | Failure mode |
|---|---|---|---|
| Line and branch coverage | Reachability | Tells you what’s even being exercised | High coverage can still have weak assertions |
| Type checking strictness | Static invariants | Catches interface drift and many refactor mistakes early | Types can be missing or too permissive |
| Mutation score | Test strength | Checks whether tests detect small semantic changes | Expensive; can be noisy without stable tests |
| Contract/integration tests | Cross-module invariants | Protects public interfaces and side effects | Often slow; requires environment realism |
| Benchmarks/ratchets | Non-functional invariants | Catches “refactor that passes tests but regresses performance” | Requires stable harness and tolerance |
The pragmatic posture is to refuse broad refactors when guards are weak. Instead, run a “guard strengthening” mission first: add one missing test, add one contract check, or add one ratchet, then refactor.
Mutation testing (a guard for your guards)
Mutation testing is a direct way to check whether your tests are sensitive to behavioral changes. The tool edits your code in small ways (flip a comparison, remove a branch) and verifies that tests fail.
Example (Python):
# Install once
pip install mutmut
# Run mutations on a bounded surface
mutmut run --paths-to-mutate=src/auth/ --tests-dir=tests/auth/
# Inspect survivors (mutations that tests failed to catch)
mutmut resultsUse mutation testing selectively. It’s too expensive as a default per-PR gate, but it is useful for deciding whether a module is safe for autonomous refactors.
A practical operating pattern:
- Run mutation testing before allowlisting a refactor class on a module.
- Track a simple mutation score (
killed / total) over time. - If score drops below your floor, pause broad autonomous refactors on that surface.
- Open a guard-strengthening mission (add assertions, contracts, or integration cases), then re-measure.
Treat survivors as map updates:
- categorize by risk (data integrity > auth > formatting)
- link each survivor to a missing test or contract
- close survivors before increasing refactor blast radius
For extended recipes and stack variants, see Appendix C.
Scope Boundaries (How wide is the refactor allowed to reach?)
“Refactor the code” is not a Mission. A refactoring mission needs explicit boundaries.
One way to frame scope:
| Refactor type | Typical scope | Minimum guards | Default posture |
|---|---|---|---|
| Within-function cleanup | One symbol | Unit tests + types + lint | Often safe |
| Extract method / simplify logic | One module | Module tests + coverage signal + types | Usually safe when bounded |
| Rename with callers | One module or package | Tests + type checks + contract tests (if public) | Require tighter review |
| Public interface change | Multiple modules | Integration/consumer tests + explicit deprecation plan | Human review by default |
| Architecture change | System-wide | Full system checks + explicit approval | Rare and expensive |
This is the same idea as blast radius limits, but moved earlier in the pipeline: scope is part of the mission, not an afterthought.
Equivalence Classes (What does “no behavior change” mean?)
Refactoring is “change the how, keep the what.” But “the what” has layers:
- Strict equivalence: same outputs for the same inputs. Always required.
- Observational equivalence: same externally observable behavior (responses, persisted state, emitted events), with different internals. Usually acceptable.
- Semantic equivalence: same business meaning, but with a different representation or different trade-offs. Requires explicit review because it can shift edge cases.
This is where performance, memory, logging, and side-effect timing enter the picture. If they matter, they need to be promoted into the Map: benchmarks, invariants, and ratchets that make them gateable.
The Ratchet: Metrics That Can’t Go Backward
Here is the failure mode that kills Neuroplasticity:
The system learns to pass tests by weakening the tests.
If an autonomous loop can modify both the code and the validators, it can discover that the easiest way to achieve “all tests pass” is to delete the hard tests. This isn’t malice—it’s optimization finding the shortest path.
The Ratchet prevents this. It’s a governance mechanism that treats a chosen metric value m as strictly monotonic:
mnew ≥ mold
If the new value is worse, the commit is rejected—regardless of whether tests pass.
Ratchets (Quality Floors That Only Move Up)
Ratchets turn quality expectations into executable law. A candidate may fail, but the floor does not move down. If a candidate passes deterministic guards, the floor moves up and future loops start from a stronger baseline.
What to ratchet:
| Metric | Ratchet Rule | Why |
|---|---|---|
| Test count | Can’t decrease | Prevents “pass by deletion” |
| Coverage % | Can’t decrease | Prevents coverage gaming |
| Type coverage | Can’t decrease | Prevents any creep |
| Lint violations | Can’t increase | Prevents gradual decay |
| Security findings | Can’t increase | Prevents vulnerability accumulation |
| API surface | Can’t expand without approval | Prevents scope creep |
Implementation:
# In your governance config or CI pipeline
ratchets:
test_count:
direction: up
baseline_file: .metrics/test_count.json
on_violation: reject
coverage_percent:
direction: up
baseline_file: .metrics/coverage.json
on_violation: reject
tolerance: 0.1 # Allow tiny fluctuations from test timing
lint_errors:
direction: down
baseline_file: .metrics/lint.json
on_violation: rejectThe baseline files are updated only when a commit passes all ratchets. They become the new floor.
Ratchet tolerance and flaky metrics (noise management)
Not all metrics are equally stable. Keep hard invariants strict, and handle noisy metrics explicitly.
- Hard invariants (strict): test pass/fail, lint errors, security findings. No tolerance.
- Noisy metrics (modeled): benchmark latency, throughput, memory peaks. Use statistical guards, not single-run comparisons.
A practical noise policy:
ratchets:
benchmark_p95_ms:
direction: down
baseline_file: .metrics/benchmark_p95.json
tolerance_mode: relative
tolerance: 0.05 # 5%
sample_runs: 5
aggregation: median
required_consecutive_failures: 2
on_first_violation: warn
on_repeated_violation: rejectGuidance:
- Pin the benchmark harness (hardware class, warmup, dataset, runtime flags).
- Compare aggregates (
median-of-Nor trimmed mean), not single runs. - Use relative tolerances for variable metrics and absolute tolerances for counts.
- Escalate on persistent regression; avoid blocking on one noisy sample.
To enforce this in CI, the ratchet is typically a short shell gate:
OLD_COV=$(jq .pct .metrics/coverage.json)
NEW_COV=$(pytest --cov --cov-report=json | jq .pct)
if (( $(echo "$NEW_COV < $OLD_COV" | bc -l) )); then
echo "Ratchet failed: coverage dropped." && exit 1
fiThe companion repo (github.com/kjwise/aoi_code) includes
ratchet-check and ratchet-baseline targets
that demonstrate this. ratchet_check.py compares current
metrics against a baseline, and ratchet_update_baseline.py
updates the baseline.
The deeper principle:
A ratchet is a monotonic invariant: a property that can only move in one direction over the system’s lifetime. That is how you prevent drift in a self-modifying system.
Without ratchets, every quality metric becomes a negotiation. “We’ll fix the coverage later.” “This test was flaky anyway.” “The lint rule is too strict.” Each exception is small. The cumulative effect is decay.
With ratchets, the negotiation happens once—when you set the baseline. After that, the direction is locked.
Releasing a ratchet (human-only)
Ratchets aren’t permanent. Legitimate reasons to reset a baseline:
- Major architectural change (intentional coverage reset)
- Deprecated module removal (test count drops, but that’s correct)
- Security policy update (old findings reclassified)
The key: releasing a ratchet requires explicit human approval and documentation. The system can’t release its own ratchets—that’s the point.
┌─────────────────────────────────────────────────────────┐
│ Ratchet Release Request │
│ │
│ Metric: test_count │
│ Current baseline: 847 │
│ Proposed baseline: 812 │
│ Reason: Removed deprecated billing module (35 tests) │
│ Approved by: @platform-lead │
│ Date: 2024-03-15 │
│ │
│ [Approve] [Reject] [Request more context] │
└─────────────────────────────────────────────────────────┘
This audit trail is governance. Anyone can see when baselines changed and why.
Rollback Mechanics (what “revert” means operationally)
“Revert” is not one mechanism. Use the rollback pattern that matches the change surface:
| Change surface | Primary rollback | Typical trigger | Notes |
|---|---|---|---|
| Candidate code diff (pre-merge) | Drop candidate branch / discard workspace | Any guard failure | Default and cheapest path |
| Runtime behavior behind flag | Disable feature flag | Post-merge SLO breach | Requires flag discipline and owner |
| Deployment/runtime config | Blue-green or canary rollback | Error budget burn, health checks | Prefer automated rollback thresholds |
Concrete defaults:
- Pre-merge refactors: never mutate
main; failed guard means no merge artifact exists. - Post-merge exposure: gate high-risk behavior behind kill switches or flags.
- Production rollout: pair ratchets with rollout monitors so rollback can trigger automatically.
Rollback itself is part of the Immune System: test it, log it, and make it deterministic.
Containing the Blast Radius
Even with robust guards, keep each automated change small. Blast radius controls ensure that if an unforeseen issue does slip through, its effects stay localized and reversible.
Small, Focused Changes: Automated agents should strive to make the smallest possible atomic changes. For instance, a dependency update for one library, not an entire
package.jsonmanifest.File Path Restrictions: Define
CODEOWNERS-like paths for automated agents, limiting them to specific directories or file types. An agent focused onsrc/componentsshould not touchsrc/database.Immutable Boundaries: Keep autonomous refactors out of protected governance surfaces (CI workflows, policy files, validator code). If a change touches those paths, require explicit human review before merge.
Monorepo and multi-repo ratchets
Repository topology changes where ratchets live, not whether you need them.
Monorepo pattern:
- Keep per-surface baselines (for example
.metrics/service-a/coverage.json,.metrics/service-b/latency.json). - Evaluate ratchets only for affected packages, plus shared contracts.
- Keep a small set of global ratchets for shared infrastructure and policy surfaces.
Multi-repo pattern:
- Keep local ratchets per repo, but add contract ratchets on published interfaces (OpenAPI/schema/event versions).
- Use a central governance ledger to track cross-repo exceptions and releases.
- For cross-repo refactors, use compatibility windows: producer-compatible change first, consumer migration second, removal last.
The rule is consistent: ratchet at the boundary where failure becomes expensive.
Refactoring Pattern Gallery (and the guards that make them safe)
Refactoring patterns are useful because they define a bounded change shape. Guards become easier to write when you can name what the Effector is allowed to do.
| Pattern | Typical scope | Minimum guards | Notes |
|---|---|---|---|
| Extract function | One module | Tests + types + lint | Prefer one extraction per diff |
| Inline variable | One function | Tests + lint | Safe when assertions are strong |
| Rename symbol (private) | One module | Tests + type checks | Prefer automated rename tooling when available |
| Introduce parameter object | One module (sometimes cross-module) | Tests + types + integration checks | Easy to break call sites; keep diff small |
| Replace conditional with dispatch | One module/package | Tests + contract checks | Can change edge cases; watch equivalence class |
| Move function/class | Multi-file | Integration tests + ratchets | Treat as higher risk; enforce strict scope |
Multi-step refactors (chains)
Some refactors need multiple steps (extract → rename → move). Handle these as a chain of atomic diffs:
- Run guards after each step. If a step fails, revert the chain, not just the last edit.
- Keep intermediate artifacts in a quarantine directory (candidate diff + guard output) so humans can salvage useful fragments without re-running the Effector.
- Prefer squashing “refactor noise” into one logical commit once the chain is proven safe. Your Ledger is the PR evidence; your Git history is the narrative.
Worked Example: An Automated Lint Fix
Here is a concrete example: an automated agent applies a lint fix using the Measure → Mutate → Measure → Commit/Revert loop.
Imagine a .js file with an unused import that triggers a
linter warning (ESLint: no-unused-vars).
1. Baseline Measurement: The CI pipeline is triggered (e.g., by a daily scheduled job or a change in a linting rule).
npm test→ All tests pass.npm run lint→ Detects1 error, 0 warnings(due tono-unused-vars).Code coverage: 88%.
2. Automated Mutation: An autonomous agent (e.g., a
script that runs eslint --fix on detected files) is
invoked.
It identifies
src/utils/data-parser.jsas having an unused import.It removes the unused import (a tiny, bounded edit).
This change is staged on a temporary Git branch, say
feat/autofix-eslint-20231027.
3. Post-Mutation Validation: The CI pipeline runs
again on the feat/autofix-eslint-20231027 branch.
npm test→ All tests pass (Success).npm run lint→ Detects0 errors, 0 warnings(Success, lint error fixed).Code coverage: 88% (Success, no drop from baseline).
Blast Radius: Only 1 file changed, 1 line removed (Within limits).
4. Decision Gate:
All guards passed.
The system commits the change to the temporary branch and opens (or updates) a pull request with evidence.
If policy explicitly allowlists this change class, it may auto-merge after all gates pass. Otherwise it waits at the human approval gate (see Chapter 12).
An audit log entry is created, showing the agent, the change, the metrics before and after, and the gate outcome (auto-merged or awaiting human approval).
Had npm test failed, or if code coverage dropped, the
automation would revert its changes, delete the candidate branch, and
log the failure for human attention. This cycle runs unattended only for
explicitly allowlisted low-risk change classes; otherwise it stops at
human approval.
Actionable: What you can do this week
Identify a Monotonic Metric: Choose one quality metric in your project (e.g., code coverage, number of lint errors, number of security vulnerabilities) that you want to prevent from backsliding.
Add a “Ratchet” Check to CI: Configure your CI/CD pipeline to:
Capture the current value of this metric (e.g.,
coverage.json,eslint-report.json).During a build, compare the new metric value against a persisted baseline (e.g., from
mainbranch).Fail the build if the new metric is worse than the baseline (e.g.,
new_coverage < old_coverage,new_errors > old_errors).
Experiment with an Automated Linter Fix: Set up a scheduled job or a local script that runs
eslint --fix(or equivalent for your language) on a small, well-tested part of your codebase. Manually verify the changes, but envision how this would fit into the Measure → Mutate → Measure loop.