Chapter 4 – The Stochastic Engine (Why You Need Physics)
Run the same small task twice:
Run 1:
normalize_country(country)calculate_tax(amount, country, rate)
Run 2:
normalize_country(country: str)— Normalize an input country string.calculate_tax(amount: float, country: str, rate: float)— Compute tax for an amount.
Both are “reasonable.” If your only validators are “valid Markdown” and “the doc mentions the right function names,” both pass. But they’re different: different signature surfaces, extra prose, different formatting.
That difference is the chapter’s subject. This is what drift looks like on a bounded surface: the same task, the same apparent intent, different outputs.
The rest of the chapter gives that variance a ruler. We will measure it, explain why it happens, and show why deterministic Physics is what makes the loop reusable instead of lucky.
Carry one idea forward from Part I:
- Terrain: what runs (code, configs, runtime behavior).
- Map: a versioned intent surface derived from that Terrain (docs, specs, inventories).
The pattern repeats at every level. What counts as runnable reality at one level becomes the spec for the next. The same Software Development as Code (SDaC) shape shows up again and again: explicit intent (Map), validated reality (Terrain), and a loop that keeps them aligned.
Same pattern at multiple levels:
The Same Loop at Different Scales
The same pattern shows up at multiple levels. Functions change in minutes; products change in weeks; organizations change much more slowly. Faster loops feed the slower ones above them.
- Drift at lower levels spreads upward.
- Improvements at lower levels spread upward too.
- Intent and validation at every scale keep the stack coherent.
Zn+1 = Pk(Zn)
Pk+1 = improve(Pk, evidence)
graph TB
subgraph ORG["Organization Level"]
O_Map["Strategy Docs<br/>(Map)"]
O_Terrain["Running Systems<br/>(Terrain)"]
end
subgraph SRV["Service Level"]
S_Map["API Contracts<br/>(Map)"]
S_Terrain["Deployed Services<br/>(Terrain)"]
end
subgraph MOD["Module Level"]
M_Map["Type Signatures<br/>(Map)"]
M_Terrain["Implementations<br/>(Terrain)"]
end
subgraph FUN["Function Level"]
F_Map["Docstring + Signature<br/>(Map)"]
F_Terrain["Function Body<br/>(Terrain)"]
end
O_Terrain --> S_Map
S_Terrain --> M_Map
M_Terrain --> F_Map
%% aoi:layout
O_Map --> O_Terrain
S_Map --> S_Terrain
M_Map --> M_Terrain
F_Map --> F_Terrain
linkStyle 3 opacity:0
linkStyle 4 opacity:0
linkStyle 5 opacity:0
linkStyle 6 opacity:0
classDef map fill:#151B2B,stroke:#06b6d4,stroke-width:2px,color:#f8fafc
classDef terrain fill:#151B2B,stroke:#10b981,stroke-width:2px,color:#f8fafc
class O_Map,S_Map,M_Map,F_Map map
class O_Terrain,S_Terrain,M_Terrain,F_Terrain terrain
style ORG fill:#0B0F19,stroke:#374151,color:#94a3b8
style SRV fill:#0B0F19,stroke:#374151,color:#94a3b8
style MOD fill:#0B0F19,stroke:#374151,color:#94a3b8
style FUN fill:#0B0F19,stroke:#374151,color:#94a3b8
linkStyle default stroke:#6366f1,stroke-width:2px
The arrows show the recursion: each level’s Terrain becomes the next level’s Map. This is why SDaC scales. You are not managing one loop. You are managing the same loop shape at multiple levels, each with the same validate-before-trust structure.
This also explains why “just document everything” fails. A single flat document that tries to cover every level becomes unmanageable. The repeating structure lets you work at the level you are in, trusting that adjacent levels have their own Map/Terrain alignment.
This is not a bug in your loop. This is the nature of the engine.
Large Language Models (LLMs) are stochastic engines: probability machines that produce plausible outputs, not guaranteed ones. If you want reliable automation, you have to put hard checks around the generation.
LLMs Are Probability Engines
At its core, an LLM assigns probabilities to possible next tokens, then samples one. Even if the “best” token has a 90% chance, there is still a tail. That tail is where regressions come from.
Three implications matter in SDaC:
No guarantee of identical output: With sampling enabled, the same task request can produce slightly different candidates across runs. Even at very low temperature (or temperature
0), treat outputs as near-reproducible, not a contract.Contextual sensitivity: Small changes in input or context can shift the distribution and change the output.
Non-local failure modes: A tiny variation can move a change from “passes” to “fails” without a clean, debuggable execution path.
This is why SDaC treats generation as one component in a system, not as the plan itself: you measure variance, constrain it, and gate it.
The Specifiable / Problem-solving / Evolutionary (S/P/E) Mismatch: heuristics meeting reality
To understand why drift is inevitable, it helps to borrow one useful frame from Meir M. Lehman’s classification of software systems. In plain language: some systems are exact, some are heuristic, and some live in a changing world.
- S-Type (Specifiable): The problem is formally defined. Correctness is absolute. A compiler or schema validator is S-Type.
- P-Type (Problem-solving): The solution is heuristic. An LLM is P-Type. It does not know absolute truth; it produces a high-probability answer.
- E-Type (Evolutionary): The system is embedded in reality. Because the world changes, the software must keep changing to stay useful. Your production codebase is E-Type.
Lehman’s First Law says an E-Type system must be continually adapted, or it becomes progressively less satisfactory. That is exactly the dynamic that breaks vibe coding.
When you use a raw AI assistant, you are pointing a P-Type engine directly at an E-Type system: a heuristic generator touching a changing real-world codebase.
If you do not constrain the P-Type engine, it will generate plausible code that subtly violates the hard realities of your E-Type environment. The resulting friction is stochastic drift.
The SDaC Synthesis: S-P-S Composition
You cannot tame an E-Type system using only P-Type tools. You need S-Type constraints.
This is the theoretical justification for the Deterministic Sandwich
from Chapter 2. You wrap the P-Type generator (the model) in S-Type
constraints (Prep and Validation).
When you write a Validator, you are creating an S-Type boundary. The model can use heuristics to generate the flesh, but it still has to pass the hard gate.
Why the V-Model Alone Is Not Enough
The classical V-model is still useful here because it preserves traceability: intent on one side, checks on the other. But phase correspondence alone does not tell you whether a stochastic implementation step will produce the same candidate twice, stay bounded under variance, or converge within budget.
That is why drift matters. Once implementation includes a probabilistic engine, the question is not only “did I define matching checks?” but also “does this loop remain stable under variance?” Chapter 5 answers that second question. Chapter 4 names why the classical shape needs an extension.
Operating the Stochastic Engine (Controls, Budgets, Tiering)
If you are going to wire an LLM into a real pipeline, you need operating discipline: controls, budgets, and a selection strategy.
Sampling controls (temperature is a dial, not a guarantee)
Sampling parameters shape variance.
- Lower temperature typically reduces drift (fewer weird variants).
- Higher temperature typically increases exploration (more diverse candidates).
In many tasks, you want near-reproducibility, not
“creativity.” That usually means a low temperature (often in the
0.1–0.2 range) and a tight output contract (diff-only,
JSON-only, schema-first). Exact 0 can be useful, but it is
not a guarantee: distributed inference can still vary across calls, and
some models degrade when you clamp too hard.
The point is not to find the perfect temperature. The point is: assume variance exists, and design Physics that makes variance safe.
Cost-aware generation (budgeted retries)
Retries are not free. A loop that can retry can also burn tokens, time, and review throughput.
Treat cost as a first-class budget, just like diff size or scope:
- Max attempts: stop after N candidates.
- Max spend: stop after a token/cost cap for this Mission.
- Max wall-clock: stop after a time limit (especially in CI).
- Max review load: cap how many agent PRs can exist concurrently.
Chapter 5 is where we engineer circuit breakers in detail (including the economics of determinism). The key move here is simpler: make “stop” deterministic before you make “generate” powerful.
Model selection rubric (pick the smallest engine that converges)
Model selection is not “best model you can afford.” It is “smallest model that converges under your Physics.”
Use a simple rubric:
| Task type | Recommended model class | Why |
|---|---|---|
| Mechanical edits under strict gates (formatting, renames, schema-shaped JSON) | Fast / cheap | Low semantic load; validators do most of the work |
| Constrained refactors (multiple files, must satisfy tests + types + style) | Flagship | Needs to hold more constraints simultaneously |
| High-branch-factor work (architecture changes, ambiguous requirements) | Flagship + human | The hard part is intent and slicing, not token prediction |
| Background maintenance at scale (Map-Updaters, doc sync, evergreen chores) | Owned inference behind a loop | You can afford retries and frequency; availability becomes strategy |
To evaluate fit for your repo, run the drift experiment (next section) across your candidate model tiers and compare failure rate and time-to-convergence, not vibes.
Retry strategy (feed back deterministic failures)
A retry loop should not be “try again.” It should be: “try again with the exact failure signal.”
Illustrative shape:
findings = []
for attempt in range(max_attempts):
model = choose_model_tier(attempt) # e.g., fast first, flagship on failure
request = render_request(mission, slice, findings)
candidate = effector_call(model, request)
report = validate(candidate) # schemas, lint, tests, scope, policy
if report.passed:
return candidate
findings.append(report.findings)
maybe_backoff(report, attempt) # rate limits, transient errors
raise NonConverged("Exceeded budget; escalate to human or split the mission")The loop converges when each retry is narrower than the last. If you keep seeing the same failure, treat it as a signal problem (bad slice, wrong contract, impossible constraint), not just as a “model problem.”
Instruction-channel hardening (authority vs. evidence)
LLMs often follow instruction-shaped text even when it comes from
untrusted sources (tickets, comments, logs, README snippets). If you do
not harden the instruction channel, you can build a perfect loop and
still get a bypass (often called prompt injection).
This is a Prep-layer problem:
- Compile authority from allowlisted templates (Mission Objects, policies).
- Carry untrusted text as tagged evidence with provenance (file, line, source).
- Enforce scope mechanically (write allowlists + diff gates), not rhetorically.
Chapter 2 shows the attack shape and the defense; Chapter 12 shows how to scale it with policy validators.
Internal Reasoning Does Not Replace the Loop
Some model interfaces expose intermediate reasoning (often called “chain-of-thought”) before the final output. You might think this replaces the SDaC loop. It does not.
Internal loop (model deliberation): The model “talks to itself” to increase plausibility and internal consistency. It is still operating in language space. It does not have access to your compiler, linter, schema validator, or runtime. At best, it is simulating what validation might say.
External loop (SDaC): The system talks to Physics: compilers, tests, schemas, linters, and policy gates. It validates against reality. The decision layer does not care how confident the model sounds; it cares whether the artifact passes the gates.
Reasoning models can be better Effectors: they often produce higher-quality candidates and reduce the number of iterations needed to converge. But they are not Validators, and they do not remove the need for an external decision layer.
Don’t confuse a model “thinking hard” with a test passing.
The Drift Experiment: Measure Structural Variance
Run the same task request 10 times. The goal is to make variance measurable.
Setup (hold everything constant):
Same task request: “Update
## Public Interfacesinproduct/docs/architecture.mdto matchproduct/src/.”Same model, same temperature, same system instructions, same context slice.
Store the output of each run.
Hold everything deterministic except the stochastic call. Then measure drift where it matters: on the change surface.
Drift measurements
- Unique diffs: how many distinct patches you get for
the same task under identical
Prep. - Failure rate: how often you cannot reach
PASSwithin your circuit breakers (max iters, time, or diff budget). - Time-to-convergence: seconds (or iterations) until the patch passes your Validators.
One simple proxy: count unique diffs
- Start from the same baseline each run (reset the file you’re generating).
- Run the one-shot effector n times, saving each emitted diff.
- Hash the diffs and count unique hashes.
If you want a runnable script, see the companion repo at
github.com/kjwise/aoi_code (the drift target
uses a model-driven sync script,
stochastic_sync_public_interfaces.py).
Running it produces a summary like this:
$ make drift
python3 factory/tools/measure_drift.py --src product/src --doc product/docs/architecture.md --runs 10
runs=10 unique_diffs=8 drift_coefficient=0.800 failures=0
example_unique_runs=1, 2, 3, 4, 5This output tells us:
- Out of 10 runs, we got 8 unique diffs.
- The drift coefficient is a high 0.8, meaning the output is highly variable.
- There were no outright failures (e.g., crashes), but the high drift means the surface is not stable.
This is stochastic drift with a ruler next to it: you can measure it, budget it, and reduce it by tightening context selection and hardening validators.
A simple drift coefficient (plus an illustrative run log)
You do not need a perfect statistical model to make drift operational. Start with a simple, auditable coefficient:
$$ d = \frac{u}{n} $$
where u is the number of unique diffs, and n is the total runs.
This is an operational proxy, not a statistical claim or a benchmark. Use it to compare the same workflow over time, set budgets, and detect regressions, not to compare different repositories or teams.
Interpretation:
- d = 0.1 means “one in ten runs differs” (stable enough for many workflows).
- d = 1.0 means “every run is different” (you do not have a reusable automation surface yet).
It also helps to classify variance. Some differences are harmless (whitespace, reordering). Some are structural (different interfaces, different edit regions, new files).
Illustrative run log (N=10; not a benchmark):
| run | result | diff class |
|---|---|---|
| 1 | PASS | A (canonical) |
| 2 | PASS | A (canonical) |
| 3 | PASS | B (whitespace-only) |
| 4 | PASS | B (whitespace-only) |
| 5 | PASS | C (reordered lines) |
| 6 | FAIL | — (exceeded circuit breakers) |
| 7 | PASS | A (canonical) |
| 8 | FAIL | — (exceeded circuit breakers) |
| 9 | PASS | D (structural variance) |
| 10 | PASS | A (canonical) |
From that log:
- With u = 4 unique diffs out of n = 10 runs, d = 0.4.
- Failure rate = 20%.
If that feels “too high,” the fix is not more instruction text. Tighten your slice, tighten your allowed edit region, and add validators that make structural variance fail fast.
Why Ad-hoc Instructions Stop Scaling
Your first instinct is reasonable: make the instructions more specific.
Update `product/docs/architecture.md`.
Only edit content under `## Public Interfaces`.
Format as:
## Public Interfaces
- `name(arg1, arg2)`
This helps. You might go from “wildly inconsistent” to “mostly consistent.” But you hit a ceiling:
Longer instructions do not eliminate sampling. The model still produces variants.
You’re encoding constraints in text. Text is interpreted, not enforced.
You still need verification. Without Validators, you only learn the rules were violated after something breaks.
Compare the two approaches:
Instruction-string approach:
You encode constraints in prose and hope they are followed.
Physics approach:
- deterministically extract the skeleton (signatures, allowed edit region, schema)
- generate only within that boundary (templates help)
- validate alignment and reject variance you do not want to review
Instructions are suggestions. Physics are constraints.
Why This Matters for Your Loop
Chapter 4 is the clean definition of the problem.
- Without deterministic context and gates, your loop produces different outputs on every run.
- Even if the output “looks fine,” it can drift in ways that break downstream automation.
- Failures show up as non-reproducible production incidents, not clean local errors.
The next chapters apply the constraints:
- Chapter 5: Loops that converge despite stochasticity.
- Chapter 6: Context slicing that reduces variance.
- Chapter 7: Mission Objects that make intent executable.
Actionable: What you can do this week
Measure drift on the change surface: Run the same task request 10 times and count unique diffs. Record how often you hit
FAILand how many iterations it took to reachPASS.Pick one invariant and enforce it: Choose one deterministic rule (exact signature surface, allowed edit region, required headings) and add a Validator that fails fast when the output drifts.