Part II Understand It The Theory Behind Reliability
12 min read

Chapter 4 – The Stochastic Engine (Why You Need Physics)

Run the same small task twice:

Run 1:

Run 2:

Both are “reasonable.” If your only validators are “valid Markdown” and “the doc mentions the right function names,” both pass. But they’re different: different signature surfaces, extra prose, different formatting.

That difference is the chapter’s subject. This is what drift looks like on a bounded surface: the same task, the same apparent intent, different outputs.

The rest of the chapter gives that variance a ruler. We will measure it, explain why it happens, and show why deterministic Physics is what makes the loop reusable instead of lucky.

Carry one idea forward from Part I:

The pattern repeats at every level. What counts as runnable reality at one level becomes the spec for the next. The same Software Development as Code (SDaC) shape shows up again and again: explicit intent (Map), validated reality (Terrain), and a loop that keeps them aligned.

Same pattern at multiple levels:

The Same Loop at Different Scales

The same pattern shows up at multiple levels. Functions change in minutes; products change in weeks; organizations change much more slowly. Faster loops feed the slower ones above them.

  • Drift at lower levels spreads upward.
  • Improvements at lower levels spread upward too.
  • Intent and validation at every scale keep the stack coherent.

Zn+1 = Pk(Zn)
Pk+1 = improve(Pk, evidence)

Pick a scale
Company Scale ~Years
State: Market position · Process: Culture + leadership
Product Scale ~Months
State: System architecture · Process: Org + strategy
Feature Scale ~Days
State: Capabilities · Process: Team practices
Function Scale ~Minutes
State: Implementation · Process: Dev tools
Selected scale
State
Market position
Process
Culture + leadership
Cadence
Years
Example
Strategic bets
Improve the fastest loops first; the rest of the stack inherits the gain.
graph TB
    subgraph ORG["Organization Level"]
        O_Map["Strategy Docs<br/>(Map)"]
        O_Terrain["Running Systems<br/>(Terrain)"]
    end

    subgraph SRV["Service Level"]
        S_Map["API Contracts<br/>(Map)"]
        S_Terrain["Deployed Services<br/>(Terrain)"]
    end

    subgraph MOD["Module Level"]
        M_Map["Type Signatures<br/>(Map)"]
        M_Terrain["Implementations<br/>(Terrain)"]
    end

    subgraph FUN["Function Level"]
        F_Map["Docstring + Signature<br/>(Map)"]
        F_Terrain["Function Body<br/>(Terrain)"]
    end

    O_Terrain --> S_Map
    S_Terrain --> M_Map
    M_Terrain --> F_Map

  %% aoi:layout
  O_Map --> O_Terrain
  S_Map --> S_Terrain
  M_Map --> M_Terrain
  F_Map --> F_Terrain
  linkStyle 3 opacity:0
  linkStyle 4 opacity:0
  linkStyle 5 opacity:0
  linkStyle 6 opacity:0

    classDef map fill:#151B2B,stroke:#06b6d4,stroke-width:2px,color:#f8fafc
    classDef terrain fill:#151B2B,stroke:#10b981,stroke-width:2px,color:#f8fafc
    class O_Map,S_Map,M_Map,F_Map map
    class O_Terrain,S_Terrain,M_Terrain,F_Terrain terrain

    style ORG fill:#0B0F19,stroke:#374151,color:#94a3b8
    style SRV fill:#0B0F19,stroke:#374151,color:#94a3b8
    style MOD fill:#0B0F19,stroke:#374151,color:#94a3b8
    style FUN fill:#0B0F19,stroke:#374151,color:#94a3b8
    linkStyle default stroke:#6366f1,stroke-width:2px

The arrows show the recursion: each level’s Terrain becomes the next level’s Map. This is why SDaC scales. You are not managing one loop. You are managing the same loop shape at multiple levels, each with the same validate-before-trust structure.

This also explains why “just document everything” fails. A single flat document that tries to cover every level becomes unmanageable. The repeating structure lets you work at the level you are in, trusting that adjacent levels have their own Map/Terrain alignment.

This is not a bug in your loop. This is the nature of the engine.

Large Language Models (LLMs) are stochastic engines: probability machines that produce plausible outputs, not guaranteed ones. If you want reliable automation, you have to put hard checks around the generation.

LLMs Are Probability Engines

At its core, an LLM assigns probabilities to possible next tokens, then samples one. Even if the “best” token has a 90% chance, there is still a tail. That tail is where regressions come from.

Three implications matter in SDaC:

This is why SDaC treats generation as one component in a system, not as the plan itself: you measure variance, constrain it, and gate it.

The Specifiable / Problem-solving / Evolutionary (S/P/E) Mismatch: heuristics meeting reality

To understand why drift is inevitable, it helps to borrow one useful frame from Meir M. Lehman’s classification of software systems. In plain language: some systems are exact, some are heuristic, and some live in a changing world.

  1. S-Type (Specifiable): The problem is formally defined. Correctness is absolute. A compiler or schema validator is S-Type.
  2. P-Type (Problem-solving): The solution is heuristic. An LLM is P-Type. It does not know absolute truth; it produces a high-probability answer.
  3. E-Type (Evolutionary): The system is embedded in reality. Because the world changes, the software must keep changing to stay useful. Your production codebase is E-Type.

Lehman’s First Law says an E-Type system must be continually adapted, or it becomes progressively less satisfactory. That is exactly the dynamic that breaks vibe coding.

When you use a raw AI assistant, you are pointing a P-Type engine directly at an E-Type system: a heuristic generator touching a changing real-world codebase.

If you do not constrain the P-Type engine, it will generate plausible code that subtly violates the hard realities of your E-Type environment. The resulting friction is stochastic drift.

The SDaC Synthesis: S-P-S Composition

You cannot tame an E-Type system using only P-Type tools. You need S-Type constraints.

This is the theoretical justification for the Deterministic Sandwich from Chapter 2. You wrap the P-Type generator (the model) in S-Type constraints (Prep and Validation).

When you write a Validator, you are creating an S-Type boundary. The model can use heuristics to generate the flesh, but it still has to pass the hard gate.

Why the V-Model Alone Is Not Enough

The classical V-model is still useful here because it preserves traceability: intent on one side, checks on the other. But phase correspondence alone does not tell you whether a stochastic implementation step will produce the same candidate twice, stay bounded under variance, or converge within budget.

That is why drift matters. Once implementation includes a probabilistic engine, the question is not only “did I define matching checks?” but also “does this loop remain stable under variance?” Chapter 5 answers that second question. Chapter 4 names why the classical shape needs an extension.

Operating the Stochastic Engine (Controls, Budgets, Tiering)

If you are going to wire an LLM into a real pipeline, you need operating discipline: controls, budgets, and a selection strategy.

Sampling controls (temperature is a dial, not a guarantee)

Sampling parameters shape variance.

In many tasks, you want near-reproducibility, not “creativity.” That usually means a low temperature (often in the 0.1–0.2 range) and a tight output contract (diff-only, JSON-only, schema-first). Exact 0 can be useful, but it is not a guarantee: distributed inference can still vary across calls, and some models degrade when you clamp too hard.

The point is not to find the perfect temperature. The point is: assume variance exists, and design Physics that makes variance safe.

Cost-aware generation (budgeted retries)

Retries are not free. A loop that can retry can also burn tokens, time, and review throughput.

Treat cost as a first-class budget, just like diff size or scope:

Chapter 5 is where we engineer circuit breakers in detail (including the economics of determinism). The key move here is simpler: make “stop” deterministic before you make “generate” powerful.

Model selection rubric (pick the smallest engine that converges)

Model selection is not “best model you can afford.” It is “smallest model that converges under your Physics.”

Use a simple rubric:

Task type Recommended model class Why
Mechanical edits under strict gates (formatting, renames, schema-shaped JSON) Fast / cheap Low semantic load; validators do most of the work
Constrained refactors (multiple files, must satisfy tests + types + style) Flagship Needs to hold more constraints simultaneously
High-branch-factor work (architecture changes, ambiguous requirements) Flagship + human The hard part is intent and slicing, not token prediction
Background maintenance at scale (Map-Updaters, doc sync, evergreen chores) Owned inference behind a loop You can afford retries and frequency; availability becomes strategy

To evaluate fit for your repo, run the drift experiment (next section) across your candidate model tiers and compare failure rate and time-to-convergence, not vibes.

Retry strategy (feed back deterministic failures)

A retry loop should not be “try again.” It should be: “try again with the exact failure signal.”

Illustrative shape:

findings = []

for attempt in range(max_attempts):
    model = choose_model_tier(attempt)  # e.g., fast first, flagship on failure
    request = render_request(mission, slice, findings)
    candidate = effector_call(model, request)

    report = validate(candidate)  # schemas, lint, tests, scope, policy
    if report.passed:
        return candidate

    findings.append(report.findings)
    maybe_backoff(report, attempt)  # rate limits, transient errors

raise NonConverged("Exceeded budget; escalate to human or split the mission")

The loop converges when each retry is narrower than the last. If you keep seeing the same failure, treat it as a signal problem (bad slice, wrong contract, impossible constraint), not just as a “model problem.”

Instruction-channel hardening (authority vs. evidence)

LLMs often follow instruction-shaped text even when it comes from untrusted sources (tickets, comments, logs, README snippets). If you do not harden the instruction channel, you can build a perfect loop and still get a bypass (often called prompt injection).

This is a Prep-layer problem:

Chapter 2 shows the attack shape and the defense; Chapter 12 shows how to scale it with policy validators.

Internal Reasoning Does Not Replace the Loop

Some model interfaces expose intermediate reasoning (often called “chain-of-thought”) before the final output. You might think this replaces the SDaC loop. It does not.

Internal loop (model deliberation): The model “talks to itself” to increase plausibility and internal consistency. It is still operating in language space. It does not have access to your compiler, linter, schema validator, or runtime. At best, it is simulating what validation might say.

External loop (SDaC): The system talks to Physics: compilers, tests, schemas, linters, and policy gates. It validates against reality. The decision layer does not care how confident the model sounds; it cares whether the artifact passes the gates.

Reasoning models can be better Effectors: they often produce higher-quality candidates and reduce the number of iterations needed to converge. But they are not Validators, and they do not remove the need for an external decision layer.

Don’t confuse a model “thinking hard” with a test passing.

The Drift Experiment: Measure Structural Variance

Run the same task request 10 times. The goal is to make variance measurable.

Setup (hold everything constant):

  1. Same task request: “Update ## Public Interfaces in product/docs/architecture.md to match product/src/.”

  2. Same model, same temperature, same system instructions, same context slice.

  3. Store the output of each run.

Hold everything deterministic except the stochastic call. Then measure drift where it matters: on the change surface.

Drift measurements

One simple proxy: count unique diffs

  1. Start from the same baseline each run (reset the file you’re generating).
  2. Run the one-shot effector n times, saving each emitted diff.
  3. Hash the diffs and count unique hashes.

If you want a runnable script, see the companion repo at github.com/kjwise/aoi_code (the drift target uses a model-driven sync script, stochastic_sync_public_interfaces.py).

Running it produces a summary like this:

$ make drift
python3 factory/tools/measure_drift.py --src product/src --doc product/docs/architecture.md --runs 10
runs=10 unique_diffs=8 drift_coefficient=0.800 failures=0
example_unique_runs=1, 2, 3, 4, 5

This output tells us:

This is stochastic drift with a ruler next to it: you can measure it, budget it, and reduce it by tightening context selection and hardening validators.

A simple drift coefficient (plus an illustrative run log)

You do not need a perfect statistical model to make drift operational. Start with a simple, auditable coefficient:

$$ d = \frac{u}{n} $$

where u is the number of unique diffs, and n is the total runs.

This is an operational proxy, not a statistical claim or a benchmark. Use it to compare the same workflow over time, set budgets, and detect regressions, not to compare different repositories or teams.

Interpretation:

It also helps to classify variance. Some differences are harmless (whitespace, reordering). Some are structural (different interfaces, different edit regions, new files).

Illustrative run log (N=10; not a benchmark):

run result diff class
1 PASS A (canonical)
2 PASS A (canonical)
3 PASS B (whitespace-only)
4 PASS B (whitespace-only)
5 PASS C (reordered lines)
6 FAIL — (exceeded circuit breakers)
7 PASS A (canonical)
8 FAIL — (exceeded circuit breakers)
9 PASS D (structural variance)
10 PASS A (canonical)

From that log:

If that feels “too high,” the fix is not more instruction text. Tighten your slice, tighten your allowed edit region, and add validators that make structural variance fail fast.

Why Ad-hoc Instructions Stop Scaling

Your first instinct is reasonable: make the instructions more specific.

Update `product/docs/architecture.md`.
Only edit content under `## Public Interfaces`.
Format as:

## Public Interfaces
- `name(arg1, arg2)`

This helps. You might go from “wildly inconsistent” to “mostly consistent.” But you hit a ceiling:

  1. Longer instructions do not eliminate sampling. The model still produces variants.

  2. You’re encoding constraints in text. Text is interpreted, not enforced.

  3. You still need verification. Without Validators, you only learn the rules were violated after something breaks.

Compare the two approaches:

Instruction-string approach:

You encode constraints in prose and hope they are followed.

Physics approach:

Instructions are suggestions. Physics are constraints.

Why This Matters for Your Loop

Chapter 4 is the clean definition of the problem.

The next chapters apply the constraints:

Actionable: What you can do this week

  1. Measure drift on the change surface: Run the same task request 10 times and count unique diffs. Record how often you hit FAIL and how many iterations it took to reach PASS.

  2. Pick one invariant and enforce it: Choose one deterministic rule (exact signature surface, allowed edit region, required headings) and add a Validator that fails fast when the output drifts.

Share