Part II Understand It The Theory Behind Reliability

Chapter 4 – The Stochastic Engine (Why You Need Physics)

Here’s what happens when you take the Chapter 1 Map/Terrain sync loop and swap the deterministic Effector for a stochastic one.

Run 1:

## Public Interfaces

- `normalize_country(country)`
- `calculate_tax(amount, country, rate)`

Run 2:

## Public Interfaces

- `normalize_country(country: str)` — Normalize an input country string.
- `calculate_tax(amount: float, country: str, rate: float)` — Compute tax for an amount.

Both are “reasonable.” If your only validators are “valid Markdown” and “the doc mentions the right function names,” both pass. But they’re different: different signature surfaces, extra prose, different formatting.

This isn’t a bug in your loop. This is the nature of the engine.

Large Language Models (LLMs) are stochastic engines: probability machines that produce plausible outputs, not guaranteed ones. If you want reliable automation, you have to add Physics around the generation.

LLMs Are Probability Engines

At its heart, an LLM assigns probabilities to possible next tokens, then samples one. Even if the “best” token has a 90% chance, there’s still a 10% tail. That tail is where regressions come from.

Three implications matter in SDaC:

No guarantee of identical output: The same Mission Object, given to the same model, with the same temperature, can produce slightly different results across runs.
Contextual sensitivity: Small changes in input or context can shift the distribution and change the output.
Non-local failure modes: A tiny variation can move a change from “passes” to “fails” without a clean, debuggable execution path.

This is why SDaC treats generation as a component, not a plan: you measure variance, constrain it, and gate it.

The Drift Experiment: Measure Structural Variance

Run the same Mission Object 10 times. The goal is to make variance measurable.

Setup (hold everything constant):

Same Mission Object: “Update ## Public Interfaces in product/docs/architecture.md to match product/src/.”
Same model, same temperature, same system instructions, same context slice.
Store the output of each run.

Hold everything deterministic except the stochastic call. Then measure drift where it matters: on the change surface.

Drift measurements

Unique diffs: how many distinct patches you get for the same Mission Object under identical Prep.
Failure rate: how often you cannot reach PASS within your circuit breakers (max iters, time, or diff budget).
Time-to-convergence: seconds (or iterations) until the patch is admissible under your Validators.

One simple proxy: count unique diffs

# Replace `run_effector_once` with your one-shot, diff-emitting command.
# The only rule: every run starts from the same baseline state.

for i in {1..10}; do
  git checkout -- product/docs/architecture.md
  run_effector_once > "run_${i}.diff" || echo "FAIL run=${i}"
  shasum "run_${i}.diff" | cut -d' ' -f1
done | sort -u | wc -l

Example summary (your numbers will vary):

mission_id=doc_sync:public_interfaces runs=10
unique_diffs=4
failure_rate=20%  (2/10 exceeded circuit breakers)
iterations_to_converge: p50=2 p90=6

This is Stochastic Drift with a ruler next to it: you can measure it, budget it, and reduce it by tightening context selection and hardening validators.

A simple drift coefficient (plus an illustrative run log)

You don’t need a perfect statistical model to make drift operational. Start with a simple, auditable coefficient:

drift_coefficient = unique_diffs / total_runs

This is an operational proxy, not a statistical claim or a benchmark. Use it to compare the same workflow over time, set budgets, and detect regressions — not to compare different repositories or teams.

Interpretation:

0.1 means “one in ten runs differs” (stable enough for many workflows).
1.0 means “every run is different” (you do not have a reusable automation surface yet).

It also helps to classify variance. Some differences are harmless (whitespace, reordering). Some are structural (different interfaces, different edit regions, new files).

Illustrative run log (N=10; not a benchmark):

run	result	diff class
1	PASS	A (canonical)
2	PASS	A (canonical)
3	PASS	B (whitespace-only)
4	PASS	B (whitespace-only)
5	PASS	C (reordered lines)
6	FAIL	— (exceeded circuit breakers)
7	PASS	A (canonical)
8	FAIL	— (exceeded circuit breakers)
9	PASS	D (structural variance)
10	PASS	A (canonical)

From that log:

unique_diffs = 4 → drift_coefficient = 0.4
failure_rate = 20%

If that feels “too high,” the fix is not more instruction text. Tighten your slice, tighten your allowed edit region, and add validators that make structural variance fail fast.

Why Ad-hoc Instructions Stop Scaling

Your first instinct is reasonable: make the instructions more specific.

Update `product/docs/architecture.md`.
Only edit content under `## Public Interfaces`.
Format as:

## Public Interfaces
- `name(arg1, arg2)`

This helps. You might go from “wildly inconsistent” to “mostly consistent.” But you hit a ceiling:

Longer instructions do not eliminate sampling. The model still produces variants.
You’re encoding constraints in text. Text is interpreted, not enforced.
You still need verification. Without Validators, you only learn the rules were violated after something breaks.

Compare the two approaches:

Instruction-string approach:

instructions = load_text("public_interfaces_instructions.txt")
output = model.generate(instructions)  # hope it followed the rules

Physics approach:

context = extract_public_signatures("product/src/")  # deterministic extraction
template = load_template("public_interfaces.jinja2") # structure is defined
draft = model.fill(template, context)            # bounded generation
validate_map_alignment(draft)                    # enforced, not hoped for

Instructions are suggestions. Physics are constraints.

Why This Matters for Your Loop

You built a working loop in Chapter 1. You wrapped the stochastic call in a Deterministic Sandwich in Chapter 2. You defined Validators in Chapter 3.

Chapter 4 gives you the uncomfortable truth those patterns are designed to address:

Without deterministic context and gates, your loop produces different outputs on every run.
Even if the output “looks fine,” it can drift in ways that break downstream automation.
Failures show up as non-reproducible production incidents, not clean local errors.

The drift you measured is not a tuning problem. It’s an architectural property of stochastic generation.

The next chapters show you how to constrain it:

Chapter 5: Loops that converge despite stochasticity.
Chapter 6: Context slicing that reduces variance.
Chapter 7: Mission Objects that make intent executable.

Actionable: What you can do this week

Measure drift on the change surface: Run the same Mission Object 10 times and count unique diffs. Record how often you hit FAIL and how many iterations it took to reach PASS.
Pick one invariant and enforce it: Choose one deterministic rule (exact signature surface, allowed edit region, required headings) and add a Validator that fails fast when the output drifts.