Part I Build It The Weekend Sprint
15 min read

Chapter 3 – Define “Good” (Your First Validators)

Start with one concrete gate:

{
  "validator": "json_schema",
  "status": "fail",
  "artifact": "generated/pr_summary_bad.json",
  "errors": [
    {
      "path": "reviewer_suggestions",
      "message": "'charlie_qa' is not of type 'array'"
    }
  ]
}

Nothing mystical happened there. The model produced plausible-looking output. A cheap deterministic gate caught a structural bug before review, and the loop now has a precise finding it can act on.

You’ve built a loop and named the sandwich. Now define “good” so the gate means something.

Validators turn “looks fine” into a deterministic signal (often PASS/FAIL) and give you failure you can act on. Start small: hard physics first (linters, types, schemas), then a few task-specific checks you can tighten over time.

Two practical rules:

One reality check up front: Validators are not omniscient. They are executable opinions about what matters. Your job is to keep them cheap, guaranteed to finish, and aligned with the invariants you actually care about.

Where Validators live in real systems

Validator Bundles (Make “good” explicit)

Validators are the hard checks. Budgets stop the loop from grinding forever. Together they turn open-ended iteration into a bounded system.

Interactive model
Turn checks on and off, adjust the retry budget, then run the bundle and watch what becomes provable.
Checks in the bundle
Retry budget 4 attempts
Results
Sample diff (incoming patch)

Run record
run_000

In production systems, Validators are rarely “just scripts.” They usually exist as explicit workflow steps so the loop is inspectable and reproducible.

In the engine behind this book, the workflow runtime lives in core/workflow.py, and common checks live in core/steps/ (for example LintStep, TypeCheckStep, CoverageStep, and TestStep in core/steps/code_steps.py). A controller wires those steps into a deterministic graph with named transitions, then records a trace.

That implementation detail matters: once Validators are first-class nodes, you can tag them, order them, retry them, visualize them, and reuse the same gates across every loop.

The change phase can be a single model call, a chain, or a multi-agent swarm; the loop does not care because the checks validate the output artifact, not the process that produced it.

The Hard Physics of Validation

Forget the philosophical debates about AI truthfulness for a moment. Ask a simpler question: what can you check deterministically in code and data?

These are “hard physics” because they are entirely deterministic. Given an input and a rule, the outcome is always the same: valid or invalid. This is your anchor while the model remains variable.

Validator Taxonomy: Hard First, Then Semantic

Hard physics gets you artifacts that pass the gate. It does not guarantee correct artifacts.

In practice, a validator stack blends:

Use this as a planning grid:

Validator type What it catches What it misses Effort to implement
Type checker type mismatches, missing fields (typed surfaces) business logic, runtime behavior Low (greenfield), Medium (retrofit)
Formatter / linter invalid syntax, common bug patterns, style drift deep semantics Low
Schema validator structural correctness of JSON/YAML/etc semantic meaning (“does this spec make sense?”) Low
Unit tests local logic errors (covered paths) uncovered paths, integration boundaries Medium
Integration tests interface mismatches across components full production environment behavior High
Property-based tests edge cases via systematic input exploration wrong or incomplete properties / generators Medium–High
Security scanners known vulnerability patterns novel vulnerabilities, business abuse cases Low–Medium
Policy / scope checks protected paths, budgets, governance invariants correctness of the policy itself Low–Medium
Semantic invariants the rules that are “true for this system” whatever you didn’t encode Medium

The table has a bias: it’s easier to add a new Validator than it is to decide what “good” should mean. That decision is the craft.

Semantic Validators: A few non-trivial cases

Semantic Validators are where you encode “this would pass schema + tests, but it’s still wrong for us.”

Common patterns (all deterministic):

They should emit structured signals, not opinions:

[validator] semantic_invariant_fail
  code=missing_docs_for_public_api
  detail="public surface changed but docs did not"

Sufficiency criteria (what “good enough” looks like)

There is no universal threshold, but teams do converge on practical bars. Treat these as heuristics you can ratchet over time:

Notice what’s missing: “100% coverage.” The goal is not to prove perfection. The goal is to make drift expensive and failures legible.

Who writes the Physics? (who owns the rules)

If Validators are the laws, you don’t want a system that can quietly rewrite its own laws to make the gates green.

A practical middle ground:

Chapter 10 is where we make this enforceable: protected paths, CODEOWNERS, and a boundary around the grader code so the loop cannot quietly edit its own checks.

Verification, validation, evaluation, and runtime assurance

It helps to separate four jobs that often get blurred:

That role separation matters. A Validator emits findings, so in this book it remains a Sensor, not a Judge. The Judge is the decision layer: it operates over deterministic findings plus declared acceptance criteria and policy to decide whether the candidate passes, refines, reverts, or escalates.

You can absolutely use evaluation to help decide what to do next. But do not mistake it for Physics. If you cannot make it deterministic, it is not merge law.

Runtime assurance has the same posture. It produces evidence from production behavior, drift, and incidents, then feeds that evidence back into future Maps, Missions, and Validators. It does not create a softer parallel gate.

Property-based tests (Soft Physics) and how to design properties

Property-based tests are powerful because they explore the input space systematically. The trap is that “the function does the right thing” is not a property. You need a property you can assert deterministically.

Property design patterns that actually work:

  1. Round-trip: decode(encode(x)) == x (serializers, codecs, parsers).
  2. Idempotence: f(f(x)) == f(x) (normalizers, formatters, canonicalizers).
  3. Invariants: output stays inside a valid range (no negatives, monotonicity, sortedness).
  4. Reference comparison: new implementation matches a known-correct reference (even if slower).

Example (illustrative, Python + Hypothesis):

from hypothesis import given, strategies as st

@given(st.text(), st.text())
def test_concatenate_preserves_length(a, b):
    result = concatenate(a, b)
    assert len(result) == len(a) + len(b)

The point isn’t the toy domain. It’s the move: choose a property that is cheap to evaluate, hard to game, and directly tied to the bug class you care about (off-by-one, boundary handling, unexpected nulls).

Portability Map: Same Physics, Different Tooling

The pattern is stack-agnostic. The tool names change; the deterministic PASS/FAIL signals do not.

Surface Python TypeScript Rust
Formatting ruff format / black prettier rustfmt
Linting ruff eslint clippy
Types mypy / pyright tsc rustc
Immune System pytest jest cargo test

This is not limited to application code. Infrastructure and policy surfaces have hard physics too. For Terraform, a deterministic validator suite might include terraform validate, tflint, and a policy scanner; see Appendix C for a ready-to-copy recipe.

For security tooling, the exact choices vary, but common examples include Bandit (Python), Semgrep (multi-language), and dependency vulnerability scanners (Snyk, OSV-Scanner, or platform-native equivalents). The tool doesn’t matter as much as the contract: deterministic PASS/FAIL on a bounded surface.

A Validator That Catches a Real Bug

Let’s anchor this with a concrete example. Imagine you’re using a large language model (LLM) to help summarize pull requests and suggest reviewers. Your system expects a JSON output with specific fields. A common GenAI bug class is structural drift—the model sometimes deviates from the requested JSON format, either subtly changing field names, missing fields, or providing the wrong data type for a value.

We’ll use a JSON Schema validator to enforce the expected structure.

First, define a JSON Schema for the smallest invariant you care about. For example: “reviewer_suggestions must be an array of strings.”

{
  "type": "object",
  "required": ["reviewer_suggestions"],
  "properties": {
    "reviewer_suggestions": { "type": "array", "items": { "type": "string" } }
  }
}

Now if the model outputs "reviewer_suggestions": "charlie_qa", schema validation deterministically fails with a type error at reviewer_suggestions.

The implementation language doesn’t matter. The contract does:

That exit code is the universal signal a runner needs to halt the loop. It’s also why Minimum Viable Factory (MVF) v0 works: in the companion repo (github.com/kjwise/aoi_code), make validate is a deterministic gate that can block integration without debate. The validate_map_alignment.py script is a concrete example of a Validator that enforces a specific, hard rule.

Failure Modes: False Positives and False Negatives

Once you have Validators, you’ll hit the two classic failure modes:

The fix is not to abandon validation. The fix is to tune it.

Example false positive (too strict)

Imagine you require impact_scope for every PR summary. That’s reasonable for code changes, but maybe your workflow allows docs-only PRs where impact_scope is intentionally omitted. Your JSON Schema would fail an otherwise useful summary.

Tune options:

Example false negative (too weak)

Our schema doesn’t prove that "reviewer_suggestions" is a good list. This passes schema validation but is still wrong in practice:

"reviewer_suggestions": ["definitely_not_a_user_123"]

Tune options:

Composing Validators: Build a checker stack

One validator is good, but several working together are much stronger. Later in the book that bundle becomes an “immune system,” but you do not need the term to use the pattern: chain checks together behind one command.

Consider a generated directory that contains JSON, Python code, and documentation. Each might need its own type of validation:

Wire them behind one command surface (make validate, a CI job, or a task runner) and run them in a cheap-to-expensive order:

Fail fast on the first violation. That’s not pessimism; it’s throughput.

This is also where AI changes the economics of strictness. It lowers the cost of generating drafts, and it lowers the cost of writing many of the validators, schemas, and harness glue that grade those drafts. When both sides get cheaper, strictness becomes leverage: schemas, parsers, type checks, and compilers turn hidden runtime drift into loud, local failures the loop can correct quickly. Rust is the clearest language example here: once AI can write the boilerplate, rustc, Clippy, and the type system stop feeling like ceremony and start acting like an always-on grader.

At this point, you have seen how to build a deterministic gate for model-generated outputs. Any deviation from your explicit definition of “good” (as encoded in schemas, linters, or type systems) will halt the loop before bad output propagates.

Coverage and Strictness: A Ladder, Not a Cliff

Don’t aim for perfect validation from the start. Think of validator coverage as a ladder:

  1. Start with the basics (minimal coverage, low strictness): Ensure your outputs are valid JSON/YAML/Python. Check one or two invariants you care about (types, required fields). This catches the most obvious system-breaking errors. Our schema example started with one invariant: "reviewer_suggestions" is an array of strings.

  2. Increase coverage (more files, more structures): Extend validators to all generated artifacts. If you generate 5 types of JSON, write 5 schemas. If you generate 10 Python files, lint and type-check all 10.

  3. Increase strictness (more granular rules): Once basic structure is guaranteed, start adding more specific rules. For example: add an enum for an impact_scope field ("minor", "medium", "major"), enforce minimum/maximum lengths, or add regex patterns. This catches more subtle, but still critical, semantic errors.

  4. Custom validators (business logic): For very specific business rules that can’t be expressed purely in schemas or linters, write custom scripts. For instance, a script that checks if all suggested reviewers (reviewer_suggestions) are actual known users in your system.

Each step up the ladder adds more reliability. The key is to build this incrementally, focusing on the highest-impact checks first, and tightening your grip on “good” over time.


The stance behind the mechanics

Now that you’ve defined “good” as hard checks, here is the stance behind the mechanics.

Central dogma

We do not optimize for “smart” models. We optimize for loops that converge.

Model behavior varies. We do not negotiate with that variance. We constrain it.

A GenAI system is more likely to hold up in production when three things hold at once:

If you want the book’s canonical chain, here it is:

Mission Object → Context Architecture → Effectors → Immune System → Ouroboros Protocol

The meta-patterns

1) Don’t Chat, Compile

Chat is for ideation. Production work should end up as an artifact you can version and rerun: a Mission Object (typed request) plus a deterministic runner that produces a diff and a trace.

Rule: if it isn’t in a file, it isn’t an instruction. It’s just a vibe.

2) Physics is Law

If a change fails Physics, it does not exist. Physics is binary and fast: schemas, linters, tests, and architectural constraints.

If you allow “just this once” exceptions, autonomy can scale the exception faster than you can review it.

3) Recursion

The same gates apply to human and automated changes. Scheduled maintenance turns entropy signals into Missions and runs them through the same Validators. No privileged mode.

The goal

Build a factory where intent (the Map) is systematically converted into real changes (the Terrain) through verifiable loops, guarded by hard constraints, and recorded in a Ledger with enough evidence to explain why each diff exists.

The contract


Actionable: What you can do this week

  1. Identify a Generated Output: Pick one artifact generated by an LLM in your current workflow (e.g., a config file, a code snippet, a report).

  2. Define a Simple Schema/Rule: For that output, identify one clear, deterministic rule it must follow. Examples:

    • “It must be valid JSON.”

    • “It must be valid Python syntax.”

    • “If it’s a JSON file, it must have a name and version field.”

  3. Implement a Basic Validator:

    • If it’s JSON, write a simple JSON Schema and run a schema validator in CI (any language/tooling is fine) that exits non-zero on failure.

    • If it’s Python, run python -m py_compile <file.py> or ruff/black --check as a deterministic gate.

  4. Integrate and Observe: Add this validator to your existing Software Development as Code (SDaC) Makefile or validation script. Intentionally generate an output that violates your rule and confirm that your build fails deterministically. Then, generate a correct output and confirm it passes.

  5. Expand (Optional): If you’re feeling ambitious, add a second, different type of validator (e.g., if you have a JSON schema, add a linter for a Python script).

  6. Tune one failure mode: Find one false positive or false negative and fix it by splitting schemas, adding a semantic validator, or making the selection logic explicit in Prep.

Share