Chapter 3 – Define “Good” (Your First Validators)
Start with one concrete gate:
{
"validator": "json_schema",
"status": "fail",
"artifact": "generated/pr_summary_bad.json",
"errors": [
{
"path": "reviewer_suggestions",
"message": "'charlie_qa' is not of type 'array'"
}
]
}Nothing mystical happened there. The model produced plausible-looking output. A cheap deterministic gate caught a structural bug before review, and the loop now has a precise finding it can act on.
You’ve built a loop and named the sandwich. Now define “good” so the gate means something.
Validators turn “looks fine” into a deterministic signal (often
PASS/FAIL) and give you failure you can act on. Start
small: hard physics first (linters, types, schemas), then a few
task-specific checks you can tighten over time.
Two practical rules:
- A Validator reports what it found; it does not decide the entire outcome. In the book’s terms, a Validator is a Sensor, not a Judge.
- Put the cheapest Validators first. Failing fast is not pessimism; it’s throughput.
One reality check up front: Validators are not omniscient. They are executable opinions about what matters. Your job is to keep them cheap, guaranteed to finish, and aligned with the invariants you actually care about.
Where Validators live in real systems
Validator Bundles (Make “good” explicit)
Validators are the hard checks. Budgets stop the loop from grinding forever. Together they turn open-ended iteration into a bounded system.
In production systems, Validators are rarely “just scripts.” They usually exist as explicit workflow steps so the loop is inspectable and reproducible.
In the engine behind this book, the workflow runtime lives in
core/workflow.py, and common checks live in
core/steps/ (for example LintStep,
TypeCheckStep, CoverageStep, and
TestStep in core/steps/code_steps.py). A
controller wires those steps into a deterministic graph with named
transitions, then records a trace.
That implementation detail matters: once Validators are first-class nodes, you can tag them, order them, retry them, visualize them, and reuse the same gates across every loop.
The change phase can be a single model call, a chain, or a multi-agent swarm; the loop does not care because the checks validate the output artifact, not the process that produced it.
The Hard Physics of Validation
Forget the philosophical debates about AI truthfulness for a moment. Ask a simpler question: what can you check deterministically in code and data?
Linters: Do your generated code blocks conform to a style guide? Are they valid syntax?
Type Checkers: Does the generated data structure use the expected data types (e.g., an integer where an integer is expected, not a string)?
Schema Validators: Does the overall structure of the generated output match a predefined schema (e.g., a JSON Schema, a Protobuf definition, a Pydantic model)?
Contract Tests: Does the generated output fulfill a specific contract, like an API request or response format?
Policy Checks: Does the generated content comply with specific policies, such as security rules or legal disclaimers?
These are “hard physics” because they are entirely deterministic. Given an input and a rule, the outcome is always the same: valid or invalid. This is your anchor while the model remains variable.
Validator Taxonomy: Hard First, Then Semantic
Hard physics gets you artifacts that pass the gate. It does not guarantee correct artifacts.
In practice, a validator stack blends:
- Hard Physics Validators: syntax, formatting, types, schemas, path invariants.
- Terminology Validators: banned/redirect terms, canonical vocabulary, contract language.
- Semantic Validators: domain invariants you care about (e.g., “reviewers must be real users,” “no new public API without docs,” “diff touches only allowed paths”).
- Policy Validators: security and compliance rules (secrets scanning, licensing checks, dependency allow/deny lists).
Use this as a planning grid:
| Validator type | What it catches | What it misses | Effort to implement |
|---|---|---|---|
| Type checker | type mismatches, missing fields (typed surfaces) | business logic, runtime behavior | Low (greenfield), Medium (retrofit) |
| Formatter / linter | invalid syntax, common bug patterns, style drift | deep semantics | Low |
| Schema validator | structural correctness of JSON/YAML/etc | semantic meaning (“does this spec make sense?”) | Low |
| Unit tests | local logic errors (covered paths) | uncovered paths, integration boundaries | Medium |
| Integration tests | interface mismatches across components | full production environment behavior | High |
| Property-based tests | edge cases via systematic input exploration | wrong or incomplete properties / generators | Medium–High |
| Security scanners | known vulnerability patterns | novel vulnerabilities, business abuse cases | Low–Medium |
| Policy / scope checks | protected paths, budgets, governance invariants | correctness of the policy itself | Low–Medium |
| Semantic invariants | the rules that are “true for this system” | whatever you didn’t encode | Medium |
The table has a bias: it’s easier to add a new Validator than it is to decide what “good” should mean. That decision is the craft.
Semantic Validators: A few non-trivial cases
Semantic Validators are where you encode “this would pass schema + tests, but it’s still wrong for us.”
Common patterns (all deterministic):
- Compatibility checks: compare OpenAPI/JSON Schema to the previous version and fail on breaking changes unless the Mission explicitly allows them.
- Coupled surfaces: if a public surface changes (exported symbol, route table, CLI flag), require the corresponding Map surface (docs, spec, inventory) to change in the same diff.
- Safety rails: if a diff touches sensitive surfaces (auth, permissions, encryption settings), require additional evidence (a specific test case, a policy waiver, or a checklist artifact) before merge.
They should emit structured signals, not opinions:
[validator] semantic_invariant_fail
code=missing_docs_for_public_api
detail="public surface changed but docs did not"
Sufficiency criteria (what “good enough” looks like)
There is no universal threshold, but teams do converge on practical bars. Treat these as heuristics you can ratchet over time:
- Minimum bar (local loop): formatting + lint + basic types (if you have them) + one fast Immune System case or equivalent gate that catches the common failure mode for your surface.
- Professional bar (shared repo): above + test cases for invariants + a scope validator (paths + allowed edit regions) + at least one security check (secrets scan and/or static analysis).
- Production-critical bar (blast radius is real): above + integration tests for key flows + property-based tests for the core parser/serializer or money/auth logic + explicit policy gates (budgets, protected paths) + human approval on governance surfaces.
Notice what’s missing: “100% coverage.” The goal is not to prove perfection. The goal is to make drift expensive and failures legible.
Who writes the Physics? (who owns the rules)
If Validators are the laws, you don’t want a system that can quietly rewrite its own laws to make the gates green.
A practical middle ground:
- Core invariants are human-authored and protected. The Validator that enforces “don’t touch protected paths” should not be solely written by the agent it constrains.
- Derived validators can be AI-assisted. Let the change step propose a new test case, a schema update, or a semantic invariant, but require it to pass existing Physics and require human review when it changes governance surfaces.
- Validators must be testable artifacts. A Validator is only trustworthy when you can show a case where it fails and a case where it passes. Treat “add a validator” as a normal change that itself goes through gates.
Chapter 10 is where we make this enforceable: protected paths, CODEOWNERS, and a boundary around the grader code so the loop cannot quietly edit its own checks.
Verification, validation, evaluation, and runtime assurance
It helps to separate four jobs that often get blurred:
- Verification: deterministic checks that an artifact is correctly formed or aligned.
- Validation: deterministic confirmation that the bounded change satisfies declared intent, acceptance criteria, and merge policy.
- Evaluation (Judgement support): heuristic or model-based scoring. Useful, but not reproducible enough to be gate law.
- Runtime assurance: post-merge evidence that the live system is still operating inside acceptable bounds.
That role separation matters. A Validator emits findings, so in this book it remains a Sensor, not a Judge. The Judge is the decision layer: it operates over deterministic findings plus declared acceptance criteria and policy to decide whether the candidate passes, refines, reverts, or escalates.
You can absolutely use evaluation to help decide what to do next. But do not mistake it for Physics. If you cannot make it deterministic, it is not merge law.
Runtime assurance has the same posture. It produces evidence from production behavior, drift, and incidents, then feeds that evidence back into future Maps, Missions, and Validators. It does not create a softer parallel gate.
Property-based tests (Soft Physics) and how to design properties
Property-based tests are powerful because they explore the input space systematically. The trap is that “the function does the right thing” is not a property. You need a property you can assert deterministically.
Property design patterns that actually work:
- Round-trip:
decode(encode(x)) == x(serializers, codecs, parsers). - Idempotence:
f(f(x)) == f(x)(normalizers, formatters, canonicalizers). - Invariants: output stays inside a valid range (no negatives, monotonicity, sortedness).
- Reference comparison: new implementation matches a known-correct reference (even if slower).
Example (illustrative, Python + Hypothesis):
from hypothesis import given, strategies as st
@given(st.text(), st.text())
def test_concatenate_preserves_length(a, b):
result = concatenate(a, b)
assert len(result) == len(a) + len(b)The point isn’t the toy domain. It’s the move: choose a property that is cheap to evaluate, hard to game, and directly tied to the bug class you care about (off-by-one, boundary handling, unexpected nulls).
Portability Map: Same Physics, Different Tooling
The pattern is stack-agnostic. The tool names change; the deterministic PASS/FAIL signals do not.
| Surface | Python | TypeScript | Rust |
|---|---|---|---|
| Formatting | ruff format /
black |
prettier |
rustfmt |
| Linting | ruff |
eslint |
clippy |
| Types | mypy /
pyright |
tsc |
rustc |
| Immune System | pytest |
jest |
cargo test |
This is not limited to application code. Infrastructure and policy
surfaces have hard physics too. For Terraform, a deterministic validator
suite might include terraform validate,
tflint, and a policy scanner; see Appendix C for a
ready-to-copy recipe.
For security tooling, the exact choices vary, but common examples
include Bandit (Python), Semgrep (multi-language), and dependency
vulnerability scanners (Snyk, OSV-Scanner, or platform-native
equivalents). The tool doesn’t matter as much as the contract:
deterministic PASS/FAIL on a bounded surface.
A Validator That Catches a Real Bug
Let’s anchor this with a concrete example. Imagine you’re using a large language model (LLM) to help summarize pull requests and suggest reviewers. Your system expects a JSON output with specific fields. A common GenAI bug class is structural drift—the model sometimes deviates from the requested JSON format, either subtly changing field names, missing fields, or providing the wrong data type for a value.
We’ll use a JSON Schema validator to enforce the expected structure.
First, define a JSON Schema for the smallest invariant you care
about. For example: “reviewer_suggestions must be an array
of strings.”
{
"type": "object",
"required": ["reviewer_suggestions"],
"properties": {
"reviewer_suggestions": { "type": "array", "items": { "type": "string" } }
}
}Now if the model outputs
"reviewer_suggestions": "charlie_qa", schema validation
deterministically fails with a type error at
reviewer_suggestions.
The implementation language doesn’t matter. The contract does:
- Deterministic
PASS/FAIL - Machine-readable failure (path + message)
- Exit code
0on success, non-zero on failure
That exit code is the universal signal a runner needs to halt the
loop. It’s also why Minimum Viable Factory (MVF) v0 works: in the
companion repo (github.com/kjwise/aoi_code),
make validate is a deterministic gate that can block
integration without debate. The validate_map_alignment.py
script is a concrete example of a Validator that enforces a specific,
hard rule.
Failure Modes: False Positives and False Negatives
Once you have Validators, you’ll hit the two classic failure modes:
- False positive: the Validator fails on an artifact you would accept.
- False negative: the Validator passes an artifact you would reject.
The fix is not to abandon validation. The fix is to tune it.
Example false positive (too strict)
Imagine you require impact_scope for every PR summary.
That’s reasonable for code changes, but maybe your workflow allows
docs-only PRs where impact_scope is intentionally omitted.
Your JSON Schema would fail an otherwise useful summary.
Tune options:
- Split the schema into variants (
pr_summary_code_v1.jsonvspr_summary_docs_v1.json) and select deterministically inPrep. - Keep one schema but make
impact_scopeoptional, then add a separate Policy Validator that requires it only when the diff includes code paths.
Example false negative (too weak)
Our schema doesn’t prove that "reviewer_suggestions" is
a good list. This passes schema validation but is still wrong
in practice:
"reviewer_suggestions": ["definitely_not_a_user_123"]
Tune options:
- Add a Semantic Validator that cross-checks suggestions against an allowlist (or your directory) and fails if any are unknown.
- Add a budget (“at most 3 suggestions”) and a policy (“must include at least 1 code owner when touching protected paths”).
Composing Validators: Build a checker stack
One validator is good, but several working together are much stronger. Later in the book that bundle becomes an “immune system,” but you do not need the term to use the pattern: chain checks together behind one command.
Consider a generated directory that contains JSON,
Python code, and documentation. Each might need its own type of
validation:
.pyfiles:black(linter),mypy(type checker).jsonfiles:jsonschema(schema validator).mdfiles:markdownlint(style checker)
Wire them behind one command surface (make validate, a
CI job, or a task runner) and run them in a cheap-to-expensive
order:
- schemas / parsers
- formatting + lint
- type checks
- tests
Fail fast on the first violation. That’s not pessimism; it’s throughput.
This is also where AI changes the economics of strictness. It lowers
the cost of generating drafts, and it lowers the cost of writing many of
the validators, schemas, and harness glue that grade those drafts. When
both sides get cheaper, strictness becomes leverage: schemas, parsers,
type checks, and compilers turn hidden runtime drift into loud, local
failures the loop can correct quickly. Rust is the clearest language
example here: once AI can write the boilerplate, rustc,
Clippy, and the type system stop feeling like ceremony and start acting
like an always-on grader.
At this point, you have seen how to build a deterministic gate for model-generated outputs. Any deviation from your explicit definition of “good” (as encoded in schemas, linters, or type systems) will halt the loop before bad output propagates.
Coverage and Strictness: A Ladder, Not a Cliff
Don’t aim for perfect validation from the start. Think of validator coverage as a ladder:
Start with the basics (minimal coverage, low strictness): Ensure your outputs are valid JSON/YAML/Python. Check one or two invariants you care about (types, required fields). This catches the most obvious system-breaking errors. Our schema example started with one invariant:
"reviewer_suggestions"is an array of strings.Increase coverage (more files, more structures): Extend validators to all generated artifacts. If you generate 5 types of JSON, write 5 schemas. If you generate 10 Python files, lint and type-check all 10.
Increase strictness (more granular rules): Once basic structure is guaranteed, start adding more specific rules. For example: add an
enumfor animpact_scopefield ("minor", "medium", "major"), enforce minimum/maximum lengths, or add regex patterns. This catches more subtle, but still critical, semantic errors.Custom validators (business logic): For very specific business rules that can’t be expressed purely in schemas or linters, write custom scripts. For instance, a script that checks if all suggested reviewers (
reviewer_suggestions) are actual known users in your system.
Each step up the ladder adds more reliability. The key is to build this incrementally, focusing on the highest-impact checks first, and tightening your grip on “good” over time.
The stance behind the mechanics
Now that you’ve defined “good” as hard checks, here is the stance behind the mechanics.
Central dogma
We do not optimize for “smart” models. We optimize for loops that converge.
Model behavior varies. We do not negotiate with that variance. We constrain it.
A GenAI system is more likely to hold up in production when three things hold at once:
- It carries durable intent: a Mission Object, which is the typed request you can name, version, and test.
- It resists entropy as models, data, and people change.
- It adapts through a repeatable ratchet of feedback and revision, not heroic one-off instructions.
If you want the book’s canonical chain, here it is:
Mission Object → Context Architecture → Effectors → Immune System → Ouroboros Protocol
The meta-patterns
1) Don’t Chat, Compile
Chat is for ideation. Production work should end up as an artifact you can version and rerun: a Mission Object (typed request) plus a deterministic runner that produces a diff and a trace.
Rule: if it isn’t in a file, it isn’t an instruction. It’s just a vibe.
2) Physics is Law
If a change fails Physics, it does not exist. Physics is binary and fast: schemas, linters, tests, and architectural constraints.
If you allow “just this once” exceptions, autonomy can scale the exception faster than you can review it.
3) Recursion
The same gates apply to human and automated changes. Scheduled maintenance turns entropy signals into Missions and runs them through the same Validators. No privileged mode.
The goal
Build a factory where intent (the Map) is systematically converted into real changes (the Terrain) through verifiable loops, guarded by hard constraints, and recorded in a Ledger with enough evidence to explain why each diff exists.
- If it cannot be verified, it does not ship.
- If it cannot be reproduced, it does not exist.
The contract
- We buy down uncertainty with constraints, logs, and tests.
- We treat language as a contract.
- We do not ship vibes. We ship verifiable state.
- We do not trust the model.
- We trust the loop.
Actionable: What you can do this week
Identify a Generated Output: Pick one artifact generated by an LLM in your current workflow (e.g., a config file, a code snippet, a report).
Define a Simple Schema/Rule: For that output, identify one clear, deterministic rule it must follow. Examples:
“It must be valid JSON.”
“It must be valid Python syntax.”
“If it’s a JSON file, it must have a
nameandversionfield.”
Implement a Basic Validator:
If it’s JSON, write a simple JSON Schema and run a schema validator in CI (any language/tooling is fine) that exits non-zero on failure.
If it’s Python, run
python -m py_compile <file.py>orruff/black --checkas a deterministic gate.
Integrate and Observe: Add this validator to your existing Software Development as Code (SDaC)
Makefileor validation script. Intentionally generate an output that violates your rule and confirm that your build fails deterministically. Then, generate a correct output and confirm it passes.Expand (Optional): If you’re feeling ambitious, add a second, different type of validator (e.g., if you have a JSON schema, add a linter for a Python script).
Tune one failure mode: Find one false positive or false negative and fix it by splitting schemas, adding a semantic validator, or making the selection logic explicit in
Prep.