Chapter 15 – Appendix B: Failure Mode Gallery
This appendix catalogues common failure modes encountered when building and operating Software Development as Code (SDaC) loops. Understanding these patterns helps in designing robust systems, writing effective Mission Objects, and debugging issues when they arise. Each entry describes the failure, its typical causes, provides an example, and suggests diagnostic and mitigation strategies.
Slice Too Large or Too Small
What it looks like
The generative system (e.g., your Map or Updater) produces output that is either overwhelmingly broad and unfocused, or excessively granular and ineffective for the given task.
Slice Too Large: Output involves significant changes across many files, introduces complex new structures without explicit instruction, or attempts to solve problems far beyond the immediate scope. The resulting diff might be impossible to review meaningfully, or the changes might contradict implicit system constraints.
Slice Too Small: Output is trivial, making minimal changes that don’t address the core problem. The system might require many iterations to achieve a simple goal, or worse, generate an “empty” change when a real change was expected.
Why it happens
This typically stems from an imbalanced mission for your generative step, often embedded in the system’s Mission Object.
Too Large: The Mission Object is too open-ended, lacks clear boundaries, or does not sufficiently specify the kind of change expected. The LLM interprets the lack of constraints as an invitation to be highly creative or broad. This is common when the context provided to the LLM is also very large, encouraging it to draw connections across disparate parts of the codebase.
Too Small: The Mission Object is overly restrictive, specifies an atomic change when a conceptual refactor is needed, or the provided context is insufficient for the LLM to understand the broader implications of the change. This can also occur if validation rules are so strict that any meaningful change is immediately rejected, leading the LLM to attempt only the most minor (and often ineffective) adjustments.
Example
Consider an Updater whose mission is to “refactor the
UserPreferences module to improve performance.”
Slice Too Large: The Updater rewrites not only
UserPreferences.pybut also modifies database schemas, introduces a new caching layer, and changes API endpoints—none of which were explicitly requested or scoped. The generated pull request touches 50+ files and breaks existing integration Immune System checks.--- a/src/services/user_preferences.py +++ b/src/services/user_preferences.py # (extensive changes to caching, new database calls, etc.) --- a/src/database/schema.sql +++ b/src/database/schema.sql # (new table for user preference cache) --- a/src/api/v1/preferences.py +++ b/src/api/v1/preferences.py # (new API endpoint for cache invalidation)Slice Too Small: The Updater only reorders imports in
UserPreferences.pyor changes a single variable name, failing to address any performance concerns. The generated pull request is trivial and does not resolve the original mission.--- a/src/services/user_preferences.py +++ b/src/services/user_preferences.py @@ -1,6 +1,6 @@ import os import json -from typing import Dict, Any +from typing import Any, Dict # Trivial reorder
Diagnostics
Diff Size and Scope: Review the generated diff. Is it concentrated in the expected areas or spread widely?
Validator Output: Do Validators immediately fail due to unrelated changes, or are they silent because the changes are too minor?
Traceability: Can you directly map every significant change in the output back to an explicit instruction in the Mission Object or relevant context?
Mitigation
Refine Mission Objects: Make missions explicit, bounded, and focused. Use phrases like “Only modify files in
src/feature_x/” or “Generate a new function that does Y, do not alter existing interfaces.”Context Management: Provide only the necessary context. Too much context can lead to overreach; too little can lead to an inability to make meaningful changes.
Pre-computation/Planning: For larger tasks, use a
Mapstep to break down the problem into smaller, well-definedUpdatermissions. This is the coreMap-Updaterpattern.Post-processing/Filtering: Implement mechanisms (e.g., custom Effectors,
sedcommands, file path filters) to prune irrelevant changes before they reach Validators or human review.
Thrash
What it looks like
The SDaC loop gets stuck in a cycle where the Updater repeatedly makes and undoes changes, or makes minor, ineffective adjustments that don’t lead to a resolution. This can manifest as:
Oscillating diffs: changes in one commit are reversed or altered significantly in the next.
Repeated Validator failures: the same Validator error appears in consecutive loop iterations.
Slow progress: the system takes an inordinate number of iterations to achieve a simple goal, often failing along the way.
Why it happens
Thrash typically results from a mismatch or conflict between the mission, the Updater’s capabilities, and the validation rules.
Conflicting Constraints: The Mission Object, provided context, and Validator rules implicitly or explicitly demand contradictory outcomes. The Updater tries to satisfy one constraint, which then violates another.
Insufficient Feedback: The Updater receives only binary pass/fail feedback from Validators, without specific guidance on how to correct an error. It’s left guessing.
Lack of State/Memory: The Updater doesn’t “remember” its previous attempts or the specific failures it encountered, leading it to repeat the same mistakes.
Overly Aggressive Validators: Validators might be too sensitive, flagging minor stylistic issues as critical errors, causing the Updater to churn on trivial Refinements.
Stale Context: The context provided to the Updater doesn’t reflect the current state of the code after its last change, leading it to work on an outdated mental model.
Example
An Updater is tasked with “ensuring all new functions have docstrings.” A Validator enforces the presence of docstrings and also a ‘maximum line length of 80 characters.’
Updater: Adds a docstring to a function, but the docstring makes the line exceed 80 characters.
def calculate_sum(a, b): """This function calculates the sum of two numbers, a and b, and returns the result.""" # Line too long return a + bValidator: Fails due to line length.
Updater: Tries to Refine the line length, but in doing so, either removes part of the docstring (making it too short or invalid) or formats it in a way that the Validator still finds problematic, or simply tries to wrap it without realizing that the original docstring content itself is too verbose.
def calculate_sum(a, b): """Calculates the sum of two numbers.""" # Valid line, but less descriptive. Original mission was not to shorten it. return a + b(Or, if the Updater focuses only on the line length, it might remove the docstring entirely, failing the original docstring mission.)
This cycle repeats, with the Updater failing to satisfy both constraints simultaneously, or making partial Refinements that don’t stick.
Diagnostics
Loop History: Examine the sequence of changes and Validator outputs over several iterations. Look for repetitive patterns or alternating failures.
Conflicting Rules: Review your Mission Object, context, and all active Validator configurations for potential overlaps or contradictions.
Updater’s Internal State: If possible, inspect the intermediate thoughts or reasoning of the generative model (if your tooling provides it) to understand its decision-making process.
Mitigation
Prioritize Constraints: If conflicts are inevitable, ensure your Mission Object explicitly prioritizes which rules take precedence.
Granular Feedback: Instead of just pass/fail, structure Validator output to provide specific, actionable feedback (e.g., “Line 85: max length 80 exceeded, current length 120” instead of just “Linter error”).
Contextual Memory: Equip your Updater with a way to retain short-term memory of previous attempts and Validator feedback. This can be as simple as appending the last N Validator failures to the Mission Object for the next attempt.
Sequential Enforcement: If certain rules are inherently conflicting, consider enforcing them in separate, sequential Updater steps rather than simultaneously. For example, “add docstrings” then “lint for line length.”
Tune Validators: Adjust the strictness of Validators. Are all rules truly critical for automated enforcement, or can some be softened or moved to human review?
Map-Updater Invents Structure
What it looks like
The generative system (typically the Map or Updater) introduces new file paths, directories, data structures, or architectural patterns that were not explicitly part of its mission, were not present in the provided context, or deviate significantly from established project conventions. This often leads to:
Broken builds or CI/CD pipelines (e.g., new files not included in build Effectors).
Unreviewable changes (e.g., a “feature” implemented in a completely new, isolated mini-application).
Increased cognitive load for human engineers (e.g., a new, undocumented architectural pattern).
Difficulties in downstream automation (e.g., a new data format that breaks existing parsers).
Why it happens
This is a form of Stochastic Drift or over-creativity, often stemming from an overly broad mission or insufficient guardrails.
Broad Mission, No Constraints: The Mission Object is too high-level (“Implement Feature X”) without specific instructions on where or how to implement it within the existing codebase structure. The LLM then defaults to creating its own structure.
Lack of Structural Context: The generative model is not sufficiently grounded in the existing project’s file hierarchy, module boundaries, or design patterns. It doesn’t “know” the established way of doing things.
Pre-training Bias: LLMs are trained on vast amounts of code, and might lean towards common patterns (e.g., creating a
utilsfolder, aconfigfile, atestsdirectory) even if your project has different conventions or already provides those utilities elsewhere.Over-optimization: The model might attempt to “optimize” the solution by introducing a new, arguably better, structure without considering the cost of integration or departure from convention.
Example
An Updater is tasked with “adding a new
NotificationService to handle email alerts.” The existing
project structure has services in src/app/services/ and
configurations in src/app/config/.
Instead of adding notification_service.py to
src/app/services/ and its config to
src/app/config/, the Updater creates:
new_feature/
|-- notification_service_v2.py
|-- notification_config.yaml
`-- templates/
`-- email_template.html
This new new_feature/ directory and its contents are
entirely outside the established project structure. The build system
doesn’t pick up notification_service_v2.py,
notification_config.yaml is in a new format, and
templates/ duplicates existing templating mechanisms.
Diagnostics
File Tree Diff: Immediately check for new or deleted directories and files that were not explicitly intended.
Configuration Changes: Review any generated configuration files for new formats or structures.
Dependency Graph Analysis: Use tools to visualize or analyze changes to the project’s dependency graph. New, isolated subgraphs are a red flag.
Human Review: This is where human review in the loop is critical, catching these structural deviations before they propagate.
Mitigation
Structural Directives in Mission Objects: Explicitly instruct the generative model on where to make changes. Use Mission Objects like “Create
src/app/services/notification_service.py” or “Modifysrc/app/config/settings.pyto add new notification settings.”Provide Structural Context: Feed the LLM relevant portions of the file system tree,
ls -Routput, ortreeoutput for the target directories.Schema Enforcement: For configuration or data structure changes, use JSON Schema or other schema Validators to prevent the introduction of new, unapproved formats.
File Path Validators/Filters: Implement a Validator that explicitly enforces that generated files are within an allowed set of paths or adhere to naming conventions. Automatically reject any generated file paths outside these boundaries.
Refactor, Don’t Reinvent: Emphasize modifying existing components and patterns rather than creating new ones, unless explicitly instructed.
Validator False Positives
What it looks like
A Validator incorrectly flags a correct, acceptable, or intended change as an error. This can lead to:
Thrash (as the Updater tries to ‘Refine’ a non-existent problem).
Blocked progress (if the system cannot proceed past a false positive).
Erosion of trust in the automated system.
Wasted human effort in investigating or overriding false alarms.
Why it happens
False positives arise when the Validator’s rules are misaligned with the project’s actual requirements, the intent of the generative step, or the capabilities of the system.
Overly Strict Rules: Validators might be configured with rules that are too prescriptive, not accounting for edge cases, stylistic variations, or legitimate deviations.
Outdated Rules: Validator rules might not have been updated to reflect new coding standards, library versions, or architectural decisions.
Context Blindness: The Validator assesses a change in isolation, without understanding the broader context or the specific mission given to the Updater.
Lack of Nuance: Automated Validators struggle with subjective quality, semantic intent, or context-dependent correctness. They operate on fixed rules.
Faulty Implementation: The Validator itself might have a bug, leading it to misinterpret valid input.
Example
An Updater is tasked with “optimizing string concatenation” in a
Python file. It changes s = "a" + "b" + "c" to
s = f"a{b}c" for better readability and performance with
variables.
A static analysis Validator is configured with a rule that flags
f-strings as “too new” or “not preferred” if the codebase
primarily uses older .format() calls, leading to a false
positive even though f-strings are standard Python 3.6+
practice.
--- a/src/utils/string_formatter.py
+++ b/src/utils/string_formatter.py
@@ -1,3 +1,3 @@
def format_message(name, event):
- return "Hello, " + name + "! Welcome to " + event + "."
+ return f"Hello, {name}! Welcome to {event}." # Validator flags this line: "Avoid f-strings; use .format() instead."Diagnostics
Human Review of Validator Output: When a Validator fails, have a human carefully review both the change and the Validator’s specific error message. Is the error legitimate?
Rule Traceability: For complex Validators, can you trace the failing error back to a specific rule ID or configuration entry?
Reproducibility: Can you manually make the “failing” change, run the Validator, and get the same result? This helps isolate whether the issue is the change or the Validator.
Compare with Manual Review: If a human reviewer would pass the change, but the Validator fails, it’s a strong indicator of a false positive.
Mitigation
Validator Tuning: Regularly review and adjust Validator rules. Remove rules that are excessively strict, outdated, or prone to false positives.
Allowlisting/Denylisting: For specific cases where a rule is generally useful but problematic for an SDaC agent, consider implementing a mechanism to allowlist specific changes or paths from certain rules.
Context-Aware Validation: Design Validators to be aware of the mission context. A Validator enforcing a ‘new feature’ might tolerate different things than one enforcing a ‘bug Refinement.’
Exclusion Zones: Define areas of the codebase or types of changes where automated Validators should be less strict or entirely disabled for SDaC agents.
Feedback Loop for Validators: Implement a process where false positives reported by engineers lead directly to Validator rule adjustments.
Human Override: For stubborn false positives, ensure there is a clear, auditable process for human engineers to temporarily bypass or suppress a Validator, documenting the reason.
Actionable: What you can do this week
Start a “Failure Log”: Begin a simple text file or spreadsheet where you record any unexpected or problematic behaviors you observe from your SDaC loop. For each entry, describe the observed failure, the generative step involved (Map, Updater, Validator), and your initial hypothesis about the cause.
Inspect Diffs and Traces: For the next few automated changes your system attempts, don’t just look at the final outcome. Examine the full diff and, if available, any intermediate outputs or reasoning from your generative models. Look for patterns related to “slice size” (too big, too small) or unexpected structural changes.
Review Validator Outputs: Pay close attention to the specific error messages from your Validators. If an automated change is rejected, ask yourself: Is this error message truly indicative of a problem, or could it be a false positive given the context of the change?
Refine Your Core Mission Objects: Based on your observations from steps 2 and 3, review the Mission Objects for your Map and Updater. Try adding or refining constraints related to scope, file paths, or expected output structure to explicitly guide the generative models away from the failure modes discussed here. For example, add a line like “Only modify files within
src/feature_x/” if you’re seeing overreach.