Shifting Left with Agentic Development

Mechanical Guardrails for AI-Generated Code

In my previous article, I described five iterations of building the same application with Claude Code and arrived at a principle that became the foundation for everything that followed: every rule should have a mechanical enforcement mechanism. If enforcement relies on the agent "remembering" or "knowing," it will eventually fail.

That article ended with an enforcement hierarchy — prose rules at the bottom, mechanical tooling at the top — and a claim that moving rules up that hierarchy was the key to consistent output. This article is about what happened when I took that seriously.

The Problem with "Just Tell It"

When you work with an AI coding agent, there's a tempting pattern: write a rule in your constitution or project docs, and expect the agent to follow it. "Always use the configuration service instead of annotation-based injection." "Never add a copyleft-licensed dependency." "Run translations before committing new resource keys."

These rules are clear. They're well-documented. And they get violated constantly.

Not because the agent is incapable. It can follow any of these rules perfectly when you remind it. The problem is the same one you'd have with a brilliant new hire who's read the company wiki once and retained about 70% of it. They'll get most things right, but the tribal knowledge — the stuff that only surfaces when you're about to do the wrong thing — slips through. Every time.

The failure mode isn't dramatic. You don't get a crash or an error. You get code that looks right, passes tests, and introduces a subtle violation that won't surface until someone who knows better reviews it. If you're using AI agents to scale your development capacity, "someone who knows better" is exactly the bottleneck you were trying to eliminate.

What Linters Can and Can't Do

The first line of defense is obvious: configure your linters. ESLint, Checkstyle, Spotless, Prettier — these tools are the low-hanging fruit of mechanical enforcement. If a rule can be expressed as a linter configuration, it should be. No discussion, no exceptions. A rule that exists only as prose when a linter plugin could enforce it is a rule that's waiting to be broken.

I spent time auditing project constitutions and development rules, classifying each one: Can a linter enforce this? Partially? Not at all? The results were illuminating. About 40% of rules mapped cleanly to existing linter configurations. Another 20% could be partially covered. The remaining 40% fell into a gap that no off-the-shelf tool could reach.

These weren't exotic requirements. They were things like:

  • "When using pattern A alongside pattern B, you must also include guard C" — a multi-condition rule that spans different parts of the same file
  • "New resource keys must have translations registered for all supported locales before they can be committed" — a cross-file consistency check against a data source
  • "Dependencies added to build files must not use copyleft licenses" — a domain-specific policy applied to a specific file type at commit time
  • "When tests fail, try a clean build before diagnosing the failure" — a behavioral pattern that should be reinforced, not mandated

Linters are great at syntax and style. They can't handle multi-condition patterns. They're useless at cross-file domain logic and behavioral coaching. I needed something that could sit in that gap.

Claude Code Hooks: The Mechanism

Claude Code has a hook system that fires scripts at specific points in the agent's tool-use lifecycle. A hook can intercept any tool call — file reads, shell commands, grep searches — either before the tool executes (PreToolUse) or after (PostToolUse). The hook receives the full context of the tool call as JSON on stdin: what tool is being invoked, what arguments it's being called with, and for post-hooks, what the tool returned.

The critical capability: a PreToolUse hook can block the tool call entirely. It prints a JSON response with "decision": "block" and a reason, and the agent never executes the command. Instead, it receives the violation message as feedback and has to fix the problem before retrying.

This is the shift-left moment. Instead of catching a violation in code review, or worse, in production, you catch it at the exact instant the agent tries to commit the code. The agent can't proceed until the violation is resolved. It's not a suggestion. It's a wall.

The Architecture

I built a shared engine — a single Python module that handles the mechanical parts: parsing the staged git diff, filtering files by path patterns, matching regex patterns against added lines, and formatting violation output. Each domain concern gets its own hook script that declares a list of rules and calls the engine.

A rule is a small data structure:

1Rule(
2    id="example-1",
3    file_pattern=r"\.java$",                   # which files to check
4    pattern=r"some_pattern_to_catch",          # what to flag
5    message="Brief explanation of the fix",    # what the agent sees
6    description="Why this rule exists",        # human context
7    exempt_paths=("path/to/exceptions/",),     # known exemptions
8)

The engine only examines lines that were added in the current diff — it doesn't re-litigate existing code. This is important. Retrofitting rules onto a large codebase is a separate initiative. The hooks enforce going forward.

Each domain gets its own hook file: one for configuration patterns, one for licensing, one for data conventions, one for service layer rules, and so on. They're organized by the bounded context they protect, not by the type of check they perform. This maps naturally to how teams think about their code — "these are the rules for the data layer" rather than "these are all the regex checks."

The hooks are registered in the Claude Code settings file, wired to fire on specific tool patterns:

 1{
 2  "hooks": {
 3    "PreToolUse": [
 4      {
 5        "matcher": "Bash",
 6        "hooks": [
 7          {
 8            "type": "command",
 9            "command": "python3 hooks/pre-commit-service-rules.py",
10            "if": "Bash(*git commit*)"
11          }
12        ]
13      }
14    ]
15  }
16}

When the agent runs git commit, every registered pre-commit hook fires. Each one independently parses the staged diff and checks its rules. If any hook finds a violation, the commit is blocked and the agent gets a clear error message with the rule ID, file, line number, and what to fix.

Three Tiers of Enforcement

Not every rule deserves the same severity. I landed on three tiers that match different enforcement needs:

Tier 1: Blocking pre-commit hooks. These are the non-negotiable rules. Copyleft license violations. Missing compatibility guards. Configuration anti-patterns that will cause runtime failures. The commit doesn't happen until these are resolved. The agent gets a clear violation message, fixes the code, and tries again.

Tier 2: Non-blocking post-commit warnings. These fire after a successful commit and inject feedback into the conversation. "You committed source files but no tests — make sure coverage exists." "This directory has source files but no README." They don't stop progress, but they create awareness. The agent sees the warning and can choose to act on it in the same session.

Tier 3: Behavioral nudges. These aren't tied to commits at all. They fire on everyday tool use and suggest better approaches. When the agent uses grep to search for what looks like a Java symbol name, a hook suggests using LSP instead — workspace symbol search, find references, go to definition. When a test run fails without a preceding clean build, a hook reminds the agent that stale artifacts cause misleading failures. These are coaching, not enforcement.

The three tiers mirror how you'd train a human developer. Some things are hard stops — you can't merge code with a license violation. Some things are reminders — did you forget to write tests? And some things are teaching moments — there's a better tool for that.

What This Looks Like in Practice

Here's a real scenario. The agent is implementing a feature that touches a shared screen component. It writes the code, runs the tests, everything passes. It stages the files and runs git commit.

The pre-commit hook fires. It examines the staged diff and finds that the file uses two patterns together — a UI component and a title-setting method — without a required compatibility guard. The commit is blocked:

1Pattern violation — fix before committing.
2
3  [rule-1] src/screens/OrderScreen.java:47:
4    setTitle() used alongside DialogHeaderPart but missing compatibility guard.
5    Shared screens must wrap setTitle() with version check.

The agent reads the violation, understands it, adds the guard, re-stages, and commits successfully. Total time added: about 15 seconds. Without the hook, this would have been caught in code review — maybe — after the PR was already up and a human reviewer happened to know about this particular compatibility requirement.

Here's another scenario. The agent adds a new dependency to a build file. The licensing hook fires, identifies it as a GPL-licensed package, and blocks the commit with a suggestion for an approved alternative. The agent swaps the dependency and proceeds. No human had to remember that mysql-connector-java is GPL but mysql-connector-j is EPL.

And a behavioral example: the agent runs tests, they fail, and the clean-build nudge fires. Instead of spending ten minutes diagnosing a phantom failure caused by stale compiled artifacts, the agent runs a clean build first. This one isn't blocking — it's coaching. But it saves real time by short-circuiting a common debugging dead end.

Beyond Regex: Cross-File Consistency

Some of the most valuable hooks don't use the shared regex engine at all. They implement custom logic for checks that span multiple files or require reading project state.

One hook checks that when you add new internationalization keys to a properties file, those keys have corresponding translations registered in a central data file for every supported locale. It reads the locale configuration, builds a set of what's already translated, diffs the staged properties file for new keys, and blocks the commit if any key is missing translations. This prevents a class of runtime bug — a user switching to an unsupported locale and seeing raw key names — that no linter could catch.

This is the kind of rule that would traditionally live in a PR checklist: "Did you add translations for all locales?" Checklists are prose. Prose gets skipped. A mechanical check doesn't.

The Economics of Enforcement

There's a cost argument here that goes beyond just "catching bugs earlier."

In the previous article, I tracked the token cost of each iteration. The expensive part wasn't the initial code generation — it was the rework cycle. Agent writes code, reviewer catches violation, agent fixes, reviewer re-reviews. Each round trip costs tokens, time, and context window space.

Pre-commit hooks collapse that cycle to zero round trips for the classes of violations they cover. The agent never produces a PR with a licensing violation or a missing compatibility guard. The human reviewer never has to flag it. The fix happens at the moment of creation, not after the fact.

More importantly, these hooks scale. Adding a new rule is a ten-minute task: write the regex, add a test case, register the hook. That rule then applies to every future commit, in every branch, for every agent session. The cost of enforcement is paid once. The cost of not enforcing is paid on every violation that slips through.

Lessons Learned (The Meta-Lesson)

Building this system reinforced the central thesis from the first article, but added nuance.

The enforcement hierarchy is real, but it has more levels than I initially described. I originally thought of it as a binary: prose rules versus mechanical enforcement. In practice, there's a spectrum. Blocking hooks are at the top. Non-blocking warnings are in the middle. Behavioral nudges are below that. Prose rules are at the bottom. Each level up buys you more consistency, but also more rigidity. Not everything belongs at the top.

Rules should encode "why," not just "what." Every rule in the system carries a description field explaining the reasoning. When the agent hits a violation, it doesn't just see "don't do X" — it sees why X is prohibited. This context helps the agent make better fixes and, occasionally, helps it recognize when a rule doesn't apply to its specific situation. A rule without a reason is a rule that gets cargo-culted.

Exemptions are first-class citizens. Almost every rule has known exceptions — paths where the pattern is intentionally used, file types where it doesn't apply, modules that predate the convention. The rule system supports exempt paths and exempt suffixes explicitly. This prevents the hooks from becoming a source of false positives that train the agent (and the developer) to ignore them.

Post-hooks are underrated. I initially focused almost entirely on blocking hooks. But the behavioral nudges — "use LSP instead of grep for symbol search," "try a clean build before debugging test failures" — turned out to be surprisingly high-value. They don't prevent mistakes; they prevent wasted time. Over dozens of sessions, the accumulated time savings from the agent not chasing phantom failures or doing inefficient searches add up significantly.

The diff-only principle matters. Only checking added lines, not the entire file, is critical for adoption. You can introduce these hooks to a legacy codebase without triggering thousands of violations on existing code. The hooks enforce the new standard going forward, which means you can adopt them incrementally without a massive remediation effort.

Where This Is Heading

The system I've described is essentially a pattern-matching layer between the AI agent and the git repository. It's regex-based, diff-scoped, and organized by domain. It catches a meaningful class of violations that linters can't reach, and it does so at the earliest possible moment.

But there's an obvious next frontier: hooks that don't just match patterns but understand semantics. A hook that can verify an API contract is honored across service boundaries. A hook that checks whether a database migration is backwards-compatible with the previous schema. A hook that validates that a new endpoint follows the project's authentication pattern, not just syntactically but structurally.

Some of this is possible today with more sophisticated hook implementations — AST parsing instead of regex, for example. Some of it will require hooks that can call out to other tools or services. The architecture supports it; the hook system doesn't care what your script does internally, only what decision it returns.

The broader point is this: as AI agents take on more of the implementation work, the value shifts from writing code to defining and enforcing the constraints that code must satisfy. The developers who do this well won't be the ones who write the most detailed constitutions. They'll be the ones who build the most effective mechanical enforcement around their constraints. Rules are wishes. Hooks are walls.

Frequently Asked Questions

What are Claude Code hooks?
How do pre-commit hooks prevent AI agents from violating project rules?
What are the three tiers of hook enforcement?
Why does the custom engine only check added lines rather than entire files?
Can hook scripts enforce rules that span multiple files?
What is the economic case for pre-commit hooks in agentic development?

Posts in this series