SR 11-7

The five places SR 11-7 breaks down on AI agents

SR 11-7's three pillars survive the translation to generative AI. The specific workflows don't. Here are the five places the 2011 guidance strains, with the fix for each one.

2026-05-125 min readAshish K. Saxena

Your bank's model validators trained on decision trees, logistic regression and the occasional gradient-boosted ensemble. SR 11-7 was written with that toolkit in mind.

Now your business unit ships a customer-facing agent built on GPT-4o or Claude Opus. Same SR 11-7 expectations apply. Same examiner is going to read the documentation pack next year.

The three pillars of SR 11-7 (development, validation, ongoing monitoring) translate to AI agents without modification. The breakdowns happen in five specific places. Here they are with the fix for each one.

I. Non-determinism

Traditional models produce identical outputs for identical inputs. You can replay a decision exactly six months later and get the same answer.

That property is what makes back-testing possible. LLMs don't have it.

Even with temperature pinned to zero, identical prompts can produce different outputs across model versions, infrastructure deployments and minor input perturbations you didn't notice.

What breaks: reproducibility-based validation. You can no longer say "we tested the model and it produced output X on input Y" with full confidence.

The fix: shift from "model produces deterministic decisions" to "model produces decisions within a behavioral envelope, monitored for drift." Validate the envelope, not the exact output. The envelope is the set of outputs the agent should produce on a representative input distribution.

II. Prompt-as-feature

In a traditional model, your features are well-defined numerical or categorical inputs. Income, credit utilization, months on book.

Each feature has a distribution you monitor and a coefficient you interpret.

In an LLM agent, the system prompt is itself a feature. The user input is another.

They don't fit feature-matrix abstractions because they're free-form text.

What breaks: feature importance analysis, feature-level monitoring and ECOA disparate impact analysis at the individual feature level. You can't tell an examiner "we observed the income feature shifted 8% over the quarter" because there is no income feature in the prompt sense.

The fix: treat the prompt as a model component. Version-control it. Hash it.

Pin a specific hash to each production deployment. Track decision outputs against subject demographics rather than against individual prompt features. That gets you disparate impact analysis at the population level even when feature-level analysis is impossible.

III. Tool use and emergent behavior

Modern AI agents call tools. APIs, databases, other models, vector stores.

The agent decides which tool to call, with what arguments, in what order. Here's the awkward part for SR 11-7: the same agent on the same input might use different tools on different days as the model's reasoning shifts.

Provider updates can change tool-selection behavior. So can a tiny rewording of the system prompt.

What breaks: pre-deployment behavioral specification. You can't enumerate every execution path the agent might take in production. The validation team can test the most likely paths, but "most likely" is doing heavy lifting in that sentence.

The fix: instrument the actual production execution. Capture every tool call with arguments and results.

Make the tool-call trace itself an audit artifact. When an examiner asks "how did this agent arrive at the decision", you produce the trace. Validation shifts from predicting behavior to making behavior auditable after the fact.

IV. Model provider opacity

When you train your own model, you can document training data, hyperparameters and validation methodology. You wrote it. The model card is your model card.

When you use GPT-4o, Claude Opus or a Bedrock-hosted model, you don't get any of that. You get a provider's representations and a usage-rights agreement.

The training data is proprietary. The fine-tuning details are proprietary. Reproducibility experiments end at the API boundary.

What breaks: documentation completeness. SR 11-7 anticipates a full model card. You cannot produce one for the LLM layer of the stack.

The fix: maintain a clear delineation between "your model" and "vendor model." Your model is the prompts, the tools, the retrieval architecture and the post-processing logic.

That part is fully documentable. It's also, frankly, where your actual proprietary intellectual property lives.

The vendor model is what's behind the API. Document each separately. Treat vendor model version changes as model changes requiring re-validation.

V. Continuous capability evolution

OpenAI, Anthropic and AWS Bedrock all ship updates frequently. They sometimes ship updates that meaningfully change agent behavior on identical inputs.

Anthropic released a significant Claude update in late 2025 that shifted refusal behavior across regulated industries. OpenAI did something similar in early 2026.

What breaks: "the model has been validated" becomes ambiguous. Validated against which version? This week's? Or the one that ships next Tuesday?

The fix: tie validation evidence to specific model versions. Make version pinning a first-class governance control.

Require re-validation before allowing provider auto-upgrades. Negotiate behavior-change notification commitments into your provider contract.

Avoid sole dependency on a single provider for your highest-risk agents.

What this means in practice

Each of these five breakdowns is fixable. None of them are fixable by treating an AI agent as if it were a credit scorecard.

If your model risk function is still using the 2018 SR 11-7 template, you'll find the gaps quickly. The template asks for things AI agents can't produce.

The fix isn't to push back on the template. The fix is to extend the documentation pack with the specific evidence types these five breakdowns require: behavioral envelope tests, prompt hashes, tool-call traces, two-layer model documentation, version-pinned validation.

Banks that build this in 2026 spend 2028 in calm. Banks that wait spend 2027 retrofitting under examiner pressure.

The full framework with templates, examples and an implementation roadmap is in the SR 11-7 whitepaper. The post before this one introduces the four-step TRiSM gauntlet that this fits inside.

If your gauntlet looks different in some specific way, that's the kind of conversation the founder takes by email.

From the founder

If this resonates, talk to the founder directly.

Caventia is taking ten design partners in 2026. Conversations are with Ashish K. Saxena, not a sales team. Thirty minutes, your specific regulator gap, no purchase obligation.

Talk to the founder ← All notes