A model risk officer at a regional bank sat across from a data science team and asked one question that the room could not answer cleanly. “If this large language model is a model, how do you validate it the way we validate our PD scorecard?” The scorecard was a logistic regression with twelve features, a fixed coefficient table, and an outcomes analysis that ran every quarter. The LLM took free-text input, returned different wording each time it ran, and sat on top of a foundation model the bank had never trained. The thesis of this guide is simple. SR 11-7’s principles still apply to AI model risk management, but machine learning and especially non-deterministic generative AI break specific assumptions that the 2011 guidance quietly relied on.
What SR 11-7 requires: the three-pillar discipline
SR 11-7, the Federal Reserve guidance paired with OCC Bulletin 2011-12, is the supervisory backbone of model risk management at U.S. banks. It defines a model as a quantitative method that applies statistical, economic, financial, or mathematical theory to produce estimates, and it treats every model as a source of risk to be managed. The discipline rests on three pillars that any examiner expects to see operating.
The first pillar is model development, implementation, and use. SR 11-7 expects developmental evidence that shows the model is sound for its intended purpose, and it codifies the principle of “effective challenge,” meaning critical review by parties with the competence and independence to push back. The second pillar is model validation, which checks conceptual soundness, runs ongoing monitoring, and performs outcomes analysis and benchmarking against alternatives. The third pillar is governance, policies, and controls, anchored by a model inventory and documentation that lets the institution see every model it relies on.
These three pillars are durable. They survive the move to AI. What changes is how hard each pillar is to satisfy when the underlying model stops behaving like a regression.
Where traditional MRM breaks down for ML and GenAI
Traditional MRM carries four unstated assumptions. It assumes deterministic and inspectable logic, stable and well-understood inputs, a fixed model that changes only through controlled re-fits, and reproducible outputs. A logistic regression honors all four. Machine learning and generative AI honor almost none of them.
Machine learning breaks input stability and inspectability first. High-dimensional feature spaces, opaque learned functions, feature drift, and frequent retraining mean the “model” is a moving target whose internal logic resists the line-by-line review that a scorecard invites. You can still test an ML model, but you cannot read it the way you read a coefficient table.
Generative AI breaks the rest. LLMs are non-deterministic, so the same prompt can return different outputs depending on temperature and sampling. Their input and output space is unbounded natural language, which resists the pre-enumerated test cases that traditional validation depends on. Behavior is emergent and prompt-dependent, hallucination is a failure mode with no analog in a scorecard, and the foundation model is often a third party you did not train and cannot fully inspect.
There is one more break that practitioners feel daily. A GenAI “model” is usually a pipeline, not a single estimator. Retrieval, the system prompt, the model call, and downstream tools all shape the output, so the unit you are validating is a system. SR 11-7 artificial intelligence questions almost always come back to that boundary problem.
The boundary problem is where most programs stumble. A validator who scopes the review to the foundation model alone will miss the retrieval index that feeds it, the system prompt that constrains it, and the tools it can call. Two of those components can change weekly without any formal model change request. The practical answer is to define the model as the smallest unit that produces the decision, then validate that whole unit, not the most convenient piece of it.
What a modern AI MRM program looks like
A modern AI MRM framework extends the three pillars rather than running a parallel program. This matters for both cost and credibility, because examiners and auditors want to see your AI risk inside the model risk discipline they already understand.
On development and use, expand documentation to cover data provenance, prompt and system-prompt versioning, the full pipeline boundary, and explicit intended-use constraints. Preserve effective challenge, but point it at the new surfaces, including the prompt design and the retrieval layer. The “model” you document is the system, not just the weights.
On validation, replace single-point accuracy with distributional and behavioral testing across many inputs. Add adversarial and red-team testing for prompt injection and jailbreaks, bias and fair-lending testing, and drift monitoring that covers both retraining and external-service dependencies. Where inspection fails, run an explainability assessment and add compensating controls, including human-in-the-loop checks for non-deterministic outputs that touch credit, fraud, or customer decisions.
On governance, build an AI-aware model inventory with risk tiering, named owners, and a defined oversight cadence. Treat non-determinism and hallucination as named, logged risks with owners and mitigations, not as informal caveats. Everything here is additive to existing validation, which keeps the program inside SR 11-7 rather than beside it.
The additive framing is not a stylistic choice, it is a survival strategy. Banks that stand up a separate “AI governance” track find that the two programs drift, duplicate effort, and confuse owners about which committee approves what. When AI risk lives inside the existing model risk inventory and validation calendar, the same effective challenge, the same tiering logic, and the same oversight committee carry the load. You add new tests and new documentation fields, not a new bureaucracy.
How NIST AI RMF fills the gaps
NIST AI RMF 1.0, published as NIST AI 100-1 in January 2023, organizes AI risk into four functions, GOVERN, MAP, MEASURE, and MANAGE. It is voluntary and has no certification program. A bank already running SR 11-7 validation typically has most of MEASURE and MANAGE covered, because measurement and ongoing management of model performance are what validation does.
The gap sits in GOVERN and MAP. MAP forces you to characterize context of use, data provenance, and dependencies before you measure anything, which is exactly the AI-specific framing that 2011-era MRM did not anticipate. GOVERN adds the accountability and policy structure that turns a model inventory into an AI inventory with owners and tiering. The table below maps each SR 11-7 pillar to the stress that ML and GenAI place on it, and to the NIST function that helps most.
| SR 11-7 pillar | ML stress | GenAI / LLM stress | NIST AI RMF function that helps |
|---|---|---|---|
| Development and use | Opaque learned functions resist line-by-line review | System prompt, retrieval, and tools blur the model boundary | MAP (context of use, dependencies) |
| Validation | Drift and retraining make a moving target | Non-determinism and hallucination defeat fixed test cases | MEASURE (behavioral, adversarial testing) |
| Governance | Frequent re-fits strain inventory and change control | Third-party foundation models you did not train | GOVERN (accountability, AI inventory, policy) |
This is why teams pair the two. SR 11-7 supplies the supervisory weight, and NIST AI RMF supplies the AI-specific vocabulary. For the deeper mapping, see NIST AI RMF for financial services.
OCC Bulletin 2013-29 and third-party AI models
Most AI models a bank uses are not built in-house. They are foundation models, AI SaaS products, and a model supply chain the institution depends on but does not control. OCC Bulletin 2013-29 is the supervisory guidance on third-party relationship risk management, covering due diligence, contracting, and ongoing monitoring. The OCC, Federal Reserve, and FDIC issued updated Interagency Third-Party Risk Management Guidance in June 2023 that is now frequently cited alongside or in place of 2013-29, though 2013-29 remains a named touchstone.
For AI, this guidance carries real weight. You must perform due diligence on a model you did not build and cannot fully inspect, secure contractual terms on data use and on notice of model changes, and monitor a dependency whose behavior can shift under you without warning. A provider that silently updates a foundation model has changed your model, which is a validation event you need to detect.
This ties directly to two NIST functions. MAP is how you know the dependency exists and what it does in your pipeline, and GOVERN is the policy that says no AI vendor goes live without vetting and ongoing monitoring. A Governance Readiness Snapshot is a fast way to find the third-party AI dependencies that have slipped past your existing vendor process. For the institution-level view, see our work on banking AI governance.
What this guide is / What it is not
What it is: a practitioner guide for applying SR 11-7 model risk discipline to machine learning and generative AI, and for using NIST AI RMF to fill the AI-specific gaps. What it is not: legal or regulatory advice, a certification, or a guarantee of any exam or audit outcome. DSE prepares organizations for audit and examination and strengthens your examiner-facing posture. We do not certify, and we do not guarantee any exam or audit result. NIST AI RMF is voluntary with no certification program.
FAQ
Does SR 11-7 apply to AI and machine learning models?
Yes. SR 11-7 defines a model broadly as a quantitative method that applies statistical, economic, or mathematical theory to produce estimates, and that definition covers machine learning and generative AI systems used for estimates or decisions. Its three pillars of development, validation, and governance still apply, though ML and GenAI break specific assumptions the guidance relied on.
How is validating a generative AI model different from validating a traditional model?
A traditional model is deterministic and produces reproducible outputs you can test with fixed cases. A generative AI model is non-deterministic, accepts unbounded natural-language input, and can hallucinate, so validation shifts from single-point accuracy to distributional and behavioral testing, adversarial and red-team testing, drift monitoring, and human-in-the-loop checks for outputs that touch decisions.
What is an AI model risk management (AI MRM) framework?
An AI MRM framework extends SR 11-7’s three pillars to AI rather than running a parallel program. It expands development documentation to cover data provenance, prompts, and the full pipeline boundary, adds behavioral and adversarial validation, and builds an AI-aware model inventory that treats non-determinism and hallucination as named, owned risks.
How does NIST AI RMF relate to SR 11-7 for AI?
SR 11-7 validation already covers most of the MEASURE and MANAGE functions of NIST AI RMF 1.0. The gap sits in GOVERN and MAP, which add the AI-specific structure for context of use, data provenance, dependencies, and accountability that 2011-era MRM did not anticipate. NIST AI RMF is voluntary and has no certification program.
How does SR 11-7 treat third-party and foundation AI models?
SR 11-7 expects third-party models to be governed like internal ones, and OCC Bulletin 2013-29 plus the June 2023 Interagency Third-Party Risk Management Guidance set the due diligence, contracting, and monitoring expectations. For AI this means vetting foundation models you did not train, securing contract terms on data use and model changes, and monitoring a dependency whose behavior can shift under you without notice.
The Bottom Line
SR 11-7 did not become obsolete when banks adopted AI. Its three-pillar discipline of development, validation, and governance is exactly the right scaffolding for AI model risk management, because it forces ownership, challenge, and inventory on every model that drives a decision. What changed is the difficulty. Machine learning breaks inspectability and input stability, and generative AI breaks determinism, test enumeration, and the assumption that you trained the model at all.
The practical move is to extend, not replace. Keep effective challenge and the model inventory, but widen documentation to the full pipeline, swap single-point accuracy for behavioral and adversarial testing, and name non-determinism and hallucination as logged risks with owners. Use NIST AI RMF’s GOVERN and MAP functions to fill the gaps SR 11-7 left, and use OCC Bulletin 2013-29 and the 2023 interagency guidance to govern the foundation models you depend on but did not build. That is a complete AI MRM framework, and it lives inside the model risk discipline your institution already runs.
Want to see where your AI model risk program stands against SR 11-7, NIST AI RMF, and third-party guidance? Start with a Governance Readiness Snapshot to surface the gaps, then work through our AI governance checklist to build the validation, inventory, and oversight evidence that strengthens your examiner-facing posture.
Key facts
- SR 11-7, the Federal Reserve and OCC joint supervisory guidance on model risk management issued in April 2011 and parallel to OCC Bulletin 2011-12, defines a model as a quantitative method that applies statistical, economic, or mathematical theory to produce estimates, and organizes model risk around development, validation, and governance (DSE, 2026).
- SR 11-7 was written for deterministic, inspectable models, so it is silent on the specific failure modes of non-deterministic generative AI, including hallucination, prompt-dependent behavior, and third-party foundation models the bank did not train (DSE, 2026).