shipping production AI · since 2026 NAICS 541330 / 541511 / 541512 / 541519  ·  CMMC-aware
Refinery Report / AI Security / post · thcare
AI SecuritySelf-Hosted AIPrivate AIHIPAA

Self-Hosted AI Deployment: Security and Compliance Guide for Finserv and Healthcare

A practitioner guide to secure self-hosted AI deployment for finserv and healthcare: the architecture, controls, and compliance-evidence mapping to SOC 2, HIPAA, GLBA, and CMMC.

D
By the DSE practice team
Operator-led practice · how we research & review
June 21, 2026
20 min · 4,324 words

By the DSE practice team · published June 21, 2026 · reviewed June 21, 2026

The fastest way to fail a security review of an AI system is to discover, during the review, that you cannot say who asked the model what, what it answered, or what data it could reach. For a bank, a lender, a healthcare operator, or a defense supplier, that is the whole problem in one sentence. The productivity of a large language model is real, but so is the obligation to keep nonpublic personal information, protected health information, and controlled unclassified information inside a boundary you actually govern. A self-hosted or private AI deployment is how regulated organizations get the first without surrendering the second.

This guide is the practitioner version of that argument. It covers what a secure self-hosted AI deployment actually consists of, the controls that make it defensible, how each control becomes compliance evidence under SOC 2, the HIPAA Security Rule, GLBA, and CMMC, and where the framework conversation goes wrong. We will keep the scope honest throughout: this is readiness and engineering work, not certification, and no architecture guarantees passing a specific examination. What it does is let you walk into one with your work in order.

Why regulated organizations self-host AI

The case for self-hosting is not about distrust of model providers. It is about the control boundary. When you send a prompt to a public API, that prompt, and often the documents your retrieval layer pulled to answer it, leaves your environment for a third party. For a marketing team drafting copy, that is fine. For a credit team reasoning over a borrower file, a clinician summarizing a chart, or an engineer working against controlled technical data, it is a disclosure of regulated information to a party outside your perimeter, and the evidence you can produce about it describes someone else’s system, not yours.

GLBA frames this directly for financial institutions. The Gramm-Leach-Bliley Act governs the handling of nonpublic personal information, and its Safeguards Rule requires a documented information security program that protects that information across its lifecycle. An AI workflow that routes nonpublic personal information to an external model is part of that lifecycle, and you owe an account of how it is protected. The HIPAA Security Rule does the equivalent for healthcare: protected health information moving into an AI system is electronic PHI, and the rule’s administrative, physical, and technical safeguards apply to wherever that PHI lands. CMMC, for the defense industrial base, is stricter still, because controlled unclassified information carries handling requirements that a public API simply cannot satisfy.

Self-hosting resolves the boundary question by construction. The model runs in your VPC or your data center, the retrieval corpus is your own document store, the prompts and completions never leave, and the audit trail is yours to produce. You are no longer attesting to a vendor’s controls layered under a data processing agreement. You are describing a system you operate, which is exactly what a SOC 2 auditor, a HIPAA reviewer, a federal assessor, or an enterprise customer’s security questionnaire is asking you to do.

There is a cost to this, and it is honest to name it. Self-hosting means you own the infrastructure, the model operations, and the security of the stack, where a public API outsources all three. That is why the deployment pattern, not just the model, is the product. The rest of this guide is about building that pattern so the ownership is an asset rather than a liability.

The reference architecture for a secure private deployment

A secure self-hosted AI deployment is a small number of components, each of which exists partly to do its job and partly to produce evidence. Read the architecture as a set of control points, not just a data flow.

The model layer is an open-weight family you run yourself: Llama, Mistral, Qwen, or a comparable model served on a governed inference stack inside your boundary. The choice is driven by capability, license, and the hardware you can dedicate, but the security property is constant: weights and inference stay in your environment, so no prompt or completion egresses to a third party. Self-hosting open weights is what makes the boundary real rather than contractual.

The infrastructure layer is where the deployment lives: an isolated network segment in your cloud account or on-premise, with secrets managed in a vault rather than in environment variables, encryption in transit and at rest as a default, and the whole thing expressed as infrastructure as code so it is reproducible and reviewable. The retrieval layer, where you use one, keeps your documents inside the boundary and exposes them to the model only through access-controlled queries. The point of building this as code is not elegance. It is that an auditor can read the configuration that actually runs, and you can prove that what you described is what you deployed.

The access layer is the control that matters most and is most often missing. Every model call and every document the retrieval layer can reach should be governed by role-based access control and, where the data warrants it, attribute-based access control, tied to your existing identity provider. Access scope is the blast radius of an AI tool: if any authenticated user can ask the model anything and the retrieval layer can read every document, then a single compromised account reads your entire corpus through a friendly natural-language interface. RBAC and ABAC turn that from a catastrophe into a contained event, and they are the control SOC 2 and the HIPAA Security Rule both expect to see enforced.

The observability layer is the audit trail. Every prompt, every completion, every tool call, and every retrieved document is logged, attributed to a user, and retained in a tamper-evident store. This is not optional telemetry. It is the record that answers the question that opened this guide, and it is the single artifact whose absence most reliably ends a security review badly.

The governance layer sits over all of it: change control for the model, the system prompts, and the retrieval corpus, so a change in behavior is a controlled, approved, versioned event rather than a silent drift. A model update or a prompt revision is a change under your existing change-management discipline, with the same approval and the same record you already keep for any production change.

What gets deployed, control by control

It helps to walk the controls individually, because each maps to a specific obligation and a specific piece of evidence. This is the part of the work that turns an architecture diagram into an audit-ready posture.

Access control comes first because it is the highest-leverage and most-skipped. Role-based access control assigns users to roles and roles to permissions, so a credit analyst and a marketing associate reach different models and different documents. Attribute-based access control refines that with context: data sensitivity, user clearance, time, or location can all gate a call. For a healthcare deployment, ABAC is how you keep a model that can summarize a chart from being asked to summarize a chart the requesting user has no treatment relationship with. The evidence this produces is an access-control matrix and the enforcement logs that show it operating, which tags directly to SOC 2 logical-access criteria and the HIPAA access-control standard.

Audit logging is the second pillar and the one with the broadest reach across frameworks. A complete log records the user, the prompt, the completion, any tools the model invoked, and any documents the retrieval layer surfaced, with timestamps and integrity protection so the record cannot be quietly altered. The retention period is set to satisfy the longest applicable obligation. This single control produces evidence for SOC 2 monitoring criteria, the HIPAA audit-controls standard, and the accountability expectations in CMMC, which is why it is worth building well rather than bolting on. A deployment that cannot reconstruct an interaction after the fact is not audit-ready, regardless of how secure the rest of it is.

Model change control is the third, and it is where AI deployments diverge from conventional applications. A large language model’s behavior can shift when the weights change, when the system prompt is edited, or when the retrieval corpus is updated, and any of those can alter what the system does without a line of application code changing. Treating the model, the prompts, and the corpus as versioned, approved, and logged changes is what keeps that behavior governed. The evidence is a change log and an approval trail, which tags to SOC 2 and ISO 27001 change-management controls and satisfies the configuration-management expectations in CMMC.

Data-boundary controls are the fourth pillar and the reason for self-hosting in the first place. These are the network isolation, the egress controls, and the retrieval-scoping that together guarantee regulated data does not leave the perimeter and that the model cannot reach data the requesting context is not entitled to. The evidence is the network configuration, the data-flow documentation, and the egress monitoring, which map to GLBA Safeguards Rule obligations for nonpublic personal information and to the HIPAA transmission-security standard.

Encryption and secrets management are the fifth, and while they are table stakes, they are table stakes that get checked. Encryption in transit and at rest, keys managed in a dedicated service, and secrets kept out of code and configuration are the baseline a reviewer assumes and the baseline a surprising number of hurried deployments miss.

A compliance-evidence mapping you can reuse

The mistake that makes AI governance expensive is treating it as a separate documentation effort from the SOC 2 or ISO 27001 program you already run. The principle that saves the most work is to document once and tag twice: a control you already operate, and evidence you already collect, can be pointed at the AI deployment without being rebuilt. The table below is the crosswalk we use to do that.

Control in the deployment Evidence produced SOC 2 HIPAA Security Rule GLBA CMMC
RBAC and ABAC on model calls and retrieval Access matrix, enforcement logs Logical access criteria Access control standard Safeguards Rule access controls Access control practices
Audit logging of prompts, completions, tools, retrieval Attributable, tamper-evident logs Monitoring criteria Audit controls standard Safeguards Rule monitoring Audit and accountability practices
Model and prompt change control Change log, approval trail Change-management criteria Information system activity review Safeguards Rule change management Configuration management practices
Data-boundary and egress controls Network config, data-flow docs System boundary documentation Transmission security standard Safeguards Rule data protection System and communications protection
Encryption in transit and at rest Key-management records Encryption criteria Encryption and decryption Safeguards Rule encryption Media and transmission protection

The payoff of building the crosswalk is that your AI deployment inherits most of its evidence from work already underway. A user-access review you run quarterly for SOC 2 covers the AI access control once you tag it. A change-approval record you keep for ISO 27001 covers the model change once you point it at the deployment. The new work concentrates in the genuinely AI-specific places: the retrieval-scoping, the prompt change control, and the adversarial testing, which are the parts a conventional control program never had to address.

Securing the deployment: testing the boundary

A private deployment that is never tested is a boundary you have asserted but not proven, and the assertion is exactly what a sophisticated reviewer will probe. Securing a self-hosted model means red-teaming it against the failure modes specific to LLM systems, not just the infrastructure around it.

Prompt injection is the first and most consequential. An attacker who can get instructions into the model’s context, directly in a prompt or indirectly through a document the retrieval layer ingests, can attempt to override the system’s intended behavior, exfiltrate data the model can reach, or abuse any tools the model can call. The defense is layered: input handling, least-privilege on the tools and the retrieval scope, and output filtering, with the audit log as the record that lets you detect an attempt after the fact. Indirect prompt injection through the retrieval corpus is the version that catches well-built deployments, because the malicious instruction arrives inside data the system was designed to trust.

Tool and agent abuse is the second. The moment a model can call a tool, query a database, send a message, or trigger an action, the question becomes what an attacker who controls the model’s input can make those tools do. Least privilege on every tool, human approval on consequential actions, and full logging of tool invocations are the controls, and they map to the same access-control and audit evidence the rest of the deployment produces.

Retrieval poisoning is the third, and it is specific to systems with a retrieval layer. If an attacker can place content into the corpus the model retrieves from, they can influence its answers or smuggle in injection payloads. Controlling who can write to the corpus, and treating the corpus as a security boundary rather than a passive document store, is the defense.

We organize this testing against the OWASP Top 10 for LLM Applications and the MITRE ATLAS threat model, because a named, shared taxonomy is how you demonstrate coverage to a reviewer rather than asserting it. For the deeper treatment of how these attacks work and how we test each one, see our walkthrough of the OWASP LLM Top 10 mapped to NIST AI RMF controls and the MITRE ATLAS guide for tool-using and multi-agent AI. The point of a private deployment is to keep your data inside your boundary. Testing is how you prove the boundary holds, and the test results are themselves evidence.

Where the framework conversation goes wrong

The most common error in regulated AI deployment is misreading which framework governs what, and it costs organizations real money in misdirected effort. Be precise about it.

US bank examiners supervise AI through the supervisory guidance you already answer to, not through an international management-system standard. SR 11-7, the joint Federal Reserve and OCC guidance on model risk management, treats a model used to inform a business decision as a model, and an LLM that informs a credit, fraud, or servicing decision falls under it. Third-party risk guidance, including OCC Bulletin 2013-29 and the June 2023 interagency third-party risk management guidance, governs the foundation-model provider and the AI-enabled vendors in your supply chain. Fair lending law under the Equal Credit Opportunity Act and Regulation B governs any AI that touches a credit decision, and prohibitions on unfair, deceptive, or abusive acts and practices apply broadly. That is the examiner-facing posture, and NIST AI RMF 1.0, the voluntary framework organized into the GOVERN, MAP, MEASURE, and MANAGE functions, is the structure that organizes it.

ISO/IEC 42001:2023, the AI management system standard, is real and valuable, but it is procurement and board assurance, not a US prudential exam tool. A vendor’s ISO 42001 certificate is an upstream attestation about the vendor’s management system; it is not coverage of your own deployment, and it does not satisfy an examiner reading you against SR 11-7. The clean division of labor is that NIST AI RMF plus SR 11-7 is your examiner-facing posture and ISO 42001 is your procurement and assurance posture. The EU AI Act is a separate axis that binds only where you place AI systems on the EU market.

One more accuracy note, because it changes which authorities are current. Federal AI policy now runs through Executive Order 14179 and the Office of Management and Budget memoranda M-25-21 and M-25-22. The earlier Executive Order 14110 and OMB memorandum M-24-10 were rescinded and should not be cited as current authority. Getting this right is part of the credibility the rest of the work depends on. For the full treatment of how these frameworks fit together for a financial institution, see our NIST AI RMF for financial services guide and the companion piece on AI model risk management and SR 11-7.

A sequence that works

The order of operations matters as much as the controls, and the instinct to start with policy is the one to resist. A deployment that opens with a governance document writes rules for a system it has not built and risks it has not tested. The sequence that produces an audit-ready posture inverts that.

Start with the inventory and the data classification. Know which AI use cases you are deploying, what data each one touches, and what regulatory obligation that data carries, before you build. You cannot scope a boundary for data you have not classified. Then build the deployment boundary: the network isolation, the egress controls, and the retrieval scoping that keep regulated data inside the perimeter, because every later control assumes the boundary exists.

Layer access control and logging onto the boundary next, because they are the controls with the broadest evidence reach and the ones a reviewer checks first. Add model and prompt change control so the system’s behavior is governed from the start rather than retrofitted. Then test: red-team the deployment against prompt injection, tool abuse, and retrieval poisoning, and fix what the testing finds. Finally, assemble the evidence package and write the governance documentation against the system you actually built, mapping each control to the frameworks you answer to. Policy last, against a real system, is the difference between a binder nobody can apply and a posture an examiner can follow.

Done in that order, a secure private AI deployment is a scoped engineering project with a defined evidence output, not an open-ended compliance program. The organizations that struggle are the ones that try to govern before they build and test before they classify. The ones that succeed build the boundary, enforce access and logging, control change, prove the boundary holds, and document last.

FAQ

What is self-hosted LLM compliance? Self-hosted LLM compliance is the set of controls that let you run a large language model inside your own environment and prove to an auditor or examiner that the deployment meets your obligations. It means keeping prompts, completions, and your documents inside your control boundary, enforcing access control on every model call, logging everything in an attributable and tamper-evident way, controlling changes to the model and its prompts, and mapping each control to the framework you answer to, such as SOC 2, the HIPAA Security Rule, GLBA, or CMMC. The controls are conventional security disciplines applied to a new kind of system; the evidence they produce is what makes the deployment defensible.

Why deploy a private AI instead of using a public API? For regulated data, the deciding factor is the control boundary. A public API sends your prompt, and often your retrieved documents, to a third party you do not control, which is a disclosure problem for nonpublic personal information under GLBA and protected health information under the HIPAA Security Rule. A private or self-hosted deployment keeps the model, the data, and the audit trail inside your perimeter, so the evidence you produce describes a system you actually govern rather than a vendor’s controls layered under a data processing agreement. The trade-off is that you own the infrastructure and the operations, which is why the deployment pattern matters as much as the model choice.

How does a self-hosted deployment map to SOC 2, HIPAA, GLBA, and CMMC? Each control in the deployment is tagged to the criteria that apply. Access control and audit logging map to SOC 2 logical-access and monitoring criteria and to the HIPAA access-control and audit-controls standards. Data-boundary and egress controls map to GLBA Safeguards Rule obligations for nonpublic personal information and the HIPAA transmission-security standard. The full control set maps to CMMC practices for defense industrial base work. One control operated once can answer multiple frameworks, which is the document-once, tag-twice principle. DSE prepares the evidence; we do not issue certifications.

Do US bank examiners supervise AI through ISO 42001? No. US examiners supervise AI through the supervisory guidance you already answer to: SR 11-7 for model risk, third-party risk guidance for vendors, fair lending under the Equal Credit Opportunity Act and Regulation B, and prohibitions on unfair, deceptive, or abusive practices. ISO/IEC 42001 is valuable for procurement signaling and board assurance, and a vendor’s certificate is an upstream attestation about the vendor’s management system, but it is not how a US prudential examiner reads your own deployment. The examiner-facing posture is built on NIST AI RMF plus SR 11-7.

Does a self-hosted deployment make us certified or compliant? No. A secure self-hosted deployment is engineering and readiness work, not a certification. NIST AI RMF is voluntary and has no certification program, ISO 42001 certificates come only from accredited bodies, and DSE prepares organizations for audit and examination without issuing certificates or guaranteeing any outcome. What the deployment delivers is a defensible, documented, audit-ready posture: a system you govern, evidence a reviewer can follow, and a senior owner who can answer the auditor, the board, and the customer security review.

The Bottom Line

For a regulated organization, the question is not whether AI is useful but whether you can deploy it without surrendering control of the data you are obligated to protect. A self-hosted deployment answers that by keeping the model, the data, and the audit trail inside your boundary, and by producing, as a byproduct of how it is built, the evidence a SOC 2 auditor, a HIPAA reviewer, a federal assessor, or an enterprise security review will ask for.

The work is a sequence, not a purchase. Classify the data, build the boundary, enforce access and logging, control change, test the boundary against the failure modes specific to LLM systems, and document last against the system you actually built. Map each control to the frameworks you answer to, and reuse the SOC 2 and ISO 27001 evidence you already produce rather than rebuilding it. Done that way, secure private AI is a scoped engineering engagement with a defined output, and the output is a program your own team can run and an examiner can follow. That posture, and the work to reach it, is what a private AI security engagement is built to deliver.

Key facts

Read next · AI Security & Governance

P
Founder · Principal Engineer
Data & AI engineer · 10+ yrs hands-on

Writes most of the long-form here. Lives in the codebase. Active on GitHub and LinkedIn.

§ Next step

Not sure which of these is you?

Tell us what's broken in a paragraph and a principal reads it directly — or walk the ladder from a low-commitment first engagement up to retained work.

One long-form a week. No marketing.

Subscribe to the Refinery Report. Practitioner deep-dives on AI engineering, security, and the realities of running production systems. Unsubscribe in one click.

~12 issues / quarter