shipping production AI · since 2026 NAICS 541330 / 541511 / 541512 / 541519  ·  CMMC-aware
Selected Work / AI Security / case · mework
AI SecurityRed-TeamingNIST AI RMFPrompt Injection

An AI Security Red-Teaming Framework for Enterprise and Federal LLM Systems

A structured methodology for stress-testing LLM and agent systems against prompt injection, tool abuse, and data exfiltration before they ship—mapped to OWASP LLM Top 10, MITRE ATLAS, and NIST AI RMF.

D
DSE-Experts
Operator-led practice
May 27, 2026
8 min · 1,671 words

An AI Security Red-Teaming Framework for Enterprise and Federal LLM Systems

Executive Summary

The most dangerous failures in LLM systems are rarely in the model. They are in the systems built around it: the ingestion pipelines, retrieval layers, tool routers, identity boundaries, and agent control flow that decide what the model can read, do, and disclose. A model that politely refuses a “how do I build a weapon” prompt can still be coerced—through a poisoned PDF, a malicious tool response, or an over-permissioned agent—into exfiltrating another tenant’s data or executing an unintended action.

Our team developed this red-teaming framework from work hardening LLM and agentic systems for a regulated SaaS client and a federal program. It treats the LLM application as an adversarial attack surface and tests it the way a determined attacker would, then maps every finding to a shared taxonomy so engineering, security, and governance stakeholders work from one language. It is opinionated by design: we red-team before launch, we red-team continuously after, and we gate deployment on severity thresholds rather than vibes.

Why Model-Level Testing Is Not Enough

Most teams that “test their AI for safety” run a battery of toxic and jailbreak prompts against the model endpoint and call it done. That is model-level red-teaming, and it is necessary but insufficient.

Real incidents—indirect prompt injection via documents and tools, vector-store poisoning, credential leakage in third-party agent skills, denial-of-wallet cost attacks—show the attack surface lies in the application, not the weights. Any point where an attacker can influence input to an instruction-tuned model is in scope: a web page, a PDF, an email body, a tool response, even an image. The structural flaw is blending trusted instructions and untrusted data in a single context window, then treating the model’s output as a command for downstream systems.

Our framework therefore tests at three levels: the model (does it produce unsafe output in isolation), the application (does the full RAG, tool, and API stack leak or misbehave), and the system (do identity, authorization, and tenancy boundaries hold under adversarial pressure).

The Attack Taxonomy We Test Against

We organize testing around a concrete taxonomy of adversary behaviors rather than a generic checklist:

Attack class What it looks like
Direct prompt injection User-supplied instructions that override system intent or guardrails
Indirect prompt injection Hostile instructions hidden in retrieved documents, tool outputs, or web content
Jailbreaks DAN-style role-play, encoding tricks, and crescendo escalation to bypass safety layers
Sensitive data exfiltration Coaxing the model to reveal secrets, training data, or another user’s context
Tool / function abuse Inducing the agent to call tools in unintended, privileged, or destructive ways
Excessive agency Exploiting excessive functionality, permission, or autonomy in agent design
Data and model poisoning Tainting training data, RAG indices, or memory stores to alter future behavior
Denial-of-wallet Driving runaway token or tool spend through adversarial inputs

Indirect prompt injection is the one we emphasize most. It is not “a jailbreak by another name”—it is a system-level vulnerability that turns any content source into an instruction channel, and it becomes critical the moment the system grants the model tools that can write to external endpoints.

Mapping Findings to a Shared Language

A finding nobody can prioritize is a finding nobody fixes. We map every result to two complementary standards so that security operations and engineering both understand it.

OWASP Top 10 for LLM Applications (2025). The current list anchors our application-layer coverage: Prompt Injection (LLM01), Sensitive Information Disclosure (LLM02), Supply Chain (LLM03), Data and Model Poisoning (LLM04), Improper Output Handling (LLM05), Excessive Agency (LLM06), System Prompt Leakage (LLM07), Vector and Embedding Weaknesses (LLM08), Misinformation (LLM09), and Unbounded Consumption (LLM10). OWASP decomposes excessive agency into excessive functionality, excessive permission, and excessive autonomy—a distinction we use directly in remediation guidance.

MITRE ATLAS. ATLAS extends the ATT&CK philosophy to AI, with roughly fourteen tactics and sixty-plus techniques covering data poisoning, model theft, evasion, AI supply-chain exploitation, and agentic abuses like memory manipulation. Where OWASP categorizes application risk, ATLAS describes adversary behavior—and it lets us hand findings to a SOC that already lives in ATT&CK.

Tooling such as Microsoft PyRIT and Promptfoo maps its adversarial datasets to OWASP categories natively, so reports can classify each vulnerability under LLM01–LLM10 and the corresponding ATLAS tactic with minimal manual translation.

Aligning With NIST and the EU AI Act

For enterprise and especially federal systems, red-teaming has to feed governance, not sit beside it.

We frame the engagement inside the NIST AI Risk Management Framework and its four functions—Govern, Map, Measure, and Manage. Threat modeling and scenario design are Map activities; adversarial testing is Measure; remediation decisions and continuous testing are Manage; and the whole program rolls up to Govern. The NIST Generative AI Profile sharpens this for generative use cases, calling for contextual misuse analysis and the same governance rigor applied to other critical systems. Federal programs are explicitly encouraged to anchor AI risk practice in NIST frameworks, which makes AI RMF a natural backbone for justifying and structuring this work.

We also produce artifacts traceable to EU AI Act expectations for clients who deploy globally. We do not assert that an engagement makes a client “compliant” or “certified”—that is a determination for the client’s counsel and auditors. What we deliver is evidence: documented threat models, test coverage, severity-scored findings, and remediation tracking that those obligations can be mapped against.

Designing the Test Suite

A red-team engagement is only as good as its corpus and its instrumentation.

We combine three sources. Automated scanners like Garak—an open-source LLM vulnerability scanner with a curated library of jailbreak, encoding, injection, and training-data-extraction prompts—are well-suited to model-level testing. Application-aware tooling like Promptfoo dynamically generates attacks tailored to the specific RAG pipeline or agent rather than treating the model in isolation. Agent red-teaming frameworks like PyRIT supply fifty-plus adversarial datasets, prompt converters (base64, leetspeak, translation), and attack strategies (prompt-sending, crescendo) with LLM-as-judge scoring—and it has been exercised against a hundred-plus real products including Copilot.

On top of automation, we build a mini-benchmark from 50–200 real use cases for the system under test and augment it with synthetic adversarial variants, following the practice of expanding the corpus continuously from red-team findings and incident postmortems. Throughout, the system is instrumented to capture full request–response and tool-call traces, so tool misuse, data leakage, and context poisoning can be observed and attributed rather than merely suspected.

Execution is staged: lower-risk tests run first in a sandbox; invasive scenarios escalate only once containment and monitoring are validated.

The Authorization-Aware Threat Model

This is where most LLM security programs are thinnest and where we spend the most rigor. We model the system around its identity and authorization boundaries, not just its prompts.

Scoring, Reporting, and Continuous Red-Teaming

Findings are scored on a CVSS-adapted 0–10 scale—critical at 9.0–10.0, high at 7.0–8.9—decomposing risk into business and security impact, observed attack-success rate during testing, and human exploitability. Each finding carries its OWASP category, ATLAS tactic, reproduction steps with exact payloads, evidence from the traces, and governance implications. Reports are written to be consumable by engineers and executives alike.

Red-teaming is not a launch gate you pass once. We operationalize it as a continuous practice: the corpus grows from every new finding and incident, runtime monitoring tracks intent and goal alignment, tool and API usage, latency and cost, and anomaly detection flags drift. Findings feed a deployment gating policy with explicit acceptance thresholds—what severity blocks a release, what gets a waiver, and who signs it.

What This Means For You

If your LLM or agent system touches customer data, calls tools, or runs in a regulated environment, model-level safety testing is table stakes, not assurance. The questions worth asking before launch are: Can a poisoned document make our agent act against us? Can one tenant reach another’s data through the vector store? Does any tool run with more privilege than its task requires? Can we score, prioritize, and track what we find against a standard our auditors recognize?

A credible answer requires testing the whole system—model, application, and authorization boundary—and wiring the results into your governance process rather than a one-off report. Our AI Security Sprint delivers a scoped red-team engagement against your highest-risk LLM workflow, with findings mapped to OWASP, ATLAS, and NIST and a continuous-testing plan to keep them closed.


This framework represents research and engineering work by the DSE team, drawing on professional experience hardening LLM and agentic systems in regulated SaaS and federal contexts. It is designed as a reference methodology for organizations evaluating the security of AI systems before and after deployment. It is not legal advice and does not by itself establish regulatory compliance or certification.

P
Founder · Principal Engineer
Data & AI engineer · 10+ yrs hands-on

Writes most of the long-form here. Lives in the codebase. Active on GitHub and LinkedIn.

§ Next step

Not sure which of these is you?

Tell us what's broken in a paragraph and a principal reads it directly — or walk the ladder from a low-commitment first engagement up to retained work.

One long-form a week. No marketing.

Subscribe to the Refinery Report. Practitioner deep-dives on AI engineering, security, and the realities of running production systems. Unsubscribe in one click.

~12 issues / quarter