An AI Security Red-Teaming Framework for Enterprise and Federal LLM Systems
Executive Summary
The most dangerous failures in LLM systems are rarely in the model. They are in the systems built around it: the ingestion pipelines, retrieval layers, tool routers, identity boundaries, and agent control flow that decide what the model can read, do, and disclose. A model that politely refuses a “how do I build a weapon” prompt can still be coerced—through a poisoned PDF, a malicious tool response, or an over-permissioned agent—into exfiltrating another tenant’s data or executing an unintended action.
Our team developed this red-teaming framework from work hardening LLM and agentic systems for a regulated SaaS client and a federal program. It treats the LLM application as an adversarial attack surface and tests it the way a determined attacker would, then maps every finding to a shared taxonomy so engineering, security, and governance stakeholders work from one language. It is opinionated by design: we red-team before launch, we red-team continuously after, and we gate deployment on severity thresholds rather than vibes.
Why Model-Level Testing Is Not Enough
Most teams that “test their AI for safety” run a battery of toxic and jailbreak prompts against the model endpoint and call it done. That is model-level red-teaming, and it is necessary but insufficient.
Real incidents—indirect prompt injection via documents and tools, vector-store poisoning, credential leakage in third-party agent skills, denial-of-wallet cost attacks—show the attack surface lies in the application, not the weights. Any point where an attacker can influence input to an instruction-tuned model is in scope: a web page, a PDF, an email body, a tool response, even an image. The structural flaw is blending trusted instructions and untrusted data in a single context window, then treating the model’s output as a command for downstream systems.
Our framework therefore tests at three levels: the model (does it produce unsafe output in isolation), the application (does the full RAG, tool, and API stack leak or misbehave), and the system (do identity, authorization, and tenancy boundaries hold under adversarial pressure).
The Attack Taxonomy We Test Against
We organize testing around a concrete taxonomy of adversary behaviors rather than a generic checklist:
| Attack class | What it looks like |
|---|---|
| Direct prompt injection | User-supplied instructions that override system intent or guardrails |
| Indirect prompt injection | Hostile instructions hidden in retrieved documents, tool outputs, or web content |
| Jailbreaks | DAN-style role-play, encoding tricks, and crescendo escalation to bypass safety layers |
| Sensitive data exfiltration | Coaxing the model to reveal secrets, training data, or another user’s context |
| Tool / function abuse | Inducing the agent to call tools in unintended, privileged, or destructive ways |
| Excessive agency | Exploiting excessive functionality, permission, or autonomy in agent design |
| Data and model poisoning | Tainting training data, RAG indices, or memory stores to alter future behavior |
| Denial-of-wallet | Driving runaway token or tool spend through adversarial inputs |
Indirect prompt injection is the one we emphasize most. It is not “a jailbreak by another name”—it is a system-level vulnerability that turns any content source into an instruction channel, and it becomes critical the moment the system grants the model tools that can write to external endpoints.
Mapping Findings to a Shared Language
A finding nobody can prioritize is a finding nobody fixes. We map every result to two complementary standards so that security operations and engineering both understand it.
OWASP Top 10 for LLM Applications (2025). The current list anchors our application-layer coverage: Prompt Injection (LLM01), Sensitive Information Disclosure (LLM02), Supply Chain (LLM03), Data and Model Poisoning (LLM04), Improper Output Handling (LLM05), Excessive Agency (LLM06), System Prompt Leakage (LLM07), Vector and Embedding Weaknesses (LLM08), Misinformation (LLM09), and Unbounded Consumption (LLM10). OWASP decomposes excessive agency into excessive functionality, excessive permission, and excessive autonomy—a distinction we use directly in remediation guidance.
MITRE ATLAS. ATLAS extends the ATT&CK philosophy to AI, with roughly fourteen tactics and sixty-plus techniques covering data poisoning, model theft, evasion, AI supply-chain exploitation, and agentic abuses like memory manipulation. Where OWASP categorizes application risk, ATLAS describes adversary behavior—and it lets us hand findings to a SOC that already lives in ATT&CK.
Tooling such as Microsoft PyRIT and Promptfoo maps its adversarial datasets to OWASP categories natively, so reports can classify each vulnerability under LLM01–LLM10 and the corresponding ATLAS tactic with minimal manual translation.
Aligning With NIST and the EU AI Act
For enterprise and especially federal systems, red-teaming has to feed governance, not sit beside it.
We frame the engagement inside the NIST AI Risk Management Framework and its four functions—Govern, Map, Measure, and Manage. Threat modeling and scenario design are Map activities; adversarial testing is Measure; remediation decisions and continuous testing are Manage; and the whole program rolls up to Govern. The NIST Generative AI Profile sharpens this for generative use cases, calling for contextual misuse analysis and the same governance rigor applied to other critical systems. Federal programs are explicitly encouraged to anchor AI risk practice in NIST frameworks, which makes AI RMF a natural backbone for justifying and structuring this work.
We also produce artifacts traceable to EU AI Act expectations for clients who deploy globally. We do not assert that an engagement makes a client “compliant” or “certified”—that is a determination for the client’s counsel and auditors. What we deliver is evidence: documented threat models, test coverage, severity-scored findings, and remediation tracking that those obligations can be mapped against.
Designing the Test Suite
A red-team engagement is only as good as its corpus and its instrumentation.
We combine three sources. Automated scanners like Garak—an open-source LLM vulnerability scanner with a curated library of jailbreak, encoding, injection, and training-data-extraction prompts—are well-suited to model-level testing. Application-aware tooling like Promptfoo dynamically generates attacks tailored to the specific RAG pipeline or agent rather than treating the model in isolation. Agent red-teaming frameworks like PyRIT supply fifty-plus adversarial datasets, prompt converters (base64, leetspeak, translation), and attack strategies (prompt-sending, crescendo) with LLM-as-judge scoring—and it has been exercised against a hundred-plus real products including Copilot.
On top of automation, we build a mini-benchmark from 50–200 real use cases for the system under test and augment it with synthetic adversarial variants, following the practice of expanding the corpus continuously from red-team findings and incident postmortems. Throughout, the system is instrumented to capture full request–response and tool-call traces, so tool misuse, data leakage, and context poisoning can be observed and attributed rather than merely suspected.
Execution is staged: lower-risk tests run first in a sandbox; invasive scenarios escalate only once containment and monitoring are validated.
The Authorization-Aware Threat Model
This is where most LLM security programs are thinnest and where we spend the most rigor. We model the system around its identity and authorization boundaries, not just its prompts.
- Multi-tenant isolation. Tenant identifiers must propagate through authentication, authorization, and the retrieval layer. We test whether one tenant’s embeddings, retrieved chunks, or conversation history can ever surface for another. Vector-store weaknesses that cross tenant boundaries are treated as critical by default.
- Secrets handling. Models can memorize and regurgitate secrets they are given access to. Agent-skill ecosystems frequently hard-code credentials or store them in weakly protected config. We probe for credential leakage in outputs and through third-party skill channels.
- Tool permission scoping. Every tool the agent can call is evaluated against least privilege. We test for excessive functionality (tools beyond the task), excessive permission (over-broad privileges), and excessive autonomy (high-impact actions without human-in-the-loop).
- Output trust. All model output is treated as untrusted until filtered. We verify deterministic redaction and that downstream systems never execute model output as a command without validation.
Scoring, Reporting, and Continuous Red-Teaming
Findings are scored on a CVSS-adapted 0–10 scale—critical at 9.0–10.0, high at 7.0–8.9—decomposing risk into business and security impact, observed attack-success rate during testing, and human exploitability. Each finding carries its OWASP category, ATLAS tactic, reproduction steps with exact payloads, evidence from the traces, and governance implications. Reports are written to be consumable by engineers and executives alike.
Red-teaming is not a launch gate you pass once. We operationalize it as a continuous practice: the corpus grows from every new finding and incident, runtime monitoring tracks intent and goal alignment, tool and API usage, latency and cost, and anomaly detection flags drift. Findings feed a deployment gating policy with explicit acceptance thresholds—what severity blocks a release, what gets a waiver, and who signs it.
What This Means For You
If your LLM or agent system touches customer data, calls tools, or runs in a regulated environment, model-level safety testing is table stakes, not assurance. The questions worth asking before launch are: Can a poisoned document make our agent act against us? Can one tenant reach another’s data through the vector store? Does any tool run with more privilege than its task requires? Can we score, prioritize, and track what we find against a standard our auditors recognize?
A credible answer requires testing the whole system—model, application, and authorization boundary—and wiring the results into your governance process rather than a one-off report. Our AI Security Sprint delivers a scoped red-team engagement against your highest-risk LLM workflow, with findings mapped to OWASP, ATLAS, and NIST and a continuous-testing plan to keep them closed.
This framework represents research and engineering work by the DSE team, drawing on professional experience hardening LLM and agentic systems in regulated SaaS and federal contexts. It is designed as a reference methodology for organizations evaluating the security of AI systems before and after deployment. It is not legal advice and does not by itself establish regulatory compliance or certification.