How Enterprise AI Deployment Actually Gets Architected From Scratch
Executive Summary
Most published AI “architecture” diagrams show a box labeled “LLM” with arrows pointing in and out. That is not architecture. It is a wish.
Standing up a production AI system from zero is a sequence of consequential decisions made in a specific order, where each choice constrains the next: what the system must actually do, whether the data can support it, how retrieval and inference are wired, how you prove it works, how identity and isolation are enforced, how it deploys, and how it is handed off so the client owns it outright. This framework walks that sequence as we run it in practice — anonymized into a reference architecture. The throughline is that the model is the least interesting decision. Isolation, auth, evaluation, and handoff are what make it production.
The Premise: From Zero, Not From a Demo
The hardest enterprise AI work is not improving an existing system. It is the first production deployment for an organization that has a working prototype and no idea how to make it real, multi-user, secure, and theirs.
A prototype answers one user’s question in a notebook. A production system answers thousands of questions from many tenants, enforces who can see what, proves it has not regressed since yesterday, runs inside a security boundary an auditor will accept, and can be operated by the client’s own team after you leave. The distance between those two is the engagement.
We architect that distance in seven stages. They are ordered deliberately. Skipping ahead — most commonly jumping straight to model selection — is the single most reliable way to build something that never ships.
Stage 1: Requirements That Constrain, Not Inspire
The first artifact is not a model choice. It is a requirements document that constrains the design.
- The job statement. One sentence: what decision or task does this system support, for whom, and what is a correct outcome? If it takes a paragraph, the scope is not yet a system.
- The non-functional envelope. Latency budget, concurrency, tenancy model, data residency, audit obligations. In a regulated engagement these are hard constraints that eliminate entire architectures on day one.
- The definition of done. Written before the build: what a right answer, a wrong answer, and an acceptable refusal look like. This becomes the seed of the eval harness in Stage 4.
This stage produces a document that says “no” to things. A requirements doc that only inspires has done nothing useful. A requirements doc that rules out cloud regions, rules out a tenancy model, and pins a latency budget has done the architect’s job.
Stage 2: Data — Readiness Over Presence
With the job pinned, the next decision is whether the data can support it. This is where most timelines are actually set, and where optimistic plans break.
- Semantic integrity audit. Do fields mean the same thing across sources? Where are the overloaded columns and drifted enums that will poison retrieval?
- Coverage check. Does the data actually contain what the job statement requires, or only adjacent facts?
- Ingestion and chunking design. For retrieval, content is structured so meaning survives splitting — conditions stay attached to their exceptions, tables stay coherent, provenance travels with the chunk.
- The classification map. Every data tier is labeled by sensitivity now, because Stage 5 will enforce those boundaries in the architecture.
The output is an honest readiness verdict. Frequently the first sprint becomes data engineering rather than AI work. Naming that early is a feature: it prevents building a confident system on a foundation that produces wrong answers.
Stage 3: Retrieval and Model — The Boring, Correct Choices
Only now does the model enter, and it enters as a replaceable component behind an interface — not as the center of the system.
Retrieval layer
┌──────────────────────────────────────────────────────────────┐
│ Request (authenticated) │
│ │ │
│ ▼ │
│ ┌────────────────────┐ │
│ │ Retrieval service │ ◀── tenant-scoped │
│ │ (hybrid search + │ ◀── permission-aware │
│ │ re-ranking) │ ◀── provenance-tagged │
│ └─────────┬──────────┘ │
│ │ context │
│ ▼ │
│ ┌────────────────────┐ │
│ │ Inference router │ ◀── model-agnostic │
│ │ (managed LLMs) │ ◀── cost/latency aware │
│ └─────────┬──────────┘ │
│ │ │
│ ▼ │
│ Response + audit record │
└──────────────────────────────────────────────────────────────┘
- Hybrid retrieval with re-ranking. Dense plus sparse retrieval, then a re-ranking stage, because pure vector similarity surfaces plausible-but-wrong context often enough to matter.
- A model router, not a model. Inference goes through a routing layer over managed providers so the system is model-agnostic. Models are swapped, A/B tested, and cost-tuned without rewriting the application. This single decision protects the build against the fact that the best model in six months is not the best model today.
- Retrieval is permission-aware by construction. The retrieval service only ever returns context the requesting identity is cleared for. Permissions are not a filter applied to results — they are part of the query.
Stage 4: Evaluation — The Harness That Gates Releases
Before anything ships, the definition of done from Stage 1 becomes executable.
- A curated eval set. Representative cases with known-good outcomes, including the hard exceptions and the cases that should be refused.
- Automated scoring on every change. Each prompt, retrieval, or model change runs the full set. Correctness, faithfulness to retrieved context, and refusal behavior are scored.
- A hard regression gate. A quality drop blocks the release the way a failing test blocks a merge. This is the mechanism that prevents silent regression — the failure mode that kills systems quietly after launch.
- Production observability. Live drift, hallucination, and retrieval-quality signals feed back so degradation is caught before users report it.
The eval harness is the most undervalued component in enterprise AI and the one that most reliably separates a system that stays correct from one that erodes. It is built before launch, not after the first incident.
Stage 5: Security and Auth — Identity Through the Whole Stack
Security is not a stage you can append. By Stage 5 it has already shaped retrieval (Stage 3) and classification (Stage 2). Here it is enforced end to end.
- JWT/JWKS at the gateway. Every request is authenticated at the API gateway against signed tokens (RS256, validated via JWKS) before it reaches any application logic. Unauthenticated requests never touch inference.
- Identity propagation. The authenticated identity flows from the gateway into the retrieval layer so permission-aware retrieval has a real subject to enforce against. Auth at the edge plus open data behind it is a breach waiting to happen.
- Multi-tenant isolation. Tenants are isolated at the data and retrieval layer so one tenant’s context can never surface in another’s response. In multi-tenant AI, isolation is the product, not a feature.
- Audit by design. Every request — who, what was asked, what was retrieved, what was returned — is recorded in a form an auditor accepts. Retrofitting this later is expensive and often impossible.
Stage 6: Deploy — Boring on Purpose
Deployment is intentionally unremarkable, because remarkable deployments are usually the bad kind.
- Managed, declarative infrastructure. The stack is defined as code — gateway, auth, retrieval, routing, observability — so environments are reproducible and reviewable.
- Phased rollout. A contained pilot tenant first, with the eval gate and observability live, before broad exposure. Real load reveals what staging cannot.
- Cost instrumentation from day one. Inference and retrieval spend is tracked per tenant and per request, because at machine speed a cost mistake compounds before a human notices.
Stage 7: Handoff — The Client Owns It
The engagement is not complete when the system runs. It is complete when the client’s own team can operate, extend, and reason about it without us.
- Full IP transfer. Code, infrastructure definitions, eval harness, and runbooks transfer to the client. They own the system outright. No lock-in to us as a vendor.
- Operational runbooks. How to add a tenant, swap a model, extend the eval set, read the audit log, respond to drift alerts.
- Knowledge transfer sessions. The client team can defend the architecture decisions in their own audit, because they understand why each one was made.
A handoff that leaves the client dependent on the consultancy is a failed handoff regardless of how well the system runs.
The Sequence, on One Page
| Stage | Decision | The non-obvious point |
|---|---|---|
| 1. Requirements | What it must do; what “done” means | The doc should say “no” to architectures |
| 2. Data | Ready vs. present | This sets the real timeline |
| 3. Retrieval & model | Hybrid retrieval; model router | The model is replaceable; the interface is not |
| 4. Evaluation | Harness that gates releases | Prevents silent regression after launch |
| 5. Security & auth | JWT/JWKS, identity propagation, isolation | Isolation is the product in multi-tenant AI |
| 6. Deploy | Declarative infra, phased rollout | Boring on purpose |
| 7. Handoff | Full IP transfer, runbooks | Client owns it, no lock-in |
Applicability
This reference architecture applies to organizations standing up their first production AI system, teams converting a successful prototype into a multi-tenant product, and regulated or federal environments where auth, isolation, and auditability are hard constraints rather than nice-to-haves.
It is deliberately model-agnostic. The provider landscape will change; the sequence will not.
This framework reflects the architecture patterns our team applies when standing up production AI systems in enterprise and regulated-industry engagements. Client-specific details are anonymized; it is published as a reference architecture for organizations planning a first production AI deployment.