shipping production AI · since 2026 NAICS 541330 / 541511 / 541512 / 541519  ·  CMMC-aware
Selected Work / Enterprise AI / case · mework
Enterprise AIArchitectureProduction AIMulti-Tenant

How Enterprise AI Deployment Actually Gets Architected From Scratch

A reference architecture for standing up a production AI system from zero — requirements through handoff — drawn from real engagements: multi-tenant isolation, JWT/JWKS auth at the gateway, managed inference with routing, an eval harness that gates releases, and phased delivery with full IP transfer.

D
DSE-Experts
Operator-led practice
May 27, 2026
7 min · 1,521 words

How Enterprise AI Deployment Actually Gets Architected From Scratch

Executive Summary

Most published AI “architecture” diagrams show a box labeled “LLM” with arrows pointing in and out. That is not architecture. It is a wish.

Standing up a production AI system from zero is a sequence of consequential decisions made in a specific order, where each choice constrains the next: what the system must actually do, whether the data can support it, how retrieval and inference are wired, how you prove it works, how identity and isolation are enforced, how it deploys, and how it is handed off so the client owns it outright. This framework walks that sequence as we run it in practice — anonymized into a reference architecture. The throughline is that the model is the least interesting decision. Isolation, auth, evaluation, and handoff are what make it production.

The Premise: From Zero, Not From a Demo

The hardest enterprise AI work is not improving an existing system. It is the first production deployment for an organization that has a working prototype and no idea how to make it real, multi-user, secure, and theirs.

A prototype answers one user’s question in a notebook. A production system answers thousands of questions from many tenants, enforces who can see what, proves it has not regressed since yesterday, runs inside a security boundary an auditor will accept, and can be operated by the client’s own team after you leave. The distance between those two is the engagement.

We architect that distance in seven stages. They are ordered deliberately. Skipping ahead — most commonly jumping straight to model selection — is the single most reliable way to build something that never ships.

Stage 1: Requirements That Constrain, Not Inspire

The first artifact is not a model choice. It is a requirements document that constrains the design.

This stage produces a document that says “no” to things. A requirements doc that only inspires has done nothing useful. A requirements doc that rules out cloud regions, rules out a tenancy model, and pins a latency budget has done the architect’s job.

Stage 2: Data — Readiness Over Presence

With the job pinned, the next decision is whether the data can support it. This is where most timelines are actually set, and where optimistic plans break.

The output is an honest readiness verdict. Frequently the first sprint becomes data engineering rather than AI work. Naming that early is a feature: it prevents building a confident system on a foundation that produces wrong answers.

Stage 3: Retrieval and Model — The Boring, Correct Choices

Only now does the model enter, and it enters as a replaceable component behind an interface — not as the center of the system.

Retrieval layer

┌──────────────────────────────────────────────────────────────┐
│                      Request (authenticated)                  │
│                              │                                 │
│                              ▼                                 │
│                   ┌────────────────────┐                       │
│                   │  Retrieval service  │ ◀── tenant-scoped     │
│                   │  (hybrid search +   │ ◀── permission-aware  │
│                   │   re-ranking)       │ ◀── provenance-tagged │
│                   └─────────┬──────────┘                       │
│                              │ context                          │
│                              ▼                                 │
│                   ┌────────────────────┐                       │
│                   │  Inference router  │ ◀── model-agnostic     │
│                   │  (managed LLMs)    │ ◀── cost/latency aware │
│                   └─────────┬──────────┘                       │
│                              │                                  │
│                              ▼                                 │
│                          Response + audit record               │
└──────────────────────────────────────────────────────────────┘

Stage 4: Evaluation — The Harness That Gates Releases

Before anything ships, the definition of done from Stage 1 becomes executable.

The eval harness is the most undervalued component in enterprise AI and the one that most reliably separates a system that stays correct from one that erodes. It is built before launch, not after the first incident.

Stage 5: Security and Auth — Identity Through the Whole Stack

Security is not a stage you can append. By Stage 5 it has already shaped retrieval (Stage 3) and classification (Stage 2). Here it is enforced end to end.

Stage 6: Deploy — Boring on Purpose

Deployment is intentionally unremarkable, because remarkable deployments are usually the bad kind.

Stage 7: Handoff — The Client Owns It

The engagement is not complete when the system runs. It is complete when the client’s own team can operate, extend, and reason about it without us.

A handoff that leaves the client dependent on the consultancy is a failed handoff regardless of how well the system runs.

The Sequence, on One Page

Stage Decision The non-obvious point
1. Requirements What it must do; what “done” means The doc should say “no” to architectures
2. Data Ready vs. present This sets the real timeline
3. Retrieval & model Hybrid retrieval; model router The model is replaceable; the interface is not
4. Evaluation Harness that gates releases Prevents silent regression after launch
5. Security & auth JWT/JWKS, identity propagation, isolation Isolation is the product in multi-tenant AI
6. Deploy Declarative infra, phased rollout Boring on purpose
7. Handoff Full IP transfer, runbooks Client owns it, no lock-in

Applicability

This reference architecture applies to organizations standing up their first production AI system, teams converting a successful prototype into a multi-tenant product, and regulated or federal environments where auth, isolation, and auditability are hard constraints rather than nice-to-haves.

It is deliberately model-agnostic. The provider landscape will change; the sequence will not.


This framework reflects the architecture patterns our team applies when standing up production AI systems in enterprise and regulated-industry engagements. Client-specific details are anonymized; it is published as a reference architecture for organizations planning a first production AI deployment.

P
Founder · Principal Engineer
Data & AI engineer · 10+ yrs hands-on

Writes most of the long-form here. Lives in the codebase. Active on GitHub and LinkedIn.

§ Next step

Not sure which of these is you?

Tell us what's broken in a paragraph and a principal reads it directly — or walk the ladder from a low-commitment first engagement up to retained work.

One long-form a week. No marketing.

Subscribe to the Refinery Report. Practitioner deep-dives on AI engineering, security, and the realities of running production systems. Unsubscribe in one click.

~12 issues / quarter