shipping production AI · since 2020 NAICS 541511 / 541512 / 541519  ·  CMMC-aware
Home / Work / case-01 · PrivateStack
case-01 · commercial multi-tenant LLM SaaS AWS · Bedrock · Lambda 2025 · 11 wks

PrivateStack — 0 → production in 11 weeks.

A private, multi-tenant LLM platform for regulated industries — auth, billing, retrieval, routing across five providers, observability — built end-to-end and handed off with a runbook a third party could operate.

delivery
11wks
0 → customer-zero
endpoints
66
production API surface
p95 latency
181ms
/chat route, last 30d
uptime · 12mo
99.94%
since hand-off
providers routed
5
cost-optimized via LiteLLM
The brief
Engagement type
AI Engineering · fixed-fee · 11 wk
IndustryRegulated B2B SaaS
StageSeries B
TeamPrincipal + Staff Eng + SME
CadenceWeekly demo · decision log
CloudAWS us-east-1
Hand-offFull IP + 23-pg runbook

From a Notion doc to customer-zero in eleven weeks.

A regulated-industry B2B SaaS needed a private, multi-tenant LLM platform their security team could approve. They'd tried two earlier engagements with body-shop consultancies. One delivered a deck. The other delivered a Streamlit demo.

We started from the brief, not from a template. Week one was a threat model and architecture review. Week eleven was customer-zero in production. The handoff included a 23-page runbook, full IP transfer, and a 30-day post-launch support window.

The system has run for twelve months since hand-off without our intervention. Their on-call team operates it from the runbook we left.

§01 Architecture.
Production reference.

Every box has a runbook.
Every arrow has an ADR.

AWS-native by default. Bedrock as the primary inference path, LiteLLM as the routing layer, pgvector for retrieval. Per-tenant cost ceilings and observability throughout.

PrivateStack — production reference architecture
fig 01 · v0.84
→ ingress & auth
edge
CloudFront
waf
AWS WAF
identity
Clerk · JWT · OIDC
api gateway
66 routes
→ compute (λ)
λ
auth + tenant
λ
inference · LiteLLM router
λ
billing + usage
→ data & eval
store
pgvector · BM25
models
Bedrock · vLLM fallback
evals
CI · 842 golden
→ observability & ops
traces
per-tenant
logs
structured
cost
route-attributed
runbook
23 pp
§02 Eleven weeks, in order.
What we shipped, when.

No surprises after week one.

Fixed-scope, written decision log on every call, weekly demo on Friday. The scope doc we signed in week zero matches the artifacts handed off in week eleven.

W0

Scope.

Discovery call. Written scope, deliverables, milestones, fee. 48-hour fixed-fee quote.

OutScope doc · MSA · NDA
W1

Architecture & threat model.

System diagram, data-flow, ADR-001 (routing strategy), threat model. End-of-week demo of the scaffolding.

OutArch diagram · TM doc · ADRs 1–4
W2

Auth + tenant schema.

Clerk JWT integration, tenant table, RLS policies, API skeleton with three live routes. CI green.

OutAuth service · tenant model
W3–5

Build out the 66 endpoints.

Inference routes, admin routes, embedding routes. LiteLLM router with five providers, cost ceilings per tenant, fallback strategy.

Out66 routes · LiteLLM config
W6

Retrieval + eval harness.

pgvector + BM25 hybrid retrieval. Eval harness with 842 golden cases. Drift baseline established.

OutRetrieval lib · eval harness
W7–8

Billing, usage, admin console.

Stripe billing, per-tenant usage metering, admin console for support. PII scrubbing wired through every route.

OutBilling · usage · admin UI
W9

Security review & red-team.

IAM hardening, secrets rotation, prompt-injection red-team, fixes for 8 findings. Bedrock IAM clean.

OutRed-team report · IAM policies
W10

Observability & load test.

Traces, logs, cost dashboards per tenant. Alerting wired. Load test to 5× expected peak. p95 stable.

OutDashboards · alerts · load report
W11

IP transfer & customer-zero launch.

Full IP transfer. 23-page runbook handed to their on-call. Customer-zero traffic enabled. 30-day support clock starts.

OutRunbook · IP transfer · launch
We expected a deck and got a deploy.
The runbook outlived three of our engineers. — CTO, anonymized · Series-B SaaS · reference on request
§03 Outcomes.
Numbers, attached.

A working system, not a slide deck.

Metrics on the day we handed off, and what they look like twelve months later. References available on request.

uptime · 12mo
99.94%
Since hand-off, twelve months running, on their on-call.
p95 latency
181ms
/chat route. Stable across model swaps, well under the 500ms SLO.
cost / 1k req
−42%
vs. all-OpenAI baseline, after LiteLLM routing went live.
on-call burden
~0/wk
Their engineers operate from the runbook. We've answered fewer than three follow-ups in twelve months.
§04 The stack.
What's in production today.

AWS-native. Boringly chosen.

No exotic infrastructure. Every choice is one that a customer's own on-call team can operate without us.

Auth & identity
  • Clerk
  • JWT · OIDC
  • Tenant RLS
  • Cognito as fallback
API & compute
  • API Gateway · 66 routes
  • Lambda (Node 20)
  • Step Functions
  • SQS · EventBridge
Models & retrieval
  • Bedrock (primary)
  • LiteLLM · 5 providers
  • pgvector + BM25
  • vLLM fallback
Ops & security
  • OpenTelemetry traces
  • Per-tenant cost dashboards
  • WAF · GuardDuty
  • Runbook (23pp)
§05·Engage

Got something that looks like this?

We'll tell you in 48 hours whether it's a fit, scope it if it is, and refer you elsewhere if it isn't. A principal reads every message.

Scope a call Email the principal

Other engagements

case-02 · federal · 24 mo
Federal contract intelligence platform.

SAM.gov ingestion · MongoDB buyer maps · automated proposal drafting.

read case-02 →
case-03 · commercial · ongoing
24/7 algorithmic trading inference.

GPU-shared inference · MongoDB time-series · risk circuit breakers.

read case-03 →
case-04 · commercial · 8 wks
Computer-vision surveillance pipeline.

OSNet / InsightFace re-ID across 8 RTSP streams · edge GPU.

read case-04 →