shipping production AI · since 2026 NAICS 541330 / 541511 / 541512 / 541519  ·  CMMC-aware
Refinery Report / AI Engineering / post · rieval
AI EngineeringRAGVector DatabasesFinOps

Did Agents Kill Vector Search? The Honest, Scale-Dependent Answer

The 2026 take that filesystem agents killed vector databases is half right and dangerously oversimplified. The honest engineering answer depends on your scale threshold, and production converges on hybrid retrieval.

D
DSE-Experts
Operator-led practice
May 28, 2026
8 min · 1,748 words

A controlled benchmark published by LlamaIndex in 2026 set off the loudest data-infrastructure argument of the year. The headline framing — “Did Filesystem Tools Kill Vector Search?” — was catnip for a feed that loves a clean reversal. A capable agent with filesystem access, the story went, can simply read your documents the way a human would, reason over them, and skip the entire apparatus of embeddings, chunking, and approximate-nearest-neighbor indexes. Vercel reinforced the narrative when it stripped roughly 80% of the specialized tools out of a text-to-SQL agent once that agent could explore the filesystem on its own.

The contrarian conclusion that followed — “you don’t need a vector database anymore, just use files and agents” — is the kind of statement that gets a lot of engagement and quietly wrecks a lot of architectures. It is half right. And being half right about an infrastructure decision is how teams end up rebuilding their retrieval layer twice.

Executive Summary. The “agents killed vector search” claim is real but narrow. Filesystem-plus-agent retrieval genuinely wins for small, complex, reasoning-heavy corpora where quality matters more than latency. Vector databases win decisively at scale, on latency, and under concurrency. Serious production systems do not choose — they layer: a vector database narrows millions of candidates to dozens in milliseconds, and an agent reasons over that shortlist. The right question is not “which architecture” but “what is my scale threshold.” Below we give you the numbers to answer that, plus the decision table we hand to clients.

What the Benchmark Actually Showed

The LlamaIndex study is worth reading carefully, because the people quoting it usually stop at the first finding. At small scale, the filesystem-explorer agent did outperform vector RAG on output quality. It scored a correctness of 8.4 versus 6.4 and a relevance of 9.6 versus 8.0 on a 1–10 scale. That is not a rounding error. When an agent can read a modest corpus end to end, it reasons about the material instead of stitching together retrieved fragments, and the answers are simply better.

But the same benchmark recorded the cost. The vector RAG pipeline was about 3.8 seconds faster — 7.36s versus 11.17s — because the agent burned time and tokens exploring, re-reading, and backtracking. And critically, when the corpus grew, the result inverted. In the authors’ own words, “scaling is easier with RAG than with agentic file search,” with RAG pulling ahead substantially on speed and slightly on correctness once the document set grew large.

So the honest reading is not “agents won.” It is: filesystem agents win on quality at small scale and lose on speed and scaling, while vector retrieval wins on speed and scales gracefully. That is a trade-off, not a verdict.

The Performance Reality at Scale

The reason vector databases exist is that filesystem scanning has no answer for sub-200ms retrieval across millions of items for thousands of concurrent users. The 2025–26 latency numbers make the gap concrete.

System / scenario Measured result
Redis @ 1M vectors ~5ms latency
Milvus @ 1M vectors ~8ms latency
Weaviate @ 1M vectors ~12ms latency
Qdrant @ 1M vectors ~15ms latency
Chroma @ 1M vectors ~20ms latency
Pinecone serverless ~47% avg latency reduction vs pod clusters; up to 85% on Cohere-768
MongoDB Atlas Vector Search 15.3M vectors (2048-dim) at <50ms, 90–95% accuracy with quantization
Single HNSW index ceiling ~50–100M vectors on one machine before sharding
RAG retrieval latency share ~41% of end-to-end latency in typical RAG workloads

Two things stand out. First, a tuned vector index answers in single-digit-to-low-double-digit milliseconds at a million vectors — a regime where naive filesystem exploration is simply not a contender. Second, retrieval is not free even in the vector world: roughly 41% of end-to-end RAG latency comes from the retrieval step itself, which is exactly why the index engineering matters.

There is a quality dimension here too, and it cuts against the “just use files” camp in a way that rarely gets mentioned. Filesystem-exploration recall is not tunable. Vector recall is — through HNSW ef_search, IVF nprobe, and similar knobs. Qdrant’s 2026 “Skills for AI Agents” work framed this directly: an agent equipped with tuned vector-search skills improved a task from a 68% to a 100% pass rate, treating vector search as a tunable decision space of memory versus latency, recall versus throughput, and precision versus cost. You cannot tune a grep.

The Reliability Argument the Hype Skips

The case for agents-as-retrievers leans on the assumption that the agent reliably finds the right material. Current evidence is sobering. General agent success on complex multi-step tasks sat in the 40–60% range in early 2026, and reliability degrades with repetition: one analysis found success dropping from ~60% on a single run to ~25% across eight runs. An unconstrained filesystem explorer inherits that variance directly. A vector first-stage retriever, by contrast, gives the agent a deterministic, tunable shortlist to reason over — which is precisely why it raises the reliability floor.

This is the core engineering insight buried under the hype: quality, not infrastructure, is now the bottleneck for scaling agents. Removing the vector database does not remove that bottleneck. It often makes it worse, because the agent spends its budget re-reading and trial-and-error exploring instead of reasoning.

The Cost Story Nobody Puts on the Slide

“Just use files and agents” is frequently sold as the cheaper option. At small scale it can be. At any real traffic it inverts, because every exploratory read is tokens and compute. Re-reading a corpus on each query is a recurring bill, not a one-time index build.

The broader spend picture should make any data leader cautious about hand-waving infrastructure decisions. Global AI investment reached an estimated $202 billion in 2025 (American Action Forum), and Deloitte’s 2025 AI Infrastructure Survey flags data-center, grid, and supply-chain capacity as the binding constraints on scaling AI. Against that backdrop, the waste is staggering. Lyceum’s 2026 analysis of GPU over-provisioning found that AI teams waste roughly 32% of their GPU budget and that organizations overshoot their cloud budgets by about 17% on average, with roughly a third of that overspend being pure waste — idle or underutilized capacity from misaligned scheduling, pipeline bottlenecks, and poorly tuned workloads.

The lesson for retrieval architecture is the same as the lesson for GPU fleets: capacity decisions made on narrative instead of measured workload are where the money leaks. The team that “skips the vector database” to save on managed-service fees, then quietly triples its token spend on exploratory reads, has not saved anything. It has moved the waste somewhere harder to see on the invoice.

The Public-Sector Wrinkle: Governance Favors Determinism

For the federal and regulated clients DSE works with, the calculus shifts again. Governance requirements routinely demand fine-grained access control, auditing, and queries that combine semantic similarity with structured metadata filters. Deterministic vector retrieval with explicit metadata is far easier to audit and access-control than an agent freely exploring a filesystem.

Filesystems do offer one governance advantage — a natural, versionable, inspectable audit trail, which is part of why early public-sector explorations (such as the National Geospatial-Intelligence Agency examining AI in HR workflows) find them appealing. But data-residency and compliance mandates frequently require on-premises infrastructure, and in those settings self-hosted, open-source vector databases are often the only viable path regardless of cost. The governance question is rarely “files or vectors” in isolation; it is “can I prove what was retrieved, why, and who was allowed to see it.” That requirement bends toward controlled retrieval.

The Decision Heuristics

Here is the table we use with clients. It replaces the architecture debate with a scale-threshold question, which is the only version of the question that has a defensible answer.

Situation Winner
< ~10k docs / 1–2 GB, single team, latency-tolerant Files + agents (simpler, higher quality on deep reasoning)
> 100k docs / > 1M chunks, interactive UX, multi-tenant Vector DB (only robust option for sub-200ms at scale)
In between Vector-capable extension (pgvector, Atlas, Redis) as on-ramp
Production at any real traffic Hybrid: vector DB narrows millions → dozens in ms, agent reasons over the candidates

The “in between” row deserves emphasis because it is where most organizations actually live. You do not need to stand up Pinecone or Milvus to get on the curve. If you already run PostgreSQL, pgvector (with pgvectorscale) is frequently the most cost-effective on-ramp up to tens of millions of vectors — and it keeps your retrieval data inside infrastructure you already govern.

What This Means For You

The viral take got one true thing right: for a small, complex corpus where reasoning quality dominates and latency does not matter, a filesystem agent can beat a vector pipeline. If that is your situation, do not over-engineer — files and agents are simpler and they win.

But three things follow that the hype leaves out. First, that advantage evaporates as you scale; vector retrieval is the only robust path to sub-200ms answers across millions of items for many concurrent users. Second, the “cheaper” framing is an illusion at real traffic, where exploratory reads convert directly into token and compute spend — the same over-provisioning dynamic that already wastes roughly a third of enterprise GPU budgets. Third, in regulated and federal environments, controlled vector retrieval is easier to audit and access-control than unconstrained exploration.

The destination for serious production systems is not one or the other. It is hybrid: a vector database does the brutal first-stage narrowing from millions to dozens in milliseconds, and the agent does the second-stage reasoning over a small, tunable, defensible candidate set. You get the agent’s quality and the index’s speed, reliability, and governance.

The firms that win this decision are not the ones with the strongest opinion. They are the ones who did the boring math — counted their documents, measured their latency budget, and modeled their token cost — before they picked a side.

That boring math is what we do. If your team is debating retrieval architecture, or quietly bleeding budget on a system that was chosen by narrative instead of by numbers, let’s talk. We will help you find your scale threshold and build the layer your workload actually needs.

P
Founder · Principal Engineer
Data & AI engineer · 10+ yrs hands-on

Writes most of the long-form here. Lives in the codebase. Active on GitHub and LinkedIn.

§ Next step

Not sure which of these is you?

Tell us what's broken in a paragraph and a principal reads it directly — or walk the ladder from a low-commitment first engagement up to retained work.

One long-form a week. No marketing.

Subscribe to the Refinery Report. Practitioner deep-dives on AI engineering, security, and the realities of running production systems. Unsubscribe in one click.

~12 issues / quarter