trending
Majid Fekri

5h ago

Where $2.5 Million a Year Actually Goes

When most people think about the cost of AI search, they think about the vector database. But the database is just the tip of the iceberg. Here's what the full cost stack actually looks like for a typical enterprise running retrieval-augmented generation (RAG) at scale:

The vector database cluster is the obvious one. To serve 150 multi-tenant customers with real-time retrieval, this customer was running Qdrant on a fleet of AWS memory-optimized instances (r7g.2xlarge) plus Kubernetes orchestration. Annual cost: ~$591,000. And that infrastructure runs 24/7, whether it's peak hours or 3 AM on a Sunday. You're renting RAM by the year to hold vectors that might get queried once an hour.

The reranking API is the cost nobody budgets for. Traditional vector databases use approximate search they give you "close enough" results using a probabilistic algorithm called HNSW. For enterprise use cases in regulated industries, "close enough" isn't good enough. So teams bolt on a reranking service like Cohere Rerank to improve accuracy after the initial retrieval. That API call on every query, at this volume, costs roughly ~$1.5 million per year. It's the single biggest line item, and most teams don't see it coming until they're already in production.

The middleware and observability layer adds another surprise. Enterprise RAG requires auditability you need to trace exactly which documents were retrieved, with what parameters, through what logic. Teams typically bolt on LangSmith or similar observability tooling on top of LangChain, which adds token overhead and tracing costs. For this customer: ~$378,000 per year.

Majid Fekri

9mo ago

Moorcheh - Search beyond distance. discover true relevance.

Moorcheh.ai delivers next-gen serverless vector search based on Information-Theoretic principles, empowering developers to build radically efficient and hyper-accurate AI chatbots and RAG systems.