Every tutorial makes RAG look easy: chunk your documents, embed them, store in a vector database, retrieve and generate. In practice, building a production RAG system that's accurate, fast, and cost-effective is a completely different challenge. Here's what we learned building VIZIQO Assist's knowledge retrieval system on LangChain and Qdrant.
Chunking Strategy: The Decision That Affects Everything
Our first chunking approach was naive: split documents into 500-token chunks with 50-token overlap. It worked for demos. In production, it produced terrible results. A customer asking about "refund policy for annual subscriptions" would get chunks from three different sections of the refund policy, none containing the complete answer.
What actually works: semantic chunking based on document structure. We parse documents to identify headings, sections, and logical boundaries, then chunk along those boundaries. A section about refund policy stays together as one chunk, even if it's 800 tokens. We also maintain parent-child relationships — when a chunk is retrieved, we can pull the surrounding context.
The lesson: spend 40% of your RAG development time on chunking and preprocessing. It's the foundation everything else depends on.
Embedding Costs Will Surprise You
At prototype scale, embedding costs are negligible. At production scale, they add up fast. We're embedding customer knowledge bases that range from 500 to 50,000 documents. Each document gets chunked into 5-20 chunks. Each chunk gets embedded. Every time a customer updates their knowledge base, affected chunks get re-embedded.
Our monthly embedding costs grew 10x faster than our customer count. The solution: aggressive caching (don't re-embed unchanged content), batched embedding operations, and evaluating whether cheaper embedding models (like smaller variants) provide sufficient quality for your use case. We found that for FAQ-style content, a lighter model performed nearly as well as the premium model at one-fifth the cost.
Qdrant: The Good and the Gotchas
Qdrant has been excellent overall. It's fast, the filtering capabilities are powerful, and the gRPC API is well-designed. But there were gotchas. Memory consumption scales with the number of vectors AND their dimensionality — switching from 1536-dim to 384-dim embeddings cut our memory usage by 60%. Collection-level configuration decisions (distance metric, HNSW parameters) can't be changed after creation — plan carefully.
The biggest operational lesson: always run Qdrant with WAL (Write-Ahead Logging) enabled in production. We had one incident where a node restart lost unsynced data. WAL prevents that.
Latency: The Hidden Killer
Our target was sub-2-second response time for voice calls (users notice anything longer). The naive pipeline — embed query → search Qdrant → pass results to LLM → generate response — consistently hit 3-4 seconds. We had to optimize aggressively.
Key optimizations: pre-warm the embedding model (cold starts added 500ms), use Qdrant's built-in payload filtering to narrow the search space before vector similarity, implement a re-ranking step that's cheaper than LLM generation to filter irrelevant results, and stream the LLM response so the user hears the beginning before generation is complete.
After optimization, we consistently hit 1.5-1.8 seconds for the full pipeline. For chat-based interactions where streaming is natural, perceived latency is even lower.
LangChain: Framework vs Building Blocks
We started with LangChain's high-level chains and agents. They're great for prototyping. In production, we gradually replaced most high-level abstractions with custom code that uses LangChain's lower-level components. The high-level abstractions make too many assumptions about flow control, error handling, and retry logic.
Our recommendation: use LangChain for its excellent integrations (document loaders, embedding wrappers, LLM interfaces) but build your own orchestration logic for production workloads. You'll need the control.
RAG in production is an engineering discipline, not a demo. The gap between "it works in a notebook" and "it works at scale with real users" is enormous. But the results — AI agents that actually answer questions accurately from your data — are worth every hour of optimization.