Why RAG Exists
A language model trained on internet data knows a lot. It does not know your product documentation, your company policies, your internal knowledge base, or anything that wasn't in its training set.
Ask a general LLM a question about your specific business and it either makes something up (hallucination) or tells you it doesn't know. Neither is useful in a production chatbot, and the first one is actively harmful — confident wrong answers do more damage than honest "I don't knows."
Retrieval-Augmented Generation solves this. Instead of relying solely on the model's training data, a RAG system retrieves relevant content from your own sources at query time, injects it into the prompt, and generates an answer grounded in that content.
The result: a chatbot that answers questions about your specific products, policies, and processes using a knowledge base you control and update. This is the architecture we use for the majority of production AI chatbots we ship.
RAG Architecture: The Four Components
Every RAG system has four components:
1. Knowledge base — your source documents: PDFs, markdown files, database records, web pages, support tickets, whatever contains the information the chatbot needs.
2. Vector store — a database that stores your documents as mathematical representations (embeddings) that capture semantic meaning, enabling similarity search.
3. Retriever — the system that takes an incoming query, converts it to an embedding, searches the vector store for similar content, and returns the most relevant chunks.
4. Generator — the LLM that receives the query plus the retrieved chunks and generates a response grounded in the retrieved content.
The flow on every query:
User query → Retriever → Top-k relevant chunks → LLM prompt → Response
Step 1: Prepare Your Documents
Before any code, prepare your knowledge base. This is the step most teams underinvest in, and it's the one that determines how well the system performs.
Collect your source documents. FAQ files, product documentation, policy documents, support articles, internal wikis — whatever the chatbot needs to answer questions from.
Clean and normalise. Remove irrelevant content, fix formatting inconsistencies, ensure headings are clear. The retriever finds relevant chunks based on semantic similarity, and clean, well-structured content retrieves better. Garbage source documents will quietly tank your retrieval quality and you'll spend weeks blaming the model.
Split into chunks. Documents are split into smaller segments before embedding. Chunk size matters: too small and individual chunks lack context; too large and retrieval precision drops. A common starting point is 500–800 tokens with 50–100 token overlap between chunks. We've found you'll usually need to tune this for your specific corpus — there's no universal right answer.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=75,
separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)
Step 2: Build the Vector Store
Each chunk is converted to an embedding — a vector of numbers representing its semantic meaning — and stored in a vector database.
Choose an embedding model. OpenAI's text-embedding-3-small is a solid default. Cohere's embedding models are strong alternatives. For cost-sensitive applications, open-source models like sentence-transformers/all-MiniLM-L6-v2 work well, with the trade-off that you're now hosting the embedding model yourself.
Choose a vector store. Options by use case:
- Pinecone — managed, production-ready, good at scale
- Weaviate — open-source option with strong filtering capabilities
- pgvector — PostgreSQL extension, good if you're already on Postgres
- Chroma — lightweight, good for development and small deployments
- Qdrant — fast, open-source, good filtering
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name="your-index-name",
)
Step 3: Build the Retriever
The retriever handles the query-time search. When a user asks a question, the retriever:
- Converts the query to an embedding using the same model used for documents
- Searches the vector store for chunks with high cosine similarity to the query embedding
- Returns the top-k most relevant chunks (typically k=3 to 5)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4},
)
Improving retrieval quality. Basic similarity search is fine to start with. For production, you'll likely want some combination of:
- Hybrid search — combining dense vector search with BM25 keyword search. Handles cases where exact keyword matches matter (product codes, proper nouns) better than pure vector search.
- Re-ranking — a second model scores retrieved chunks for relevance and reorders them. Cohere's Rerank API is commonly used. Adds latency but improves precision.
- Metadata filtering — attach metadata to chunks (source document, category, date) and filter retrieval by metadata before semantic search. Essential when your knowledge base covers multiple distinct domains.
The retrieval step is where most RAG systems quietly underperform. If your chatbot is producing wrong-sounding answers, the model is usually not the problem — the wrong context is reaching it.
Step 4: Build the Generation Chain
The generator takes the user query and retrieved chunks, formats them into a prompt, and calls the LLM to generate a response.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant for [Company Name]. Answer the user's question
using only the context provided below. If the context does not contain
enough information to answer the question, say so clearly — do not guess.
Context:
{context}
Question: {question}
Answer:
""")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
The system prompt is doing a lot of work. The instruction to answer only from the provided context is what prevents hallucination. Without it, the model will quietly supplement retrieved content with its own training data — sometimes accurately, sometimes confidently wrong. Make this instruction explicit and test that the model actually obeys it.
Step 5: Add Conversation Memory
A basic RAG chain answers individual questions but forgets the conversation. For a real chatbot, you need memory — the model needs to understand follow-up questions in context.
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
k=5, # remember last 5 exchanges
)
conversational_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
verbose=False,
)
Memory choices:
ConversationBufferWindowMemory— keeps the last k exchanges. Simple and effective for most cases.ConversationSummaryMemory— summarises older history as it grows. Good for long conversations.- External storage (Redis, database) — for production deployments where memory needs to persist across sessions.
Step 6: Wire Up an API
Expose your RAG chain as an API endpoint for your frontend to call.
# FastAPI example
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class ChatRequest(BaseModel):
message: str
session_id: str
@app.post("/chat")
async def chat(request: ChatRequest):
response = conversational_chain.invoke({
"question": request.message,
})
return {"answer": response["answer"]}
For Next.js, you can call this endpoint from an API route or use a streaming response for a better user experience. Streaming makes a huge difference in perceived speed and is worth the extra plumbing.
Step 7: Production Considerations
Latency. A RAG pipeline has multiple steps: embedding the query, vector search, LLM generation. Total latency is typically 1–4 seconds depending on model and infrastructure. Streaming the LLM response makes a substantial difference to perceived performance, even if total time is the same.
Cost. Each query incurs: an embedding API call (cheap), vector search (very cheap), and an LLM generation call (the main cost). Monitor cost-per-query early — it's much harder to optimise after launch when nobody knows what "normal" looks like. Common levers: reduce the number of chunks retrieved, use smaller models for simpler queries, cache frequent responses.
Evaluation. You need a way to know if the chatbot is answering correctly. Build an eval set of 50–100 representative questions with expected answers. Run the pipeline against them and score the results. RAGAS is a useful framework for automated RAG evaluation. Without an eval set, every prompt tweak is a guess.
Monitoring. Log every query, every set of retrieved chunks, and every response in production. Review samples regularly for accuracy issues, retrieval failures, and unexpected behaviour. The errors you don't log are the errors you won't find until a customer does.
Knowledge base updates. When source documents change, you need to re-embed and re-index the affected chunks. Build an update pipeline from the start — don't treat the knowledge base as a one-time setup. We've inherited projects where this was an afterthought and the chatbot was three months out of date by the time anyone noticed.
The Full Stack for a Production RAG Chatbot
| Component | Recommended options |
|---|---|
| Embedding model | OpenAI text-embedding-3-small, Cohere embed-v3 |
| Vector store | Pinecone (managed), pgvector (self-hosted) |
| LLM | GPT-4o-mini (cost/speed), GPT-4o (quality) |
| Framework | LangChain, LlamaIndex |
| API | FastAPI, Next.js API routes |
| Memory storage | Redis (production), in-memory (dev) |
| Evaluation | RAGAS, custom eval harness |
| Monitoring | Langfuse, LangSmith |
We Build RAG Systems in Production
Building a RAG chatbot that works in a demo is one thing. Building one that handles real queries, holds accuracy at scale, and improves over time is engineering work — and most of that work is in the parts that don't show up in tutorials.
Talk to us about your project — we'll help you scope a RAG system that fits your data, your users, and your production requirements, and we'll tell you if a simpler approach would do the job instead.