Woyce

AI Development

How to Build a RAG Chatbot: A Step-by-Step Guide for Developers

Retrieval-Augmented Generation (RAG) is how you build AI chatbots that answer questions from your own data — accurately, without hallucinating. Here's the complete guide from architecture to production deployment.

Woyce Technologies

AI & Engineering Team

Published Apr 1, 2026Reading minTopic AI Development

How to Build a RAG Chatbot: A Step-by-Step Guide for Developers — Woyce Technologies

Why RAG Exists

A language model trained on internet data knows a lot. It does not know your product documentation, your company policies, your internal knowledge base, or anything that wasn't in its training set.

Ask a general LLM a question about your specific business and it either makes something up (hallucination) or tells you it doesn't know. Neither is useful in a production chatbot, and the first one is actively harmful — confident wrong answers do more damage than honest "I don't knows."

Retrieval-Augmented Generation solves this. Instead of relying solely on the model's training data, a RAG system retrieves relevant content from your own sources at query time, injects it into the prompt, and generates an answer grounded in that content.

The result: a chatbot that answers questions about your specific products, policies, and processes using a knowledge base you control and update. This is the architecture we use for the majority of production AI chatbots we ship.

RAG Architecture: The Four Components

Every RAG system has four components:

1. Knowledge base — your source documents: PDFs, markdown files, database records, web pages, support tickets, whatever contains the information the chatbot needs.

2. Vector store — a database that stores your documents as mathematical representations (embeddings) that capture semantic meaning, enabling similarity search.

3. Retriever — the system that takes an incoming query, converts it to an embedding, searches the vector store for similar content, and returns the most relevant chunks.

4. Generator — the LLM that receives the query plus the retrieved chunks and generates a response grounded in the retrieved content.

The flow on every query:

User query → Retriever → Top-k relevant chunks → LLM prompt → Response

Step 1: Prepare Your Documents

Before any code, prepare your knowledge base. This is the step most teams underinvest in, and it's the one that determines how well the system performs.

Collect your source documents. FAQ files, product documentation, policy documents, support articles, internal wikis — whatever the chatbot needs to answer questions from.

Clean and normalise. Remove irrelevant content, fix formatting inconsistencies, ensure headings are clear. The retriever finds relevant chunks based on semantic similarity, and clean, well-structured content retrieves better. Garbage source documents will quietly tank your retrieval quality and you'll spend weeks blaming the model.

Split into chunks. Documents are split into smaller segments before embedding. Chunk size matters: too small and individual chunks lack context; too large and retrieval precision drops. A common starting point is 500–800 tokens with 50–100 token overlap between chunks. We've found you'll usually need to tune this for your specific corpus — there's no universal right answer.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=75,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)

Step 2: Build the Vector Store

Each chunk is converted to an embedding — a vector of numbers representing its semantic meaning — and stored in a vector database.

Choose an embedding model. OpenAI's text-embedding-3-small is a solid default. Cohere's embedding models are strong alternatives. For cost-sensitive applications, open-source models like sentence-transformers/all-MiniLM-L6-v2 work well, with the trade-off that you're now hosting the embedding model yourself.

Choose a vector store. Options by use case:

Pinecone — managed, production-ready, good at scale
Weaviate — open-source option with strong filtering capabilities
pgvector — PostgreSQL extension, good if you're already on Postgres
Chroma — lightweight, good for development and small deployments
Qdrant — fast, open-source, good filtering

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="your-index-name",
)

Step 3: Build the Retriever

The retriever handles the query-time search. When a user asks a question, the retriever:

Converts the query to an embedding using the same model used for documents
Searches the vector store for chunks with high cosine similarity to the query embedding
Returns the top-k most relevant chunks (typically k=3 to 5)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},
)

Improving retrieval quality. Basic similarity search is fine to start with. For production, you'll likely want some combination of:

Hybrid search — combining dense vector search with BM25 keyword search. Handles cases where exact keyword matches matter (product codes, proper nouns) better than pure vector search.
Re-ranking — a second model scores retrieved chunks for relevance and reorders them. Cohere's Rerank API is commonly used. Adds latency but improves precision.
Metadata filtering — attach metadata to chunks (source document, category, date) and filter retrieval by metadata before semantic search. Essential when your knowledge base covers multiple distinct domains.

The retrieval step is where most RAG systems quietly underperform. If your chatbot is producing wrong-sounding answers, the model is usually not the problem — the wrong context is reaching it.

Step 4: Build the Generation Chain

The generator takes the user query and retrieved chunks, formats them into a prompt, and calls the LLM to generate a response.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant for [Company Name]. Answer the user's question 
using only the context provided below. If the context does not contain 
enough information to answer the question, say so clearly — do not guess.

Context:
{context}

Question: {question}

Answer:
""")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

The system prompt is doing a lot of work. The instruction to answer only from the provided context is what prevents hallucination. Without it, the model will quietly supplement retrieved content with its own training data — sometimes accurately, sometimes confidently wrong. Make this instruction explicit and test that the model actually obeys it.

Step 5: Add Conversation Memory

A basic RAG chain answers individual questions but forgets the conversation. For a real chatbot, you need memory — the model needs to understand follow-up questions in context.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=5,  # remember last 5 exchanges
)

conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    verbose=False,
)

Memory choices:

ConversationBufferWindowMemory — keeps the last k exchanges. Simple and effective for most cases.
ConversationSummaryMemory — summarises older history as it grows. Good for long conversations.
External storage (Redis, database) — for production deployments where memory needs to persist across sessions.

Step 6: Wire Up an API

Expose your RAG chain as an API endpoint for your frontend to call.

# FastAPI example
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    session_id: str

@app.post("/chat")
async def chat(request: ChatRequest):
    response = conversational_chain.invoke({
        "question": request.message,
    })
    return {"answer": response["answer"]}

For Next.js, you can call this endpoint from an API route or use a streaming response for a better user experience. Streaming makes a huge difference in perceived speed and is worth the extra plumbing.

Step 7: Production Considerations

Latency. A RAG pipeline has multiple steps: embedding the query, vector search, LLM generation. Total latency is typically 1–4 seconds depending on model and infrastructure. Streaming the LLM response makes a substantial difference to perceived performance, even if total time is the same.

Cost. Each query incurs: an embedding API call (cheap), vector search (very cheap), and an LLM generation call (the main cost). Monitor cost-per-query early — it's much harder to optimise after launch when nobody knows what "normal" looks like. Common levers: reduce the number of chunks retrieved, use smaller models for simpler queries, cache frequent responses.

Evaluation. You need a way to know if the chatbot is answering correctly. Build an eval set of 50–100 representative questions with expected answers. Run the pipeline against them and score the results. RAGAS is a useful framework for automated RAG evaluation. Without an eval set, every prompt tweak is a guess.

Monitoring. Log every query, every set of retrieved chunks, and every response in production. Review samples regularly for accuracy issues, retrieval failures, and unexpected behaviour. The errors you don't log are the errors you won't find until a customer does.

Knowledge base updates. When source documents change, you need to re-embed and re-index the affected chunks. Build an update pipeline from the start — don't treat the knowledge base as a one-time setup. We've inherited projects where this was an afterthought and the chatbot was three months out of date by the time anyone noticed.

The Full Stack for a Production RAG Chatbot

Component	Recommended options
Embedding model	OpenAI text-embedding-3-small, Cohere embed-v3
Vector store	Pinecone (managed), pgvector (self-hosted)
LLM	GPT-4o-mini (cost/speed), GPT-4o (quality)
Framework	LangChain, LlamaIndex
API	FastAPI, Next.js API routes
Memory storage	Redis (production), in-memory (dev)
Evaluation	RAGAS, custom eval harness
Monitoring	Langfuse, LangSmith

We Build RAG Systems in Production

Building a RAG chatbot that works in a demo is one thing. Building one that handles real queries, holds accuracy at scale, and improves over time is engineering work — and most of that work is in the parts that don't show up in tutorials.

Talk to us about your project — we'll help you scope a RAG system that fits your data, your users, and your production requirements, and we'll tell you if a simpler approach would do the job instead.

build RAG chatbotretrieval augmented generation tutorialRAG implementationRAG chatbot guidehow to build RAGLangChain RAG

Woyce Technologies

AI & Engineering Team · Woyce

Woyce Technologies builds AI chatbots, LLM integrations, voice AI, and full-stack web applications for businesses in the US and India. Based in Rajkot, Gujarat.

READY TO BUILD?

Let's build something
that actually works.

Tell us about your project. We'll be honest about whether we're the right fit — and if we are, we move fast.

Talk to us about your business →Explore our AI services

AI Development

How to Build a RAG Chatbot: A Step-by-Step Guide for Developers

Woyce Technologies

AI & Engineering Team

Published Apr 1, 2026Reading minTopic AI Development

Why RAG Exists

A language model trained on internet data knows a lot. It does not know your product documentation, your company policies, your internal knowledge base, or anything that wasn't in its training set.

RAG Architecture: The Four Components

Every RAG system has four components:

1. Knowledge base — your source documents: PDFs, markdown files, database records, web pages, support tickets, whatever contains the information the chatbot needs.

2. Vector store — a database that stores your documents as mathematical representations (embeddings) that capture semantic meaning, enabling similarity search.

3. Retriever — the system that takes an incoming query, converts it to an embedding, searches the vector store for similar content, and returns the most relevant chunks.

4. Generator — the LLM that receives the query plus the retrieved chunks and generates a response grounded in the retrieved content.

The flow on every query:

User query → Retriever → Top-k relevant chunks → LLM prompt → Response

Step 1: Prepare Your Documents

Before any code, prepare your knowledge base. This is the step most teams underinvest in, and it's the one that determines how well the system performs.

Collect your source documents. FAQ files, product documentation, policy documents, support articles, internal wikis — whatever the chatbot needs to answer questions from.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=75,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)

Step 2: Build the Vector Store

Each chunk is converted to an embedding — a vector of numbers representing its semantic meaning — and stored in a vector database.

Choose a vector store. Options by use case:

Pinecone — managed, production-ready, good at scale
Weaviate — open-source option with strong filtering capabilities
pgvector — PostgreSQL extension, good if you're already on Postgres
Chroma — lightweight, good for development and small deployments
Qdrant — fast, open-source, good filtering

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="your-index-name",
)

Step 3: Build the Retriever

The retriever handles the query-time search. When a user asks a question, the retriever:

Converts the query to an embedding using the same model used for documents
Searches the vector store for chunks with high cosine similarity to the query embedding
Returns the top-k most relevant chunks (typically k=3 to 5)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},
)

Improving retrieval quality. Basic similarity search is fine to start with. For production, you'll likely want some combination of:

Hybrid search — combining dense vector search with BM25 keyword search. Handles cases where exact keyword matches matter (product codes, proper nouns) better than pure vector search.
Re-ranking — a second model scores retrieved chunks for relevance and reorders them. Cohere's Rerank API is commonly used. Adds latency but improves precision.
Metadata filtering — attach metadata to chunks (source document, category, date) and filter retrieval by metadata before semantic search. Essential when your knowledge base covers multiple distinct domains.

The retrieval step is where most RAG systems quietly underperform. If your chatbot is producing wrong-sounding answers, the model is usually not the problem — the wrong context is reaching it.

Step 4: Build the Generation Chain

The generator takes the user query and retrieved chunks, formats them into a prompt, and calls the LLM to generate a response.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant for [Company Name]. Answer the user's question 
using only the context provided below. If the context does not contain 
enough information to answer the question, say so clearly — do not guess.

Context:
{context}

Question: {question}

Answer:
""")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Step 5: Add Conversation Memory

A basic RAG chain answers individual questions but forgets the conversation. For a real chatbot, you need memory — the model needs to understand follow-up questions in context.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=5,  # remember last 5 exchanges
)

conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    verbose=False,
)

Memory choices:

ConversationBufferWindowMemory — keeps the last k exchanges. Simple and effective for most cases.
ConversationSummaryMemory — summarises older history as it grows. Good for long conversations.
External storage (Redis, database) — for production deployments where memory needs to persist across sessions.

Step 6: Wire Up an API

Expose your RAG chain as an API endpoint for your frontend to call.

# FastAPI example
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    session_id: str

@app.post("/chat")
async def chat(request: ChatRequest):
    response = conversational_chain.invoke({
        "question": request.message,
    })
    return {"answer": response["answer"]}

Step 7: Production Considerations

The Full Stack for a Production RAG Chatbot

Component	Recommended options
Embedding model	OpenAI text-embedding-3-small, Cohere embed-v3
Vector store	Pinecone (managed), pgvector (self-hosted)
LLM	GPT-4o-mini (cost/speed), GPT-4o (quality)
Framework	LangChain, LlamaIndex
API	FastAPI, Next.js API routes
Memory storage	Redis (production), in-memory (dev)
Evaluation	RAGAS, custom eval harness
Monitoring	Langfuse, LangSmith

We Build RAG Systems in Production

build RAG chatbotretrieval augmented generation tutorialRAG implementationRAG chatbot guidehow to build RAGLangChain RAG

Woyce Technologies

AI & Engineering Team · Woyce

Woyce Technologies builds AI chatbots, LLM integrations, voice AI, and full-stack web applications for businesses in the US and India. Based in Rajkot, Gujarat.

READY TO BUILD?

Let's build something
that actually works.

Tell us about your project. We'll be honest about whether we're the right fit — and if we are, we move fast.

Talk to us about your business →Explore our AI services

How to Build a RAG Chatbot: A Step-by-Step Guide for Developers

Why RAG Exists

RAG Architecture: The Four Components

Step 1: Prepare Your Documents

Step 2: Build the Vector Store

Step 3: Build the Retriever

Step 4: Build the Generation Chain

Step 5: Add Conversation Memory

Step 6: Wire Up an API

Step 7: Production Considerations

The Full Stack for a Production RAG Chatbot

We Build RAG Systems in Production

Woyce Technologies

More from theWoyce engineering desk.

What Are AI Agents? A Plain-English Guide for Business Owners

AI Agents for Content Marketing Teams: Automate Research, Distribution, and Performance Tracking

How AI Agents Learn From Feedback: Making Your Agent Smarter Over Time

Let's build somethingthat actually works.

How to Build a RAG Chatbot: A Step-by-Step Guide for Developers

Why RAG Exists

RAG Architecture: The Four Components

Step 1: Prepare Your Documents

Step 2: Build the Vector Store

Step 3: Build the Retriever

Step 4: Build the Generation Chain

Step 5: Add Conversation Memory

Step 6: Wire Up an API

Step 7: Production Considerations

The Full Stack for a Production RAG Chatbot

We Build RAG Systems in Production

Woyce Technologies

More from theWoyce engineering desk.

What Are AI Agents? A Plain-English Guide for Business Owners

AI Agents for Content Marketing Teams: Automate Research, Distribution, and Performance Tracking

How AI Agents Learn From Feedback: Making Your Agent Smarter Over Time

Let's build somethingthat actually works.

More from the
Woyce engineering desk.

Let's build something
that actually works.

More from the
Woyce engineering desk.

Let's build something
that actually works.