The Problem That Vector Databases Solve
Imagine you've got ten thousand support articles, product documents, and internal policies. A customer asks a question. You need to find the three or four documents most relevant to that specific question — not by keyword, but by meaning — in under a second.
Traditional databases can't really do this. A keyword search for "my payment failed" won't surface a document titled "troubleshooting transaction errors" unless those exact words appear somewhere in it. The meaning is the same; the words are different. And customers, in our experience, almost never phrase things the way your documentation does.
Vector databases solve this. They store information as mathematical representations of meaning — called embeddings — and can find the most semantically similar content to any query in milliseconds. This is what makes AI agents that answer questions from your own data accurate rather than generic.
What an Embedding Actually Is
Before you can understand vector databases, you need to understand embeddings.
An embedding is a list of numbers — usually several hundred to several thousand — that represents the meaning of a piece of text. Two pieces of text with similar meaning will have embeddings that are numerically close to each other. Two pieces with very different meanings will have embeddings that are far apart.
Here's what a sentence embedding looks like, simplified:
"My payment failed" → [0.23, -0.87, 0.41, 0.12, ...] (1,536 numbers)
"Transaction error" → [0.21, -0.84, 0.39, 0.15, ...] (1,536 numbers)
"Dog breeds" → [-0.92, 0.34, -0.67, 0.88, ...] (1,536 numbers)
The payment and transaction embeddings are numerically similar. The dog breeds one is very different. A vector database finds the most similar embeddings to a query embedding — and that similarity corresponds, roughly, to semantic relevance.
Embeddings are generated by embedding models. OpenAI's text-embedding-3-small and Cohere's embed-v3 are the most commonly used in production AI applications. You pass text to the model and it returns the embedding — a list of numbers representing meaning.
How a Vector Database Works
A vector database stores embeddings alongside the original content they represent. When you query it, it:
- Takes your query and converts it to an embedding using the same model
- Compares your query embedding against all stored embeddings
- Returns the most similar ones — the content most relevant to your query
This is called nearest neighbour search. The database finds the stored embeddings nearest (most similar) to the query embedding in mathematical space.
Because embeddings capture meaning rather than exact words, the search finds content that's semantically relevant even when the words don't match. That's fundamentally different from keyword search, and it's why the AI agents you'd actually trust feel like they "get" what you mean.
Why AI Agents Need Vector Databases
An AI agent that answers questions from your data needs to retrieve relevant content at query time and include it in the prompt sent to the language model. That's Retrieval-Augmented Generation (RAG).
Without a vector database, the agent has two bad options:
- Include all your documents in every prompt (too expensive, too slow, blows past context limits)
- Use keyword search (misses semantically relevant content, returns irrelevant results)
With a vector database, the agent retrieves the three to five most semantically relevant documents per query, includes only those in the prompt, and generates an accurate response grounded in actually-relevant content. This is the approach used in virtually every production AI agent that answers from a custom knowledge base.
The honest caveat: vector search isn't magic. It can still pull the wrong chunk if the embedding model doesn't capture intent well, or if your chunking is awkward. We've spent more debugging time on this than on any other part of RAG systems.
The Main Vector Database Options
Pinecone
The most widely used managed vector database for production AI applications. Pinecone handles the infrastructure entirely — you don't manage servers, indices, or scaling. It has a generous free tier and scales predictably.
Best for: Production applications where you want managed infrastructure and don't want to deal with operational complexity. The default choice for teams without dedicated DevOps resource.
Limitations: Costs money at scale; you don't control the underlying infrastructure; data is hosted on Pinecone's servers (worth thinking about for data sovereignty requirements).
Chroma
Open-source, lightweight, easy to run locally. The default choice for development and prototyping.
Best for: Development, testing, and small deployments where you want to run everything locally without external services. Not what you reach for at production scale.
Limitations: Requires self-hosting for production; not designed for high-throughput production workloads.
Weaviate
Open-source with a managed cloud option. Strong filtering capabilities — you can combine vector search with metadata filters ("find semantically similar documents that are also from category X and published after date Y"). Good choice when you need hybrid search with complex filtering.
Best for: Applications where metadata filtering alongside semantic search matters. Self-hosting teams who want open-source with production capabilities.
Qdrant
Open-source, high-performance, built for production. Particularly fast on filtered vector search. Good Python and TypeScript client libraries.
Best for: High-throughput applications where filtering matters and you want open-source with production-grade performance. A strong alternative to Pinecone for teams who'd rather self-host.
pgvector
A PostgreSQL extension that adds vector search to an existing Postgres database. If you're already running Postgres, adding vector search without standing up a separate service is attractive.
Best for: Teams already on PostgreSQL who want to add semantic search without managing another service. Not optimal for very large vector collections but works well at moderate scale.
Choosing the Right One
| Factor | Recommendation |
|---|---|
| Getting started quickly | Chroma locally, Pinecone for production |
| Already on PostgreSQL | pgvector |
| Need complex metadata filtering | Weaviate or Qdrant |
| Open-source, self-hosted production | Qdrant |
| Managed, no infrastructure management | Pinecone |
| Data sovereignty requirements | Qdrant or pgvector (self-hosted) |
For most teams building their first production RAG application: start with Chroma locally, deploy with Pinecone. It's the path of least friction and least operational risk, and you can revisit the choice later when you actually know your traffic patterns.
Key Concepts You Will Encounter
Chunking: Before storing documents, you split them into smaller pieces (chunks). A 20-page PDF becomes 50 chunks. Each chunk is embedded and stored separately. Retrieval finds the most relevant chunks, not entire documents.
Chunk size: How big each chunk is. Smaller chunks (200–400 tokens) give more precise retrieval but less context per chunk. Larger chunks (600–1000 tokens) provide more context but less precise matching. Most production systems use 400–600 tokens with some overlap between chunks. You'll likely tune this for your corpus.
Similarity metric: How closeness between embeddings is measured. Cosine similarity is the most common — it measures the angle between embedding vectors rather than their distance, which behaves better for text embeddings.
Hybrid search: Combining vector search with keyword search (BM25). Vector search alone misses exact keyword matches (product codes, proper nouns). BM25 alone misses semantic matches. Hybrid search catches both. Most production RAG systems we've built end up using some form of hybrid search once they're tuned.
Re-ranking: After vector search returns the top 20 results, a re-ranking model scores each one for relevance and reorders them before the top 5 are sent to the LLM. Re-ranking significantly improves quality at the cost of additional latency. Worth turning on once basic retrieval is in place.
What This Means for Your AI Agent Project
If you're commissioning a RAG-powered AI agent — one that answers questions from your documents, knowledge base, or data — your development team will need to:
- Choose an embedding model
- Choose a vector database
- Design a chunking strategy for your content
- Build an ingestion pipeline to load and index your content
- Build a retrieval layer that queries the vector database at runtime
These decisions significantly affect the quality of the agent's responses. A well-designed retrieval pipeline produces accurate, relevant answers. A poorly designed one produces generic or wrong ones — and the rest of the system can't really fix that downstream.
When evaluating vendors, ask specifically about their approach to each of these decisions — not just which tools they use, but why those over the alternatives. A vendor whose answer is "we just use the defaults" is one to be cautious about.
Talk to us about your project — RAG system design is one of our core capabilities and we're happy to walk you through the architecture decisions before you commit to a build.