Retrieval Fundamentals: How RAG Search Works

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Retrieval is the beating heart of any RAG (Retrieval-Augmented Generation) system. Without effective search, even the best LLM in the world will produce off-topic or incomplete answers. This guide walks you through a deep understanding of retrieval mechanisms, from theory to practical implementation.

Why Retrieval is Critical in a RAG System

A RAG system works in two stages: first retrieving relevant documents (retrieval), then generating a response based on those documents (generation). The quality of the final response directly depends on the quality of retrieved documents.

Imagine an assistant that needs to answer "What is your return policy?" If retrieval brings back pages about shipping conditions instead of the return policy, the LLM will generate an incorrect answer or invent a fictional policy.

The Three Pillars of Retrieval

Representation: How to transform text into mathematical vectors
Indexing: How to organize these vectors for fast search
Search: How to find the most relevant documents

Understanding Embeddings

Embeddings are vector representations of text. Each word, sentence, or document is transformed into a vector of numbers (typically 384 to 1536 dimensions) that captures its semantic meaning.

How Embeddings Work

DEVELOPERpython
from sentence_transformers import SentenceTransformer

# Load an embedding model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Create embeddings
texts = [
    "How do I return a product?",
    "What is the refund policy?",
    "Store opening hours"
]

embeddings = model.encode(texts)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)

print("Similarity 'return' vs 'refund':", similarities[0][1])  # ~0.85
print("Similarity 'return' vs 'hours':", similarities[0][2])   # ~0.25

The first two sentences, although worded differently, have high similarity because they deal with the same topic. The third is semantically distant.

Choosing Your Embedding Model

Model	Dimensions	Performance	Speed	Recommended Use
all-MiniLM-L6-v2	384	Good	Fast	Prototyping, high volumes
all-mpnet-base-v2	768	Very good	Medium	General production
text-embedding-3-small	1536	Excellent	Fast (API)	Production with API budget
text-embedding-3-large	3072	State of the art	Medium (API)	Critical high-precision cases
multilingual-e5-large	1024	Excellent multilingual	Medium	FR/EN/multilingual content

For multilingual projects, prioritize models trained on diverse corpora:

DEVELOPERpython
# Excellent choice for multilingual
model = SentenceTransformer('intfloat/multilingual-e5-large')

# Prefix required for E5
query = "query: How does the warranty work?"
documents = ["passage: The warranty covers manufacturing defects for 2 years..."]

Chunking: Intelligently Splitting Documents

Chunking is the art of splitting documents into appropriately sized pieces. Too large, and the chunk contains noise. Too small, and it loses context.

Chunking Strategies

1. Fixed-Size Chunking

The simplest method: split every X characters with overlap.

DEVELOPERpython
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Target size
    chunk_overlap=50,      # Overlap to preserve context
    separators=["\n\n", "\n", ". ", " ", ""]
)

document = """
Return Policy

You have 30 days to return an unused product in its original packaging.

Return Procedure:
1. Log in to your customer account
2. Select the relevant order
3. Click "Request a return"
4. Print the return label

Return shipping costs are your responsibility unless the product is defective.

Refund

Once the return is received and validated, the refund is processed within 5 business days to the payment method used for the purchase.
"""

chunks = splitter.split_text(document)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:100]}...")

2. Semantic Chunking

More sophisticated: split at natural text boundaries (paragraphs, sections).

DEVELOPERpython
from langchain.text_splitter import MarkdownTextSplitter

md_splitter = MarkdownTextSplitter(
    chunk_size=500,
    chunk_overlap=0
)

# Respects Markdown structure
chunks = md_splitter.split_text(markdown_document)

3. Sentence Chunking with Sliding Window

Ideal for FAQs and short content:

DEVELOPERpython
import nltk
nltk.download('punkt')

def chunk_by_sentences(text, sentences_per_chunk=3, overlap=1):
    sentences = nltk.sent_tokenize(text)
    chunks = []

    for i in range(0, len(sentences), sentences_per_chunk - overlap):
        chunk = " ".join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)

    return chunks

Strategy Comparison Table

Strategy	Advantages	Disadvantages	Use Case
Fixed size	Simple, predictable	Cuts mid-idea	Homogeneous documents
Semantic	Preserves meaning	More complex	Structured documentation
By sentence	Fine precision	Sometimes too short chunks	FAQ, support
Hierarchical	Parent context preserved	Increased complexity	Technical documentation

Indexing with Vector Databases

Once embeddings are created, they need to be stored and indexed for fast search. Vector databases are optimized for this task.

Qdrant: Implementation Example

DEVELOPERpython
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Connection
client = QdrantClient(host="localhost", port=6333)

# Create a collection
client.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(
        size=384,  # Dimension of your embeddings
        distance=Distance.COSINE
    )
)

# Index documents
points = [
    PointStruct(
        id=i,
        vector=embedding.tolist(),
        payload={
            "text": chunk,
            "source": "return_policy.md",
            "category": "support"
        }
    )
    for i, (embedding, chunk) in enumerate(zip(embeddings, chunks))
]

client.upsert(
    collection_name="knowledge_base",
    points=points
)

Vector Search

DEVELOPERpython
def search(query: str, top_k: int = 5):
    # Encode the query
    query_embedding = model.encode(query)

    # Search
    results = client.search(
        collection_name="knowledge_base",
        query_vector=query_embedding.tolist(),
        limit=top_k
    )

    return [
        {
            "text": hit.payload["text"],
            "score": hit.score,
            "source": hit.payload["source"]
        }
        for hit in results
    ]

# Example
results = search("How do I get a refund?")
for r in results:
    print(f"Score: {r['score']:.3f} - {r['text'][:100]}...")

Similarity Metrics

The choice of metric impacts search results.

Cosine Similarity

The most widely used. Measures the angle between two vectors, regardless of their magnitude.

DEVELOPERpython
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Advantages: Insensitive to original text length Disadvantages: May miss magnitude nuances

Dot Product

Faster, but sensitive to vector magnitude.

DEVELOPERpython
def dot_product(a, b):
    return np.dot(a, b)

Advantages: Faster to compute Disadvantages: Requires normalized vectors to be comparable to cosine

Euclidean Distance

Measures the "as the crow flies" distance between two points.

DEVELOPERpython
def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

Advantages: Geometrically intuitive Disadvantages: Sensitive to outliers and dimensionality

Optimizing Retrieval

1. Query Expansion

Enrich the user query to improve recall:

DEVELOPERpython
def expand_query(query: str, llm) -> list[str]:
    prompt = f"""
    Generate 3 reformulations of this question to improve search:
    Original question: {query}

    Reformulations:
    """

    expansions = llm.generate(prompt)
    return [query] + expansions

# Search with all variants
def search_expanded(query: str, top_k: int = 5):
    queries = expand_query(query, llm)
    all_results = []

    for q in queries:
        results = search(q, top_k=top_k)
        all_results.extend(results)

    # Deduplicate and re-score
    return deduplicate_and_rerank(all_results)

2. Reranking

Use a reranking model to refine results:

DEVELOPERpython
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, documents: list[str], top_k: int = 3):
    pairs = [[query, doc] for doc in documents]
    scores = reranker.predict(pairs)

    # Sort by descending score
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

# Complete pipeline
def search_with_rerank(query: str):
    # 1. Initial search (high recall)
    initial_results = search(query, top_k=20)

    # 2. Reranking (high precision)
    documents = [r["text"] for r in initial_results]
    reranked = rerank(query, documents, top_k=5)

    return reranked

3. Metadata Filtering

Combine vector search with classic filters:

DEVELOPERpython
from qdrant_client.models import Filter, FieldCondition, MatchValue

def search_filtered(query: str, category: str = None, top_k: int = 5):
    query_embedding = model.encode(query)

    # Build filter
    filter_conditions = None
    if category:
        filter_conditions = Filter(
            must=[
                FieldCondition(
                    key="category",
                    match=MatchValue(value=category)
                )
            ]
        )

    results = client.search(
        collection_name="knowledge_base",
        query_vector=query_embedding.tolist(),
        query_filter=filter_conditions,
        limit=top_k
    )

    return results

# Search only in "support" category
results = search_filtered("return policy", category="support")

Evaluating Retrieval Quality

To measure your retrieval system's effectiveness, use these metrics:

Recall@k

Proportion of relevant documents found among the top k results.

DEVELOPERpython
def recall_at_k(retrieved: list, relevant: list, k: int) -> float:
    retrieved_k = set(retrieved[:k])
    relevant_set = set(relevant)

    return len(retrieved_k & relevant_set) / len(relevant_set)

MRR (Mean Reciprocal Rank)

Average position of the first relevant document.

DEVELOPERpython
def mrr(queries_results: list[tuple[list, list]]) -> float:
    reciprocal_ranks = []

    for retrieved, relevant in queries_results:
        for i, doc in enumerate(retrieved):
            if doc in relevant:
                reciprocal_ranks.append(1 / (i + 1))
                break
        else:
            reciprocal_ranks.append(0)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

NDCG (Normalized Discounted Cumulative Gain)

Takes into account result order and relevance scores.

DEVELOPERpython
import numpy as np

def ndcg_at_k(relevances: list[float], k: int) -> float:
    relevances = np.array(relevances[:k])

    # DCG
    discounts = np.log2(np.arange(2, len(relevances) + 2))
    dcg = np.sum(relevances / discounts)

    # IDCG (ideal DCG)
    ideal_relevances = np.sort(relevances)[::-1]
    idcg = np.sum(ideal_relevances / discounts)

    return dcg / idcg if idcg > 0 else 0

Common Pitfalls and Solutions

1. Chunks Too Large

Symptom: Retrieval returns vaguely relevant but imprecise documents.

Solution: Reduce chunk size or use hierarchical chunking.

2. Domain Vocabulary

Symptom: Business terms are not well understood by embeddings.

Solution: Fine-tune the embedding model or use a synonym vocabulary.

DEVELOPERpython
synonyms = {
    "ticket": ["request", "inquiry", "incident"],
    "KB": ["knowledge base", "documentation"],
}

def expand_with_synonyms(query: str) -> str:
    for term, syns in synonyms.items():
        if term.lower() in query.lower():
            query += " " + " ".join(syns)
    return query

3. Ambiguous Queries

Symptom: "Problem with my order" returns too many different results.

Solution: Use conversational context or ask for clarification.

4. Cold Start

Symptom: Little data at startup, irrelevant retrieval.

Solution: Enrich with synthetic data or generated FAQs.

Production Architecture

For a production retrieval system, here's a recommended architecture:

┌─────────────────────────────────────────────────────────────┐
│                        API Gateway                           │
└─────────────────────┬───────────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                   Query Processor                            │
│  - Normalization                                             │
│  - Language detection                                        │
│  - Query expansion                                           │
└─────────────────────┬───────────────────────────────────────┘
                      │
         ┌────────────┴────────────┐
         ▼                         ▼
┌─────────────────┐      ┌─────────────────┐
│  Dense Search   │      │  Sparse Search  │
│   (Qdrant)      │      │   (BM25)        │
└────────┬────────┘      └────────┬────────┘
         │                        │
         └──────────┬─────────────┘
                    ▼
         ┌─────────────────┐
         │  Fusion/Rerank  │
         └────────┬────────┘
                  ▼
         ┌─────────────────┐
         │   LLM Context   │
         └─────────────────┘

Next Steps

Now that you've mastered retrieval fundamentals, dive deeper with our specialized guides:

Dense Retrieval: Semantic Search with Embeddings - Dive into advanced embeddings
Sparse Retrieval and BM25 - Discover when lexical search excels
Hybrid Fusion - Combine the best of both worlds

For a comprehensive RAG overview, check our Complete Introduction to RAG.

Put It Into Practice with Ailog

Implementing a performant retrieval system takes time and expertise. With Ailog, get a turnkey RAG infrastructure:

Intelligent chunking optimized for your content type
Multilingual embedding models (native French/English)
Automatic reranking for ultra-precise results
Sovereign hosting in France, GDPR compliant

Try Ailog for free and deploy your first RAG assistant in 3 minutes.

Retrieval Fundamentals: How RAG Search Works

Retrieval Fundamentals: How RAG Search Works

Why Retrieval is Critical in a RAG System

The Three Pillars of Retrieval

Understanding Embeddings

How Embeddings Work

Choosing Your Embedding Model

Chunking: Intelligently Splitting Documents

Chunking Strategies

1. Fixed-Size Chunking

2. Semantic Chunking

3. Sentence Chunking with Sliding Window

Strategy Comparison Table

Indexing with Vector Databases

Qdrant: Implementation Example

Vector Search

Similarity Metrics

Cosine Similarity

Dot Product

Euclidean Distance

Optimizing Retrieval

1. Query Expansion

2. Reranking

3. Metadata Filtering

Evaluating Retrieval Quality

Recall@k

MRR (Mean Reciprocal Rank)

NDCG (Normalized Discounted Cumulative Gain)

Common Pitfalls and Solutions

1. Chunks Too Large

2. Domain Vocabulary

3. Ambiguous Queries

4. Cold Start

Production Architecture

Next Steps

Put It Into Practice with Ailog

Tags

Related Posts

Parent Document Retrieval: Context Without Noise

Hybrid Search for RAG: BM25 + Vector Search Tutorial (2025)

Query Expansion: Retrieve More Relevant Results

Ailog Assistant