Metadata Filtering: Refine RAG Search
Master metadata filtering for precise RAG searches. Filter types, indexing, combined queries, and optimization techniques.
Metadata Filtering: Refine RAG Search
Metadata filtering combines the power of vector search with the precision of structured filters. Instead of searching only by semantic similarity, you can constrain results by category, date, author, price, or any other property. This guide explores filtering strategies and their implementation in RAG systems.
Why Metadata Filtering?
Pure vector search has limitations:
Query: "Latest machine learning articles"
Without filtering:
→ Finds old but highly relevant articles (2018, 2020)
→ Misses recent articles that are less semantically optimized
With filtering (year >= 2024):
→ Finds only 2024 articles
→ Semantic relevance + guaranteed freshness
Typical Use Cases
| Domain | Useful Metadata | Example Filter |
|---|---|---|
| E-commerce | category, price, stock, rating | category = "electronics" AND price < 500 |
| Documentation | version, language, section | version = "3.x" AND language = "en" |
| Support | status, priority, assignee | status = "open" AND priority = "high" |
| Blog | date, author, tags | date > 2024-01-01 AND tags CONTAINS "rag" |
| HR | department, level, location | department = "engineering" AND location = "NYC" |
Metadata Types and Filters
1. Scalar Filters (equality, comparison)
DEVELOPERpythonfrom qdrant_client import QdrantClient from qdrant_client.models import Filter, FieldCondition, MatchValue, Range client = QdrantClient("localhost", port=6333) # Strict equality category_filter = Filter( must=[ FieldCondition( key="category", match=MatchValue(value="electronics") ) ] ) # Numeric comparison price_filter = Filter( must=[ FieldCondition( key="price", range=Range(gte=100, lte=500) # 100 <= price <= 500 ) ] ) # Boolean in_stock_filter = Filter( must=[ FieldCondition( key="in_stock", match=MatchValue(value=True) ) ] )
2. Text Filters (partial matching)
DEVELOPERpythonfrom qdrant_client.models import MatchText # Exact match within text title_filter = Filter( must=[ FieldCondition( key="title", match=MatchText(text="guide") # Contains "guide" ) ] ) # Prefix prefix_filter = Filter( must=[ FieldCondition( key="product_code", match=MatchText(text="SKU-2024") # Starts with "SKU-2024" ) ] )
3. Array Filters
DEVELOPERpythonfrom qdrant_client.models import MatchAny # Document with at least one of the tags tags_filter = Filter( must=[ FieldCondition( key="tags", match=MatchAny(any=["rag", "llm", "embeddings"]) ) ] ) # Document with ALL tags (must for each) all_tags_filter = Filter( must=[ FieldCondition(key="tags", match=MatchValue(value="rag")), FieldCondition(key="tags", match=MatchValue(value="production")) ] )
4. Temporal Filters
DEVELOPERpythonfrom datetime import datetime, timedelta # Documents from the last 7 days now = datetime.now() week_ago = now - timedelta(days=7) recent_filter = Filter( must=[ FieldCondition( key="created_at", range=Range( gte=week_ago.isoformat(), lte=now.isoformat() ) ) ] ) # Documents from a specific year year_filter = Filter( must=[ FieldCondition( key="published_date", range=Range( gte="2024-01-01T00:00:00Z", lt="2025-01-01T00:00:00Z" ) ) ] )
5. Geographic Filters
DEVELOPERpythonfrom qdrant_client.models import GeoRadius, GeoPoint # Documents within 10km radius of New York geo_filter = Filter( must=[ FieldCondition( key="location", geo_radius=GeoRadius( center=GeoPoint(lat=40.7128, lon=-74.0060), radius=10000 # meters ) ) ] )
Logical Operators
AND Combination (must)
DEVELOPERpython# All criteria must be satisfied combined_filter = Filter( must=[ FieldCondition(key="category", match=MatchValue(value="electronics")), FieldCondition(key="price", range=Range(lte=500)), FieldCondition(key="in_stock", match=MatchValue(value=True)), FieldCondition(key="rating", range=Range(gte=4.0)) ] )
OR Combination (should)
DEVELOPERpython# At least one criterion must be satisfied or_filter = Filter( should=[ FieldCondition(key="brand", match=MatchValue(value="Apple")), FieldCondition(key="brand", match=MatchValue(value="Samsung")), FieldCondition(key="brand", match=MatchValue(value="Google")) ] )
Exclusion (must_not)
DEVELOPERpython# Exclude certain results exclusion_filter = Filter( must=[ FieldCondition(key="category", match=MatchValue(value="phones")) ], must_not=[ FieldCondition(key="brand", match=MatchValue(value="Nokia")), FieldCondition(key="status", match=MatchValue(value="discontinued")) ] )
Complex Combinations
DEVELOPERpython# (category = phones AND price < 1000) AND (brand = Apple OR brand = Samsung) AND NOT refurbished complex_filter = Filter( must=[ FieldCondition(key="category", match=MatchValue(value="phones")), FieldCondition(key="price", range=Range(lt=1000)) ], should=[ FieldCondition(key="brand", match=MatchValue(value="Apple")), FieldCondition(key="brand", match=MatchValue(value="Samsung")) ], must_not=[ FieldCondition(key="condition", match=MatchValue(value="refurbished")) ] )
Implementation in a Retriever
DEVELOPERpythonfrom sentence_transformers import SentenceTransformer class MetadataFilteredRetriever: def __init__(self, collection: str): self.client = QdrantClient("localhost", port=6333) self.collection = collection self.embedder = SentenceTransformer("BAAI/bge-m3") def search( self, query: str, filters: dict = None, top_k: int = 5 ) -> list[dict]: # Encode query query_embedding = self.embedder.encode(query) # Build filter qdrant_filter = self._build_filter(filters) if filters else None # Vector search with filters results = self.client.search( collection_name=self.collection, query_vector=query_embedding.tolist(), query_filter=qdrant_filter, limit=top_k ) return [ { "id": hit.id, "content": hit.payload.get("content"), "metadata": {k: v for k, v in hit.payload.items() if k != "content"}, "score": hit.score } for hit in results ] def _build_filter(self, filters: dict) -> Filter: """ Converts a simple dictionary to Qdrant filter Supported syntax: - {"category": "electronics"} → equality - {"price__lt": 500} → less than - {"price__gte": 100} → greater or equal - {"tags__contains": "rag"} → contains - {"brand__in": ["Apple", "Samsung"]} → in list - {"status__not": "draft"} → not equal """ must_conditions = [] must_not_conditions = [] for key, value in filters.items(): # Parse operators if "__" in key: field, operator = key.rsplit("__", 1) else: field, operator = key, "eq" condition = self._create_condition(field, operator, value) if operator == "not": must_not_conditions.append(condition) else: must_conditions.append(condition) return Filter( must=must_conditions if must_conditions else None, must_not=must_not_conditions if must_not_conditions else None ) def _create_condition(self, field: str, operator: str, value) -> FieldCondition: if operator == "eq": return FieldCondition(key=field, match=MatchValue(value=value)) elif operator == "lt": return FieldCondition(key=field, range=Range(lt=value)) elif operator == "lte": return FieldCondition(key=field, range=Range(lte=value)) elif operator == "gt": return FieldCondition(key=field, range=Range(gt=value)) elif operator == "gte": return FieldCondition(key=field, range=Range(gte=value)) elif operator == "in": return FieldCondition(key=field, match=MatchAny(any=value)) elif operator == "contains": return FieldCondition(key=field, match=MatchValue(value=value)) elif operator == "not": return FieldCondition(key=field, match=MatchValue(value=value)) else: raise ValueError(f"Unknown operator: {operator}") # Usage retriever = MetadataFilteredRetriever("products") results = retriever.search( query="high-end smartphone", filters={ "category": "phones", "price__lte": 1000, "rating__gte": 4.5, "brand__in": ["Apple", "Samsung", "Google"], "status__not": "discontinued" }, top_k=5 )
Metadata Indexing
Creating a Collection with Indices
DEVELOPERpythonfrom qdrant_client.models import ( VectorParams, PayloadSchemaType, PayloadIndexParams, KeywordIndexParams, IntegerIndexParams, FloatIndexParams, TextIndexParams ) # Create collection with index configuration client.create_collection( collection_name="products", vectors_config=VectorParams(size=1024, distance="Cosine") ) # Add indices on frequently filtered fields client.create_payload_index( collection_name="products", field_name="category", field_schema=KeywordIndexParams(type="keyword") ) client.create_payload_index( collection_name="products", field_name="price", field_schema=FloatIndexParams(type="float") ) client.create_payload_index( collection_name="products", field_name="brand", field_schema=KeywordIndexParams(type="keyword") ) client.create_payload_index( collection_name="products", field_name="created_at", field_schema=PayloadSchemaType.DATETIME ) # Full-text index for title search client.create_payload_index( collection_name="products", field_name="title", field_schema=TextIndexParams( type="text", tokenizer="word", min_token_len=2, max_token_len=20 ) )
Indexing Best Practices
| Field Type | Recommended Index | Usage |
|---|---|---|
| Category, status | Keyword | Equality, IN |
| Price, quantity | Float/Integer | Numeric comparisons |
| Date | Datetime | Temporal range |
| Free text | Text | Full-text search |
| Tags (array) | Keyword | Contains, Any |
| Boolean | Keyword | Exact match |
Performance Optimization
Prefiltering vs Postfiltering
DEVELOPERpythonclass OptimizedFilteredRetriever: def __init__(self, collection: str): self.client = QdrantClient("localhost", port=6333) self.collection = collection def search( self, query: str, filters: dict, top_k: int = 5, prefetch_multiplier: int = 3 ) -> list[dict]: """ Optimized strategy: 1. Prefiltering if filters are very selective 2. Postfiltering if filters are permissive """ # Estimate filter selectivity selectivity = self._estimate_selectivity(filters) if selectivity < 0.1: # < 10% of documents # Prefiltering: filter then search return self._prefetch_search(query, filters, top_k) else: # Postfiltering: search more then filter return self._postfilter_search(query, filters, top_k, prefetch_multiplier) def _prefetch_search(self, query: str, filters: dict, top_k: int): """Apply filters before vector search""" query_embedding = self.embedder.encode(query) qdrant_filter = self._build_filter(filters) return self.client.search( collection_name=self.collection, query_vector=query_embedding.tolist(), query_filter=qdrant_filter, limit=top_k ) def _postfilter_search(self, query: str, filters: dict, top_k: int, multiplier: int): """Retrieve more results then filter locally""" query_embedding = self.embedder.encode(query) # Broad search results = self.client.search( collection_name=self.collection, query_vector=query_embedding.tolist(), limit=top_k * multiplier ) # Local filtering filtered = [r for r in results if self._matches_filters(r.payload, filters)] return filtered[:top_k] def _estimate_selectivity(self, filters: dict) -> float: """Estimate percentage of documents that pass filters""" # Count query total = self.client.count(collection_name=self.collection).count qdrant_filter = self._build_filter(filters) matching = self.client.count( collection_name=self.collection, count_filter=qdrant_filter ).count return matching / total if total > 0 else 0
Frequent Filter Caching
DEVELOPERpythonfrom functools import lru_cache import hashlib import json class CachedFilterRetriever: def __init__(self, collection: str, cache_size: int = 100): self.base_retriever = MetadataFilteredRetriever(collection) self._filter_cache = {} def search(self, query: str, filters: dict, top_k: int = 5) -> list[dict]: # Create cache key based on filters filter_key = self._hash_filters(filters) # Check for pre-filtered IDs in cache if filter_key in self._filter_cache: cached_ids = self._filter_cache[filter_key] # Vector search only among cached IDs return self._search_in_ids(query, cached_ids, top_k) # Normal search results = self.base_retriever.search(query, filters, top_k * 3) # Cache IDs for this filter self._filter_cache[filter_key] = [r["id"] for r in results] return results[:top_k] def _hash_filters(self, filters: dict) -> str: return hashlib.md5(json.dumps(filters, sort_keys=True).encode()).hexdigest()
Dynamic Filters
Building Filters from User Interface
DEVELOPERpythonclass DynamicFilterBuilder: def __init__(self, schema: dict): """ schema = { "category": {"type": "keyword", "options": ["phones", "laptops", ...]}, "price": {"type": "range", "min": 0, "max": 5000}, "brand": {"type": "multi_select", "options": [...]}, "in_stock": {"type": "boolean"} } """ self.schema = schema def build_from_ui(self, ui_params: dict) -> dict: """Convert UI parameters to filters""" filters = {} for field, value in ui_params.items(): if field not in self.schema: continue field_type = self.schema[field]["type"] if field_type == "keyword" and value: filters[field] = value elif field_type == "range": if value.get("min") is not None: filters[f"{field}__gte"] = value["min"] if value.get("max") is not None: filters[f"{field}__lte"] = value["max"] elif field_type == "multi_select" and value: filters[f"{field}__in"] = value elif field_type == "boolean": if value is not None: filters[field] = value return filters # Usage from REST API @app.get("/search") def search( q: str, category: str = None, price_min: float = None, price_max: float = None, brands: list[str] = Query(default=[]), in_stock: bool = None ): filter_builder = DynamicFilterBuilder(product_schema) ui_params = { "category": category, "price": {"min": price_min, "max": price_max}, "brand": brands, "in_stock": in_stock } filters = filter_builder.build_from_ui(ui_params) return retriever.search(q, filters=filters)
Filter Monitoring
DEVELOPERpythonclass FilterAnalytics: def __init__(self, analytics_client): self.analytics = analytics_client def log_filter_usage( self, filters: dict, results_count: int, latency_ms: float ): self.analytics.track("filter_usage", { "filters": filters, "filter_count": len(filters), "results_count": results_count, "latency_ms": latency_ms, "timestamp": datetime.now().isoformat() }) def get_popular_filters(self, days: int = 7) -> dict: """Identify most used filters""" usages = self.analytics.query("filter_usage", days=days) filter_counts = {} for usage in usages: for field in usage["filters"].keys(): filter_counts[field] = filter_counts.get(field, 0) + 1 return sorted(filter_counts.items(), key=lambda x: x[1], reverse=True) def get_empty_result_filters(self, days: int = 7) -> list[dict]: """Identify filters that return no results""" usages = self.analytics.query("filter_usage", days=days) return [u for u in usages if u["results_count"] == 0]
Next Steps
Metadata filtering significantly refines your RAG searches. To go further:
- Self-Query Retrieval - Let the LLM extract filters
- Query Routing - Route based on metadata
- Retrieval Fundamentals - Overview
Intelligent Filtering with Ailog
Ailog implements metadata filtering transparently:
- Automatic indexing of relevant fields
- Filter extraction from natural language queries
- Dynamic optimization of prefiltering/postfiltering
- Integrated filter interface for your users
Try for free and refine your searches with powerful filters.
Tags
Related Posts
Query Routing: Direct Queries to the Right Source
Implement query routing to direct each query to the optimal data source. Classification, LLM routing, and advanced strategies explained.
Ensemble Retrieval: Combining Multiple Retrievers
Implement ensemble retrieval to combine the strengths of multiple retrievers. Voting, stacking, and advanced fusion strategies.
Hybrid Fusion: Combining Dense and Sparse Retrieval
Master hybrid fusion to combine semantic and lexical search. RRF, weighted fusion, and optimal combination strategies explained.