Filtern nach Metadaten: RAG-Suche verfeinern
Beherrschen Sie das Filtern nach Metadaten für präzise RAG-Suchen. Filtertypen, Indexierung, kombinierte Abfragen und Optimierung.
Metadaten-Filter: Die RAG-Suche verfeinern
Das Filtern nach Metadaten kombiniert die Leistungsfähigkeit der vector-Suche mit der Präzision strukturierter Filter. Anstatt nur nach semantischer Ähnlichkeit zu suchen, können Sie die Ergebnisse nach Kategorie, Datum, Autor, Preis oder jeder anderen Eigenschaft einschränken. Dieser Leitfaden erläutert Filterstrategien und deren Implementierung in RAG-Systemen.
Warum Metadaten-Filter?
Reine vector-Suche hat Grenzen :
Requête : "Derniers articles sur le machine learning"
Sans filtrage :
→ Trouve d'anciens articles très pertinents mais datés (2018, 2020)
→ Manque les articles récents moins optimisés sémantiquement
Avec filtrage (year >= 2024) :
→ Trouve uniquement les articles de 2024
→ Pertinence sémantique + fraîcheur garantie
Typische Anwendungsfälle
| Domaine | Métadonnées utiles | Exemple de filtre |
|---|---|---|
| E-commerce | catégorie, prix, stock, note | category = "electronics" AND price < 500 |
| Documentation | version, langue, section | version = "3.x" AND language = "fr" |
| Support | statut, priorité, assigné | status = "open" AND priority = "high" |
| Blog | date, auteur, tags | date > 2024-01-01 AND tags CONTAINS "rag" |
| RH | département, niveau, lieu | department = "engineering" AND location = "Paris" |
Arten von Metadaten und Filtern
1. Skalare Filter (Gleichheit, Vergleich)
DEVELOPERpythonfrom qdrant_client import QdrantClient from qdrant_client.models import Filter, FieldCondition, MatchValue, Range client = QdrantClient("localhost", port=6333) # Strikte Gleichheit category_filter = Filter( must=[ FieldCondition( key="category", match=MatchValue(value="electronics") ) ] ) # Numerischer Vergleich price_filter = Filter( must=[ FieldCondition( key="price", range=Range(gte=100, lte=500) # 100 <= price <= 500 ) ] ) # Boolesch in_stock_filter = Filter( must=[ FieldCondition( key="in_stock", match=MatchValue(value=True) ) ] )
2. Textfilter (partielle Übereinstimmung)
DEVELOPERpythonfrom qdrant_client.models import MatchText # Exakte Übereinstimmung im Text title_filter = Filter( must=[ FieldCondition( key="title", match=MatchText(text="guide") # Contient "guide" ) ] ) # Präfix prefix_filter = Filter( must=[ FieldCondition( key="product_code", match=MatchText(text="SKU-2024") # Commence par "SKU-2024" ) ] )
3. Filter auf Arrays
DEVELOPERpythonfrom qdrant_client.models import MatchAny # Dokument mit mindestens einem der Tags tags_filter = Filter( must=[ FieldCondition( key="tags", match=MatchAny(any=["rag", "llm", "embeddings"]) ) ] ) # Dokument mit ALLEN Tags (must für jedes) all_tags_filter = Filter( must=[ FieldCondition(key="tags", match=MatchValue(value="rag")), FieldCondition(key="tags", match=MatchValue(value="production")) ] )
4. Zeitliche Filter
DEVELOPERpythonfrom datetime import datetime, timedelta # Dokumente der letzten 7 Tage now = datetime.now() week_ago = now - timedelta(days=7) recent_filter = Filter( must=[ FieldCondition( key="created_at", range=Range( gte=week_ago.isoformat(), lte=now.isoformat() ) ) ] ) # Dokumente eines bestimmten Jahres year_filter = Filter( must=[ FieldCondition( key="published_date", range=Range( gte="2024-01-01T00:00:00Z", lt="2025-01-01T00:00:00Z" ) ) ] )
5. Geografische Filter
DEVELOPERpythonfrom qdrant_client.models import GeoRadius, GeoPoint # Dokumente in einem Radius von 10 km um Paris geo_filter = Filter( must=[ FieldCondition( key="location", geo_radius=GeoRadius( center=GeoPoint(lat=48.8566, lon=2.3522), radius=10000 # Meter ) ) ] )
Logische Operatoren
Kombination AND (must)
DEVELOPERpython# Alle Kriterien müssen erfüllt sein combined_filter = Filter( must=[ FieldCondition(key="category", match=MatchValue(value="electronics")), FieldCondition(key="price", range=Range(lte=500)), FieldCondition(key="in_stock", match=MatchValue(value=True)), FieldCondition(key="rating", range=Range(gte=4.0)) ] )
Kombination OR (should)
DEVELOPERpython# Mindestens ein Kriterium muss erfüllt sein or_filter = Filter( should=[ FieldCondition(key="brand", match=MatchValue(value="Apple")), FieldCondition(key="brand", match=MatchValue(value="Samsung")), FieldCondition(key="brand", match=MatchValue(value="Google")) ] )
Ausschluss (must_not)
DEVELOPERpython# Bestimmte Ergebnisse ausschließen exclusion_filter = Filter( must=[ FieldCondition(key="category", match=MatchValue(value="phones")) ], must_not=[ FieldCondition(key="brand", match=MatchValue(value="Nokia")), FieldCondition(key="status", match=MatchValue(value="discontinued")) ] )
Komplexe Kombinationen
DEVELOPERpython# (category = phones UND price < 1000) UND (brand = Apple ODER brand = Samsung) UND NICHT refurbished complex_filter = Filter( must=[ FieldCondition(key="category", match=MatchValue(value="phones")), FieldCondition(key="price", range=Range(lt=1000)) ], should=[ FieldCondition(key="brand", match=MatchValue(value="Apple")), FieldCondition(key="brand", match=MatchValue(value="Samsung")) ], must_not=[ FieldCondition(key="condition", match=MatchValue(value="refurbished")) ] )
Implementierung in einem Retriever
DEVELOPERpythonfrom sentence_transformers import SentenceTransformer class MetadataFilteredRetriever: def __init__(self, collection: str): self.client = QdrantClient("localhost", port=6333) self.collection = collection self.embedder = SentenceTransformer("BAAI/bge-m3") def search( self, query: str, filters: dict = None, top_k: int = 5 ) -> list[dict]: # Anfrage enkodieren query_embedding = self.embedder.encode(query) # Filter erstellen qdrant_filter = self._build_filter(filters) if filters else None # Vector-Suche mit Filtern results = self.client.search( collection_name=self.collection, query_vector=query_embedding.tolist(), query_filter=qdrant_filter, limit=top_k ) return [ { "id": hit.id, "content": hit.payload.get("content"), "metadata": {k: v for k, v in hit.payload.items() if k != "content"}, "score": hit.score } for hit in results ] def _build_filter(self, filters: dict) -> Filter: """ Convertit un dictionnaire simple en filtre Qdrant Syntaxe supportée : - {"category": "electronics"} → égalité - {"price__lt": 500} → moins que - {"price__gte": 100} → plus ou égal - {"tags__contains": "rag"} → contient - {"brand__in": ["Apple", "Samsung"]} → dans la liste - {"status__not": "draft"} → différent de """ must_conditions = [] must_not_conditions = [] for key, value in filters.items(): # Parser les opérateurs if "__" in key: field, operator = key.rsplit("__", 1) else: field, operator = key, "eq" condition = self._create_condition(field, operator, value) if operator == "not": must_not_conditions.append(condition) else: must_conditions.append(condition) return Filter( must=must_conditions if must_conditions else None, must_not=must_not_conditions if must_not_conditions else None ) def _create_condition(self, field: str, operator: str, value) -> FieldCondition: if operator == "eq": return FieldCondition(key=field, match=MatchValue(value=value)) elif operator == "lt": return FieldCondition(key=field, range=Range(lt=value)) elif operator == "lte": return FieldCondition(key=field, range=Range(lte=value)) elif operator == "gt": return FieldCondition(key=field, range=Range(gt=value)) elif operator == "gte": return FieldCondition(key=field, range=Range(gte=value)) elif operator == "in": return FieldCondition(key=field, match=MatchAny(any=value)) elif operator == "contains": return FieldCondition(key=field, match=MatchValue(value=value)) elif operator == "not": return FieldCondition(key=field, match=MatchValue(value=value)) else: raise ValueError(f"Opérateur inconnu: {operator}") # Utilisation retriever = MetadataFilteredRetriever("products") results = retriever.search( query="smartphone haut de gamme", filters={ "category": "phones", "price__lte": 1000, "rating__gte": 4.5, "brand__in": ["Apple", "Samsung", "Google"], "status__not": "discontinued" }, top_k=5 )
Indexierung der Metadaten
Erstellen einer Collection mit Indizes
DEVELOPERpythonfrom qdrant_client.models import ( VectorParams, PayloadSchemaType, PayloadIndexParams, KeywordIndexParams, IntegerIndexParams, FloatIndexParams, TextIndexParams ) # Collection mit Index-Konfiguration erstellen client.create_collection( collection_name="products", vectors_config=VectorParams(size=1024, distance="Cosine") ) # Indizes für häufig gefilterte Felder hinzufügen client.create_payload_index( collection_name="products", field_name="category", field_schema=KeywordIndexParams(type="keyword") ) client.create_payload_index( collection_name="products", field_name="price", field_schema=FloatIndexParams(type="float") ) client.create_payload_index( collection_name="products", field_name="brand", field_schema=KeywordIndexParams(type="keyword") ) client.create_payload_index( collection_name="products", field_name="created_at", field_schema=PayloadSchemaType.DATETIME ) # Volltextindex für Suche im Titel client.create_payload_index( collection_name="products", field_name="title", field_schema=TextIndexParams( type="text", tokenizer="word", min_token_len=2, max_token_len=20 ) )
Best Practices für die Indexierung
| Type de champ | Index recommandé | Usage |
|---|---|---|
| Catégorie, statut | Keyword | Égalité, IN |
| Prix, quantité | Float/Integer | Comparaisons numériques |
| Date | Datetime | Range temporel |
| Texte libre | Text | Recherche full-text |
| Tags (array) | Keyword | Contains, Any |
| Booléen | Keyword | Match exact |
Performance-Optimierung
Vorfilterung vs Nachfilterung
DEVELOPERpythonclass OptimizedFilteredRetriever: def __init__(self, collection: str): self.client = QdrantClient("localhost", port=6333) self.collection = collection def search( self, query: str, filters: dict, top_k: int = 5, prefetch_multiplier: int = 3 ) -> list[dict]: """ Stratégie optimisée : 1. Préfiltrage si les filtres sont très sélectifs 2. Postfiltrage si les filtres sont permissifs """ # Selektivität der Filter schätzen selectivity = self._estimate_selectivity(filters) if selectivity < 0.1: # < 10% des documents # Prefiltering: zuerst filtern, dann suchen return self._prefetch_search(query, filters, top_k) else: # Postfiltering: mehr suchen und dann filtern return self._postfilter_search(query, filters, top_k, prefetch_multiplier) def _prefetch_search(self, query: str, filters: dict, top_k: int): """Applique les filtres avant la recherche vectorielle""" query_embedding = self.embedder.encode(query) qdrant_filter = self._build_filter(filters) return self.client.search( collection_name=self.collection, query_vector=query_embedding.tolist(), query_filter=qdrant_filter, limit=top_k ) def _postfilter_search(self, query: str, filters: dict, top_k: int, multiplier: int): """Récupère plus de résultats puis filtre localement""" query_embedding = self.embedder.encode(query) # Recherche large results = self.client.search( collection_name=self.collection, query_vector=query_embedding.tolist(), limit=top_k * multiplier ) # Filtrage local filtered = [r for r in results if self._matches_filters(r.payload, filters)] return filtered[:top_k] def _estimate_selectivity(self, filters: dict) -> float: """Estime le pourcentage de documents qui passent les filtres""" # Zählabfrage total = self.client.count(collection_name=self.collection).count qdrant_filter = self._build_filter(filters) matching = self.client.count( collection_name=self.collection, count_filter=qdrant_filter ).count return matching / total if total > 0 else 0
Cache für häufige Filter
DEVELOPERpythonfrom functools import lru_cache import hashlib import json class CachedFilterRetriever: def __init__(self, collection: str, cache_size: int = 100): self.base_retriever = MetadataFilteredRetriever(collection) self._filter_cache = {} def search(self, query: str, filters: dict, top_k: int = 5) -> list[dict]: # Erstelle einen Cache-Schlüssel basierend auf den Filtern filter_key = self._hash_filters(filters) # Prüfen, ob vorgefilterte IDs im Cache vorhanden sind if filter_key in self._filter_cache: cached_ids = self._filter_cache[filter_key] # Nur vector-Suche unter den gecachten IDs return self._search_in_ids(query, cached_ids, top_k) # Normale Suche results = self.base_retriever.search(query, filters, top_k * 3) # IDs für diesen Filter cachen self._filter_cache[filter_key] = [r["id"] for r in results] return results[:top_k] def _hash_filters(self, filters: dict) -> str: return hashlib.md5(json.dumps(filters, sort_keys=True).encode()).hexdigest()
Dynamische Filter
Erstellen von Filtern aus der Benutzeroberfläche
DEVELOPERpythonclass DynamicFilterBuilder: def __init__(self, schema: dict): """ schema = { "category": {"type": "keyword", "options": ["phones", "laptops", ...]}, "price": {"type": "range", "min": 0, "max": 5000}, "brand": {"type": "multi_select", "options": [...]}, "in_stock": {"type": "boolean"} } """ self.schema = schema def build_from_ui(self, ui_params: dict) -> dict: """Convertit les paramètres UI en filtres""" filters = {} for field, value in ui_params.items(): if field not in self.schema: continue field_type = self.schema[field]["type"] if field_type == "keyword" and value: filters[field] = value elif field_type == "range": if value.get("min") is not None: filters[f"{field}__gte"] = value["min"] if value.get("max") is not None: filters[f"{field}__lte"] = value["max"] elif field_type == "multi_select" and value: filters[f"{field}__in"] = value elif field_type == "boolean": if value is not None: filters[field] = value return filters # Verwendung aus einer REST-API @app.get("/search") def search( q: str, category: str = None, price_min: float = None, price_max: float = None, brands: list[str] = Query(default=[]), in_stock: bool = None ): filter_builder = DynamicFilterBuilder(product_schema) ui_params = { "category": category, "price": {"min": price_min, "max": price_max}, "brand": brands, "in_stock": in_stock } filters = filter_builder.build_from_ui(ui_params) return retriever.search(q, filters=filters)
Monitoring der Filter
DEVELOPERpythonclass FilterAnalytics: def __init__(self, analytics_client): self.analytics = analytics_client def log_filter_usage( self, filters: dict, results_count: int, latency_ms: float ): self.analytics.track("filter_usage", { "filters": filters, "filter_count": len(filters), "results_count": results_count, "latency_ms": latency_ms, "timestamp": datetime.now().isoformat() }) def get_popular_filters(self, days: int = 7) -> dict: """Identifie les filtres les plus utilisés""" usages = self.analytics.query("filter_usage", days=days) filter_counts = {} for usage in usages: for field in usage["filters"].keys(): filter_counts[field] = filter_counts.get(field, 0) + 1 return sorted(filter_counts.items(), key=lambda x: x[1], reverse=True) def get_empty_result_filters(self, days: int = 7) -> list[dict]: """Identifie les filtres qui ne retournent aucun résultat""" usages = self.analytics.query("filter_usage", days=days) return [u for u in usages if u["results_count"] == 0]
Nächste Schritte
Das Filtern nach Metadaten verfeinert Ihre RAG-Suchen erheblich. Um weiterzugehen:
- Self-Query Retrieval - Das LLM Filter extrahieren lassen
- Query Routing - Routing basierend auf Metadaten
- Fondamentaux du Retrieval - Übersicht
Intelligentes Filtern mit Ailog
Ailog implementiert Metadaten-Filter nahtlos:
- Automatische Indexierung relevanter Felder
- Extraktion von Filtern aus natürlichen Anfragen
- Dynamische Optimierung Prefiltering/Postfiltering
- Integrierte Filteroberfläche für Ihre Nutzer
Testez gratuitement et affinez vos recherches avec des filtres puissants.
FAQ
Tags
Verwandte Artikel
Query Routing: Anfragen an die richtige Quelle weiterleiten
Implementieren Sie Query Routing, um jede Anfrage zur optimalen Datenquelle zu leiten. Klassifizierung, LLM-Routing und fortgeschrittene Strategien.
Ensemble Retrieval: Mehrere retrievers kombinieren
Implementieren Sie Ensemble Retrieval, um die Stärken mehrerer retrievers zu kombinieren. Voting, stacking und fortgeschrittene Fusionsstrategien.
Hybride Fusion: Dense- und Sparse-Retrieval kombinieren
Meistern Sie die hybride Fusion zur Kombination von semantischer und lexikalischer Suche. RRF, weighted fusion und optimale Kombinationsstrategien.