Query Decomposition Breakthrough: DecomposeRAG Handles Complex Questions 50% Better
UC Berkeley researchers introduce DecomposeRAG, an automated query decomposition framework that significantly improves multi-hop question answering.
Research Overview
UC Berkeley's NLP lab published DecomposeRAG, a framework that automatically breaks complex queries into simpler sub-queries, achieving state-of-the-art results on multi-hop QA benchmarks.
The Problem
Complex questions require multi-hop reasoning:
Example: "What is the population of the capital of the country where the Eiffel Tower is located?"
Requires:
- Where is the Eiffel Tower? → France
- What is the capital of France? → Paris
- What is the population of Paris? → 2.1 million
Traditional RAG retrieves context for the full question, often missing intermediate steps.
DecomposeRAG Approach
Automatic Decomposition
Uses GPT-4 to break queries into sub-questions:
DEVELOPERpythondef decompose_query(complex_query): prompt = f"""Break this question into simple sub-questions that must be answered in order. Question: {complex_query} Sub-questions (in order): 1.""" response = gpt4.generate(prompt) sub_questions = parse_questions(response) return sub_questions
Sequential Retrieval
Answer sub-questions in order, using previous answers as context:
DEVELOPERpythondef sequential_rag(sub_questions): context = "" for i, sub_q in enumerate(sub_questions): # Retrieve for this sub-question docs = retrieve(sub_q + " " + context, k=5) # Generate answer answer = llm.generate( query=sub_q, context=docs, previous_answers=context ) # Add to cumulative context context += f"\nQ{i+1}: {sub_q}\nA{i+1}: {answer}\n" return answer # Answer to final sub-question
Answer Validation
Validates each intermediate answer before proceeding:
DEVELOPERpythondef validate_answer(question, answer, retrieved_docs): prompt = f"""Is this answer supported by the documents? Question: {question} Answer: {answer} Documents: {retrieved_docs} Supported? (yes/no):""" validation = llm.generate(prompt) return "yes" in validation.lower()
If validation fails, retry with more context or alternative retrieval strategy.
Benchmark Results
Tested on four multi-hop QA datasets:
| Dataset | Baseline RAG | DecomposeRAG | Improvement |
|---|---|---|---|
| HotpotQA | 45.3% | 68.7% | +51.7% |
| 2WikiMultihopQA | 38.2% | 57.9% | +51.6% |
| MuSiQue | 32.1% | 49.8% | +55.1% |
| IIRC | 41.7% | 62.3% | +49.4% |
Average improvement: +52%
Comparison to Other Methods
| Method | Avg F1 | Cost (relative) |
|---|---|---|
| Standard RAG | 39.3% | 1x |
| Chain-of-Thought | 43.8% | 2x |
| ReACT | 48.2% | 3x |
| DecomposeRAG | 59.7% | 2.5x |
DecomposeRAG achieves best accuracy at moderate cost.
Key Insights
When Decomposition Helps
Effectiveness varies by query complexity:
| Hops | Baseline | DecomposeRAG | Gain |
|---|---|---|---|
| 1 (simple) | 68.2% | 69.1% | +1.3% |
| 2 (medium) | 51.3% | 67.4% | +31.4% |
| 3 (complex) | 28.7% | 52.3% | +82.2% |
| 4+ (very complex) | 15.2% | 38.9% | +156.3% |
Finding: More hops = bigger gains from decomposition.
Decomposition Quality
Analyzed quality of LLM-generated decompositions:
- Correct decomposition: 87.3%
- Missing steps: 8.2%
- Incorrect order: 3.1%
- Circular logic: 1.4%
Even imperfect decompositions improve results.
Error Analysis
Where does DecomposeRAG fail?
- Decomposition errors (23%): Wrong sub-questions
- Retrieval failures (34%): Can't find relevant docs for sub-question
- Answer errors (28%): Wrong intermediate answer propagates
- Integration failures (15%): Can't combine sub-answers
Most common: Retrieval still fails for sub-questions.
Implementation
Basic Version
DEVELOPERpythonclass DecomposeRAG: def __init__(self, retriever, llm): self.retriever = retriever self.llm = llm async def query(self, complex_question): # Step 1: Decompose sub_questions = await self.decompose(complex_question) # Step 2: Sequential RAG context = "" for sub_q in sub_questions: # Retrieve docs = await self.retriever.retrieve( sub_q + " " + context, k=5 ) # Generate answer = await self.llm.generate( query=sub_q, context=docs, previous=context ) context += f"\n{sub_q} -> {answer}" # Return final answer return answer async def decompose(self, query): # Use LLM to decompose return await self.llm.decompose(query)
Advanced: With Validation
DEVELOPERpythonclass ValidatedDecomposeRAG(DecomposeRAG): async def query(self, complex_question, max_retries=2): sub_questions = await self.decompose(complex_question) context = "" for sub_q in sub_questions: for attempt in range(max_retries): docs = await self.retriever.retrieve(sub_q + " " + context) answer = await self.llm.generate(sub_q, docs, context) # Validate if await self.validate(sub_q, answer, docs): context += f"\n{sub_q} -> {answer}" break elif attempt == max_retries - 1: # Failed validation, use best-effort answer context += f"\n{sub_q} -> {answer} (unverified)" return answer
Optimizations
Parallel Sub-Queries
When sub-questions are independent:
DEVELOPERpython# Identify independent sub-questions dependencies = analyze_dependencies(sub_questions) # Group independent questions independent_groups = group_by_dependencies(sub_questions, dependencies) # Process groups in parallel for group in independent_groups: # Parallel retrieval for group results = await asyncio.gather(*[ self.retrieve_and_answer(q, context) for q in group ]) # Add all to context for q, answer in zip(group, results): context += f"\n{q} -> {answer}"
Caching Intermediate Results
DEVELOPERpythonclass CachedDecomposeRAG(DecomposeRAG): def __init__(self, retriever, llm): super().__init__(retriever, llm) self.cache = {} async def retrieve_and_answer(self, sub_q, context): cache_key = hash(sub_q + context) if cache_key in self.cache: return self.cache[cache_key] result = await super().retrieve_and_answer(sub_q, context) self.cache[cache_key] = result return result
Practical Considerations
Latency
DecomposeRAG is 2-3x slower:
- 2-hop query: +2-3 seconds
- 3-hop query: +4-6 seconds
- 4-hop query: +6-10 seconds
Mitigation:
- Parallel sub-queries when possible
- Cache common decompositions
- Use faster LLMs for intermediate steps
Cost
More LLM calls = higher cost:
- Decomposition: 1 LLM call
- Each sub-question: 1 LLM call
- Validation (optional): 1 call per sub-question
Example:
- 3 sub-questions + validation = 7 LLM calls
- vs. 1 call for standard RAG
Cost multiplier: 2-5x depending on complexity
When to Use
Use DecomposeRAG when:
- Questions are complex (multi-hop)
- Accuracy more important than speed
- Budget allows higher costs
Use standard RAG when:
- Simple lookups
- Speed critical
- Cost-sensitive
Future Directions
Planned improvements:
- Better decomposition: Fine-tune smaller models
- Adaptive strategy: Auto-detect when to decompose
- Iterative refinement: Retry failed sub-questions
- Multimodal: Decompose across modalities
Resources
- Paper: "DecomposeRAG: Automatic Query Decomposition for Multi-Hop Question Answering"
- Code: github.com/berkeley-nlp/decomposerag
- Demo: decomposerag.demo.berkeley.edu
Conclusion
DecomposeRAG demonstrates that explicit query decomposition significantly improves multi-hop question answering. While costlier and slower than standard RAG, the accuracy gains justify the overhead for complex queries where correctness is critical.
Tags
Related Posts
BEIR Benchmark 2.0 Leaderboard 2025: Complete NDCG@10 Scores & Rankings
Complete BEIR 2.0 leaderboard with NDCG@10 scores for all top models. Compare Voyage, Cohere, BGE, OpenAI embeddings on the latest benchmark.
New Research: Cross-Encoder Reranking Improves RAG Accuracy by 40%
MIT study demonstrates that two-stage retrieval with cross-encoder reranking significantly outperforms single-stage vector search across multiple benchmarks.
CLaRa: A New Approach to RAG with Continuous Latent Reasoning
CLaRa introduces continuous latent reasoning to bridge retrieval and generation, achieving state-of-the-art performance on QA benchmarks