Confluence: AI Knowledge Base for Teams
Complete guide to deploying a RAG assistant on Confluence. Transform your Atlassian documentation into an AI-queryable knowledge base.
Confluence: AI Knowledge Base for Teams
Confluence is the backbone of enterprise documentation in the Atlassian ecosystem. Millions of teams use it to centralize processes, technical guides, and strategic decisions. But over time, even the best-organized wikis become labyrinths where information gets lost. Employees spend an average of 20% of their time searching for information they know exists somewhere.
A RAG assistant transforms this documentary mass into a conversational interface. Instead of navigating complex folder structures, your teams ask questions in natural language and get synthesized answers with sources. This guide details the Confluence + RAG integration from A to Z.
The Confluence Problem at Scale
Common Symptoms
After a few years of use, Confluence presents recurring challenges:
- "I know it exists somewhere": The information is there, but unfindable
- Labyrinthine navigation: Too many spaces, sub-pages, hierarchies
- Outdated content: Unmaintained pages polluting search results
- Duplication: Same information in multiple spaces
- Difficult onboarding: New hires lost in the documentation
Revealing Statistics
| Metric | Enterprise Average |
|---|---|
| Time spent searching | 20% of work time |
| Pages never viewed | 60% after 6 months |
| Repeated questions to support | 40% are in the docs |
| Confluence search satisfaction | 3.2/10 |
Native Search vs RAG
| Criteria | Confluence Search | RAG Search |
|---|---|---|
| Query type | Exact keywords | Natural language |
| Result | List of pages | Direct answer |
| Multi-page | No | Synthesizes multiple sources |
| Context | None | Conversational history |
| Formats | Text only | Text + tables + code |
| Relevance | Recency > relevance | Semantic relevance |
Confluence + RAG Architecture
The integration rests on three pillars: extraction via the Confluence API, vector indexing, and conversational interface.
+-------------------------------------------------------------------------+
| Confluence + RAG Architecture |
+-------------------------------------------------------------------------+
| |
| CONFLUENCE PROCESSING VECTOR DB |
| +--------------+ +--------------+ +-----------+ |
| | Spaces |------------>| Parsing |--------->| Qdrant | |
| | | | HTML->MD | | | |
| | - IT | +--------------+ | HNSW | |
| | - HR | | | Index | |
| | - Product | +--------------+ +-----+-----+ |
| | - Tech | | Chunking | | |
| +--------------+ | 512 tokens | | |
| | +--------------+ | |
| +------+-------+ | | |
| | REST API | +--------------+ | |
| | v2 | | Embeddings | | |
| +--------------+ | BGE-M3 | | |
| +--------------+ | |
| | |
| QUERY PIPELINE | |
| +--------------+ +-------------+ +--------------+ | |
| | Question |---->| Retrieval |<----| Reranker |<---+ |
| | employee | | Top-30 | | Top-5 | |
| +--------------+ +-------------+ +------+-------+ |
| | |
| +--------------+ +------+-------+ |
| | Response |<---| LLM | |
| | + Sources | | GPT-4/Claude | |
| +--------------+ +--------------+ |
| |
+-------------------------------------------------------------------------+
Data Flow
- Extraction: The connector queries Confluence API v2 to retrieve pages
- Parsing: Confluence HTML is converted to clean Markdown
- Chunking: Documents are split into 512-token segments
- Embedding: Each chunk is vectorized with BGE-M3 (multilingual)
- Indexing: Vectors are stored in Qdrant with metadata
- Retrieval: Questions are semantically matched to chunks
- Reranking: A second model refines the ranking
- Generation: The LLM synthesizes the response with citations
Complete Confluence Connector
Here's the reference implementation for extracting Confluence content:
DEVELOPERpythonfrom atlassian import Confluence from bs4 import BeautifulSoup import hashlib from typing import Optional import re class ConfluenceConnector: def __init__(self, url: str, username: str, api_token: str): """ Initialize Confluence connector. Args: url: Instance URL (e.g., https://company.atlassian.net) username: User email api_token: Atlassian API token """ self.confluence = Confluence( url=url, username=username, password=api_token, cloud=True ) self.base_url = url def get_all_spaces(self) -> list: """Retrieve all accessible spaces.""" spaces = [] start = 0 limit = 50 while True: result = self.confluence.get_all_spaces( start=start, limit=limit, expand='description.plain' ) for space in result.get('results', []): spaces.append({ 'key': space['key'], 'name': space['name'], 'type': space.get('type', 'global'), 'description': space.get('description', {}).get('plain', {}).get('value', '') }) if len(result.get('results', [])) < limit: break start += limit return spaces def get_space_pages(self, space_key: str, include_archived: bool = False) -> list: """ Retrieve all pages from a space. Args: space_key: Space key (e.g., 'IT', 'HR') include_archived: Include archived pages Returns: List of documents formatted for RAG """ pages = [] start = 0 limit = 50 while True: try: result = self.confluence.get_all_pages_from_space( space_key, start=start, limit=limit, expand='body.storage,ancestors,version,metadata.labels' ) except Exception as e: print(f"Error space {space_key}: {e}") break for page in result: # Filter archived pages if requested if not include_archived and page.get('status') == 'archived': continue doc = self._format_page(page, space_key) if doc and len(doc['content']) > 100: # Ignore short pages pages.append(doc) if len(result) < limit: break start += limit return pages def _format_page(self, page: dict, space_key: str) -> dict: """Format a Confluence page as a RAG document.""" # Extract and clean HTML content html_content = page.get('body', {}).get('storage', {}).get('value', '') text_content = self._html_to_markdown(html_content) # Build hierarchical path (breadcrumb) ancestors = page.get('ancestors', []) path_parts = [a['title'] for a in ancestors] + [page['title']] breadcrumb = ' > '.join(path_parts) # Extract labels labels = [ label['name'] for label in page.get('metadata', {}).get('labels', {}).get('results', []) ] # Hash to detect changes content_hash = hashlib.md5(text_content.encode()).hexdigest() # Version and date version = page.get('version', {}) return { "id": f"confluence_{page['id']}", "title": page['title'], "content": f"# {page['title']}\n\n**Path**: {breadcrumb}\n\n{text_content}", "metadata": { "source": "confluence", "source_type": "documentation", "space_key": space_key, "page_id": page['id'], "url": f"{self.base_url}/wiki/spaces/{space_key}/pages/{page['id']}", "breadcrumb": breadcrumb, "labels": labels, "version": version.get('number', 1), "last_updated": version.get('when'), "last_updated_by": version.get('by', {}).get('displayName'), "content_hash": content_hash } } def _html_to_markdown(self, html: str) -> str: """ Convert Confluence HTML to clean Markdown. Handles Confluence-specific formats: - Macros (code, panel, note, warning) - Tables - Nested lists - Links and mentions """ if not html: return "" soup = BeautifulSoup(html, 'html.parser') # Process code macros for code_block in soup.find_all('ac:structured-macro', {'ac:name': 'code'}): language = code_block.get('ac:language', '') code_body = code_block.find('ac:plain-text-body') if code_body: code_text = code_body.get_text() code_block.replace_with(f"\n```{language}\n{code_text}\n```\n") # Process panels and notes for panel in soup.find_all('ac:structured-macro', {'ac:name': ['panel', 'note', 'warning', 'info']}): panel_type = panel.get('ac:name', 'note') body = panel.find('ac:rich-text-body') if body: panel_text = body.get_text(separator='\n') panel.replace_with(f"\n> **{panel_type.upper()}**: {panel_text}\n") # Process tables for table in soup.find_all('table'): rows = table.find_all('tr') if rows: md_table = self._table_to_markdown(rows) table.replace_with(md_table) # Process headers for i in range(1, 7): for header in soup.find_all(f'h{i}'): header.replace_with(f"\n{'#' * i} {header.get_text()}\n") # Process lists for ul in soup.find_all('ul'): items = ul.find_all('li', recursive=False) list_text = '\n'.join([f"- {li.get_text()}" for li in items]) ul.replace_with(f"\n{list_text}\n") for ol in soup.find_all('ol'): items = ol.find_all('li', recursive=False) list_text = '\n'.join([f"{i+1}. {li.get_text()}" for i, li in enumerate(items)]) ol.replace_with(f"\n{list_text}\n") # Clean final text text = soup.get_text(separator='\n') text = re.sub(r'\n{3,}', '\n\n', text) # Max 2 empty lines text = re.sub(r' +', ' ', text) # Multiple spaces return text.strip() def _table_to_markdown(self, rows) -> str: """Convert an HTML table to Markdown.""" md_lines = [] for i, row in enumerate(rows): cells = row.find_all(['th', 'td']) cell_texts = [cell.get_text().strip().replace('|', '\\|') for cell in cells] md_lines.append('| ' + ' | '.join(cell_texts) + ' |') # Add separator line after header if i == 0: md_lines.append('| ' + ' | '.join(['---'] * len(cells)) + ' |') return '\n' + '\n'.join(md_lines) + '\n' def get_page_comments(self, page_id: str) -> list: """Retrieve page comments (optional).""" try: comments = self.confluence.get_page_comments( page_id, expand='body.storage', depth='all' ) return [ { 'author': c.get('author', {}).get('displayName', 'Unknown'), 'content': BeautifulSoup( c.get('body', {}).get('storage', {}).get('value', ''), 'html.parser' ).get_text(), 'date': c.get('created') } for c in comments.get('results', []) ] except: return [] class ConfluenceMultiSpaceConnector(ConfluenceConnector): """Extension to manage multiple spaces with filtering.""" def get_all_documents( self, space_keys: list = None, exclude_spaces: list = None, labels_filter: list = None ) -> list: """ Retrieve documents from multiple spaces. Args: space_keys: List of spaces to include (None = all) exclude_spaces: Spaces to exclude labels_filter: Only keep pages with these labels Returns: List of all documents """ all_docs = [] # Get space list spaces = self.get_all_spaces() for space in spaces: key = space['key'] # Filter spaces if space_keys and key not in space_keys: continue if exclude_spaces and key in exclude_spaces: continue print(f"Indexing space: {space['name']} ({key})") pages = self.get_space_pages(key) # Filter by labels if requested if labels_filter: pages = [ p for p in pages if any(label in p['metadata']['labels'] for label in labels_filter) ] all_docs.extend(pages) return all_docs
Synchronization and Updates
Synchronization can be incremental (efficient) or full (cleanup):
DEVELOPERpythonfrom datetime import datetime, timedelta import schedule class ConfluenceSyncManager: def __init__(self, connector: ConfluenceConnector, indexer): self.connector = connector self.indexer = indexer self.last_sync = None self.sync_history = [] def sync_incremental(self, spaces: list = None): """ Incremental synchronization. Only processes pages modified since last sync. """ if spaces is None: spaces = [s['key'] for s in self.connector.get_all_spaces()] updated_count = 0 for space_key in spaces: pages = self.connector.get_space_pages(space_key) for page in pages: last_updated = page['metadata'].get('last_updated') if last_updated: updated_dt = datetime.fromisoformat( last_updated.replace('Z', '+00:00') ) if self.last_sync is None or updated_dt > self.last_sync: self.indexer.upsert_document(page) updated_count += 1 self.last_sync = datetime.now() self.sync_history.append({ 'timestamp': self.last_sync, 'type': 'incremental', 'documents_updated': updated_count }) print(f"Incremental sync: {updated_count} documents updated") def sync_full(self, spaces: list = None): """Full synchronization with replacement.""" docs = self.connector.get_all_documents() if spaces is None else [] if spaces: for space_key in spaces: docs.extend(self.connector.get_space_pages(space_key)) self.indexer.replace_all(docs) self.last_sync = datetime.now() self.sync_history.append({ 'timestamp': self.last_sync, 'type': 'full', 'documents_indexed': len(docs) }) print(f"Full sync: {len(docs)} documents indexed") def cleanup_deleted(self): """Remove documents whose pages no longer exist.""" indexed_ids = self.indexer.get_all_ids() current_pages = set() for space in self.connector.get_all_spaces(): pages = self.connector.get_space_pages(space['key']) for page in pages: current_pages.add(page['id']) # Find IDs to delete to_delete = indexed_ids - current_pages if to_delete: self.indexer.delete_documents(list(to_delete)) print(f"Deleted {len(to_delete)} obsolete documents") def start_confluence_sync_worker(sync_manager: ConfluenceSyncManager): """Start synchronization worker.""" # Incremental sync every 15 minutes schedule.every(15).minutes.do(sync_manager.sync_incremental) # Weekly full sync schedule.every().sunday.at("02:00").do(sync_manager.sync_full) # Monthly cleanup schedule.every(30).days.do(sync_manager.cleanup_deleted) while True: schedule.run_pending() time.sleep(60)
Optimized Confluence System Prompt
DEVELOPERpythonCONFLUENCE_SYSTEM_PROMPT = """You are the documentation assistant for {company_name}. You help employees find information in our Confluence. ## Mission - Answer based ONLY on Confluence documentation - Systematically cite the space and source page - Indicate last update date if relevant - Direct to the right contact if the question exceeds the docs ## Rules 1. Never invent information absent from documentation 2. Prioritize recently updated pages 3. Flag if information seems outdated (> 1 year) 4. Adapt detail level to the question ## Format - Direct answer first - Details if necessary - Source: [Space > Page] (date) ## Available Documentation {context} """
Concrete Use Cases
1. Internal IT Support
Question: "How do I configure VPN for remote work?"
RAG Response:
To configure VPN:
- Download Cisco AnyConnect client from [internal link]
- Server: vpn.company.com
- Use your Active Directory credentials
- For 2FA, use the Duo app
If you have issues, open an IT ticket via ServiceNow.
Source: [IT > Remote Work VPN Guide] (updated 2 months ago)
2. HR Processes
Question: "What's the procedure for a training request?"
RAG Response:
Training request procedure:
- Identify the training with your manager
- Fill out form F-HR-015 on Confluence
- Get approval from your N+1 then HR
- Budget: max 2000 EUR/year/employee
Processing time: 2-3 weeks.
Source: [HR > Training > Request Procedure] (January 2025)
3. Technical Documentation
Question: "How do I deploy to the staging environment?"
RAG Response:
Staging deployment:
DEVELOPERbash# 1. Merge to develop git checkout develop && git pull && git merge feature/xxx # 2. Run pipeline ./scripts/deploy.sh staging # 3. Verify curl https://staging.company.com/healthStaging access: request from DevOps (#devops on Slack).
Source: [Tech > DevOps > Deployment] (December 2025)
Best Practices
Structure Confluence for RAG
| Practice | Benefit |
|---|---|
| Clear hierarchy (3 levels max) | Optimal chunking |
| Descriptive titles | Better search |
| Standardized labels | Easy filtering |
| Page templates | Format consistency |
| Short, focused pages | Response precision |
Manage Permissions
RAG respects Confluence permissions:
- Create a dedicated service account with read access
- Limit to public/internal spaces based on use case
- Never index confidential spaces without validation
Quality Indicators
- Response rate with source
- Questions without match (add to docs)
- Most cited pages
- User feedback
Related Resources
- Enterprise Knowledge Base - Pillar guide
- Notion + RAG - Notion alternative
- SharePoint + RAG - For Microsoft 365
- Slack Bot RAG - Search in conversations
- Introduction to RAG - The fundamentals
Connect Confluence with Ailog
Transform your Confluence documentation into an intelligent assistant. Ailog simplifies integration:
- Native Atlassian connector: Automatic multi-space synchronization
- Semantic search: Find info in natural language
- Permission respect: Granular access by space
- Version history: Complete traceability
- French hosting: Native GDPR compliance
Try Ailog for free and deploy your Confluence assistant in 15 minutes.
Tags
Related Posts
Notion + RAG: Connect Your Company Wiki
Complete guide to integrating Notion as a knowledge source for a RAG chatbot. Synchronization, indexing, semantic search, and practical use cases.
SharePoint + RAG: Leverage Your Microsoft 365 Documents
Complete guide to connecting SharePoint to a RAG system. Make your Microsoft 365 documents AI-queryable with semantic search.
Intelligent Knowledge Base: Centralizing Enterprise Knowledge
Create an AI knowledge base for your company: technical documentation, onboarding, and business expertise accessible instantly.