Guide

Confluence: AI Knowledge Base for Teams

March 25, 2026
Ailog Team

Complete guide to deploying a RAG assistant on Confluence. Transform your Atlassian documentation into an AI-queryable knowledge base.

Confluence: AI Knowledge Base for Teams

Confluence is the backbone of enterprise documentation in the Atlassian ecosystem. Millions of teams use it to centralize processes, technical guides, and strategic decisions. But over time, even the best-organized wikis become labyrinths where information gets lost. Employees spend an average of 20% of their time searching for information they know exists somewhere.

A RAG assistant transforms this documentary mass into a conversational interface. Instead of navigating complex folder structures, your teams ask questions in natural language and get synthesized answers with sources. This guide details the Confluence + RAG integration from A to Z.

The Confluence Problem at Scale

Common Symptoms

After a few years of use, Confluence presents recurring challenges:

  • "I know it exists somewhere": The information is there, but unfindable
  • Labyrinthine navigation: Too many spaces, sub-pages, hierarchies
  • Outdated content: Unmaintained pages polluting search results
  • Duplication: Same information in multiple spaces
  • Difficult onboarding: New hires lost in the documentation

Revealing Statistics

MetricEnterprise Average
Time spent searching20% of work time
Pages never viewed60% after 6 months
Repeated questions to support40% are in the docs
Confluence search satisfaction3.2/10

Native Search vs RAG

CriteriaConfluence SearchRAG Search
Query typeExact keywordsNatural language
ResultList of pagesDirect answer
Multi-pageNoSynthesizes multiple sources
ContextNoneConversational history
FormatsText onlyText + tables + code
RelevanceRecency > relevanceSemantic relevance

Confluence + RAG Architecture

The integration rests on three pillars: extraction via the Confluence API, vector indexing, and conversational interface.

+-------------------------------------------------------------------------+
|                      Confluence + RAG Architecture                       |
+-------------------------------------------------------------------------+
|                                                                         |
|   CONFLUENCE                    PROCESSING                 VECTOR DB   |
|   +--------------+             +--------------+          +-----------+ |
|   |   Spaces     |------------>|   Parsing    |--------->|  Qdrant   | |
|   |              |             |   HTML->MD   |          |           | |
|   |  - IT        |             +--------------+          |  HNSW     | |
|   |  - HR        |                    |                  |  Index    | |
|   |  - Product   |             +--------------+          +-----+-----+ |
|   |  - Tech      |             |   Chunking   |                |       |
|   +--------------+             |   512 tokens |                |       |
|          |                     +--------------+                |       |
|   +------+-------+                    |                        |       |
|   |   REST API   |             +--------------+                |       |
|   |   v2         |             |  Embeddings  |                |       |
|   +--------------+             |   BGE-M3     |                |       |
|                                +--------------+                |       |
|                                                                |       |
|   QUERY PIPELINE                                               |       |
|   +--------------+     +-------------+     +--------------+    |       |
|   |   Question   |---->|  Retrieval  |<----|   Reranker   |<---+       |
|   |  employee    |     |  Top-30     |     |   Top-5      |            |
|   +--------------+     +-------------+     +------+-------+            |
|                                                   |                    |
|                        +--------------+    +------+-------+            |
|                        |   Response   |<---|     LLM      |            |
|                        |   + Sources  |    | GPT-4/Claude |            |
|                        +--------------+    +--------------+            |
|                                                                         |
+-------------------------------------------------------------------------+

Data Flow

  1. Extraction: The connector queries Confluence API v2 to retrieve pages
  2. Parsing: Confluence HTML is converted to clean Markdown
  3. Chunking: Documents are split into 512-token segments
  4. Embedding: Each chunk is vectorized with BGE-M3 (multilingual)
  5. Indexing: Vectors are stored in Qdrant with metadata
  6. Retrieval: Questions are semantically matched to chunks
  7. Reranking: A second model refines the ranking
  8. Generation: The LLM synthesizes the response with citations

Complete Confluence Connector

Here's the reference implementation for extracting Confluence content:

DEVELOPERpython
from atlassian import Confluence from bs4 import BeautifulSoup import hashlib from typing import Optional import re class ConfluenceConnector: def __init__(self, url: str, username: str, api_token: str): """ Initialize Confluence connector. Args: url: Instance URL (e.g., https://company.atlassian.net) username: User email api_token: Atlassian API token """ self.confluence = Confluence( url=url, username=username, password=api_token, cloud=True ) self.base_url = url def get_all_spaces(self) -> list: """Retrieve all accessible spaces.""" spaces = [] start = 0 limit = 50 while True: result = self.confluence.get_all_spaces( start=start, limit=limit, expand='description.plain' ) for space in result.get('results', []): spaces.append({ 'key': space['key'], 'name': space['name'], 'type': space.get('type', 'global'), 'description': space.get('description', {}).get('plain', {}).get('value', '') }) if len(result.get('results', [])) < limit: break start += limit return spaces def get_space_pages(self, space_key: str, include_archived: bool = False) -> list: """ Retrieve all pages from a space. Args: space_key: Space key (e.g., 'IT', 'HR') include_archived: Include archived pages Returns: List of documents formatted for RAG """ pages = [] start = 0 limit = 50 while True: try: result = self.confluence.get_all_pages_from_space( space_key, start=start, limit=limit, expand='body.storage,ancestors,version,metadata.labels' ) except Exception as e: print(f"Error space {space_key}: {e}") break for page in result: # Filter archived pages if requested if not include_archived and page.get('status') == 'archived': continue doc = self._format_page(page, space_key) if doc and len(doc['content']) > 100: # Ignore short pages pages.append(doc) if len(result) < limit: break start += limit return pages def _format_page(self, page: dict, space_key: str) -> dict: """Format a Confluence page as a RAG document.""" # Extract and clean HTML content html_content = page.get('body', {}).get('storage', {}).get('value', '') text_content = self._html_to_markdown(html_content) # Build hierarchical path (breadcrumb) ancestors = page.get('ancestors', []) path_parts = [a['title'] for a in ancestors] + [page['title']] breadcrumb = ' > '.join(path_parts) # Extract labels labels = [ label['name'] for label in page.get('metadata', {}).get('labels', {}).get('results', []) ] # Hash to detect changes content_hash = hashlib.md5(text_content.encode()).hexdigest() # Version and date version = page.get('version', {}) return { "id": f"confluence_{page['id']}", "title": page['title'], "content": f"# {page['title']}\n\n**Path**: {breadcrumb}\n\n{text_content}", "metadata": { "source": "confluence", "source_type": "documentation", "space_key": space_key, "page_id": page['id'], "url": f"{self.base_url}/wiki/spaces/{space_key}/pages/{page['id']}", "breadcrumb": breadcrumb, "labels": labels, "version": version.get('number', 1), "last_updated": version.get('when'), "last_updated_by": version.get('by', {}).get('displayName'), "content_hash": content_hash } } def _html_to_markdown(self, html: str) -> str: """ Convert Confluence HTML to clean Markdown. Handles Confluence-specific formats: - Macros (code, panel, note, warning) - Tables - Nested lists - Links and mentions """ if not html: return "" soup = BeautifulSoup(html, 'html.parser') # Process code macros for code_block in soup.find_all('ac:structured-macro', {'ac:name': 'code'}): language = code_block.get('ac:language', '') code_body = code_block.find('ac:plain-text-body') if code_body: code_text = code_body.get_text() code_block.replace_with(f"\n```{language}\n{code_text}\n```\n") # Process panels and notes for panel in soup.find_all('ac:structured-macro', {'ac:name': ['panel', 'note', 'warning', 'info']}): panel_type = panel.get('ac:name', 'note') body = panel.find('ac:rich-text-body') if body: panel_text = body.get_text(separator='\n') panel.replace_with(f"\n> **{panel_type.upper()}**: {panel_text}\n") # Process tables for table in soup.find_all('table'): rows = table.find_all('tr') if rows: md_table = self._table_to_markdown(rows) table.replace_with(md_table) # Process headers for i in range(1, 7): for header in soup.find_all(f'h{i}'): header.replace_with(f"\n{'#' * i} {header.get_text()}\n") # Process lists for ul in soup.find_all('ul'): items = ul.find_all('li', recursive=False) list_text = '\n'.join([f"- {li.get_text()}" for li in items]) ul.replace_with(f"\n{list_text}\n") for ol in soup.find_all('ol'): items = ol.find_all('li', recursive=False) list_text = '\n'.join([f"{i+1}. {li.get_text()}" for i, li in enumerate(items)]) ol.replace_with(f"\n{list_text}\n") # Clean final text text = soup.get_text(separator='\n') text = re.sub(r'\n{3,}', '\n\n', text) # Max 2 empty lines text = re.sub(r' +', ' ', text) # Multiple spaces return text.strip() def _table_to_markdown(self, rows) -> str: """Convert an HTML table to Markdown.""" md_lines = [] for i, row in enumerate(rows): cells = row.find_all(['th', 'td']) cell_texts = [cell.get_text().strip().replace('|', '\\|') for cell in cells] md_lines.append('| ' + ' | '.join(cell_texts) + ' |') # Add separator line after header if i == 0: md_lines.append('| ' + ' | '.join(['---'] * len(cells)) + ' |') return '\n' + '\n'.join(md_lines) + '\n' def get_page_comments(self, page_id: str) -> list: """Retrieve page comments (optional).""" try: comments = self.confluence.get_page_comments( page_id, expand='body.storage', depth='all' ) return [ { 'author': c.get('author', {}).get('displayName', 'Unknown'), 'content': BeautifulSoup( c.get('body', {}).get('storage', {}).get('value', ''), 'html.parser' ).get_text(), 'date': c.get('created') } for c in comments.get('results', []) ] except: return [] class ConfluenceMultiSpaceConnector(ConfluenceConnector): """Extension to manage multiple spaces with filtering.""" def get_all_documents( self, space_keys: list = None, exclude_spaces: list = None, labels_filter: list = None ) -> list: """ Retrieve documents from multiple spaces. Args: space_keys: List of spaces to include (None = all) exclude_spaces: Spaces to exclude labels_filter: Only keep pages with these labels Returns: List of all documents """ all_docs = [] # Get space list spaces = self.get_all_spaces() for space in spaces: key = space['key'] # Filter spaces if space_keys and key not in space_keys: continue if exclude_spaces and key in exclude_spaces: continue print(f"Indexing space: {space['name']} ({key})") pages = self.get_space_pages(key) # Filter by labels if requested if labels_filter: pages = [ p for p in pages if any(label in p['metadata']['labels'] for label in labels_filter) ] all_docs.extend(pages) return all_docs

Synchronization and Updates

Synchronization can be incremental (efficient) or full (cleanup):

DEVELOPERpython
from datetime import datetime, timedelta import schedule class ConfluenceSyncManager: def __init__(self, connector: ConfluenceConnector, indexer): self.connector = connector self.indexer = indexer self.last_sync = None self.sync_history = [] def sync_incremental(self, spaces: list = None): """ Incremental synchronization. Only processes pages modified since last sync. """ if spaces is None: spaces = [s['key'] for s in self.connector.get_all_spaces()] updated_count = 0 for space_key in spaces: pages = self.connector.get_space_pages(space_key) for page in pages: last_updated = page['metadata'].get('last_updated') if last_updated: updated_dt = datetime.fromisoformat( last_updated.replace('Z', '+00:00') ) if self.last_sync is None or updated_dt > self.last_sync: self.indexer.upsert_document(page) updated_count += 1 self.last_sync = datetime.now() self.sync_history.append({ 'timestamp': self.last_sync, 'type': 'incremental', 'documents_updated': updated_count }) print(f"Incremental sync: {updated_count} documents updated") def sync_full(self, spaces: list = None): """Full synchronization with replacement.""" docs = self.connector.get_all_documents() if spaces is None else [] if spaces: for space_key in spaces: docs.extend(self.connector.get_space_pages(space_key)) self.indexer.replace_all(docs) self.last_sync = datetime.now() self.sync_history.append({ 'timestamp': self.last_sync, 'type': 'full', 'documents_indexed': len(docs) }) print(f"Full sync: {len(docs)} documents indexed") def cleanup_deleted(self): """Remove documents whose pages no longer exist.""" indexed_ids = self.indexer.get_all_ids() current_pages = set() for space in self.connector.get_all_spaces(): pages = self.connector.get_space_pages(space['key']) for page in pages: current_pages.add(page['id']) # Find IDs to delete to_delete = indexed_ids - current_pages if to_delete: self.indexer.delete_documents(list(to_delete)) print(f"Deleted {len(to_delete)} obsolete documents") def start_confluence_sync_worker(sync_manager: ConfluenceSyncManager): """Start synchronization worker.""" # Incremental sync every 15 minutes schedule.every(15).minutes.do(sync_manager.sync_incremental) # Weekly full sync schedule.every().sunday.at("02:00").do(sync_manager.sync_full) # Monthly cleanup schedule.every(30).days.do(sync_manager.cleanup_deleted) while True: schedule.run_pending() time.sleep(60)

Optimized Confluence System Prompt

DEVELOPERpython
CONFLUENCE_SYSTEM_PROMPT = """You are the documentation assistant for {company_name}. You help employees find information in our Confluence. ## Mission - Answer based ONLY on Confluence documentation - Systematically cite the space and source page - Indicate last update date if relevant - Direct to the right contact if the question exceeds the docs ## Rules 1. Never invent information absent from documentation 2. Prioritize recently updated pages 3. Flag if information seems outdated (> 1 year) 4. Adapt detail level to the question ## Format - Direct answer first - Details if necessary - Source: [Space > Page] (date) ## Available Documentation {context} """

Concrete Use Cases

1. Internal IT Support

Question: "How do I configure VPN for remote work?"

RAG Response:

To configure VPN:

  1. Download Cisco AnyConnect client from [internal link]
  2. Server: vpn.company.com
  3. Use your Active Directory credentials
  4. For 2FA, use the Duo app

If you have issues, open an IT ticket via ServiceNow.

Source: [IT > Remote Work VPN Guide] (updated 2 months ago)

2. HR Processes

Question: "What's the procedure for a training request?"

RAG Response:

Training request procedure:

  1. Identify the training with your manager
  2. Fill out form F-HR-015 on Confluence
  3. Get approval from your N+1 then HR
  4. Budget: max 2000 EUR/year/employee

Processing time: 2-3 weeks.

Source: [HR > Training > Request Procedure] (January 2025)

3. Technical Documentation

Question: "How do I deploy to the staging environment?"

RAG Response:

Staging deployment:

DEVELOPERbash
# 1. Merge to develop git checkout develop && git pull && git merge feature/xxx # 2. Run pipeline ./scripts/deploy.sh staging # 3. Verify curl https://staging.company.com/health

Staging access: request from DevOps (#devops on Slack).

Source: [Tech > DevOps > Deployment] (December 2025)

Best Practices

Structure Confluence for RAG

PracticeBenefit
Clear hierarchy (3 levels max)Optimal chunking
Descriptive titlesBetter search
Standardized labelsEasy filtering
Page templatesFormat consistency
Short, focused pagesResponse precision

Manage Permissions

RAG respects Confluence permissions:

  1. Create a dedicated service account with read access
  2. Limit to public/internal spaces based on use case
  3. Never index confidential spaces without validation

Quality Indicators

  • Response rate with source
  • Questions without match (add to docs)
  • Most cited pages
  • User feedback

Related Resources


Connect Confluence with Ailog

Transform your Confluence documentation into an intelligent assistant. Ailog simplifies integration:

  • Native Atlassian connector: Automatic multi-space synchronization
  • Semantic search: Find info in natural language
  • Permission respect: Granular access by space
  • Version history: Complete traceability
  • French hosting: Native GDPR compliance

Try Ailog for free and deploy your Confluence assistant in 15 minutes.

Tags

ragconfluenceatlassianknowledge basedocumentationinternal chatbot

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !