Confluence: AI Knowledge Base for Teams

Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Confluence is the backbone of enterprise documentation in the Atlassian ecosystem. Millions of teams use it to centralize processes, technical guides, and strategic decisions. But over time, even the best-organized wikis become labyrinths where information gets lost. Employees spend an average of 20% of their time searching for information they know exists somewhere.

A RAG assistant transforms this documentary mass into a conversational interface. Instead of navigating complex folder structures, your teams ask questions in natural language and get synthesized answers with sources. This guide details the Confluence + RAG integration from A to Z.

The Confluence Problem at Scale

Common Symptoms

After a few years of use, Confluence presents recurring challenges:

"I know it exists somewhere": The information is there, but unfindable
Labyrinthine navigation: Too many spaces, sub-pages, hierarchies
Outdated content: Unmaintained pages polluting search results
Duplication: Same information in multiple spaces
Difficult onboarding: New hires lost in the documentation

Revealing Statistics

Metric	Enterprise Average
Time spent searching	20% of work time
Pages never viewed	60% after 6 months
Repeated questions to support	40% are in the docs
Confluence search satisfaction	3.2/10

Native Search vs RAG

Criteria	Confluence Search	RAG Search
Query type	Exact keywords	Natural language
Result	List of pages	Direct answer
Multi-page	No	Synthesizes multiple sources
Context	None	Conversational history
Formats	Text only	Text + tables + code
Relevance	Recency > relevance	Semantic relevance

Confluence + RAG Architecture

The integration rests on three pillars: extraction via the Confluence API, vector indexing, and conversational interface.

+-------------------------------------------------------------------------+
|                      Confluence + RAG Architecture                       |
+-------------------------------------------------------------------------+
|                                                                         |
|   CONFLUENCE                    PROCESSING                 VECTOR DB   |
|   +--------------+             +--------------+          +-----------+ |
|   |   Spaces     |------------>|   Parsing    |--------->|  Qdrant   | |
|   |              |             |   HTML->MD   |          |           | |
|   |  - IT        |             +--------------+          |  HNSW     | |
|   |  - HR        |                    |                  |  Index    | |
|   |  - Product   |             +--------------+          +-----+-----+ |
|   |  - Tech      |             |   Chunking   |                |       |
|   +--------------+             |   512 tokens |                |       |
|          |                     +--------------+                |       |
|   +------+-------+                    |                        |       |
|   |   REST API   |             +--------------+                |       |
|   |   v2         |             |  Embeddings  |                |       |
|   +--------------+             |   BGE-M3     |                |       |
|                                +--------------+                |       |
|                                                                |       |
|   QUERY PIPELINE                                               |       |
|   +--------------+     +-------------+     +--------------+    |       |
|   |   Question   |---->|  Retrieval  |<----|   Reranker   |<---+       |
|   |  employee    |     |  Top-30     |     |   Top-5      |            |
|   +--------------+     +-------------+     +------+-------+            |
|                                                   |                    |
|                        +--------------+    +------+-------+            |
|                        |   Response   |<---|     LLM      |            |
|                        |   + Sources  |    | GPT-4/Claude |            |
|                        +--------------+    +--------------+            |
|                                                                         |
+-------------------------------------------------------------------------+

Data Flow

Extraction: The connector queries Confluence API v2 to retrieve pages
Parsing: Confluence HTML is converted to clean Markdown
Chunking: Documents are split into 512-token segments
Embedding: Each chunk is vectorized with BGE-M3 (multilingual)
Indexing: Vectors are stored in Qdrant with metadata
Retrieval: Questions are semantically matched to chunks
Reranking: A second model refines the ranking
Generation: The LLM synthesizes the response with citations

Complete Confluence Connector

Here's the reference implementation for extracting Confluence content:

DEVELOPERpython
from atlassian import Confluence
from bs4 import BeautifulSoup
import hashlib
from typing import Optional
import re

class ConfluenceConnector:
    def __init__(self, url: str, username: str, api_token: str):
        """
        Initialize Confluence connector.

        Args:
            url: Instance URL (e.g., https://company.atlassian.net)
            username: User email
            api_token: Atlassian API token
        """
        self.confluence = Confluence(
            url=url,
            username=username,
            password=api_token,
            cloud=True
        )
        self.base_url = url

    def get_all_spaces(self) -> list:
        """Retrieve all accessible spaces."""
        spaces = []
        start = 0
        limit = 50

        while True:
            result = self.confluence.get_all_spaces(
                start=start,
                limit=limit,
                expand='description.plain'
            )

            for space in result.get('results', []):
                spaces.append({
                    'key': space['key'],
                    'name': space['name'],
                    'type': space.get('type', 'global'),
                    'description': space.get('description', {}).get('plain', {}).get('value', '')
                })

            if len(result.get('results', [])) < limit:
                break
            start += limit

        return spaces

    def get_space_pages(self, space_key: str, include_archived: bool = False) -> list:
        """
        Retrieve all pages from a space.

        Args:
            space_key: Space key (e.g., 'IT', 'HR')
            include_archived: Include archived pages

        Returns:
            List of documents formatted for RAG
        """
        pages = []
        start = 0
        limit = 50

        while True:
            try:
                result = self.confluence.get_all_pages_from_space(
                    space_key,
                    start=start,
                    limit=limit,
                    expand='body.storage,ancestors,version,metadata.labels'
                )
            except Exception as e:
                print(f"Error space {space_key}: {e}")
                break

            for page in result:
                # Filter archived pages if requested
                if not include_archived and page.get('status') == 'archived':
                    continue

                doc = self._format_page(page, space_key)
                if doc and len(doc['content']) > 100:  # Ignore short pages
                    pages.append(doc)

            if len(result) < limit:
                break
            start += limit

        return pages

    def _format_page(self, page: dict, space_key: str) -> dict:
        """Format a Confluence page as a RAG document."""
        # Extract and clean HTML content
        html_content = page.get('body', {}).get('storage', {}).get('value', '')
        text_content = self._html_to_markdown(html_content)

        # Build hierarchical path (breadcrumb)
        ancestors = page.get('ancestors', [])
        path_parts = [a['title'] for a in ancestors] + [page['title']]
        breadcrumb = ' > '.join(path_parts)

        # Extract labels
        labels = [
            label['name']
            for label in page.get('metadata', {}).get('labels', {}).get('results', [])
        ]

        # Hash to detect changes
        content_hash = hashlib.md5(text_content.encode()).hexdigest()

        # Version and date
        version = page.get('version', {})

        return {
            "id": f"confluence_{page['id']}",
            "title": page['title'],
            "content": f"# {page['title']}\n\n**Path**: {breadcrumb}\n\n{text_content}",
            "metadata": {
                "source": "confluence",
                "source_type": "documentation",
                "space_key": space_key,
                "page_id": page['id'],
                "url": f"{self.base_url}/wiki/spaces/{space_key}/pages/{page['id']}",
                "breadcrumb": breadcrumb,
                "labels": labels,
                "version": version.get('number', 1),
                "last_updated": version.get('when'),
                "last_updated_by": version.get('by', {}).get('displayName'),
                "content_hash": content_hash
            }
        }

    def _html_to_markdown(self, html: str) -> str:
        """
        Convert Confluence HTML to clean Markdown.

        Handles Confluence-specific formats:
        - Macros (code, panel, note, warning)
        - Tables
        - Nested lists
        - Links and mentions
        """
        if not html:
            return ""

        soup = BeautifulSoup(html, 'html.parser')

        # Process code macros
        for code_block in soup.find_all('ac:structured-macro', {'ac:name': 'code'}):
            language = code_block.get('ac:language', '')
            code_body = code_block.find('ac:plain-text-body')
            if code_body:
                code_text = code_body.get_text()
                code_block.replace_with(f"\n```{language}\n{code_text}\n```\n")

        # Process panels and notes
        for panel in soup.find_all('ac:structured-macro', {'ac:name': ['panel', 'note', 'warning', 'info']}):
            panel_type = panel.get('ac:name', 'note')
            body = panel.find('ac:rich-text-body')
            if body:
                panel_text = body.get_text(separator='\n')
                panel.replace_with(f"\n> **{panel_type.upper()}**: {panel_text}\n")

        # Process tables
        for table in soup.find_all('table'):
            rows = table.find_all('tr')
            if rows:
                md_table = self._table_to_markdown(rows)
                table.replace_with(md_table)

        # Process headers
        for i in range(1, 7):
            for header in soup.find_all(f'h{i}'):
                header.replace_with(f"\n{'#' * i} {header.get_text()}\n")

        # Process lists
        for ul in soup.find_all('ul'):
            items = ul.find_all('li', recursive=False)
            list_text = '\n'.join([f"- {li.get_text()}" for li in items])
            ul.replace_with(f"\n{list_text}\n")

        for ol in soup.find_all('ol'):
            items = ol.find_all('li', recursive=False)
            list_text = '\n'.join([f"{i+1}. {li.get_text()}" for i, li in enumerate(items)])
            ol.replace_with(f"\n{list_text}\n")

        # Clean final text
        text = soup.get_text(separator='\n')
        text = re.sub(r'\n{3,}', '\n\n', text)  # Max 2 empty lines
        text = re.sub(r' +', ' ', text)  # Multiple spaces

        return text.strip()

    def _table_to_markdown(self, rows) -> str:
        """Convert an HTML table to Markdown."""
        md_lines = []

        for i, row in enumerate(rows):
            cells = row.find_all(['th', 'td'])
            cell_texts = [cell.get_text().strip().replace('|', '\\|') for cell in cells]
            md_lines.append('| ' + ' | '.join(cell_texts) + ' |')

            # Add separator line after header
            if i == 0:
                md_lines.append('| ' + ' | '.join(['---'] * len(cells)) + ' |')

        return '\n' + '\n'.join(md_lines) + '\n'

    def get_page_comments(self, page_id: str) -> list:
        """Retrieve page comments (optional)."""
        try:
            comments = self.confluence.get_page_comments(
                page_id,
                expand='body.storage',
                depth='all'
            )
            return [
                {
                    'author': c.get('author', {}).get('displayName', 'Unknown'),
                    'content': BeautifulSoup(
                        c.get('body', {}).get('storage', {}).get('value', ''),
                        'html.parser'
                    ).get_text(),
                    'date': c.get('created')
                }
                for c in comments.get('results', [])
            ]
        except:
            return []


class ConfluenceMultiSpaceConnector(ConfluenceConnector):
    """Extension to manage multiple spaces with filtering."""

    def get_all_documents(
        self,
        space_keys: list = None,
        exclude_spaces: list = None,
        labels_filter: list = None
    ) -> list:
        """
        Retrieve documents from multiple spaces.

        Args:
            space_keys: List of spaces to include (None = all)
            exclude_spaces: Spaces to exclude
            labels_filter: Only keep pages with these labels

        Returns:
            List of all documents
        """
        all_docs = []

        # Get space list
        spaces = self.get_all_spaces()

        for space in spaces:
            key = space['key']

            # Filter spaces
            if space_keys and key not in space_keys:
                continue
            if exclude_spaces and key in exclude_spaces:
                continue

            print(f"Indexing space: {space['name']} ({key})")

            pages = self.get_space_pages(key)

            # Filter by labels if requested
            if labels_filter:
                pages = [
                    p for p in pages
                    if any(label in p['metadata']['labels'] for label in labels_filter)
                ]

            all_docs.extend(pages)

        return all_docs

Synchronization and Updates

Synchronization can be incremental (efficient) or full (cleanup):

DEVELOPERpython
from datetime import datetime, timedelta
import schedule

class ConfluenceSyncManager:
    def __init__(self, connector: ConfluenceConnector, indexer):
        self.connector = connector
        self.indexer = indexer
        self.last_sync = None
        self.sync_history = []

    def sync_incremental(self, spaces: list = None):
        """
        Incremental synchronization.

        Only processes pages modified since last sync.
        """
        if spaces is None:
            spaces = [s['key'] for s in self.connector.get_all_spaces()]

        updated_count = 0

        for space_key in spaces:
            pages = self.connector.get_space_pages(space_key)

            for page in pages:
                last_updated = page['metadata'].get('last_updated')
                if last_updated:
                    updated_dt = datetime.fromisoformat(
                        last_updated.replace('Z', '+00:00')
                    )

                    if self.last_sync is None or updated_dt > self.last_sync:
                        self.indexer.upsert_document(page)
                        updated_count += 1

        self.last_sync = datetime.now()
        self.sync_history.append({
            'timestamp': self.last_sync,
            'type': 'incremental',
            'documents_updated': updated_count
        })

        print(f"Incremental sync: {updated_count} documents updated")

    def sync_full(self, spaces: list = None):
        """Full synchronization with replacement."""
        docs = self.connector.get_all_documents() if spaces is None else []
        if spaces:
            for space_key in spaces:
                docs.extend(self.connector.get_space_pages(space_key))

        self.indexer.replace_all(docs)
        self.last_sync = datetime.now()

        self.sync_history.append({
            'timestamp': self.last_sync,
            'type': 'full',
            'documents_indexed': len(docs)
        })

        print(f"Full sync: {len(docs)} documents indexed")

    def cleanup_deleted(self):
        """Remove documents whose pages no longer exist."""
        indexed_ids = self.indexer.get_all_ids()
        current_pages = set()

        for space in self.connector.get_all_spaces():
            pages = self.connector.get_space_pages(space['key'])
            for page in pages:
                current_pages.add(page['id'])

        # Find IDs to delete
        to_delete = indexed_ids - current_pages
        if to_delete:
            self.indexer.delete_documents(list(to_delete))
            print(f"Deleted {len(to_delete)} obsolete documents")


def start_confluence_sync_worker(sync_manager: ConfluenceSyncManager):
    """Start synchronization worker."""
    # Incremental sync every 15 minutes
    schedule.every(15).minutes.do(sync_manager.sync_incremental)

    # Weekly full sync
    schedule.every().sunday.at("02:00").do(sync_manager.sync_full)

    # Monthly cleanup
    schedule.every(30).days.do(sync_manager.cleanup_deleted)

    while True:
        schedule.run_pending()
        time.sleep(60)

Optimized Confluence System Prompt

DEVELOPERpython
CONFLUENCE_SYSTEM_PROMPT = """You are the documentation assistant for {company_name}.
You help employees find information in our Confluence.

## Mission
- Answer based ONLY on Confluence documentation
- Systematically cite the space and source page
- Indicate last update date if relevant
- Direct to the right contact if the question exceeds the docs

## Rules
1. Never invent information absent from documentation
2. Prioritize recently updated pages
3. Flag if information seems outdated (> 1 year)
4. Adapt detail level to the question

## Format
- Direct answer first
- Details if necessary
- Source: [Space > Page] (date)

## Available Documentation
{context}
"""

Concrete Use Cases

1. Internal IT Support

Question: "How do I configure VPN for remote work?"

RAG Response:

To configure VPN:

Download Cisco AnyConnect client from [internal link]

Server: vpn.company.com

Use your Active Directory credentials

For 2FA, use the Duo app

If you have issues, open an IT ticket via ServiceNow.

Source: [IT > Remote Work VPN Guide] (updated 2 months ago)

2. HR Processes

Question: "What's the procedure for a training request?"

RAG Response:

Training request procedure:

Identify the training with your manager

Fill out form F-HR-015 on Confluence

Get approval from your N+1 then HR

Budget: max 2000 EUR/year/employee

Processing time: 2-3 weeks.

Source: [HR > Training > Request Procedure] (January 2025)

3. Technical Documentation

Question: "How do I deploy to the staging environment?"

RAG Response:

Staging deployment:
DEVELOPERbash
# 1. Merge to develop
git checkout develop && git pull && git merge feature/xxx

# 2. Run pipeline
./scripts/deploy.sh staging

# 3. Verify
curl https://staging.company.com/health
Staging access: request from DevOps (#devops on Slack).

Source: [Tech > DevOps > Deployment] (December 2025)

Best Practices

Structure Confluence for RAG

Practice	Benefit
Clear hierarchy (3 levels max)	Optimal chunking
Descriptive titles	Better search
Standardized labels	Easy filtering
Page templates	Format consistency
Short, focused pages	Response precision

Manage Permissions

RAG respects Confluence permissions:

Create a dedicated service account with read access
Limit to public/internal spaces based on use case
Never index confidential spaces without validation

Quality Indicators

Response rate with source
Questions without match (add to docs)
Most cited pages
User feedback

Related Resources

Enterprise Knowledge Base - Pillar guide
Notion + RAG - Notion alternative
SharePoint + RAG - For Microsoft 365
Slack Bot RAG - Search in conversations
Introduction to RAG - The fundamentals

Connect Confluence with Ailog

Transform your Confluence documentation into an intelligent assistant. Ailog simplifies integration:

Native Atlassian connector: Automatic multi-space synchronization
Semantic search: Find info in natural language
Permission respect: Granular access by space
Version history: Complete traceability
French hosting: Native GDPR compliance

Try Ailog for free and deploy your Confluence assistant in 15 minutes.

Confluence: AI Knowledge Base for Teams

Confluence: AI Knowledge Base for Teams

The Confluence Problem at Scale

Common Symptoms

Revealing Statistics

Native Search vs RAG

Confluence + RAG Architecture

Data Flow

Complete Confluence Connector

Synchronization and Updates

Optimized Confluence System Prompt

Concrete Use Cases

1. Internal IT Support

2. HR Processes

3. Technical Documentation

Best Practices

Structure Confluence for RAG

Manage Permissions

Quality Indicators

Related Resources

Connect Confluence with Ailog

Tags

Related Posts

Notion + RAG: Connect Your Company Wiki

SharePoint + RAG: Leverage Your Microsoft 365 Documents

Technical Documentation: RAG for Developers

Ailog Assistant