GuideAdvanced

Audio RAG: Podcasts, Calls and Transcriptions

March 20, 2026
22 min read
Ailog Team

Complete guide to integrating audio into your RAG system: transcription with Whisper, speaker diarization, indexing podcasts and call recordings.

Audio RAG: Podcasts, Calls and Transcriptions

Audio represents a goldmine of often untapped information: recorded meetings, sales calls, internal podcasts, training sessions. Audio RAG makes all this sound content searchable and exploitable by your AI assistants.

Why Audio RAG?

The Audio Data Problem

  • Massive volume: An average company generates 50+ hours of audio/week (meetings, calls)
  • Lost information: 80% of meeting content is never documented
  • Impossible search: Can't "ctrl+F" in an audio file
  • Wasted time: Re-listening to find information = inefficient

Business Use Cases

SectorAudio SourceExtracted Value
SalesSales callsFrequent objections, customer insights
SupportTicket recordingsRecurring problem patterns
HRInterviewsCandidate feedback, trends
TrainingWebinarsTraining knowledge base
LegalDepositionsSearch through testimonies

Typical ROI

  • 70% reduction in information search time
  • +40% retention of knowledge shared in meetings
  • Compliance: Traceability of verbal exchanges

Audio RAG Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AUDIO RAG PIPELINE                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │  Audio   │───▶│  Whisper/    │───▶│  Transcription   │  │
│  │  Input   │    │  STT Model   │    │  + Timestamps    │  │
│  └──────────┘    └──────────────┘    └──────────────────┘  │
│                         │                     │             │
│                         ▼                     ▼             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Diarization (speaker ID)                 │  │
│  └──────────────────────────────────────────────────────┘  │
│                         │                                   │
│                         ▼                                   │
│  ┌──────────────────────────────────────────────────────┐  │
│  │     Semantic segmentation (topics/chapters)           │  │
│  └──────────────────────────────────────────────────────┘  │
│                         │                                   │
│                         ▼                                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │  Embedding   │  │   Vector     │  │    Metadata      │  │
│  │  per segment │  │   Store      │  │  (speaker, time) │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
│                         │                                   │
│                         ▼                                   │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         Retrieval + Generation with source            │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Transcription: The Foundation

STT Model Comparison

ModelAccuracyLanguagesCostLatencyOpen source
Whisper Large v395%+99$0 (local)SlowYes
Whisper API95%+99$0.006/minFastNo
AssemblyAI97%+12$0.01/minFastNo
Deepgram96%+36$0.0043/minReal-timeNo
Google STT95%+125+$0.006/minFastNo

Whisper: The Recommended Choice

OpenAI's Whisper offers the best value, especially when self-hosted.

DEVELOPERpython
import whisper from pathlib import Path class AudioTranscriber: def __init__(self, model_size: str = "large-v3"): """ Available models: tiny, base, small, medium, large, large-v3 VRAM required: tiny=1GB, base=1GB, small=2GB, medium=5GB, large=10GB """ self.model = whisper.load_model(model_size) def transcribe( self, audio_path: str, language: str = None, word_timestamps: bool = True ) -> dict: """Transcribe an audio file with timestamps.""" result = self.model.transcribe( audio_path, language=language, word_timestamps=word_timestamps, verbose=False ) return { "text": result["text"], "segments": result["segments"], "language": result["language"], "duration": result["segments"][-1]["end"] if result["segments"] else 0 } def transcribe_with_chunks( self, audio_path: str, chunk_duration: int = 300 # 5 minutes ) -> list[dict]: """ Transcribe in chunks for long audio files. Avoids memory issues and improves accuracy. """ from pydub import AudioSegment audio = AudioSegment.from_file(audio_path) duration_ms = len(audio) chunk_ms = chunk_duration * 1000 chunks = [] for i, start in enumerate(range(0, duration_ms, chunk_ms)): end = min(start + chunk_ms, duration_ms) chunk = audio[start:end] # Export temporarily temp_path = f"/tmp/chunk_{i}.wav" chunk.export(temp_path, format="wav") # Transcribe result = self.transcribe(temp_path) # Adjust timestamps for seg in result["segments"]: seg["start"] += start / 1000 seg["end"] += start / 1000 chunks.append({ "chunk_index": i, "start_time": start / 1000, "end_time": end / 1000, **result }) return chunks

Whisper API (Simpler)

DEVELOPERpython
from openai import OpenAI def transcribe_with_api(audio_path: str) -> dict: """Transcription via OpenAI Whisper API.""" client = OpenAI() with open(audio_path, "rb") as audio_file: transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json", timestamp_granularities=["word", "segment"] ) return { "text": transcript.text, "segments": transcript.segments, "words": transcript.words, "language": transcript.language, "duration": transcript.duration }

Diarization: Identifying Speakers

Diarization answers "Who speaks when?". Essential for multi-participant meetings.

Pyannote: The Open Source Standard

DEVELOPERpython
from pyannote.audio import Pipeline import torch class SpeakerDiarizer: def __init__(self, hf_token: str): """ Requires a HuggingFace token with access to pyannote/speaker-diarization-3.1 model """ self.pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token=hf_token ) if torch.cuda.is_available(): self.pipeline.to(torch.device("cuda")) def diarize(self, audio_path: str, num_speakers: int = None) -> list[dict]: """ Identify speakers in audio. Args: audio_path: Path to audio file num_speakers: Number of speakers (optional, auto-detected otherwise) """ diarization = self.pipeline( audio_path, num_speakers=num_speakers ) segments = [] for turn, _, speaker in diarization.itertracks(yield_label=True): segments.append({ "speaker": speaker, "start": turn.start, "end": turn.end, "duration": turn.end - turn.start }) return segments def merge_transcription_diarization( self, transcription: dict, diarization: list[dict] ) -> list[dict]: """Merge transcription and diarization.""" merged = [] for trans_seg in transcription["segments"]: # Find the speaker who talks most during this segment seg_start = trans_seg["start"] seg_end = trans_seg["end"] speaker_times = {} for diar_seg in diarization: overlap_start = max(seg_start, diar_seg["start"]) overlap_end = min(seg_end, diar_seg["end"]) if overlap_start < overlap_end: overlap = overlap_end - overlap_start speaker = diar_seg["speaker"] speaker_times[speaker] = speaker_times.get(speaker, 0) + overlap # Assign majority speaker speaker = max(speaker_times, key=speaker_times.get) if speaker_times else "UNKNOWN" merged.append({ "speaker": speaker, "start": seg_start, "end": seg_end, "text": trans_seg["text"] }) return merged

Semantic Segmentation

Split the transcription into coherent topics/chapters for better retrieval.

DEVELOPERpython
from openai import OpenAI def segment_transcript_by_topics( transcript_segments: list[dict], client: OpenAI ) -> list[dict]: """ Segment a transcription into thematic topics. """ # Format transcription formatted = "\n".join([ f"[{seg['start']:.1f}s - {seg['end']:.1f}s] {seg.get('speaker', 'Speaker')}: {seg['text']}" for seg in transcript_segments ]) prompt = f"""Analyze this transcription and identify the different topics discussed. For each topic, provide: 1. Topic title (short, descriptive) 2. Start timestamp (in seconds) 3. End timestamp (in seconds) 4. Summary in 1-2 sentences Transcription: {formatted} Respond in JSON format: [ {{"title": "...", "start": 0.0, "end": 120.0, "summary": "..."}}, ... ]""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} ) import json topics = json.loads(response.choices[0].message.content) # Enrich with corresponding text for topic in topics: topic_text = [] for seg in transcript_segments: if seg["start"] >= topic["start"] and seg["end"] <= topic["end"]: topic_text.append(seg["text"]) topic["full_text"] = " ".join(topic_text) return topics

RAG Indexing

Recommended Data Structure

DEVELOPERpython
from dataclasses import dataclass from typing import Optional @dataclass class AudioChunk: """Represents an indexable audio segment.""" chunk_id: str audio_source_id: str audio_source_title: str # Content text: str speaker: Optional[str] # Temporal start_time: float end_time: float # Context topic: Optional[str] topic_summary: Optional[str] # Metadata language: str confidence: float created_at: str def to_indexable_text(self) -> str: """Enriched text for embedding.""" parts = [] if self.topic: parts.append(f"Topic: {self.topic}") if self.speaker: parts.append(f"Speaker: {self.speaker}") parts.append(self.text) return "\n".join(parts)

Complete Indexing Pipeline

DEVELOPERpython
from qdrant_client import QdrantClient from qdrant_client.models import VectorParams, Distance, PointStruct from openai import OpenAI import hashlib from datetime import datetime class AudioRAGIndexer: def __init__(self): self.qdrant = QdrantClient(url="http://localhost:6333") self.openai = OpenAI() self.transcriber = AudioTranscriber() self.diarizer = SpeakerDiarizer(hf_token="...") self.collection_name = "audio_rag" def create_collection(self): """Create Qdrant collection.""" self.qdrant.recreate_collection( collection_name=self.collection_name, vectors_config=VectorParams( size=1536, # text-embedding-3-small distance=Distance.COSINE ) ) def process_audio( self, audio_path: str, title: str, num_speakers: int = None ) -> list[AudioChunk]: """Complete audio processing pipeline.""" # 1. Transcription print("Transcribing...") transcription = self.transcriber.transcribe(audio_path) # 2. Diarization print("Diarizing...") diarization = self.diarizer.diarize(audio_path, num_speakers) # 3. Merge merged = self.diarizer.merge_transcription_diarization( transcription, diarization ) # 4. Topic segmentation print("Segmenting by topics...") topics = segment_transcript_by_topics(merged, self.openai) # 5. Create chunks chunks = [] source_id = hashlib.md5(audio_path.encode()).hexdigest() for topic in topics: chunk = AudioChunk( chunk_id=f"{source_id}_{topic['start']}", audio_source_id=source_id, audio_source_title=title, text=topic["full_text"], speaker=None, # Multi-speaker in a topic start_time=topic["start"], end_time=topic["end"], topic=topic["title"], topic_summary=topic["summary"], language=transcription["language"], confidence=0.95, created_at=datetime.now().isoformat() ) chunks.append(chunk) return chunks def index_chunks(self, chunks: list[AudioChunk]): """Index chunks in Qdrant.""" points = [] for chunk in chunks: # Generate embedding text = chunk.to_indexable_text() response = self.openai.embeddings.create( model="text-embedding-3-small", input=text ) embedding = response.data[0].embedding point = PointStruct( id=hash(chunk.chunk_id) % (2**63), vector=embedding, payload={ "chunk_id": chunk.chunk_id, "audio_source_id": chunk.audio_source_id, "audio_source_title": chunk.audio_source_title, "text": chunk.text, "speaker": chunk.speaker, "start_time": chunk.start_time, "end_time": chunk.end_time, "topic": chunk.topic, "topic_summary": chunk.topic_summary, "language": chunk.language } ) points.append(point) self.qdrant.upsert( collection_name=self.collection_name, points=points ) print(f"Indexed {len(points)} chunks")

Retrieval and Generation

Search with Temporal Context

DEVELOPERpython
def search_audio_rag( query: str, indexer: AudioRAGIndexer, limit: int = 5 ) -> list[dict]: """Search transcriptions with context.""" # Query embedding response = indexer.openai.embeddings.create( model="text-embedding-3-small", input=query ) query_embedding = response.data[0].embedding # Search results = indexer.qdrant.search( collection_name=indexer.collection_name, query_vector=query_embedding, limit=limit ) return [ { "text": r.payload["text"], "source": r.payload["audio_source_title"], "topic": r.payload["topic"], "timestamp": f"{r.payload['start_time']:.0f}s - {r.payload['end_time']:.0f}s", "score": r.score } for r in results ]

Generation with Audio Source Citation

DEVELOPERpython
def generate_answer_with_audio_sources( query: str, retrieved_chunks: list[dict], client: OpenAI ) -> str: """Generate a response citing audio sources.""" context = "\n\n".join([ f"**Source: {c['source']}** (Topic: {c['topic']}, {c['timestamp']})\n{c['text']}" for c in retrieved_chunks ]) prompt = f"""You are an assistant that answers questions based on audio transcriptions. Context (transcription excerpts): {context} Question: {query} Instructions: 1. Answer based only on the provided transcriptions 2. Cite your sources with format [Source: title, timestamp] 3. If the information is not in the transcriptions, say so clearly 4. Be concise but precise""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], max_tokens=1000 ) return response.choices[0].message.content

Advanced Optimizations

Automatic Meeting Summary

DEVELOPERpython
def summarize_meeting( transcript_with_speakers: list[dict], client: OpenAI ) -> dict: """Generate a structured meeting summary.""" formatted = "\n".join([ f"{seg['speaker']}: {seg['text']}" for seg in transcript_with_speakers ]) prompt = f"""Analyze this meeting transcription and generate a structured summary. Transcription: {formatted} Generate a JSON with: {{ "title": "Suggested title for this meeting", "participants": ["List of identified participants"], "duration_minutes": X, "key_points": ["Key point 1", "Key point 2", ...], "decisions": ["Decision 1", "Decision 2", ...], "action_items": [ {{"assignee": "Name", "task": "Description", "deadline": "if mentioned"}} ], "next_steps": ["Next step 1", ...], "summary": "Summary in 2-3 paragraphs" }}""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} ) import json return json.loads(response.choices[0].message.content)

Key Moment Detection

DEVELOPERpython
def detect_key_moments( transcript_segments: list[dict], client: OpenAI ) -> list[dict]: """Identify important moments in audio.""" formatted = "\n".join([ f"[{seg['start']:.0f}s] {seg.get('speaker', 'Speaker')}: {seg['text']}" for seg in transcript_segments ]) prompt = f"""Identify key moments in this transcription: - Important questions asked - Decisions made - Disagreements or debates - Critical information shared - Moments of humor or tension Transcription: {formatted} For each key moment, provide: - timestamp (in seconds) - type (question/decision/debate/info/other) - short description - importance (1-5) Respond in JSON: [{{"timestamp": X, "type": "...", "description": "...", "importance": X}}, ...]""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} ) import json return json.loads(response.choices[0].message.content)

Costs and Performance

Transcription Costs

SolutionCost/hourLatencyAccuracy
Whisper local (GPU)~$0.10 (electricity)10-30min95%+
Whisper API$0.362-5min95%+
AssemblyAI$0.605-10min97%+
Deepgram$0.26Real-time96%+

Diarization Costs

  • Pyannote local: $0 (but GPU required)
  • AssemblyAI: Included
  • AWS Transcribe: +$0.024/min

Estimated Storage

  • 1 hour audio = ~15,000 words = ~100 chunks
  • Embeddings: ~600KB/hour
  • Metadata: ~50KB/hour

Integration with Ailog

Ailog simplifies Audio RAG with native integration:

  1. Audio upload: Supported formats: MP3, WAV, M4A, WEBM
  2. Automatic transcription: Built-in Whisper
  3. Intelligent indexing: Topic segmentation
  4. Unified search: Audio + text + images in one query

Try Audio RAG on Ailog

Related Guides

Tags

RAGmultimodalaudiotranscriptionWhisperpodcastsspeech-to-text

Related Posts

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !