Audio RAG: Podcasts, Calls and Transcriptions
Complete guide to integrating audio into your RAG system: transcription with Whisper, speaker diarization, indexing podcasts and call recordings.
Audio RAG: Podcasts, Calls and Transcriptions
Audio represents a goldmine of often untapped information: recorded meetings, sales calls, internal podcasts, training sessions. Audio RAG makes all this sound content searchable and exploitable by your AI assistants.
Why Audio RAG?
The Audio Data Problem
- Massive volume: An average company generates 50+ hours of audio/week (meetings, calls)
- Lost information: 80% of meeting content is never documented
- Impossible search: Can't "ctrl+F" in an audio file
- Wasted time: Re-listening to find information = inefficient
Business Use Cases
| Sector | Audio Source | Extracted Value |
|---|---|---|
| Sales | Sales calls | Frequent objections, customer insights |
| Support | Ticket recordings | Recurring problem patterns |
| HR | Interviews | Candidate feedback, trends |
| Training | Webinars | Training knowledge base |
| Legal | Depositions | Search through testimonies |
Typical ROI
- 70% reduction in information search time
- +40% retention of knowledge shared in meetings
- Compliance: Traceability of verbal exchanges
Audio RAG Architecture
┌─────────────────────────────────────────────────────────────┐
│ AUDIO RAG PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Audio │───▶│ Whisper/ │───▶│ Transcription │ │
│ │ Input │ │ STT Model │ │ + Timestamps │ │
│ └──────────┘ └──────────────┘ └──────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Diarization (speaker ID) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Semantic segmentation (topics/chapters) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Embedding │ │ Vector │ │ Metadata │ │
│ │ per segment │ │ Store │ │ (speaker, time) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Retrieval + Generation with source │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Transcription: The Foundation
STT Model Comparison
| Model | Accuracy | Languages | Cost | Latency | Open source |
|---|---|---|---|---|---|
| Whisper Large v3 | 95%+ | 99 | $0 (local) | Slow | Yes |
| Whisper API | 95%+ | 99 | $0.006/min | Fast | No |
| AssemblyAI | 97%+ | 12 | $0.01/min | Fast | No |
| Deepgram | 96%+ | 36 | $0.0043/min | Real-time | No |
| Google STT | 95%+ | 125+ | $0.006/min | Fast | No |
Whisper: The Recommended Choice
OpenAI's Whisper offers the best value, especially when self-hosted.
DEVELOPERpythonimport whisper from pathlib import Path class AudioTranscriber: def __init__(self, model_size: str = "large-v3"): """ Available models: tiny, base, small, medium, large, large-v3 VRAM required: tiny=1GB, base=1GB, small=2GB, medium=5GB, large=10GB """ self.model = whisper.load_model(model_size) def transcribe( self, audio_path: str, language: str = None, word_timestamps: bool = True ) -> dict: """Transcribe an audio file with timestamps.""" result = self.model.transcribe( audio_path, language=language, word_timestamps=word_timestamps, verbose=False ) return { "text": result["text"], "segments": result["segments"], "language": result["language"], "duration": result["segments"][-1]["end"] if result["segments"] else 0 } def transcribe_with_chunks( self, audio_path: str, chunk_duration: int = 300 # 5 minutes ) -> list[dict]: """ Transcribe in chunks for long audio files. Avoids memory issues and improves accuracy. """ from pydub import AudioSegment audio = AudioSegment.from_file(audio_path) duration_ms = len(audio) chunk_ms = chunk_duration * 1000 chunks = [] for i, start in enumerate(range(0, duration_ms, chunk_ms)): end = min(start + chunk_ms, duration_ms) chunk = audio[start:end] # Export temporarily temp_path = f"/tmp/chunk_{i}.wav" chunk.export(temp_path, format="wav") # Transcribe result = self.transcribe(temp_path) # Adjust timestamps for seg in result["segments"]: seg["start"] += start / 1000 seg["end"] += start / 1000 chunks.append({ "chunk_index": i, "start_time": start / 1000, "end_time": end / 1000, **result }) return chunks
Whisper API (Simpler)
DEVELOPERpythonfrom openai import OpenAI def transcribe_with_api(audio_path: str) -> dict: """Transcription via OpenAI Whisper API.""" client = OpenAI() with open(audio_path, "rb") as audio_file: transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json", timestamp_granularities=["word", "segment"] ) return { "text": transcript.text, "segments": transcript.segments, "words": transcript.words, "language": transcript.language, "duration": transcript.duration }
Diarization: Identifying Speakers
Diarization answers "Who speaks when?". Essential for multi-participant meetings.
Pyannote: The Open Source Standard
DEVELOPERpythonfrom pyannote.audio import Pipeline import torch class SpeakerDiarizer: def __init__(self, hf_token: str): """ Requires a HuggingFace token with access to pyannote/speaker-diarization-3.1 model """ self.pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token=hf_token ) if torch.cuda.is_available(): self.pipeline.to(torch.device("cuda")) def diarize(self, audio_path: str, num_speakers: int = None) -> list[dict]: """ Identify speakers in audio. Args: audio_path: Path to audio file num_speakers: Number of speakers (optional, auto-detected otherwise) """ diarization = self.pipeline( audio_path, num_speakers=num_speakers ) segments = [] for turn, _, speaker in diarization.itertracks(yield_label=True): segments.append({ "speaker": speaker, "start": turn.start, "end": turn.end, "duration": turn.end - turn.start }) return segments def merge_transcription_diarization( self, transcription: dict, diarization: list[dict] ) -> list[dict]: """Merge transcription and diarization.""" merged = [] for trans_seg in transcription["segments"]: # Find the speaker who talks most during this segment seg_start = trans_seg["start"] seg_end = trans_seg["end"] speaker_times = {} for diar_seg in diarization: overlap_start = max(seg_start, diar_seg["start"]) overlap_end = min(seg_end, diar_seg["end"]) if overlap_start < overlap_end: overlap = overlap_end - overlap_start speaker = diar_seg["speaker"] speaker_times[speaker] = speaker_times.get(speaker, 0) + overlap # Assign majority speaker speaker = max(speaker_times, key=speaker_times.get) if speaker_times else "UNKNOWN" merged.append({ "speaker": speaker, "start": seg_start, "end": seg_end, "text": trans_seg["text"] }) return merged
Semantic Segmentation
Split the transcription into coherent topics/chapters for better retrieval.
DEVELOPERpythonfrom openai import OpenAI def segment_transcript_by_topics( transcript_segments: list[dict], client: OpenAI ) -> list[dict]: """ Segment a transcription into thematic topics. """ # Format transcription formatted = "\n".join([ f"[{seg['start']:.1f}s - {seg['end']:.1f}s] {seg.get('speaker', 'Speaker')}: {seg['text']}" for seg in transcript_segments ]) prompt = f"""Analyze this transcription and identify the different topics discussed. For each topic, provide: 1. Topic title (short, descriptive) 2. Start timestamp (in seconds) 3. End timestamp (in seconds) 4. Summary in 1-2 sentences Transcription: {formatted} Respond in JSON format: [ {{"title": "...", "start": 0.0, "end": 120.0, "summary": "..."}}, ... ]""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} ) import json topics = json.loads(response.choices[0].message.content) # Enrich with corresponding text for topic in topics: topic_text = [] for seg in transcript_segments: if seg["start"] >= topic["start"] and seg["end"] <= topic["end"]: topic_text.append(seg["text"]) topic["full_text"] = " ".join(topic_text) return topics
RAG Indexing
Recommended Data Structure
DEVELOPERpythonfrom dataclasses import dataclass from typing import Optional @dataclass class AudioChunk: """Represents an indexable audio segment.""" chunk_id: str audio_source_id: str audio_source_title: str # Content text: str speaker: Optional[str] # Temporal start_time: float end_time: float # Context topic: Optional[str] topic_summary: Optional[str] # Metadata language: str confidence: float created_at: str def to_indexable_text(self) -> str: """Enriched text for embedding.""" parts = [] if self.topic: parts.append(f"Topic: {self.topic}") if self.speaker: parts.append(f"Speaker: {self.speaker}") parts.append(self.text) return "\n".join(parts)
Complete Indexing Pipeline
DEVELOPERpythonfrom qdrant_client import QdrantClient from qdrant_client.models import VectorParams, Distance, PointStruct from openai import OpenAI import hashlib from datetime import datetime class AudioRAGIndexer: def __init__(self): self.qdrant = QdrantClient(url="http://localhost:6333") self.openai = OpenAI() self.transcriber = AudioTranscriber() self.diarizer = SpeakerDiarizer(hf_token="...") self.collection_name = "audio_rag" def create_collection(self): """Create Qdrant collection.""" self.qdrant.recreate_collection( collection_name=self.collection_name, vectors_config=VectorParams( size=1536, # text-embedding-3-small distance=Distance.COSINE ) ) def process_audio( self, audio_path: str, title: str, num_speakers: int = None ) -> list[AudioChunk]: """Complete audio processing pipeline.""" # 1. Transcription print("Transcribing...") transcription = self.transcriber.transcribe(audio_path) # 2. Diarization print("Diarizing...") diarization = self.diarizer.diarize(audio_path, num_speakers) # 3. Merge merged = self.diarizer.merge_transcription_diarization( transcription, diarization ) # 4. Topic segmentation print("Segmenting by topics...") topics = segment_transcript_by_topics(merged, self.openai) # 5. Create chunks chunks = [] source_id = hashlib.md5(audio_path.encode()).hexdigest() for topic in topics: chunk = AudioChunk( chunk_id=f"{source_id}_{topic['start']}", audio_source_id=source_id, audio_source_title=title, text=topic["full_text"], speaker=None, # Multi-speaker in a topic start_time=topic["start"], end_time=topic["end"], topic=topic["title"], topic_summary=topic["summary"], language=transcription["language"], confidence=0.95, created_at=datetime.now().isoformat() ) chunks.append(chunk) return chunks def index_chunks(self, chunks: list[AudioChunk]): """Index chunks in Qdrant.""" points = [] for chunk in chunks: # Generate embedding text = chunk.to_indexable_text() response = self.openai.embeddings.create( model="text-embedding-3-small", input=text ) embedding = response.data[0].embedding point = PointStruct( id=hash(chunk.chunk_id) % (2**63), vector=embedding, payload={ "chunk_id": chunk.chunk_id, "audio_source_id": chunk.audio_source_id, "audio_source_title": chunk.audio_source_title, "text": chunk.text, "speaker": chunk.speaker, "start_time": chunk.start_time, "end_time": chunk.end_time, "topic": chunk.topic, "topic_summary": chunk.topic_summary, "language": chunk.language } ) points.append(point) self.qdrant.upsert( collection_name=self.collection_name, points=points ) print(f"Indexed {len(points)} chunks")
Retrieval and Generation
Search with Temporal Context
DEVELOPERpythondef search_audio_rag( query: str, indexer: AudioRAGIndexer, limit: int = 5 ) -> list[dict]: """Search transcriptions with context.""" # Query embedding response = indexer.openai.embeddings.create( model="text-embedding-3-small", input=query ) query_embedding = response.data[0].embedding # Search results = indexer.qdrant.search( collection_name=indexer.collection_name, query_vector=query_embedding, limit=limit ) return [ { "text": r.payload["text"], "source": r.payload["audio_source_title"], "topic": r.payload["topic"], "timestamp": f"{r.payload['start_time']:.0f}s - {r.payload['end_time']:.0f}s", "score": r.score } for r in results ]
Generation with Audio Source Citation
DEVELOPERpythondef generate_answer_with_audio_sources( query: str, retrieved_chunks: list[dict], client: OpenAI ) -> str: """Generate a response citing audio sources.""" context = "\n\n".join([ f"**Source: {c['source']}** (Topic: {c['topic']}, {c['timestamp']})\n{c['text']}" for c in retrieved_chunks ]) prompt = f"""You are an assistant that answers questions based on audio transcriptions. Context (transcription excerpts): {context} Question: {query} Instructions: 1. Answer based only on the provided transcriptions 2. Cite your sources with format [Source: title, timestamp] 3. If the information is not in the transcriptions, say so clearly 4. Be concise but precise""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], max_tokens=1000 ) return response.choices[0].message.content
Advanced Optimizations
Automatic Meeting Summary
DEVELOPERpythondef summarize_meeting( transcript_with_speakers: list[dict], client: OpenAI ) -> dict: """Generate a structured meeting summary.""" formatted = "\n".join([ f"{seg['speaker']}: {seg['text']}" for seg in transcript_with_speakers ]) prompt = f"""Analyze this meeting transcription and generate a structured summary. Transcription: {formatted} Generate a JSON with: {{ "title": "Suggested title for this meeting", "participants": ["List of identified participants"], "duration_minutes": X, "key_points": ["Key point 1", "Key point 2", ...], "decisions": ["Decision 1", "Decision 2", ...], "action_items": [ {{"assignee": "Name", "task": "Description", "deadline": "if mentioned"}} ], "next_steps": ["Next step 1", ...], "summary": "Summary in 2-3 paragraphs" }}""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} ) import json return json.loads(response.choices[0].message.content)
Key Moment Detection
DEVELOPERpythondef detect_key_moments( transcript_segments: list[dict], client: OpenAI ) -> list[dict]: """Identify important moments in audio.""" formatted = "\n".join([ f"[{seg['start']:.0f}s] {seg.get('speaker', 'Speaker')}: {seg['text']}" for seg in transcript_segments ]) prompt = f"""Identify key moments in this transcription: - Important questions asked - Decisions made - Disagreements or debates - Critical information shared - Moments of humor or tension Transcription: {formatted} For each key moment, provide: - timestamp (in seconds) - type (question/decision/debate/info/other) - short description - importance (1-5) Respond in JSON: [{{"timestamp": X, "type": "...", "description": "...", "importance": X}}, ...]""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} ) import json return json.loads(response.choices[0].message.content)
Costs and Performance
Transcription Costs
| Solution | Cost/hour | Latency | Accuracy |
|---|---|---|---|
| Whisper local (GPU) | ~$0.10 (electricity) | 10-30min | 95%+ |
| Whisper API | $0.36 | 2-5min | 95%+ |
| AssemblyAI | $0.60 | 5-10min | 97%+ |
| Deepgram | $0.26 | Real-time | 96%+ |
Diarization Costs
- Pyannote local: $0 (but GPU required)
- AssemblyAI: Included
- AWS Transcribe: +$0.024/min
Estimated Storage
- 1 hour audio = ~15,000 words = ~100 chunks
- Embeddings: ~600KB/hour
- Metadata: ~50KB/hour
Integration with Ailog
Ailog simplifies Audio RAG with native integration:
- Audio upload: Supported formats: MP3, WAV, M4A, WEBM
- Automatic transcription: Built-in Whisper
- Intelligent indexing: Topic segmentation
- Unified search: Audio + text + images in one query
Related Guides
Tags
Related Posts
Image RAG: Vision Models and Visual Search
Complete guide to integrating images into your RAG system: vision models, multimodal embeddings, indexing and visual search with GPT-4V, Claude Vision and CLIP.
Multimodal RAG: Images, PDFs, and Beyond Text
Extend your RAG beyond text: image indexing, PDF extraction, tables, and charts for a truly complete assistant.
Magento: Intelligent Catalog Assistant
Deploy an AI assistant on Magento to navigate complex catalogs, recommend products and improve B2B and B2C experience.