1. ParsingIntermediate

Parse PDF Documents with PyMuPDF

November 5, 2025
10 min read
Ailog Research Team

Master PDF parsing: extract text, images, tables, and metadata from PDFs using PyMuPDF and alternatives.

Why PDFs Are Challenging

PDFs are designed for display, not data extraction. Text can be:

  • Images (scanned documents)
  • Embedded fonts with custom encodings
  • Complex layouts (multi-column, tables)
  • Protected or encrypted

PyMuPDF: The Gold Standard (2025)

PyMuPDF (fitz) is the fastest and most reliable PDF parser.

DEVELOPERpython
import fitz # PyMuPDF def extract_pdf_content(pdf_path): doc = fitz.open(pdf_path) full_text = "" metadata = { "title": doc.metadata.get("title", ""), "author": doc.metadata.get("author", ""), "pages": len(doc) } for page_num, page in enumerate(doc): # Extract text with layout preservation text = page.get_text("text") full_text += f"\n\n--- Page {page_num + 1} ---\n{text}" return full_text, metadata

Extract Images from PDFs

DEVELOPERpython
def extract_images(pdf_path): doc = fitz.open(pdf_path) images = [] for page_num, page in enumerate(doc): image_list = page.get_images() for img_index, img in enumerate(image_list): xref = img[0] base_image = doc.extract_image(xref) images.append({ "page": page_num + 1, "image_data": base_image["image"], "extension": base_image["ext"] }) return images

Handle Scanned PDFs (OCR)

For image-based PDFs, use Tesseract OCR:

DEVELOPERpython
from PIL import Image import pytesseract import fitz def ocr_pdf(pdf_path): doc = fitz.open(pdf_path) text = "" for page in doc: # Convert page to image pix = page.get_pixmap() img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) # OCR the image page_text = pytesseract.image_to_string(img) text += page_text + "\n\n" return text

Extract Tables

Use pdfplumber for table extraction:

DEVELOPERpython
import pdfplumber def extract_tables(pdf_path): tables = [] with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_tables = page.extract_tables() tables.extend(page_tables) return tables

Layout-Aware Extraction

Preserve document structure:

DEVELOPERpython
def extract_with_layout(pdf_path): doc = fitz.open(pdf_path) structured_content = [] for page in doc: blocks = page.get_text("dict")["blocks"] for block in blocks: if block["type"] == 0: # Text block text = "" for line in block["lines"]: for span in line["spans"]: text += span["text"] structured_content.append({ "type": "text", "content": text, "bbox": block["bbox"], # Position "page": page.number }) return structured_content

Performance Optimization

Parallel Processing for large PDFs:

DEVELOPERpython
from concurrent.futures import ThreadPoolExecutor def parse_pdf_parallel(pdf_path, num_workers=4): doc = fitz.open(pdf_path) def process_page(page_num): page = doc[page_num] return page.get_text() with ThreadPoolExecutor(max_workers=num_workers) as executor: texts = list(executor.map(process_page, range(len(doc)))) return "\n\n".join(texts)

Alternative Tools (2025)

  • PyPDF2: Simple but slower
  • pdfplumber: Excellent for tables
  • Camelot: Specialized table extraction
  • Adobe PDF Extract API: Commercial, very accurate

Best Practices

  1. Test with sample documents first
  2. Handle encrypted PDFs gracefully
  3. Preserve page numbers for citations
  4. Extract metadata (title, author, date)
  5. Use OCR only when needed (slower)
  6. Cache parsed results to avoid re-processing

PDFs are the backbone of enterprise knowledge. Master PDF parsing to unlock your RAG potential.

Tags

pdfparsingpymupdftext extraction

Related Guides