Parse PDF Documents with PyMuPDF
Master PDF parsing: extract text, images, tables, and metadata from PDFs using PyMuPDF and alternatives.
- Author
- Ailog Research Team
- Published
- Reading time
- 10 min read
- Level
- intermediate
- RAG Pipeline Step
- Parsing
Why PDFs Are Challenging
PDFs are designed for display, not data extraction. Text can be: • Images (scanned documents) • Embedded fonts with custom encodings • Complex layouts (multi-column, tables) • Protected or encrypted
PyMuPDF: The Gold Standard (2025)
PyMuPDF (fitz) is the fastest and most reliable PDF parser.
``python import fitz PyMuPDF
def extract_pdf_content(pdf_path): doc = fitz.open(pdf_path)
full_text = "" metadata = { "title": doc.metadata.get("title", ""), "author": doc.metadata.get("author", ""), "pages": len(doc) }
for page_num, page in enumerate(doc): Extract text with layout preservation text = page.get_text("text") full_text += f"\n\n--- Page {page_num + 1} ---\n{text}"
return full_text, metadata `
Extract Images from PDFs
`python def extract_images(pdf_path): doc = fitz.open(pdf_path) images = []
for page_num, page in enumerate(doc): image_list = page.get_images()
for img_index, img in enumerate(image_list): xref = img[0] base_image = doc.extract_image(xref)
images.append({ "page": page_num + 1, "image_data": base_image["image"], "extension": base_image["ext"] })
return images `
Handle Scanned PDFs (OCR)
For image-based PDFs, use Tesseract OCR:
`python from PIL import Image import pytesseract import fitz
def ocr_pdf(pdf_path): doc = fitz.open(pdf_path) text = ""
for page in doc: Convert page to image pix = page.get_pixmap() img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
OCR the image page_text = pytesseract.image_to_string(img) text += page_text + "\n\n"
return text `
Extract Tables
Use pdfplumber for table extraction:
`python import pdfplumber
def extract_tables(pdf_path): tables = []
with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_tables = page.extract_tables() tables.extend(page_tables)
return tables `
Layout-Aware Extraction
Preserve document structure:
`python def extract_with_layout(pdf_path): doc = fitz.open(pdf_path) structured_content = []
for page in doc: blocks = page.get_text("dict")["blocks"]
for block in blocks: if block["type"] == 0: Text block text = "" for line in block["lines"]: for span in line["spans"]: text += span["text"]
structured_content.append({ "type": "text", "content": text, "bbox": block["bbox"], Position "page": page.number })
return structured_content `
Performance Optimization
Parallel Processing for large PDFs:
`python from concurrent.futures import ThreadPoolExecutor
def parse_pdf_parallel(pdf_path, num_workers=4): doc = fitz.open(pdf_path)
def process_page(page_num): page = doc[page_num] return page.get_text()
with ThreadPoolExecutor(max_workers=num_workers) as executor: texts = list(executor.map(process_page, range(len(doc))))
return "\n\n".join(texts) ``
Alternative Tools (2025) • PyPDF2: Simple but slower • pdfplumber: Excellent for tables • Camelot: Specialized table extraction • Adobe PDF Extract API: Commercial, very accurate
Best Practices Test with sample documents first Handle encrypted PDFs gracefully Preserve page numbers for citations Extract metadata (title, author, date) Use OCR only when needed (slower) Cache parsed results to avoid re-processing
PDFs are the backbone of enterprise knowledge. Master PDF parsing to unlock your RAG potential.