1. ParsingIntermediate
OCR for Scanned Documents and Images
November 22, 2025
9 min read
Ailog Research Team
Extract text from scanned PDFs and images using Tesseract, AWS Textract, and modern OCR techniques.
When You Need OCR
Digital PDFs have extractable text. Scanned documents don't - they're just images.
OCR converts:
- Scanned contracts
- Old books
- Receipts
- Screenshots
- Handwritten notes
Tesseract (Free, Open-Source)
DEVELOPERpythonfrom PIL import Image import pytesseract def ocr_image(image_path): image = Image.open(image_path) text = pytesseract.image_to_string(image, lang='eng') return text # With confidence scores data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT) for i, word in enumerate(data['text']): confidence = data['conf'][i] if confidence > 60: # Filter low-confidence print(word)
Preprocessing for Better Accuracy
DEVELOPERpythonimport cv2 import numpy as np def preprocess_image(image_path): img = cv2.imread(image_path) # Convert to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Remove noise denoised = cv2.fastNlMeansDenoising(gray) # Threshold (binarize) _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) return binary # Use preprocessed image preprocessed = preprocess_image('scan.jpg') text = pytesseract.image_to_string(preprocessed)
AWS Textract (Commercial, Best Quality)
DEVELOPERpythonimport boto3 textract = boto3.client('textract', region_name='us-east-1') def extract_with_textract(image_path): with open(image_path, 'rb') as f: image_bytes = f.read() response = textract.detect_document_text( Document={'Bytes': image_bytes} ) text = "" for block in response['Blocks']: if block['BlockType'] == 'LINE': text += block['Text'] + '\n' return text # For tables and forms response = textract.analyze_document( Document={'Bytes': image_bytes}, FeatureTypes=['TABLES', 'FORMS'] )
Google Cloud Vision
DEVELOPERpythonfrom google.cloud import vision client = vision.ImageAnnotatorClient() def ocr_with_google(image_path): with open(image_path, 'rb') as f: content = f.read() image = vision.Image(content=content) response = client.document_text_detection(image=image) return response.full_text_annotation.text
Handwriting Recognition
DEVELOPERpython# EasyOCR for handwriting import easyocr reader = easyocr.Reader(['en']) result = reader.readtext('handwritten.jpg', detail=0) text = ' '.join(result)
Multilingual OCR
DEVELOPERpython# Tesseract with multiple languages text = pytesseract.image_to_string( image, lang='eng+fra+deu' # English + French + German ) # EasyOCR (better for Asian languages) reader = easyocr.Reader(['en', 'zh', 'ja', 'ko']) results = reader.readtext('multilingual.jpg')
OCR opens up scanned content for RAG. Essential for legal, healthcare, and historical documents.
Tags
ocrparsingtesseractscanned documents
Related Guides
guidesbeginner
Document Parsing Fundamentals
Start your RAG journey: learn how to extract text, metadata, and structure from documents for semantic search.
8 min read
guidesintermediate
Parse PDF Documents with PyMuPDF
Master PDF parsing: extract text, images, tables, and metadata from PDFs using PyMuPDF and alternatives.
10 min read
guidesintermediate
Caching Strategies to Reduce RAG Latency and Cost
Cut costs by 80%: implement semantic caching, embedding caching, and response caching for production RAG.
10 min read