OCR for Scanned Documents and Images
Extract text from scanned PDFs and images for RAG. Compare Tesseract, AWS Textract, and Google Vision OCR with code examples and accuracy benchmarks.
- Author
- Ailog Research Team
- Published
- Reading time
- 9 min read
- Level
- intermediate
- RAG Pipeline Step
- Parsing
When You Need OCR
Digital PDFs have extractable text. Scanned documents don't - they're just images.
OCR converts: • Scanned contracts • Old books • Receipts • Screenshots • Handwritten notes
Tesseract (Free, Open-Source)
``python from PIL import Image import pytesseract
def ocr_image(image_path): image = Image.open(image_path) text = pytesseract.image_to_string(image, lang='eng') return text
With confidence scores data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT) for i, word in enumerate(data['text']): confidence = data['conf'][i] if confidence > 60: Filter low-confidence print(word) `
Preprocessing for Better Accuracy
`python import cv2 import numpy as np
def preprocess_image(image_path): img = cv2.imread(image_path) Convert to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) Remove noise denoised = cv2.fastNlMeansDenoising(gray) Threshold (binarize) _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) return binary
Use preprocessed image preprocessed = preprocess_image('scan.jpg') text = pytesseract.image_to_string(preprocessed) `
AWS Textract (Commercial, Best Quality)
`python import boto3
textract = boto3.client('textract', region_name='us-east-1')
def extract_with_textract(image_path): with open(image_path, 'rb') as f: image_bytes = f.read() response = textract.detect_document_text( Document={'Bytes': image_bytes} ) text = "" for block in response['Blocks']: if block['BlockType'] == 'LINE': text += block['Text'] + '\n' return text
For tables and forms response = textract.analyze_document( Document={'Bytes': image_bytes}, FeatureTypes=['TABLES', 'FORMS'] ) `
Google Cloud Vision
`python from google.cloud import vision
client = vision.ImageAnnotatorClient()
def ocr_with_google(image_path): with open(image_path, 'rb') as f: content = f.read() image = vision.Image(content=content) response = client.document_text_detection(image=image) return response.full_text_annotation.text `
Handwriting Recognition
`python EasyOCR for handwriting import easyocr
reader = easyocr.Reader(['en']) result = reader.readtext('handwritten.jpg', detail=0) text = ' '.join(result) `
Multilingual OCR
`python Tesseract with multiple languages text = pytesseract.image_to_string( image, lang='eng+fra+deu' English + French + German )
EasyOCR (better for Asian languages) reader = easyocr.Reader(['en', 'zh', 'ja', 'ko']) results = reader.readtext('multilingual.jpg') ``
OCR opens up scanned content for RAG. Essential for legal, healthcare, and historical documents.