1. ParsingIntermediate

OCR for Scanned Documents and Images

November 22, 2025
9 min read
Ailog Research Team

Extract text from scanned PDFs and images using Tesseract, AWS Textract, and modern OCR techniques.

When You Need OCR

Digital PDFs have extractable text. Scanned documents don't - they're just images.

OCR converts:

  • Scanned contracts
  • Old books
  • Receipts
  • Screenshots
  • Handwritten notes

Tesseract (Free, Open-Source)

DEVELOPERpython
from PIL import Image import pytesseract def ocr_image(image_path): image = Image.open(image_path) text = pytesseract.image_to_string(image, lang='eng') return text # With confidence scores data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT) for i, word in enumerate(data['text']): confidence = data['conf'][i] if confidence > 60: # Filter low-confidence print(word)

Preprocessing for Better Accuracy

DEVELOPERpython
import cv2 import numpy as np def preprocess_image(image_path): img = cv2.imread(image_path) # Convert to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Remove noise denoised = cv2.fastNlMeansDenoising(gray) # Threshold (binarize) _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) return binary # Use preprocessed image preprocessed = preprocess_image('scan.jpg') text = pytesseract.image_to_string(preprocessed)

AWS Textract (Commercial, Best Quality)

DEVELOPERpython
import boto3 textract = boto3.client('textract', region_name='us-east-1') def extract_with_textract(image_path): with open(image_path, 'rb') as f: image_bytes = f.read() response = textract.detect_document_text( Document={'Bytes': image_bytes} ) text = "" for block in response['Blocks']: if block['BlockType'] == 'LINE': text += block['Text'] + '\n' return text # For tables and forms response = textract.analyze_document( Document={'Bytes': image_bytes}, FeatureTypes=['TABLES', 'FORMS'] )

Google Cloud Vision

DEVELOPERpython
from google.cloud import vision client = vision.ImageAnnotatorClient() def ocr_with_google(image_path): with open(image_path, 'rb') as f: content = f.read() image = vision.Image(content=content) response = client.document_text_detection(image=image) return response.full_text_annotation.text

Handwriting Recognition

DEVELOPERpython
# EasyOCR for handwriting import easyocr reader = easyocr.Reader(['en']) result = reader.readtext('handwritten.jpg', detail=0) text = ' '.join(result)

Multilingual OCR

DEVELOPERpython
# Tesseract with multiple languages text = pytesseract.image_to_string( image, lang='eng+fra+deu' # English + French + German ) # EasyOCR (better for Asian languages) reader = easyocr.Reader(['en', 'zh', 'ja', 'ko']) results = reader.readtext('multilingual.jpg')

OCR opens up scanned content for RAG. Essential for legal, healthcare, and historical documents.

Tags

ocrparsingtesseractscanned documents

Related Guides