OCR for Scanned Documents and Images

Extract text from scanned PDFs and images for RAG. Compare Tesseract, AWS Textract, and Google Vision OCR with code examples and accuracy benchmarks.

Author
Ailog Research Team
Published
Reading time
9 min read
Level
intermediate
RAG Pipeline Step
Parsing

When You Need OCR

Digital PDFs have extractable text. Scanned documents don't - they're just images.

OCR converts: • Scanned contracts • Old books • Receipts • Screenshots • Handwritten notes

Tesseract (Free, Open-Source)

``python from PIL import Image import pytesseract

def ocr_image(image_path): image = Image.open(image_path) text = pytesseract.image_to_string(image, lang='eng') return text

With confidence scores data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT) for i, word in enumerate(data['text']): confidence = data['conf'][i] if confidence > 60: Filter low-confidence print(word) `

Preprocessing for Better Accuracy

`python import cv2 import numpy as np

def preprocess_image(image_path): img = cv2.imread(image_path) Convert to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) Remove noise denoised = cv2.fastNlMeansDenoising(gray) Threshold (binarize) _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) return binary

Use preprocessed image preprocessed = preprocess_image('scan.jpg') text = pytesseract.image_to_string(preprocessed) `

AWS Textract (Commercial, Best Quality)

`python import boto3

textract = boto3.client('textract', region_name='us-east-1')

def extract_with_textract(image_path): with open(image_path, 'rb') as f: image_bytes = f.read() response = textract.detect_document_text( Document={'Bytes': image_bytes} ) text = "" for block in response['Blocks']: if block['BlockType'] == 'LINE': text += block['Text'] + '\n' return text

For tables and forms response = textract.analyze_document( Document={'Bytes': image_bytes}, FeatureTypes=['TABLES', 'FORMS'] ) `

Google Cloud Vision

`python from google.cloud import vision

client = vision.ImageAnnotatorClient()

def ocr_with_google(image_path): with open(image_path, 'rb') as f: content = f.read() image = vision.Image(content=content) response = client.document_text_detection(image=image) return response.full_text_annotation.text `

Handwriting Recognition

`python EasyOCR for handwriting import easyocr

reader = easyocr.Reader(['en']) result = reader.readtext('handwritten.jpg', detail=0) text = ' '.join(result) `

Multilingual OCR

`python Tesseract with multiple languages text = pytesseract.image_to_string( image, lang='eng+fra+deu' English + French + German )

EasyOCR (better for Asian languages) reader = easyocr.Reader(['en', 'zh', 'ja', 'ko']) results = reader.readtext('multilingual.jpg') ``

OCR opens up scanned content for RAG. Essential for legal, healthcare, and historical documents.

Tags

  • ocr
  • parsing
  • tesseract
  • scanned documents
1. ParsingIntermédiaire

OCR for Scanned Documents and Images

22 novembre 2025
9 min read
Ailog Research Team

Extract text from scanned PDFs and images for RAG. Compare Tesseract, AWS Textract, and Google Vision OCR with code examples and accuracy benchmarks.

When You Need OCR

Digital PDFs have extractable text. Scanned documents don't - they're just images.

OCR converts:

  • Scanned contracts
  • Old books
  • Receipts
  • Screenshots
  • Handwritten notes

Tesseract (Free, Open-Source)

DEVELOPERpython
from PIL import Image import pytesseract def ocr_image(image_path): image = Image.open(image_path) text = pytesseract.image_to_string(image, lang='eng') return text # With confidence scores data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT) for i, word in enumerate(data['text']): confidence = data['conf'][i] if confidence > 60: # Filter low-confidence print(word)

Preprocessing for Better Accuracy

DEVELOPERpython
import cv2 import numpy as np def preprocess_image(image_path): img = cv2.imread(image_path) # Convert to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Remove noise denoised = cv2.fastNlMeansDenoising(gray) # Threshold (binarize) _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) return binary # Use preprocessed image preprocessed = preprocess_image('scan.jpg') text = pytesseract.image_to_string(preprocessed)

AWS Textract (Commercial, Best Quality)

DEVELOPERpython
import boto3 textract = boto3.client('textract', region_name='us-east-1') def extract_with_textract(image_path): with open(image_path, 'rb') as f: image_bytes = f.read() response = textract.detect_document_text( Document={'Bytes': image_bytes} ) text = "" for block in response['Blocks']: if block['BlockType'] == 'LINE': text += block['Text'] + '\n' return text # For tables and forms response = textract.analyze_document( Document={'Bytes': image_bytes}, FeatureTypes=['TABLES', 'FORMS'] )

Google Cloud Vision

DEVELOPERpython
from google.cloud import vision client = vision.ImageAnnotatorClient() def ocr_with_google(image_path): with open(image_path, 'rb') as f: content = f.read() image = vision.Image(content=content) response = client.document_text_detection(image=image) return response.full_text_annotation.text

Handwriting Recognition

DEVELOPERpython
# EasyOCR for handwriting import easyocr reader = easyocr.Reader(['en']) result = reader.readtext('handwritten.jpg', detail=0) text = ' '.join(result)

Multilingual OCR

DEVELOPERpython
# Tesseract with multiple languages text = pytesseract.image_to_string( image, lang='eng+fra+deu' # English + French + German ) # EasyOCR (better for Asian languages) reader = easyocr.Reader(['en', 'zh', 'ja', 'ko']) results = reader.readtext('multilingual.jpg')

OCR opens up scanned content for RAG. Essential for legal, healthcare, and historical documents.

Tags

ocrparsingtesseractscanned documents

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !