Name: Ailog - RAG as a Service Platform
Availability: InStock
Rating: 4.8 (156 reviews)

Wann OCR erforderlich ist

Digitale PDFs enthalten extrahierbaren Text. Gescannte Dokumente nicht — sie sind nur Bilder.

OCR konvertiert :

Gescannte Verträge
Alte Bücher
Belege
Screenshots
Handschriftliche Notizen

Tesseract (Kostenlos, Open-Source)

DEVELOPERpython
from PIL import Image
import pytesseract

def ocr_image(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image, lang='eng')
    return text

# Mit Vertrauenswerten
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
for i, word in enumerate(data['text']):
    confidence = data['conf'][i]
    if confidence > 60:  # Niedrige Vertrauenswerte filtern
        print(word)

Vorverarbeitung für bessere Genauigkeit

DEVELOPERpython
import cv2
import numpy as np

def preprocess_image(image_path):
    img = cv2.imread(image_path)

    # In Graustufen konvertieren
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Rauschen entfernen
    denoised = cv2.fastNlMeansDenoising(gray)

    # Schwellenwert (Binarisierung)
    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    return binary

# Vorverarbeitetes Bild verwenden
preprocessed = preprocess_image('scan.jpg')
text = pytesseract.image_to_string(preprocessed)

AWS Textract (Kommerziell, bessere Qualität)

DEVELOPERpython
import boto3

textract = boto3.client('textract', region_name='us-east-1')

def extract_with_textract(image_path):
    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    response = textract.detect_document_text(
        Document={'Bytes': image_bytes}
    )

    text = ""
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            text += block['Text'] + '\n'

    return text

# Für Tabellen und Formulare
response = textract.analyze_document(
    Document={'Bytes': image_bytes},
    FeatureTypes=['TABLES', 'FORMS']
)

Google Cloud Vision

DEVELOPERpython
from google.cloud import vision

client = vision.ImageAnnotatorClient()

def ocr_with_google(image_path):
    with open(image_path, 'rb') as f:
        content = f.read()

    image = vision.Image(content=content)
    response = client.document_text_detection(image=image)

    return response.full_text_annotation.text

Handschriftenerkennung

DEVELOPERpython
# EasyOCR für Handschrift
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('handwritten.jpg', detail=0)
text = ' '.join(result)

Mehrsprachiges OCR

DEVELOPERpython
# Tesseract mit mehreren Sprachen
text = pytesseract.image_to_string(
    image,
    lang='eng+fra+deu'  # Englisch + Französisch + Deutsch
)

# EasyOCR (besser für asiatische Sprachen)
reader = easyocr.Reader(['en', 'zh', 'ja', 'ko'])
results = reader.readtext('multilingual.jpg')

OCR macht gescannte Inhalte für den RAG zugänglich. Unverzichtbar für juristische, medizinische und historische Dokumente.

OCR für gescannte Dokumente und Bilder

Wann OCR erforderlich ist

Tesseract (Kostenlos, Open-Source)

Vorverarbeitung für bessere Genauigkeit

AWS Textract (Kommerziell, bessere Qualität)

Google Cloud Vision

Handschriftenerkennung

Mehrsprachiges OCR

Tags

Verwandte Artikel

Grundlagen des Parsing von Dokumenten

RAG Multimodal: Bilder, PDFs und über den Text hinaus

PDF-Dokumente mit PyMuPDF parsen

Ailog Assistant