PDF Data Extraction Pipeline
Extract structured data from PDFs including tables, forms, and text. Convert unstructured documents into structured, queryable data.
from mixpeek import Mixpeekclient = Mixpeek(api_key="YOUR_API_KEY")namespace = client.namespaces.create(name="pdf-data")collection = client.collections.create(namespace_id=namespace.id,name="invoices",extractors=["pdf-extraction", "table-extraction", "ocr"])# Upload PDFsclient.buckets.upload(collection_id=collection.id,url="s3://your-bucket/invoices/")# Search extracted dataresults = client.documents.search(namespace_id=namespace.id,query="invoices over $10,000 from Q4")
Feature Extractors
PDF Text Extraction
Extract structured text and layout information from PDFs
PDF Table Extraction
Convert tables in PDFs to structured data formats
