Can it handle scanned PDFs?

Yes. Scanned pages are automatically detected and processed through OCR. The system uses a combination of traditional OCR and a vision-language model for complex layouts.

How does it handle multi-column layouts?

Layout analysis identifies column boundaries and reading order. Text is extracted in logical reading sequence rather than raw positional order, so the output reads naturally.

Are tables extracted as structured data?

Basic table structure is preserved in the text output. For full structured table extraction, use the PDF to Structured Data converter which returns tables as JSON or CSV.

What about password-protected PDFs?

You can provide the password via the `password` parameter. Encrypted PDFs without a password cannot be processed.

document

PDF
Text
Converter

Extract clean, structured text from PDF documents including scanned pages, multi-column layouts, headers/footers, and tables. Combines traditional parsing with OCR and layout analysis for maximum accuracy.

Max file size: 200 MB

Estimated: 1-10 sec per page

1 input formats

How It Works

Upload a PDF file or provide a URL.

The document is classified as digital-native or scanned.

Digital pages are parsed directly; scanned pages go through OCR.

Layout analysis preserves reading order across columns and tables.

Clean text is returned with optional page-level segmentation.

Code Examples

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

result = client.convert(
    source="https://example.com/contract.pdf",
    from_format="pdf",
    to_format="text",
    options={
        "ocr_fallback": True,
        "preserve_layout": True,
        "pages": "1-10"
    }
)

for page in result.pages:
    print(f"--- Page {page.number} ---")
    print(page.text)

Use Cases

Ingest legal contracts and regulatory filings for analysis

Extract text from research papers and academic publications

Digitize scanned invoices and receipts

Build full-text search indexes for document libraries

Supported Input Formats

PDF

Quick Info

Categorydocument

Max File Size200 MB

Est. Time1-10 sec per page

Extractordocument-descriptor

Try This Conversion

Get started with the Mixpeek API and convert your first file in minutes.

Frequently Asked Questions

Related Converters

PDF

JSON

PDF to Structured Data

Extract structured key-value pairs, tables, and form fields from PDF documents. Uses layout analysis and LLM extraction to produce clean JSON output, even from complex forms and invoices.

PDF

Embeddings

PDF to Embeddings

Convert PDF documents into semantic vector embeddings for search, retrieval, and RAG applications. Pages are chunked intelligently by sections and paragraphs, then embedded using text or multimodal models.

PDF

Images

PDF to Images

Render PDF pages as high-quality images. Each page is converted to JPEG, PNG, or WebP at configurable DPI, with options for specific page ranges and background color control.

PDF

Markdown

PDF to Markdown

Convert PDF documents to clean Markdown format, preserving headings, lists, tables, links, and emphasis. Ideal for migrating content into wikis, CMS platforms, and documentation systems.

Ready to convert pdf to text?

Start using the Mixpeek PDF to Text in minutes. Sign up for a free API key and follow the documentation to get started.

PDFTextConverter