How are documents chunked?

Documents are chunked by semantic boundaries: section headers, paragraph breaks, and page boundaries. Default chunk size is 512 tokens with 64-token overlap. Both values are configurable.

Can figures and diagrams be embedded?

Yes. Set `include_figures` to true and visual elements will be processed with a multimodal model (CLIP or SigLIP). Figure embeddings are returned alongside text embeddings.

Which embedding model should I use for documents?

For English documents, multilingual E5-large-instruct provides the best retrieval quality. For multilingual collections, it also works well. CLIP is recommended when you need to search across both text and images.

document

PDF
Embeddings
Converter

Convert PDF documents into semantic vector embeddings for search, retrieval, and RAG applications. Pages are chunked intelligently by sections and paragraphs, then embedded using text or multimodal models.

Max file size: 200 MB

Estimated: 5-30 sec per document

1 input formats

How It Works

Upload a PDF or provide a URL.

Text is extracted and segmented into semantic chunks.

Diagrams and figures are optionally processed with a vision model.

Each chunk is embedded using your selected model.

Embeddings are returned with chunk text and metadata.

Code Examples

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

result = client.convert(
    source="https://example.com/whitepaper.pdf",
    from_format="pdf",
    to_format="embeddings",
    options={
        "model": "e5-large-instruct",
        "chunk_size": 512,
        "chunk_overlap": 64,
        "include_figures": True
    }
)

for chunk in result.chunks:
    print(f"Chunk {chunk.index}: {chunk.text[:80]}...")

Use Cases

Build RAG systems over document collections

Enable semantic search across PDF archives

Power question-answering over technical documentation

Create vector indexes for legal document review

Supported Input Formats

PDF

Quick Info

Categorydocument

Max File Size200 MB

Est. Time5-30 sec per document

Extractordocument-descriptor

Try This Conversion

Get started with the Mixpeek API and convert your first file in minutes.

Frequently Asked Questions

Related Converters

PDF

Text

PDF to Text

Extract clean, structured text from PDF documents including scanned pages, multi-column layouts, headers/footers, and tables. Combines traditional parsing with OCR and layout analysis for maximum accuracy.

PDF

JSON

PDF to Structured Data

Extract structured key-value pairs, tables, and form fields from PDF documents. Uses layout analysis and LLM extraction to produce clean JSON output, even from complex forms and invoices.

Text

Embeddings

Text to Embeddings

Convert text strings, paragraphs, or documents into dense vector embeddings using state-of-the-art language models. Supports batching, chunking, and multiple model options for optimal retrieval performance.

Mixed