Document RAG Pipeline

Retrieval-augmented generation for document collections. Extracts text, tables, and figures from PDFs using OCR and layout analysis, then retrieves relevant page sections to answer natural language questions with precise page and section citations.

text

image

Multi-Stage

3.4K runs

Run in Builder

from mixpeek import Mixpeek
from openai import OpenAI

client = Mixpeek(api_key="YOUR_API_KEY")
openai = OpenAI(api_key="YOUR_OPENAI_KEY")

# Create document collection with layout-aware extraction
collection = client.collections.create(
    namespace_id="ns_your_namespace",
    name="policy_documents",
    extractors=["document-graph-extractor", "text-extractor"]
)

# Upload PDFs
client.buckets.upload(bucket_id="bkt_docs", url="s3://your-bucket/policies/")

# Retrieve relevant document sections
results = client.retrievers.execute(
    retriever_id="ret_doc_rag",
    query={"text": "What is the refund policy for enterprise customers?"}
)

# Build context with page citations
context = "\n".join([
    f"[{i+1}] {doc['text']} (Document: {doc['root_object_id']}, Page {doc['page_number']})"
    for i, doc in enumerate(results["results"])
])

# Generate answer
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": f"Answer using these document excerpts:\n{context}"},
        {"role": "user", "content": "What is the refund policy for enterprise customers?"}
    ]
)
print(response.choices[0].message.content)

Feature Extractors

Retriever Stages

rerank

Rerank documents using cross-encoder models for accurate relevance

sort

summarize

Condense multiple documents into a summary using an LLM

reduce

Document RAG Pipeline

Feature Extractors

Retriever Stages

Related Recipes & Resources

Multimodal Knowledge Base

Document Intelligence Search

PDF Data Extraction Pipeline

Document Classification Pipeline

BYO Embeddings Vector Search

Web Scraper