Mixpeek Logo

    Document RAG Pipeline

    Retrieval-augmented generation for document collections. Extracts text, tables, and figures from PDFs using OCR and layout analysis, then retrieves relevant page sections to answer natural language questions with precise page and section citations.

    text
    image
    Multi-Stage
    3.4K runs
    Deploy Recipe
    from mixpeek import Mixpeek
    from openai import OpenAI
    client = Mixpeek(api_key="YOUR_API_KEY")
    openai = OpenAI(api_key="YOUR_OPENAI_KEY")
    # Create document collection with layout-aware extraction
    collection = client.collections.create(
    namespace_id="ns_your_namespace",
    name="policy_documents",
    extractors=["document-graph-extractor", "text-extractor"]
    )
    # Upload PDFs
    client.buckets.upload(bucket_id="bkt_docs", url="s3://your-bucket/policies/")
    # Retrieve relevant document sections
    results = client.retrievers.execute(
    retriever_id="ret_doc_rag",
    query={"text": "What is the refund policy for enterprise customers?"}
    )
    # Build context with page citations
    context = "\n".join([
    f"[{i+1}] {doc['text']} (Document: {doc['root_object_id']}, Page {doc['page_number']})"
    for i, doc in enumerate(results["results"])
    ])
    # Generate answer
    response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
    {"role": "system", "content": f"Answer using these document excerpts:\n{context}"},
    {"role": "user", "content": "What is the refund policy for enterprise customers?"}
    ]
    )
    print(response.choices[0].message.content)

    Feature Extractors

    Retriever Stages

    rerank

    Rerank documents using cross-encoder models for accurate relevance

    sort

    summarize

    Condense multiple documents into a summary using an LLM

    reduce