Document RAG Pipeline
Retrieval-augmented generation for document collections. Extracts text, tables, and figures from PDFs using OCR and layout analysis, then retrieves relevant page sections to answer natural language questions with precise page and section citations.
from mixpeek import Mixpeekfrom openai import OpenAIclient = Mixpeek(api_key="YOUR_API_KEY")openai = OpenAI(api_key="YOUR_OPENAI_KEY")# Create document collection with layout-aware extractioncollection = client.collections.create(namespace_id="ns_your_namespace",name="policy_documents",extractors=["document-graph-extractor", "text-extractor"])# Upload PDFsclient.buckets.upload(bucket_id="bkt_docs", url="s3://your-bucket/policies/")# Retrieve relevant document sectionsresults = client.retrievers.execute(retriever_id="ret_doc_rag",query={"text": "What is the refund policy for enterprise customers?"})# Build context with page citationscontext = "\n".join([f"[{i+1}] {doc['text']} (Document: {doc['root_object_id']}, Page {doc['page_number']})"for i, doc in enumerate(results["results"])])# Generate answerresponse = openai.chat.completions.create(model="gpt-4o",messages=[{"role": "system", "content": f"Answer using these document excerpts:\n{context}"},{"role": "user", "content": "What is the refund policy for enterprise customers?"}])print(response.choices[0].message.content)
Feature Extractors
Retriever Stages
rerank
Rerank documents using cross-encoder models for accurate relevance
summarize
Condense multiple documents into a summary using an LLM
