Clinical Documentation Structuring
Production-grade pipeline for ingesting clinical documents — scanned charts, EHR exports, wound photos, and therapy notes — and structuring them into coded fields aligned with MDS 3.0, PDPM, and CMS audit requirements. Combines OCR, clinical NER, taxonomy classification, and hybrid retrieval to turn unstructured bedside documentation into queryable, auditable data.
Why This Matters
Nurses spend up to 40% of their time on documentation instead of patient care. Clinical data lives in free-text notes, scanned forms, and photos that are invisible to billing and compliance systems. This recipe bridges the gap — extracting structured clinical data from every modality so MDS coordinators, billers, and surveyors can work from a single source of truth.
from mixpeek import Mixpeekclient = Mixpeek(api_key="YOUR_API_KEY")# 1. Create namespace for the facilitynamespace = client.namespaces.create(name="facility-clinical-docs")# 2. Build collection with clinical extractorscollection = client.collections.create(namespace_id=namespace.id,name="patient-charts",extractors=["pdf-extraction", # OCR for scanned charts"text-embedding-v2", # Semantic embeddings"image-captioning", # Wound photos, imaging],)# 3. Upload clinical documentsclient.buckets.upload(collection_id=collection.id,url="s3://facility-ehr-export/patient-charts/")# 4. Create MDS-aligned retrieverretriever = client.retrievers.create(namespace_id=namespace.id,name="mds-documentation",stages=[{"type": "hybrid_search", "vector_weight": 0.5, "bm25_weight": 0.5, "top_k": 50},{"type": "attribute_filter", "conditions": [{"field": "mds_section", "operator": "in", "value": ["G", "J", "K"]}]},{"type": "rerank", "model": "colbert-v2", "top_k": 10}])# 5. Retrieve MDS-relevant documentationresults = client.retrievers.execute(retriever_id=retriever.id,query="functional mobility and ADL performance for Section G")for r in results:print(f"[{r.metadata.get('mds_section')}] {r.content[:120]}")
Feature Extractors
PDF Text Extraction
Extract structured text and layout information from PDFs
Image Captioning
Generate descriptive captions for images automatically
Retriever Stages
attribute filter
Filter documents by metadata attribute values using boolean logic
rerank
Rerank documents using cross-encoder models for accurate relevance
