Documentation Index Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
The course content extractor decomposes educational content into atomic learning units optimized for semantic retrieval. Processes video lectures with automatic transcription, PDF slides with layout awareness, and code archives with function-level granularity. Each unit receives E5-Large text embeddings (1024D), Jina Code embeddings (768D) for code snippets, and optional SigLIP visual embeddings (768D) for figures and screenshots.
Pipeline Steps
Filter Dataset (if collection_id provided)
Filter to specified collection
Content Detection & Routing
Auto-detect content type: video, PDF, or code archive
Route to appropriate processor
Video Segmentation (if video input)
Scene-based segmentation or SRT subtitle-based segmentation
Extract transcripts via Whisper ASR (or use provided SRT)
OCR video frames for screen text detection
PDF Decomposition (if PDF input)
Layout detection: paragraphs, headers, tables, lists, figures, code blocks
Layout-aware extraction per element or per page
Extract images and figures with bounding boxes
Code Archive Processing (if code input)
Extract source files from ZIP archive
Segment code into individual functions/classes
Auto-detect programming language
Multi-Modal Embedding Generation
E5-Large (1024D) for transcripts, PDF text, and captions
Jina Code v2 (768D) for code snippets and functions
SigLIP (768D) for figures, screenshots, diagrams (optional)
LLM Enrichment (optional: if enrich_with_llm=true)
Generate summaries using Gemini
Add semantic context and key concepts
Output
Learning units with text_content, code_content, screen_text
Layout types, timing info, language tags
Multiple embeddings per unit for diverse search scenarios
When to Use
Use Case Description Online courses Extract lectures, slides, and code into searchable learning units Technical documentation Decompose guides with code examples into semantic chunks Code tutorials Segment video + PDF + code into aligned learning units Educational archives Index historical lecture materials with multiple content types Multilingual learning Process educational content across 100+ languages API documentation Extract text, code examples, and diagrams with visual search
When NOT to Use
Scenario Recommended Alternative Simple text documents only text_extractor (faster, simpler)Images and photos only image_extractorSingle PDF documents document_graph_extractor (better OCR, confidence scoring)Pre-transcribed videos text_extractor (use transcripts directly)
Field Type Required Description videostring (one of three) URL or S3 path to video file (MP4, WebM, MOV). Maximum: 4 hours. Auto-detect format. srtstring optional (with video) URL or S3 path to SRT subtitle file. Used if present; otherwise Whisper ASR generates transcripts. pdfstring (one of three) URL or S3 path to PDF document. Multi-page supported. Maximum: 500 pages. code_archivestring (one of three) URL or S3 path to ZIP archive containing source code. Maximum: 100MB.
Exactly one of video, pdf, or code_archive must be provided.
{
"video" : "s3://my-bucket/lectures/intro-to-ml.mp4" ,
"srt" : "s3://my-bucket/lectures/intro-to-ml.srt"
}
Input Examples:
Type Example Video with subtitles {"video": "https://cdn.example.com/lecture.mp4", "srt": "https://cdn.example.com/lecture.srt"}PDF slides {"pdf": "s3://courses/machine-learning/slides-week-1.pdf"}Code archive {"code_archive": "s3://tutorials/python-algorithms.zip"}
Output Schema
Each learning unit produces one or more documents depending on content type and expand_to_granular_docs setting:
Field Type Description unit_typestring Type of unit: video_segment, pdf_element, code_function, screen_text, figure doc_typestring Granular type: transcript, code, screen_text, visual, paragraph, table, list, header, figure text_contentstring Extracted text content code_contentstring Source code (if applicable) code_languagestring Programming language (Python, JavaScript, Java, etc.) screen_textstring OCR text from video frames or PDF screenshots titlestring Unit title (lecture title, function name, figure caption) start_timenumber Video start time in seconds (video units only) end_timenumber Video end time in seconds (video units only) page_numberinteger PDF page number (0-indexed, PDF units only) element_indexinteger Element position within page (PDF units only) start_lineinteger Start line number (code units only) end_lineinteger End line number (code units only) segment_indexinteger Segment position within source (video units only) element_typestring PDF layout type: paragraph, header, list, table, figure, code, footer bboxobject Bounding box {x, y, width, height} (PDF elements with visual positioning) thumbnail_urlstring S3 URL of thumbnail image (video frames, figure screenshots) intfloat__multilingual_e5_large_instructfloat[1024] E5-Large text embedding, L2 normalized jinaai__jina_embeddings_v2_base_codefloat[768] Jina Code embedding (code units only) google__siglip_base_patch16_224float[768] SigLIP visual embedding (if run_visual_embedding=true) llm_summarystring LLM-generated summary (if enrich_with_llm=true)
{
"unit_type" : "video_segment" ,
"doc_type" : "transcript" ,
"text_content" : "In this section, we explore supervised learning algorithms..." ,
"screen_text" : "SUPERVISED LEARNING \n - Regression \n - Classification" ,
"title" : "Intro to ML: Supervised Learning" ,
"start_time" : 120.5 ,
"end_time" : 245.3 ,
"segment_index" : 3 ,
"thumbnail_url" : "s3://mixpeek/ns_123/thumbnails/seg_3.jpg" ,
"intfloat__multilingual_e5_large_instruct" : [ 0.023 , -0.041 , 0.018 , ... ],
"llm_summary" : "Introduction to supervised learning covering regression and classification techniques"
}
Parameters
Video Segmentation Parameters
Parameter Type Default Range Description target_segment_duration_msinteger 120000 30000-600000 Target duration for each video segment (30 sec - 10 min) min_segment_duration_msinteger 30000 10000+ Minimum segment duration to create segmentation_methodstring "scene"scene, srt, time Segmentation strategy: scene detection, SRT markers, or fixed time intervals scene_detection_thresholdfloat 0.3 0.1-0.9 Scene change sensitivity (lower = more scenes detected) use_whisper_asrboolean true - Use Whisper ASR for transcription if SRT not provided expand_to_granular_docsboolean true - Create separate documents for transcript, screen_text, and visual (one per granularity type) ocr_frames_per_segmentinteger 3 1-10 Number of frames to OCR per segment
Segmentation Methods
Method Description Best For sceneML-based scene detection (PySceneDetect) Lectures with natural topic breaks srtUse SRT subtitle markers as boundaries Prepared materials with timing metadata timeFixed time intervals Uniform segment length regardless of content
Parameter Type Default Description pdf_extraction_modestring "per_element"per_page (one doc per page) or per_element (one doc per detected element)pdf_render_dpiinteger 150 DPI for rendering PDF pages (72-300). Higher = better OCR quality, slower detect_code_in_pdfboolean true Automatically detect and tag code blocks in PDF text
Parameter Type Default Description segment_functionsboolean true Segment code files into individual functions/classes supported_languagesarray ["python", "javascript", "java", "go", "rust", "c", "cpp"]Programming languages to extract and embed
Parameter Type Default Description run_text_embeddingboolean true Generate E5-Large text embeddings for transcripts and text content run_code_embeddingboolean true Generate Jina Code embeddings for code snippets run_visual_embeddingboolean false Generate SigLIP visual embeddings for figures and screenshots visual_embedding_use_casestring "lecture"Context for visual embedding: lecture, code_demo, tutorial, presentation, dynamic extract_screen_textboolean true Run OCR on video frames to extract on-screen text generate_thumbnailsboolean true Generate and store thumbnail images use_cdnboolean false Use CDN for thumbnail delivery (if available)
LLM Enrichment Parameters
Parameter Type Default Description enrich_with_llmboolean false Enable LLM-generated summaries and key concept extraction llm_promptstring "Summarize this educational content, highlighting key concepts, learning objectives, and main takeaways"Custom prompt for LLM enrichment
Configuration Examples
Video with Scene Detection
PDF Slides with Code Detection
Code Archive with All Embeddings
Video with Full Enrichment
PDF with LLM Summaries
{
"feature_extractor" : {
"feature_extractor_name" : "course_content_extractor" ,
"version" : "v1" ,
"input_mappings" : {
"video" : "payload.lecture_url" ,
"srt" : "payload.subtitle_url"
},
"field_passthrough" : [
{ "source_path" : "metadata.course_id" },
{ "source_path" : "metadata.lesson_number" }
],
"parameters" : {
"target_segment_duration_ms" : 120000 ,
"segmentation_method" : "scene" ,
"scene_detection_threshold" : 0.3 ,
"use_whisper_asr" : true ,
"expand_to_granular_docs" : true ,
"ocr_frames_per_segment" : 3 ,
"run_text_embedding" : true ,
"run_code_embedding" : true ,
"run_visual_embedding" : false ,
"generate_thumbnails" : true
}
}
}
Metric Value Video processing ~1 minute per 10 minutes of video (depends on segmentation) PDF processing ~2-5 seconds per page (depends on DPI and layout complexity) Code processing ~50-100ms per 1KB of code Embedding latency ~5ms per text unit (E5), ~10ms per code unit (Jina), ~50ms per visual unit (SigLIP) Cost (Tier 2) 20 credits per video minute, 5 credits per PDF page, 2 credits per 1K code tokens GPU acceleration Recommended for 10+ videos; 2-3x speedup
Vector Indexes
All three embeddings are stored as MVS named vectors for hybrid search:
Property Value Index 1 name intfloat__multilingual_e5_large_instructDimensions 1024 Type Dense Distance metric Cosine Datatype float32 Normalization L2 normalized
Property Value Index 2 name jinaai__jina_embeddings_v2_base_codeDimensions 768 Type Dense Distance metric Cosine Datatype float32 Normalization L2 normalized
Property Value Index 3 name google__siglip_base_patch16_224Dimensions 768 Type Dense Distance metric Cosine Datatype float32 Inference model google_siglip_base_v1Status Optional (if run_visual_embedding=true)
Feature course_content_extractor text_extractor multimodal_extractor document_graph_extractor Input types Video, PDF, Code Text only Video, Image, Text PDF only Segmentation Scene/SRT/time Word/sentence/paragraph N/A Layout-based Text embeddings E5-Large (1024D) E5-Large (1024D) Vertex AI (1408D) E5-Large (1024D) Code embeddings Jina Code (768D) ✗ ✗ ✗ Visual embeddings SigLIP (768D) optional ✗ Vertex AI (1408D) ✗ Best for Educational content Text search Unified multimodal Complex PDF layouts Cost per unit Medium (2-20 credits) Low (1 credit/1K tokens) 50 credits/min video 5 credits/page
Limitations
Video length : Optimized for videos up to 4 hours. Longer videos may require segmentation.
Transcription quality : Whisper ASR works best with clear audio; noisy lectures may have reduced accuracy.
Code extraction : Requires valid ZIP archives; loose files not supported.
Language support : Code embedding works with common languages; domain-specific DSLs have reduced accuracy.
PDF complexity : Complex layouts with nested tables may have reduced extraction quality.
Visual embeddings : Optional and add significant processing cost.