Dense vector embeddings from text using E5-Large multilingual with chunking and LLM extraction
The text extractor generates dense vector embeddings from text using the E5-Large multilingual model. Optimized for semantic search, RAG applications, and general-purpose text retrieval. Supports text chunking/decomposition with multiple splitting strategies. Fast (5ms/doc), cost-effective (free), and supports 100+ languages.
The text extractor resolves its model through the central embedding registry. Leave unset to use the current TEXT modality default (intfloat_e5_large_instruct_v1, 1024d) — the registry swaps hot when a new frontier text model ships, so existing collections pick it up without a code change.
Dimensions are locked at namespace creation. Switching embedding_model on an existing namespace requires a migration since the vector index dimensionality is fixed.
Instruction-aware embedding models (E5, Gemini) use a task hint to optimize the embedding for a specific downstream use case. By default, all extractors use retrieval_document at ingestion time, which produces embeddings optimized for asymmetric search (queries find documents).
Set embedding_task at the collection level, not on the extractor. See Collection Embedding Task for full details and examples.
Task
Use Case
Effect on E5
Effect on Gemini
retrieval_document
Default. Search: find documents from queries
Prepends "passage: "
Instructs “represent for retrieval”
retrieval_query
Rare at index time. Query-side is automatic
Prepends "query: "
Instructs “represent this query”
semantic_similarity
Symmetric comparison (deduplication, matching)
Prepends "query: "
Instructs “represent for similarity”
classification
Document categorization pipelines
Prepends "query: "
Instructs “represent for classification”
clustering
Grouping documents into clusters
Prepends "query: "
Not applied
You almost never need to set this. The default retrieval_document is correct for search, and at query time Mixpeek automatically uses retrieval_query. Only override if your collection is primarily used for clustering, classification, or symmetric similarity — not retrieval.
Non-instruction-aware models (SigLIP, CLIP, Vertex multimodal) ignore this parameter.