Dense vector embeddings from text using E5-Large multilingual with chunking and LLM extraction
View on GitHub
Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
The text extractor generates dense vector embeddings from text using the E5-Large multilingual model. Optimized for semantic search, RAG applications, and general-purpose text retrieval. Supports text chunking/decomposition with multiple splitting strategies. Fast (5ms/doc), cost-effective (free), and supports 100+ languages.
Keyword-heavy queries (e.g. “iPhone 15 Pro Max 256GB”)
Lexical (BM25) search alongside the dense one
Critical technical terms or short texts (1–5 words)
Lexical (BM25) search
Dense embeddings and exact-keyword matching are complementary. Keep text_extractor for semantic recall and add a lexical: true search over a text index for exact tokens — fuse them with rrf. See Lexical (BM25) Search.
The text extractor resolves its model through the central embedding registry. Leave unset to use the current TEXT modality default (intfloat_e5_large_instruct_v1, 1024d) — the registry swaps hot when a new frontier text model ships, so existing collections pick it up without a code change.
Dimensions are locked at namespace creation. Switching embedding_model on an existing namespace requires a migration since the vector index dimensionality is fixed.
Instruction-aware embedding models (E5, Gemini) use a task hint to optimize the embedding for a specific downstream use case. By default, all extractors use retrieval_document at ingestion time, which produces embeddings optimized for asymmetric search (queries find documents).
Set embedding_task at the collection level, not on the extractor. See Collection Embedding Task for full details and examples.
Task
Use Case
Effect on E5
Effect on Gemini
retrieval_document
Default. Search: find documents from queries
Prepends "passage: "
Instructs “represent for retrieval”
retrieval_query
Rare at index time. Query-side is automatic
Prepends "query: "
Instructs “represent this query”
semantic_similarity
Symmetric comparison (deduplication, matching)
Prepends "query: "
Instructs “represent for similarity”
classification
Document categorization pipelines
Prepends "query: "
Instructs “represent for classification”
clustering
Grouping documents into clusters
Prepends "query: "
Not applied
You almost never need to set this. The default retrieval_document is correct for search, and at query time Mixpeek automatically uses retrieval_query. Only override if your collection is primarily used for clustering, classification, or symmetric similarity — not retrieval.
Non-instruction-aware models (SigLIP, CLIP, Vertex multimodal) ignore this parameter.
text_extractor produces dense embeddings — great for semantic recall, weak on exact tokens. For exact-keyword precision, pair it with a lexical (BM25) search (the lexical: true option on a feature_search stage, backed by a text payload index — not a separate extractor).
Dimension
Dense (text_extractor)
Lexical (BM25, lexical: true)
Matches
Meaning / paraphrase
Exact tokens, SKUs, codes, prices
Semantic recall
Excellent
Poor
Exact matching
Poor
Excellent
Multi-language
Excellent (E5-Large)
Good (token-based)
Requires
Embedding index
text payload index
The strongest setup is hybrid: run a dense search and a lexical search in the same feature_search stage and fuse with rrf. See Lexical (BM25) Search.
In retrievers, reference this feature by its Feature URI above (the output name is multilingual_e5_large_instruct_v1, not the index name text_extractor_v1_embedding).