The text extractor generates dense vector embeddings from text using the E5-Large multilingual model. Optimized for semantic search, RAG applications, and general-purpose text retrieval. Supports text chunking/decomposition with multiple splitting strategies. Fast (5ms/doc), cost-effective (free), and supports 100+ languages.
Pipeline Steps
Filter Dataset (if collection_id provided)
Filter to specified collection
Apply Input Mappings
Resolve text field from source (e.g., transcription, content, data)
Text Chunking (conditional: if split_by != "none")
Split by: characters, words, sentences, paragraphs, or pages
Configure chunk_size and chunk_overlap
Each chunk becomes a separate document
E5 Text Embedding Generation
Multilingual E5-Large model (1024D)
L2 normalized vectors
Batch size: 4,096 texts
Output
Text documents with embeddings
One document per input (or per chunk if chunking enabled)
When to Use
Use Case Description Product search Search products by natural language descriptions FAQ matching Match user questions to knowledge base articles Document retrieval Find relevant documents from large corpora Content discovery Recommend similar content based on semantic similarity RAG chunking Split documents into chunks for retrieval-augmented generation Multi-language search Search across 100+ languages with a single model
When NOT to Use
Scenario Recommended Alternative Exact phrase matching colbert_extractorKeyword-heavy queries splade_extractorHigh-precision legal/medical search colbert_extractorNeed for explainability (which keywords matched) splade_extractorDocuments with critical technical terms colbert_extractorVery short texts (1-5 words) splade_extractor
Field Type Required Description textstring Yes Text content to process. Recommended: 10-400 words for optimal quality. Maximum: 512 tokens (~400 words), longer text is truncated.
{
"text" : "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium sound quality."
}
Input Examples:
Type Example Product description ”Premium wireless Bluetooth headphones with active noise cancellation” FAQ question ”How do I reset my password if I forgot it?” Article paragraph ”Machine learning models have revolutionized natural language processing…” User query ”best restaurants near Times Square”
Output Schema
Field Type Description textstring The processed text content (full text or chunk) text_extractor_v1_embeddingfloat[1024] Dense vector embedding, L2 normalized
{
"text" : "Premium wireless Bluetooth headphones with active noise cancellation" ,
"text_extractor_v1_embedding" : [ 0.023 , -0.041 , 0.018 , ... ]
}
When chunking is enabled, each chunk becomes a separate document with tracking metadata stored in metadata (not in the document payload):
chunk_index – Position of this chunk in the original document
chunk_text – The text content of this chunk
total_chunks – Total number of chunks from the source
Parameters
Chunking Parameters
Parameter Type Default Description split_bystring "none"Strategy for splitting text into chunks chunk_sizeinteger 1000Target size for each chunk (units depend on split_by) chunk_overlapinteger 0Number of units to overlap between consecutive chunks
Split Strategies
Strategy Description Best For charactersSplit by character count Uniform sizes, quick testing wordsSplit by word boundaries General text, preserves words sentencesSplit by sentence boundaries Q&A, precise retrieval, preserves semantic units paragraphsSplit by paragraph (double newlines) Articles, documentation, natural structure pagesSplit by page breaks PDFs, paginated documents noneNo splitting (default) Short texts < 400 words
Recommended chunk sizes:
characters: 500-2000
words: 100-400
sentences: 3-10
paragraphs: 1-3
pages: 1
Chunk overlap: 10-20% of chunk_size helps preserve context across boundaries. Example: chunk_size: 1000, chunk_overlap: 100-200.
Embedding Model
Parameter Type Default Description embedding_modelstring current TEXT default Override the embedding model via the model registry
The text extractor resolves its model through the central embedding registry. Leave unset to use the current TEXT modality default (intfloat_e5_large_instruct_v1, 1024d) — the registry swaps hot when a new frontier text model ships, so existing collections pick it up without a code change.
Dimensions are locked at namespace creation. Switching embedding_model on an existing namespace requires a migration since the vector index dimensionality is fixed.
Parameter Type Default Description response_shapestring | object nullDefine custom structured output using LLM extraction llm_providerstring nullLLM provider: openai, google, anthropic llm_modelstring nullSpecific model for extraction
response_shape Modes
Natural Language Mode (string):
{
"response_shape" : "Extract key entities, sentiment (positive/negative/neutral), and main topics from the text"
}
The service automatically infers JSON schema from your description.
JSON Schema Mode (object):
{
"response_shape" : {
"type" : "object" ,
"properties" : {
"sentiment" : {
"type" : "string" ,
"enum" : [ "positive" , "negative" , "neutral" ]
},
"entities" : {
"type" : "array" ,
"items" : { "type" : "string" }
},
"topics" : {
"type" : "array" ,
"items" : { "type" : "string" },
"maxItems" : 5
}
},
"required" : [ "sentiment" ]
}
}
LLM Provider & Model Options
Provider Example Models openaigpt-4o-mini-2024-07-18 (cost-effective), gpt-4o-2024-08-06 (best quality)googlegemini-2.5-flash (fastest), gemini-1.5-flash-001anthropicclaude-3-5-haiku-20241022 (fast), claude-3-5-sonnet-20241022 (best reasoning)
Configuration Examples
Basic Embedding (No Chunking)
Sentence Chunking for RAG
Paragraph Chunking for Articles
Word-Level Chunking with Overlap
LLM Extraction (Natural Language)
LLM Extraction (JSON Schema)
{
"feature_extractor" : {
"feature_extractor_name" : "text_extractor" ,
"version" : "v1" ,
"input_mappings" : {
"text" : "payload.description"
},
"field_passthrough" : [
{ "source_path" : "metadata.product_id" }
],
"parameters" : {}
}
}
Metric Value Embedding latency ~5ms per document (batched: ~2ms/doc) Query latency 5-10ms for top-100 results Cost Free (self-hosted E5-Large) GPU required No (but 5-10x faster with GPU) Memory ~4GB per 1M documents Index build ~1 hour per 10M documents
LLM extraction adds cost and latency based on provider pricing. Only use when structured extraction is needed.
Feature text_extractor colbert_extractor splade_extractor Accuracy (BEIR avg) 88% 92% 90% Speed (per doc) 5ms 15ms 10ms Storage per doc 4KB 500KB (125x more) 20KB (5x more) Query Latency < 10ms 50-100ms 20-30ms Best For General search Precision Hybrid Storage Cost (1M docs) $0.40 $50 $2 Multi-language Excellent Good Good Exact Matching Poor Excellent Excellent Semantic Matching Excellent Excellent Good
Vector Index
Property Value Index name text_extractor_v1_embeddingDimensions 1024 Type Dense Distance metric Cosine Datatype float32 Inference model multilingual_e5_large_instruct_v1Normalization L2 normalized
Limitations
Token limit : 512 tokens (~400 words). Longer text is automatically truncated.
Exact phrases : Cannot reliably match exact phrases or technical terms.
Domain jargon : Struggles with very domain-specific jargon or acronyms.
Terminology variance : May miss documents that use different terminology for the same concept.
Short texts : Less effective for very short texts (1-5 words) where lexical matching is sufficient.
Keyword-heavy queries : Less effective for queries like “iPhone 15 Pro Max 256GB”.