Skip to main content

View on GitHub

Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
Text extractor pipeline showing chunking, E5-Large embedding, and optional LLM extraction
The text extractor generates dense vector embeddings from text using the E5-Large multilingual model. Optimized for semantic search, RAG applications, and general-purpose text retrieval. Supports text chunking/decomposition with multiple splitting strategies. Fast (5ms/doc), cost-effective (free), and supports 100+ languages.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/text_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Filter Dataset (if collection_id provided)
    • Filter to specified collection
  2. Apply Input Mappings
    • Resolve text field from source (e.g., transcription, content, data)
  3. Text Chunking (conditional: if split_by != "none")
    • Split by: characters, words, sentences, paragraphs, or pages
    • Configure chunk_size and chunk_overlap
    • Each chunk becomes a separate document
  4. E5 Text Embedding Generation
    • Multilingual E5-Large model (1024D)
    • L2 normalized vectors
    • Batch size: 4,096 texts
  5. Output
    • Text documents with embeddings
    • One document per input (or per chunk if chunking enabled)

When to Use

Use CaseDescription
Product searchSearch products by natural language descriptions
FAQ matchingMatch user questions to knowledge base articles
Document retrievalFind relevant documents from large corpora
Content discoveryRecommend similar content based on semantic similarity
RAG chunkingSplit documents into chunks for retrieval-augmented generation
Multi-language searchSearch across 100+ languages with a single model

When NOT to Use

ScenarioRecommended Alternative
Exact phrase / keyword matching (SKUs, codes, names)Add a lexical (BM25) search
Keyword-heavy queries (e.g. “iPhone 15 Pro Max 256GB”)Lexical (BM25) search alongside the dense one
Critical technical terms or short texts (1–5 words)Lexical (BM25) search
Dense embeddings and exact-keyword matching are complementary. Keep text_extractor for semantic recall and add a lexical: true search over a text index for exact tokens — fuse them with rrf. See Lexical (BM25) Search.

Input Schema

FieldTypeRequiredDescription
textstringYesText content to process. Recommended: 10-400 words for optimal quality. Maximum: 512 tokens (~400 words), longer text is truncated.
{
  "text": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium sound quality."
}
Input Examples:
TypeExample
Product description”Premium wireless Bluetooth headphones with active noise cancellation”
FAQ question”How do I reset my password if I forgot it?”
Article paragraph”Machine learning models have revolutionized natural language processing…”
User query”best restaurants near Times Square”

Output Schema

FieldTypeDescription
textstringThe processed text content (full text or chunk)
text_extractor_v1_embeddingfloat[1024]Dense vector embedding, L2 normalized
{
  "text": "Premium wireless Bluetooth headphones with active noise cancellation",
  "text_extractor_v1_embedding": [0.023, -0.041, 0.018, ...]
}
When chunking is enabled, each chunk becomes a separate document with tracking metadata stored in metadata (not in the document payload):
  • chunk_index – Position of this chunk in the original document
  • chunk_text – The text content of this chunk
  • total_chunks – Total number of chunks from the source

Parameters

Chunking Parameters

ParameterTypeDefaultDescription
split_bystring"none"Strategy for splitting text into chunks
chunk_sizeinteger1000Target size for each chunk (units depend on split_by)
chunk_overlapinteger0Number of units to overlap between consecutive chunks

Split Strategies

StrategyDescriptionBest For
charactersSplit by character countUniform sizes, quick testing
wordsSplit by word boundariesGeneral text, preserves words
sentencesSplit by sentence boundariesQ&A, precise retrieval, preserves semantic units
paragraphsSplit by paragraph (double newlines)Articles, documentation, natural structure
pagesSplit by page breaksPDFs, paginated documents
noneNo splitting (default)Short texts < 400 words
Recommended chunk sizes:
  • characters: 500-2000
  • words: 100-400
  • sentences: 3-10
  • paragraphs: 1-3
  • pages: 1
Chunk overlap: 10-20% of chunk_size helps preserve context across boundaries. Example: chunk_size: 1000, chunk_overlap: 100-200.

Embedding Model

ParameterTypeDefaultDescription
embedding_modelstringcurrent TEXT defaultOverride the embedding model via the model registry
The text extractor resolves its model through the central embedding registry. Leave unset to use the current TEXT modality default (intfloat_e5_large_instruct_v1, 1024d) — the registry swaps hot when a new frontier text model ships, so existing collections pick it up without a code change.
Dimensions are locked at namespace creation. Switching embedding_model on an existing namespace requires a migration since the vector index dimensionality is fixed.

Embedding Task

Instruction-aware embedding models (E5, Gemini) use a task hint to optimize the embedding for a specific downstream use case. By default, all extractors use retrieval_document at ingestion time, which produces embeddings optimized for asymmetric search (queries find documents).
Set embedding_task at the collection level, not on the extractor. See Collection Embedding Task for full details and examples.
TaskUse CaseEffect on E5Effect on Gemini
retrieval_documentDefault. Search: find documents from queriesPrepends "passage: "Instructs “represent for retrieval”
retrieval_queryRare at index time. Query-side is automaticPrepends "query: "Instructs “represent this query”
semantic_similaritySymmetric comparison (deduplication, matching)Prepends "query: "Instructs “represent for similarity”
classificationDocument categorization pipelinesPrepends "query: "Instructs “represent for classification”
clusteringGrouping documents into clustersPrepends "query: "Not applied
You almost never need to set this. The default retrieval_document is correct for search, and at query time Mixpeek automatically uses retrieval_query. Only override if your collection is primarily used for clustering, classification, or symmetric similarity — not retrieval.
Non-instruction-aware models (SigLIP, CLIP, Vertex multimodal) ignore this parameter.

LLM Structured Extraction Parameters

ParameterTypeDefaultDescription
response_shapestring | objectnullDefine custom structured output using LLM extraction
llm_providerstringnullLLM provider: openai, google, anthropic
llm_modelstringnullSpecific model for extraction

response_shape Modes

Natural Language Mode (string):
{
  "response_shape": "Extract key entities, sentiment (positive/negative/neutral), and main topics from the text"
}
The service automatically infers JSON schema from your description. JSON Schema Mode (object):
{
  "response_shape": {
    "type": "object",
    "properties": {
      "sentiment": {
        "type": "string",
        "enum": ["positive", "negative", "neutral"]
      },
      "entities": {
        "type": "array",
        "items": { "type": "string" }
      },
      "topics": {
        "type": "array",
        "items": { "type": "string" },
        "maxItems": 5
      }
    },
    "required": ["sentiment"]
  }
}

LLM Provider & Model Options

ProviderExample Models
openaigpt-4o-mini-2024-07-18 (cost-effective), gpt-4o-2024-08-06 (best quality)
googlegemini-2.5-flash (fastest), gemini-1.5-flash-001
anthropicclaude-3-5-haiku-20241022 (fast), claude-3-5-sonnet-20241022 (best reasoning)

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": {
      "text": "description"
    },
    "field_passthrough": [
      { "source_path": "metadata.product_id" }
    ],
    "parameters": {}
  }
}

Performance & Costs

MetricValue
Embedding latency~5ms per document (batched: ~2ms/doc)
Query latency5-10ms for top-100 results
CostFree (self-hosted E5-Large)
GPU requiredNo (but 5-10x faster with GPU)
Memory~4GB per 1M documents
Index build~1 hour per 10M documents
LLM extraction adds cost and latency based on provider pricing. Only use when structured extraction is needed.

Dense Embeddings vs Lexical (BM25)

text_extractor produces dense embeddings — great for semantic recall, weak on exact tokens. For exact-keyword precision, pair it with a lexical (BM25) search (the lexical: true option on a feature_search stage, backed by a text payload index — not a separate extractor).
DimensionDense (text_extractor)Lexical (BM25, lexical: true)
MatchesMeaning / paraphraseExact tokens, SKUs, codes, prices
Semantic recallExcellentPoor
Exact matchingPoorExcellent
Multi-languageExcellent (E5-Large)Good (token-based)
RequiresEmbedding indextext payload index
The strongest setup is hybrid: run a dense search and a lexical search in the same feature_search stage and fuse with rrf. See Lexical (BM25) Search.

Vector Index

PropertyValue
Feature URImixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
Index nametext_extractor_v1_embedding
Dimensions1024
TypeDense
Distance metricCosine
Datatypefloat32
Inference modelmultilingual_e5_large_instruct_v1
NormalizationL2 normalized
In retrievers, reference this feature by its Feature URI above (the output name is multilingual_e5_large_instruct_v1, not the index name text_extractor_v1_embedding).

Limitations

  • Token limit: 512 tokens (~400 words). Longer text is automatically truncated.
  • Exact phrases: Cannot reliably match exact phrases or technical terms.
  • Domain jargon: Struggles with very domain-specific jargon or acronyms.
  • Terminology variance: May miss documents that use different terminology for the same concept.
  • Short texts: Less effective for very short texts (1-5 words) where lexical matching is sufficient.
  • Keyword-heavy queries: Less effective for queries like “iPhone 15 Pro Max 256GB”.