> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Text Extractor

> Dense vector embeddings from text using E5-Large multilingual with chunking and LLM extraction

<Card title="View on GitHub" icon="github" href="https://github.com/mixpeek/mixpeek-extractors/blob/main/extractors/text_extractor/README.md" horizontal>
  Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
</Card>

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/extractors/text.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=5c9f4dd60090c5941a6283f448b4da86" alt="Text extractor pipeline showing chunking, E5-Large embedding, and optional LLM extraction" width="1000" height="420" data-path="assets/extractors/text.svg" />
</Frame>

The text extractor generates dense vector embeddings from text using the E5-Large multilingual model. Optimized for semantic search, RAG applications, and general-purpose text retrieval. Supports text chunking/decomposition with multiple splitting strategies. Fast (5ms/doc), cost-effective (free), and supports 100+ languages.

<Note>
  View extractor details at [api.mixpeek.com/v1/collections/features/extractors/text\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/text_extractor_v1) or fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`.
</Note>

## Pipeline Steps

1. **Filter Dataset** (if collection\_id provided)
   * Filter to specified collection
2. **Apply Input Mappings**
   * Resolve text field from source (e.g., `transcription`, `content`, `data`)
3. **Text Chunking** (conditional: if `split_by != "none"`)
   * Split by: characters, words, sentences, paragraphs, or pages
   * Configure `chunk_size` and `chunk_overlap`
   * Each chunk becomes a separate document
4. **E5 Text Embedding Generation**
   * Multilingual E5-Large model (1024D)
   * L2 normalized vectors
   * Batch size: 4,096 texts
5. **Output**
   * Text documents with embeddings
   * One document per input (or per chunk if chunking enabled)

## When to Use

| Use Case                  | Description                                                    |
| ------------------------- | -------------------------------------------------------------- |
| **Product search**        | Search products by natural language descriptions               |
| **FAQ matching**          | Match user questions to knowledge base articles                |
| **Document retrieval**    | Find relevant documents from large corpora                     |
| **Content discovery**     | Recommend similar content based on semantic similarity         |
| **RAG chunking**          | Split documents into chunks for retrieval-augmented generation |
| **Multi-language search** | Search across 100+ languages with a single model               |

## When NOT to Use

| Scenario                                               | Recommended Alternative                                                             |
| ------------------------------------------------------ | ----------------------------------------------------------------------------------- |
| Exact phrase / keyword matching (SKUs, codes, names)   | Add a [lexical (BM25) search](/retrieval/stages/feature-search#lexical-bm25-search) |
| Keyword-heavy queries (e.g. "iPhone 15 Pro Max 256GB") | Lexical (BM25) search alongside the dense one                                       |
| Critical technical terms or short texts (1–5 words)    | Lexical (BM25) search                                                               |

<Note>
  Dense embeddings and exact-keyword matching are complementary. Keep `text_extractor` for semantic recall and add a `lexical: true` search over a `text` index for exact tokens — fuse them with `rrf`. See [Lexical (BM25) Search](/retrieval/stages/feature-search#lexical-bm25-search).
</Note>

## Input Schema

| Field  | Type   | Required | Description                                                                                                                          |
| ------ | ------ | -------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| `text` | string | **Yes**  | Text content to process. Recommended: 10-400 words for optimal quality. Maximum: 512 tokens (\~400 words), longer text is truncated. |

```json theme={null}
{
  "text": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium sound quality."
}
```

**Input Examples:**

| Type                | Example                                                                      |
| ------------------- | ---------------------------------------------------------------------------- |
| Product description | "Premium wireless Bluetooth headphones with active noise cancellation"       |
| FAQ question        | "How do I reset my password if I forgot it?"                                 |
| Article paragraph   | "Machine learning models have revolutionized natural language processing..." |
| User query          | "best restaurants near Times Square"                                         |

## Output Schema

| Field                         | Type         | Description                                     |
| ----------------------------- | ------------ | ----------------------------------------------- |
| `text`                        | string       | The processed text content (full text or chunk) |
| `text_extractor_v1_embedding` | float\[1024] | Dense vector embedding, L2 normalized           |

```json theme={null}
{
  "text": "Premium wireless Bluetooth headphones with active noise cancellation",
  "text_extractor_v1_embedding": [0.023, -0.041, 0.018, ...]
}
```

When chunking is enabled, each chunk becomes a separate document with tracking metadata stored in `metadata` (not in the document payload):

* `chunk_index` – Position of this chunk in the original document
* `chunk_text` – The text content of this chunk
* `total_chunks` – Total number of chunks from the source

## Parameters

### Chunking Parameters

| Parameter       | Type    | Default  | Description                                             |
| --------------- | ------- | -------- | ------------------------------------------------------- |
| `split_by`      | string  | `"none"` | Strategy for splitting text into chunks                 |
| `chunk_size`    | integer | `1000`   | Target size for each chunk (units depend on `split_by`) |
| `chunk_overlap` | integer | `0`      | Number of units to overlap between consecutive chunks   |

#### Split Strategies

| Strategy     | Description                          | Best For                                          |
| ------------ | ------------------------------------ | ------------------------------------------------- |
| `characters` | Split by character count             | Uniform sizes, quick testing                      |
| `words`      | Split by word boundaries             | General text, preserves words                     |
| `sentences`  | Split by sentence boundaries         | Q\&A, precise retrieval, preserves semantic units |
| `paragraphs` | Split by paragraph (double newlines) | Articles, documentation, natural structure        |
| `pages`      | Split by page breaks                 | PDFs, paginated documents                         |
| `none`       | No splitting (default)               | Short texts \< 400 words                          |

**Recommended chunk sizes:**

* `characters`: 500-2000
* `words`: 100-400
* `sentences`: 3-10
* `paragraphs`: 1-3
* `pages`: 1

**Chunk overlap:** 10-20% of `chunk_size` helps preserve context across boundaries. Example: `chunk_size: 1000`, `chunk_overlap: 100-200`.

### Embedding Model

| Parameter         | Type   | Default                | Description                                                                       |
| ----------------- | ------ | ---------------------- | --------------------------------------------------------------------------------- |
| `embedding_model` | string | *current TEXT default* | Override the embedding model via the [model registry](/processing/model-registry) |

The text extractor resolves its model through the central embedding registry. Leave unset to use the current TEXT modality default (`intfloat_e5_large_instruct_v1`, 1024d) — the registry swaps hot when a new frontier text model ships, so existing collections pick it up without a code change.

<Warning>
  Dimensions are locked at namespace creation. Switching `embedding_model` on an existing namespace requires a migration since the vector index dimensionality is fixed.
</Warning>

### Embedding Task

Instruction-aware embedding models (E5, Gemini) use a **task hint** to optimize the embedding for a specific downstream use case. By default, all extractors use `retrieval_document` at ingestion time, which produces embeddings optimized for asymmetric search (queries find documents).

<Note>
  Set `embedding_task` at the **collection level**, not on the extractor. See [Collection Embedding Task](/platform/processing#embedding-task) for full details and examples.
</Note>

| Task                  | Use Case                                         | Effect on E5           | Effect on Gemini                         |
| --------------------- | ------------------------------------------------ | ---------------------- | ---------------------------------------- |
| `retrieval_document`  | **Default.** Search: find documents from queries | Prepends `"passage: "` | Instructs "represent for retrieval"      |
| `retrieval_query`     | Rare at index time. Query-side is automatic      | Prepends `"query: "`   | Instructs "represent this query"         |
| `semantic_similarity` | Symmetric comparison (deduplication, matching)   | Prepends `"query: "`   | Instructs "represent for similarity"     |
| `classification`      | Document categorization pipelines                | Prepends `"query: "`   | Instructs "represent for classification" |
| `clustering`          | Grouping documents into clusters                 | Prepends `"query: "`   | Not applied                              |

<Info>
  You almost never need to set this. The default `retrieval_document` is correct for search, and at query time Mixpeek automatically uses `retrieval_query`. Only override if your collection is primarily used for clustering, classification, or symmetric similarity — not retrieval.
</Info>

<Info>
  Non-instruction-aware models (SigLIP, CLIP, Vertex multimodal) ignore this parameter.
</Info>

### LLM Structured Extraction Parameters

| Parameter        | Type             | Default | Description                                          |
| ---------------- | ---------------- | ------- | ---------------------------------------------------- |
| `response_shape` | string \| object | `null`  | Define custom structured output using LLM extraction |
| `llm_provider`   | string           | `null`  | LLM provider: `openai`, `google`, `anthropic`        |
| `llm_model`      | string           | `null`  | Specific model for extraction                        |

#### response\_shape Modes

**Natural Language Mode (string):**

```json theme={null}
{
  "response_shape": "Extract key entities, sentiment (positive/negative/neutral), and main topics from the text"
}
```

The service automatically infers JSON schema from your description.

**JSON Schema Mode (object):**

```json theme={null}
{
  "response_shape": {
    "type": "object",
    "properties": {
      "sentiment": {
        "type": "string",
        "enum": ["positive", "negative", "neutral"]
      },
      "entities": {
        "type": "array",
        "items": { "type": "string" }
      },
      "topics": {
        "type": "array",
        "items": { "type": "string" },
        "maxItems": 5
      }
    },
    "required": ["sentiment"]
  }
}
```

#### LLM Provider & Model Options

| Provider    | Example Models                                                                    |
| ----------- | --------------------------------------------------------------------------------- |
| `openai`    | `gpt-4o-mini-2024-07-18` (cost-effective), `gpt-4o-2024-08-06` (best quality)     |
| `google`    | `gemini-2.5-flash` (fastest), `gemini-1.5-flash-001`                              |
| `anthropic` | `claude-3-5-haiku-20241022` (fast), `claude-3-5-sonnet-20241022` (best reasoning) |

## Configuration Examples

<CodeGroup>
  ```json Basic Embedding (No Chunking) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "description"
      },
      "field_passthrough": [
        { "source_path": "metadata.product_id" }
      ],
      "parameters": {}
    }
  }
  ```

  ```json Sentence Chunking for RAG theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "document_content"
      },
      "field_passthrough": [
        { "source_path": "metadata.document_id" }
      ],
      "parameters": {
        "split_by": "sentences",
        "chunk_size": 5,
        "chunk_overlap": 1
      }
    }
  }
  ```

  ```json Paragraph Chunking for Articles theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "article_body"
      },
      "field_passthrough": [
        { "source_path": "metadata.title" },
        { "source_path": "metadata.author" }
      ],
      "parameters": {
        "split_by": "paragraphs",
        "chunk_size": 2,
        "chunk_overlap": 0
      }
    }
  }
  ```

  ```json Word-Level Chunking with Overlap theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "content"
      },
      "parameters": {
        "split_by": "words",
        "chunk_size": 300,
        "chunk_overlap": 50
      }
    }
  }
  ```

  ```json LLM Extraction (Natural Language) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "review_text"
      },
      "parameters": {
        "response_shape": "Extract sentiment (positive/negative/neutral), key product features mentioned, and overall rating impression",
        "llm_provider": "openai",
        "llm_model": "gpt-4o-mini-2024-07-18"
      }
    }
  }
  ```

  ```json LLM Extraction (JSON Schema) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "document"
      },
      "parameters": {
        "response_shape": {
          "type": "object",
          "properties": {
            "entities": {
              "type": "array",
              "items": { "type": "string" },
              "description": "Named entities mentioned"
            },
            "sentiment": {
              "type": "string",
              "enum": ["positive", "negative", "neutral"]
            },
            "topics": {
              "type": "array",
              "items": { "type": "string" },
              "maxItems": 5
            }
          },
          "required": ["sentiment"]
        },
        "llm_provider": "anthropic",
        "llm_model": "claude-3-5-haiku-20241022"
      }
    }
  }
  ```
</CodeGroup>

## Performance & Costs

| Metric                | Value                                   |
| --------------------- | --------------------------------------- |
| **Embedding latency** | \~5ms per document (batched: \~2ms/doc) |
| **Query latency**     | 5-10ms for top-100 results              |
| **Cost**              | Free (self-hosted E5-Large)             |
| **GPU required**      | No (but 5-10x faster with GPU)          |
| **Memory**            | \~4GB per 1M documents                  |
| **Index build**       | \~1 hour per 10M documents              |

**LLM extraction** adds cost and latency based on provider pricing. Only use when structured extraction is needed.

## Dense Embeddings vs Lexical (BM25)

`text_extractor` produces **dense** embeddings — great for semantic recall, weak on exact tokens. For exact-keyword precision, pair it with a **lexical (BM25)** search (the `lexical: true` option on a `feature_search` stage, backed by a `text` payload index — not a separate extractor).

| Dimension           | Dense (`text_extractor`) | Lexical (BM25, `lexical: true`)   |
| ------------------- | ------------------------ | --------------------------------- |
| **Matches**         | Meaning / paraphrase     | Exact tokens, SKUs, codes, prices |
| **Semantic recall** | Excellent                | Poor                              |
| **Exact matching**  | Poor                     | Excellent                         |
| **Multi-language**  | Excellent (E5-Large)     | Good (token-based)                |
| **Requires**        | Embedding index          | `text` payload index              |

<Tip>
  The strongest setup is **hybrid**: run a dense search and a lexical search in the same `feature_search` stage and fuse with `rrf`. See [Lexical (BM25) Search](/retrieval/stages/feature-search#lexical-bm25-search).
</Tip>

## Vector Index

| Property            | Value                                                           |
| ------------------- | --------------------------------------------------------------- |
| **Feature URI**     | `mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1` |
| **Index name**      | `text_extractor_v1_embedding`                                   |
| **Dimensions**      | 1024                                                            |
| **Type**            | Dense                                                           |
| **Distance metric** | Cosine                                                          |
| **Datatype**        | float32                                                         |
| **Inference model** | `multilingual_e5_large_instruct_v1`                             |
| **Normalization**   | L2 normalized                                                   |

<Note>In retrievers, reference this feature by its **Feature URI** above (the output name is `multilingual_e5_large_instruct_v1`, **not** the index name `text_extractor_v1_embedding`).</Note>

## Limitations

* **Token limit**: 512 tokens (\~400 words). Longer text is automatically truncated.
* **Exact phrases**: Cannot reliably match exact phrases or technical terms.
* **Domain jargon**: Struggles with very domain-specific jargon or acronyms.
* **Terminology variance**: May miss documents that use different terminology for the same concept.
* **Short texts**: Less effective for very short texts (1-5 words) where lexical matching is sufficient.
* **Keyword-heavy queries**: Less effective for queries like "iPhone 15 Pro Max 256GB".

## Related

* [Feature Extractors Overview](/processing/feature-extractors)
* [Passthrough Extractor](/processing/extractors/passthrough)
* [Multimodal Extractor](/processing/extractors/multimodal)
