> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM Enrich

> Extract structured data from documents using language model analysis

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/retrievers/llm-enrich.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=029ebd511f0d8a7b933a0168a6624a55" alt="LLM Enrich stage showing structured data extraction with language models" width="1000" height="400" data-path="assets/retrievers/llm-enrich.svg" />
</Frame>

The LLM Enrich stage uses language models to extract structured data from document content. It can identify entities, classify content, extract key information, and generate structured outputs.

<Note>
  **Stage Category**: ENRICH (Enriches documents)

  **Transformation**: N documents → N documents (with extracted data added)
</Note>

## When to Use

| Use Case                         | Description                                |
| -------------------------------- | ------------------------------------------ |
| **Entity extraction**            | Extract names, dates, amounts from text    |
| **Content classification**       | Categorize documents by topic/type         |
| **Key information extraction**   | Pull specific facts from unstructured text |
| **Structured output generation** | Convert prose to structured data           |

## When NOT to Use

| Scenario                           | Recommended Alternative     |
| ---------------------------------- | --------------------------- |
| Simple field transformation        | `json_transform`            |
| Predefined taxonomy classification | `taxonomy_enrich`           |
| Large-scale processing             | Pre-process during indexing |
| Real-time low-latency              | Use cached extractions      |

## Parameters

| Parameter       | Type    | Default     | Description                       |
| --------------- | ------- | ----------- | --------------------------------- |
| `model`         | string  | *Required*  | LLM model to use                  |
| `prompt`        | string  | *Required*  | Extraction instructions           |
| `content_field` | string  | `content`   | Field to analyze                  |
| `output_field`  | string  | `extracted` | Field for extracted data          |
| `output_schema` | object  | `null`      | JSON schema for structured output |
| `batch_size`    | integer | `5`         | Documents per LLM call            |

## Available Models

| Model             | Speed  | Quality   | Best For           |
| ----------------- | ------ | --------- | ------------------ |
| `gpt-4o-mini`     | Fast   | Good      | Simple extractions |
| `gpt-4o`          | Medium | Excellent | Complex analysis   |
| `claude-3-haiku`  | Fast   | Good      | Quick processing   |
| `claude-3-sonnet` | Medium | Excellent | Nuanced extraction |

## Configuration Examples

<CodeGroup>
  ```json Basic Entity Extraction theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o-mini",
        "prompt": "Extract all company names and person names mentioned in this document.",
        "output_field": "entities"
      }
    }
  }
  ```

  ```json Structured Output with Schema theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o",
        "prompt": "Extract the following information from this product review.",
        "output_field": "review_analysis",
        "output_schema": {
          "type": "object",
          "properties": {
            "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]},
            "rating_mentioned": {"type": "number", "minimum": 1, "maximum": 5},
            "pros": {"type": "array", "items": {"type": "string"}},
            "cons": {"type": "array", "items": {"type": "string"}},
            "would_recommend": {"type": "boolean"}
          }
        }
      }
    }
  }
  ```

  ```json Key Facts Extraction theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o-mini",
        "prompt": "Extract key facts from this news article: main event, date, location, people involved, and outcome.",
        "output_field": "key_facts",
        "output_schema": {
          "type": "object",
          "properties": {
            "main_event": {"type": "string"},
            "date": {"type": "string"},
            "location": {"type": "string"},
            "people": {"type": "array", "items": {"type": "string"}},
            "outcome": {"type": "string"}
          }
        }
      }
    }
  }
  ```

  ```json Topic Classification theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "anthropic",
        "model_name": "claude-3-haiku",
        "prompt": "Classify this document into primary and secondary topics. Be specific.",
        "output_field": "topics",
        "output_schema": {
          "type": "object",
          "properties": {
            "primary_topic": {"type": "string"},
            "secondary_topics": {"type": "array", "items": {"type": "string"}},
            "confidence": {"type": "number", "minimum": 0, "maximum": 1}
          }
        }
      }
    }
  }
  ```

  ```json Contact Information Extraction theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o-mini",
        "prompt": "Extract all contact information (emails, phone numbers, addresses) from this document.",
        "output_field": "contacts",
        "output_schema": {
          "type": "object",
          "properties": {
            "emails": {"type": "array", "items": {"type": "string", "format": "email"}},
            "phones": {"type": "array", "items": {"type": "string"}},
            "addresses": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    }
  }
  ```
</CodeGroup>

## Output Schema

Define structured output using JSON Schema:

```json theme={null}
{
  "output_schema": {
    "type": "object",
    "properties": {
      "field_name": {"type": "string"},
      "numeric_field": {"type": "number"},
      "boolean_field": {"type": "boolean"},
      "array_field": {"type": "array", "items": {"type": "string"}},
      "enum_field": {"type": "string", "enum": ["option1", "option2", "option3"]}
    },
    "required": ["field_name"]
  }
}
```

### Supported Types

| Type      | Description    |
| --------- | -------------- |
| `string`  | Text values    |
| `number`  | Numeric values |
| `boolean` | True/false     |
| `array`   | Lists of items |
| `object`  | Nested objects |

## Output Examples

### Without Schema

```json theme={null}
{
  "document_id": "doc_123",
  "content": "Apple Inc. announced...",
  "entities": "Companies: Apple Inc., Microsoft\nPeople: Tim Cook, Satya Nadella"
}
```

### With Schema

```json theme={null}
{
  "document_id": "doc_123",
  "content": "Great product, 5 stars!...",
  "review_analysis": {
    "sentiment": "positive",
    "rating_mentioned": 5,
    "pros": ["easy to use", "great value", "fast shipping"],
    "cons": ["packaging could be better"],
    "would_recommend": true
  }
}
```

## Performance

| Metric          | Value                          |
| --------------- | ------------------------------ |
| **Latency**     | 300-800ms per batch            |
| **Batch size**  | 5 documents default            |
| **Token usage** | \~200 tokens per document      |
| **Parallel**    | Batches processed concurrently |

<Warning>
  LLM enrichment is expensive. Consider pre-computing extractions during indexing for frequently accessed data.
</Warning>

## Common Pipeline Patterns

### Search + Extract + Filter

```json theme={null}
[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          {
            "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
            "query": "{{INPUT.query}}",
            "top_k": 20
          }
        ],
        "final_top_k": 20
      }
    }
  },
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o-mini",
        "prompt": "Extract the main topic and sentiment.",
        "output_field": "analysis",
        "output_schema": {
          "type": "object",
          "properties": {
            "topic": {"type": "string"},
            "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]}
          }
        }
      }
    }
  },
  {
    "stage_name": "attribute_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "field": "analysis.sentiment",
        "operator": "eq",
        "value": "positive"
      }
    }
  }
]
```

### Entity Extraction Pipeline

```json theme={null}
[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          {
            "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
            "query": "{{INPUT.query}}",
            "top_k": 10
          }
        ],
        "final_top_k": 10
      }
    }
  },
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o",
        "prompt": "Extract all entities with their types and relationships.",
        "output_field": "entities",
        "output_schema": {
          "type": "object",
          "properties": {
            "people": {"type": "array", "items": {"type": "object", "properties": {"name": {"type": "string"}, "role": {"type": "string"}}}},
            "organizations": {"type": "array", "items": {"type": "string"}},
            "locations": {"type": "array", "items": {"type": "string"}},
            "dates": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    }
  }
]
```

## Writing Effective Prompts

### Good Prompts

```
✓ "Extract the product name, price, and key features from this product listing."
✓ "Identify all dates mentioned and their associated events."
✓ "Classify the sentiment as positive, neutral, or negative, and explain why."
```

### Poor Prompts

```
✗ "Analyze this document" (too vague)
✗ "Get the data" (not specific)
✗ "Tell me about it" (unclear output)
```

<Tip>
  Be specific about what to extract and in what format. When using `output_schema`, the LLM will conform to the structure.
</Tip>

## Multimodal Query Inputs

Pass images from your query inputs to the LLM alongside document content. This enables visual comparison, brand matching, and cross-modal analysis.

### Parameters

| Parameter           | Type   | Default | Description                                                    |
| ------------------- | ------ | ------- | -------------------------------------------------------------- |
| `multimodal_inputs` | object | `null`  | Map of INPUT field names to media types (`"image"`, `"video"`) |

### How It Works

1. Declare which INPUT fields carry multimodal content via `multimodal_inputs`
2. At runtime, the stage extracts URLs from `{{INPUT.field_name}}`
3. Images are sent to the LLM alongside the prompt and document content
4. Works with providers that support vision (Google Gemini, OpenAI GPT-4o, Anthropic Claude)

### Configuration Examples

<CodeGroup>
  ```json Brand Logo Comparison theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "google",
        "model_name": "gemini-2.5-flash-lite",
        "prompt": "Compare this brand logo to the document. Rate alignment 1-10: {{DOC.brand_name}}",
        "output_field": "brand_alignment",
        "multimodal_inputs": {
          "query_image": "image"
        }
      }
    }
  }
  ```

  ```json Visual Product Match theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o",
        "prompt": "Does this product image match the description? {{DOC.description}}",
        "output_field": "visual_match",
        "output_schema": {
          "type": "object",
          "properties": {
            "matches": {"type": "boolean"},
            "confidence": {"type": "number"},
            "differences": {"type": "array", "items": {"type": "string"}}
          }
        },
        "multimodal_inputs": {
          "product_photo": "image"
        }
      }
    }
  }
  ```

  ```json Multi-Image Reference theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "google",
        "model_name": "gemini-2.5-flash-lite",
        "prompt": "Compare the reference image against this document's content: {{DOC.title}}",
        "output_field": "comparison",
        "multimodal_inputs": {
          "reference_image": "image",
          "style_guide": "image"
        }
      }
    }
  }
  ```
</CodeGroup>

### Calling with Multimodal Inputs

When using `multimodal_inputs`, pass the image URLs in the retriever's `inputs`:

```json theme={null}
{
  "inputs": {
    "query": "Find products similar to this",
    "query_image": "https://storage.example.com/brand-logo.png"
  },
  "stages": [
    {
      "stage_name": "feature_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": { "..." : "..." }
      }
    },
    {
      "stage_name": "llm_enrich",
      "stage_type": "enrich",
      "config": {
        "stage_id": "llm_enrich",
        "parameters": {
          "prompt": "Compare this image to: {{DOC.product_name}}",
          "output_field": "match_score",
          "multimodal_inputs": {"query_image": "image"}
        }
      }
    }
  ]
}
```

<Note>
  If a declared multimodal input field is missing from the query inputs at runtime, the stage proceeds without it (text-only mode). No error is raised.
</Note>

***

## Bring Your Own Key (BYOK)

Use your own LLM API keys instead of Mixpeek's default keys. This gives you control over costs, rate limits, and API usage.

### Why Use BYOK?

| Benefit          | Description                                |
| ---------------- | ------------------------------------------ |
| **Cost Control** | Use your own API credits and billing       |
| **Rate Limits**  | Use your own rate limits instead of shared |
| **Compliance**   | Keep API calls under your own account      |
| **Key Rotation** | Rotate keys without changing retrievers    |

### Setup

<Steps>
  <Step title="Store your API key as a secret">
    Store your LLM provider API key in the organization secrets vault:

    ```bash theme={null}
    curl -X POST "https://api.mixpeek.com/v1/organizations/secrets" \
      -H "Authorization: Bearer YOUR_MIXPEEK_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "secret_name": "openai_api_key",
        "secret_value": "sk-proj-abc123..."
      }'
    ```
  </Step>

  <Step title="Reference the secret in your stage">
    Use the `api_key` parameter with template syntax:

    ```json theme={null}
    {
      "stage_name": "llm_enrich",
      "stage_type": "enrich",
      "config": {
        "stage_id": "llm_enrich",
        "parameters": {
          "provider": "openai",
          "model_name": "gpt-4o-mini",
          "prompt": "Extract key entities from this document.",
          "output_field": "entities",
          "api_key": "{{secrets.openai_api_key}}"
        }
      }
    }
    ```
  </Step>
</Steps>

### BYOK Configuration Example

<CodeGroup>
  ```json OpenAI BYOK theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "provider": "openai",
        "model_name": "gpt-4o-mini",
        "prompt": "Summarize this document in 2-3 sentences.",
        "output_field": "summary",
        "api_key": "{{secrets.openai_api_key}}"
      }
    }
  }
  ```

  ```json Anthropic BYOK theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "anthropic",
        "provider": "anthropic",
        "model_name": "claude-3-haiku-20240307",
        "prompt": "Extract the main topics from this content.",
        "output_field": "topics",
        "api_key": "{{secrets.anthropic_api_key}}"
      }
    }
  }
  ```

  ```json Google BYOK theme={null}
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "google",
        "provider": "google",
        "model_name": "gemini-3.1-flash-lite",
        "prompt": "Classify the sentiment of this text.",
        "output_field": "sentiment",
        "api_key": "{{secrets.google_api_key}}"
      }
    }
  }
  ```
</CodeGroup>

### Supported Providers

| Provider  | Secret Name Example | Models                                         |
| --------- | ------------------- | ---------------------------------------------- |
| OpenAI    | `openai_api_key`    | gpt-4o, gpt-4o-mini                            |
| Anthropic | `anthropic_api_key` | claude-3-haiku, claude-3-sonnet, claude-3-opus |
| Google    | `google_api_key`    | gemini-3.1-flash-lite, gemini-2.5-pro          |

<Note>
  When `api_key` is not specified, the stage uses Mixpeek's default API keys and usage is charged to your Mixpeek account.
</Note>

## Custom Enrichment Model (BYO Plugin)

Use your own enrichment model deployed as a [custom extractor](/processing/custom-extractors) instead of a hosted LLM provider. Set `feature_uri` to route enrichment through your extractor's inference endpoint.

### Parameters

| Parameter     | Type   | Default | Description                                                                       |
| ------------- | ------ | ------- | --------------------------------------------------------------------------------- |
| `feature_uri` | string | `null`  | Feature URI of a custom enrichment plugin. Overrides `provider`/`model` when set. |

Your plugin must accept `{prompt: str, document: dict}` and return `{text: str}`.

### Configuration Example

```json theme={null}
{
  "stage_name": "my_enricher",
  "config": {
    "stage_id": "llm_enrich",
    "parameters": {
      "feature_uri": "mixpeek://my_summarizer@1.0.0/summarize",
      "prompt": "Summarize: {{DOC.content}}",
      "output_field": "summary"
    }
  }
}
```

<Tip>
  Set `inference_type: "generate"` in your plugin's manifest to declare compatibility with LLM stages.
</Tip>

## Error Handling

| Error                  | Behavior                         |
| ---------------------- | -------------------------------- |
| LLM timeout            | Retry once, then null result     |
| Schema validation fail | Raw text in output\_field        |
| Rate limit             | Automatic backoff                |
| Empty content          | Skip enrichment                  |
| Invalid API key        | Error returned with auth failure |

## Related

* [LLM Filter](/retrieval/stages/llm-filter) - Filter using LLM evaluation
* [Taxonomy Enrich](/retrieval/stages/taxonomy-enrich) - Predefined classification
* [JSON Transform](/retrieval/stages/json-transform) - Template-based transformation
