> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Scrape

> Extract and parse web content from URLs using Firecrawl

<Frame>
  <img src="https://mintcdn.com/mixpeek/TmiAqiYj-LwmWL2a/assets/retrievers/web-scrape.svg?fit=max&auto=format&n=TmiAqiYj-LwmWL2a&q=85&s=e12278d3bcf69dd335260e72367de304" alt="Web Scrape stage showing Firecrawl content extraction from URLs" width="1000" height="400" data-path="assets/retrievers/web-scrape.svg" />
</Frame>

The Web Scrape stage extracts content from web URLs using Firecrawl. It handles JavaScript rendering, content extraction, and structured parsing to add web content to your retrieval pipeline.

<Note>
  **Stage Category**: APPLY (Enriches documents with scraped content)

  **Transformation**: N documents → N documents (with web content added)
</Note>

## When to Use

| Use Case                | Description                            |
| ----------------------- | -------------------------------------- |
| **URL enrichment**      | Extract content from URLs in documents |
| **Reference expansion** | Scrape linked references for context   |
| **Content aggregation** | Pull in external content sources       |
| **Real-time content**   | Access current webpage content         |

## When NOT to Use

| Scenario                       | Recommended Alternative      |
| ------------------------------ | ---------------------------- |
| Searching the web              | `external_web_search` (Exa)  |
| Static content already indexed | Use indexed content          |
| High-volume scraping           | Pre-index content in Mixpeek |

## Parameters

| Parameter          | Type    | Default           | Description                             |
| ------------------ | ------- | ----------------- | --------------------------------------- |
| `url_field`        | string  | *Required*        | Document field containing URL to scrape |
| `result_field`     | string  | `scraped_content` | Field for extracted content             |
| `include_markdown` | boolean | `true`            | Return content as markdown              |
| `include_html`     | boolean | `false`           | Return raw HTML                         |
| `include_links`    | boolean | `false`           | Extract all page links                  |
| `include_images`   | boolean | `false`           | Extract image URLs                      |
| `wait_for`         | integer | `0`               | Wait ms for JS rendering                |
| `timeout_ms`       | integer | `30000`           | Request timeout                         |

## Configuration Examples

<CodeGroup>
  ```json Basic URL Scraping theme={null}
  {
    "stage_name": "web_scrape",
    "stage_type": "apply",
    "config": {
      "stage_id": "web_scrape",
      "parameters": {
        "url_field": "metadata.source_url",
        "result_field": "page_content"
      }
    }
  }
  ```

  ```json With JavaScript Rendering theme={null}
  {
    "stage_name": "web_scrape",
    "stage_type": "apply",
    "config": {
      "stage_id": "web_scrape",
      "parameters": {
        "url_field": "metadata.url",
        "result_field": "content",
        "wait_for": 3000,
        "include_markdown": true
      }
    }
  }
  ```

  ```json Full Content Extraction theme={null}
  {
    "stage_name": "web_scrape",
    "stage_type": "apply",
    "config": {
      "stage_id": "web_scrape",
      "parameters": {
        "url_field": "metadata.reference_url",
        "result_field": "reference",
        "include_markdown": true,
        "include_links": true,
        "include_images": true
      }
    }
  }
  ```

  ```json HTML Extraction theme={null}
  {
    "stage_name": "web_scrape",
    "stage_type": "apply",
    "config": {
      "stage_id": "web_scrape",
      "parameters": {
        "url_field": "url",
        "result_field": "html_content",
        "include_html": true,
        "include_markdown": false,
        "timeout_ms": 15000
      }
    }
  }
  ```
</CodeGroup>

## Output Schema

### Markdown Output (default)

```json theme={null}
{
  "document_id": "doc_123",
  "metadata": {
    "source_url": "https://example.com/article"
  },
  "scraped_content": {
    "markdown": "# Article Title\n\nArticle content here...",
    "title": "Article Title",
    "description": "Meta description",
    "language": "en",
    "status": "success"
  }
}
```

### Full Extraction

```json theme={null}
{
  "document_id": "doc_123",
  "scraped_content": {
    "markdown": "# Article Title\n\n...",
    "html": "<html>...</html>",
    "links": [
      {"text": "Link 1", "href": "https://example.com/link1"},
      {"text": "Link 2", "href": "https://example.com/link2"}
    ],
    "images": [
      {"alt": "Image 1", "src": "https://example.com/img1.jpg"},
      {"alt": "Image 2", "src": "https://example.com/img2.png"}
    ],
    "title": "Article Title",
    "status": "success"
  }
}
```

### Error Case

```json theme={null}
{
  "document_id": "doc_123",
  "scraped_content": {
    "status": "error",
    "error": "Timeout exceeded",
    "markdown": null
  }
}
```

## Firecrawl Features

| Feature                  | Description                        |
| ------------------------ | ---------------------------------- |
| **JavaScript rendering** | Full browser rendering for SPAs    |
| **Content extraction**   | Intelligent main content detection |
| **Markdown conversion**  | Clean, structured output           |
| **Anti-bot handling**    | Bypasses common protections        |

<Tip>
  Use `wait_for` when scraping JavaScript-heavy sites. Start with 2000-3000ms and adjust based on page complexity.
</Tip>

## Performance

| Metric                  | Value                              |
| ----------------------- | ---------------------------------- |
| **Latency**             | 1-10s (depends on page complexity) |
| **Concurrent requests** | Up to 5 per pipeline               |
| **Timeout default**     | 30 seconds                         |
| **Retry behavior**      | 2 retries on failure               |

<Warning>
  Web scraping adds significant latency. Use sparingly and consider pre-indexing frequently accessed content.
</Warning>

## Common Pipeline Patterns

### Enrich Documents with Referenced Content

```json theme={null}
[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 10 }
        ],
        "final_top_k": 10
      }
    }
  },
  {
    "stage_name": "structured_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "conditions": {
          "field": "metadata.has_url",
          "operator": "eq",
          "value": true
        }
      }
    }
  },
  {
    "stage_name": "web_scrape",
    "stage_type": "apply",
    "config": {
      "stage_id": "web_scrape",
      "parameters": {
        "url_field": "metadata.source_url",
        "result_field": "source_content",
        "wait_for": 2000
      }
    }
  }
]
```

### Scrape and Summarize

```json theme={null}
[
  {
    "stage_name": "web_scrape",
    "stage_type": "apply",
    "config": {
      "stage_id": "web_scrape",
      "parameters": {
        "url_field": "metadata.url",
        "result_field": "page_content"
      }
    }
  },
  {
    "stage_name": "json_transform",
    "stage_type": "apply",
    "config": {
      "stage_id": "json_transform",
      "parameters": {
        "template": {
          "content": "{{ DOC.page_content.markdown }}",
          "source": "{{ DOC.metadata.url }}"
        }
      }
    }
  },
  {
    "stage_name": "summarize",
    "stage_type": "reduce",
    "config": {
      "stage_id": "summarize",
      "parameters": {
        "provider": "google",
        "model_name": "gemini-2.5-flash-lite",
        "prompt": "Summarize the key points from this webpage"
      }
    }
  }
]
```

## Error Handling

| Error        | Behavior                                |
| ------------ | --------------------------------------- |
| Invalid URL  | `status: "error"`, continues pipeline   |
| Timeout      | `status: "error"`, null content         |
| 404/403      | `status: "error"`, HTTP status in error |
| Rate limited | Retry with backoff                      |

## Rate Limits and Best Practices

1. **Batch wisely**: Limit to 5-10 URLs per pipeline run
2. **Cache results**: Consider storing scraped content
3. **Respect robots.txt**: Firecrawl handles this automatically
4. **Use timeouts**: Set appropriate `timeout_ms` for your use case

## Related

* [External Web Search](/retrieval/stages/external-web-search) - Search the web (Exa)
* [API Call](/retrieval/stages/api-call) - General HTTP enrichment
* [Document Enrich](/retrieval/stages/document-enrich) - Collection joins
