Feature Extraction

Web Crawling & Extraction

Recursively crawl websites and extract multimodal content with automatic embeddings for text, code, and images

Try It Live Beginner Tutorial Advanced Tutorial

Video Overview

Why do anything?

Building a knowledge base from documentation requires crawling hundreds of pages, extracting code snippets, and making it all searchable.

Why now?

AI applications need comprehensive, up-to-date content. Manual scraping breaks on site updates and can't extract semantic meaning.

Why this feature?

Automated recursive crawling with built-in multimodal embeddings (E5-Large text, Jina Code, SigLIP vision). Handles SPAs, retries failures, and generates search-ready vectors automatically.

How It Works

Web Scraper is a recursive crawling system with BFS-based traversal, multimodal content extraction, and automatic embedding generation. Supports static and JavaScript-rendered sites with built-in resilience (retries, proxies, captcha handling).

Crawl Configuration & Setup

Initialize crawler with seed URL, depth limits, filtering patterns, and rendering strategy (static/JS/auto)

Recursive Web Crawling

BFS traversal from seed URL, discovering links up to max_depth, respecting include/exclude patterns, with optional semantic goal-based crawling

Content Extraction

Extract text content, detect and parse code blocks (with language and line numbers), discover images (with metadata), identify downloadable assets

Content Chunking (Optional)

Split large content into chunks using configurable strategy: sentences, paragraphs, words, or characters with overlap

Document Expansion

Create individual documents per page/chunk with full metadata, crawl depth, and parent relationships

Multi-Modal Embedding Generation

Generate E5-Large (1024D) text embeddings, Jina Code (768D) embeddings for code blocks, SigLIP (768D) visual embeddings for images

Resilience & Output

Handle failures with exponential backoff retries, rotate proxies on errors, solve captchas if configured, output indexed documents

Why This Approach

Combines recursive crawling for comprehensive coverage, multimodal extraction for rich content capture, and automatic embedding generation for instant semantic search. Built-in resilience (retries, proxies, captcha solving) handles real-world challenges. Unlike manual scraping, this scales to thousands of pages with consistent output.

Where This Is Used

E-commerce

Product Catalog Enrichment

Media

News & Content Aggregation

Finance

Market Intelligence & Competitive Analysis

Integration

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Create collection with web scraper
collection = client.collections.create(
    collection_name="docs_collection",
    source={
        "type": "bucket",
        "bucket_id": "bucket_id"
    },
    feature_extractor={
        "feature_extractor_name": "web_scraper",
        "version": "v1",
        "input_mappings": {
            "url": "payload.docs_url"
        },
        "parameters": {
            "max_depth": 2,
            "max_pages": 50,
            "crawl_timeout": 300,
            "render_strategy": "auto",
            "generate_text_embeddings": True,
            "generate_code_embeddings": True,
            "generate_image_embeddings": False,
            "delay_between_requests": 0.5
        }
    }
)

# API docs with structured extraction
collection = client.collections.create(
    collection_name="api_docs",
    source={
        "type": "bucket",
        "bucket_id": "bucket_id"
    },
    feature_extractor={
        "feature_extractor_name": "web_scraper",
        "version": "v1",
        "input_mappings": {
            "url": "payload.api_docs_url"
        },
        "parameters": {
            "max_depth": 3,
            "max_pages": 100,
            "include_patterns": ["/api/.*", "/reference/.*"],
            "exclude_patterns": ["/changelog/.*", "/deprecated/.*"],
            "render_strategy": "javascript",
            "response_shape": {
                "type": "object",
                "properties": {
                    "endpoints": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "method": {"type": "string"},
                                "path": {"type": "string"},
                                "description": {"type": "string"},
                                "parameters": {"type": "array", "items": {"type": "string"}}
                            }
                        }
                    },
                    "authentication": {"type": "string"},
                    "rate_limits": {"type": "string"}
                }
            },
            "llm_provider": "openai",
            "llm_model": "gpt-4o-mini",
            "generate_text_embeddings": True,
            "generate_code_embeddings": True,
            "generate_image_embeddings": True
        }
    }
)

# High-volume crawl with chunking
collection = client.collections.create(
    collection_name="knowledge_base",
    source={
        "type": "bucket",
        "bucket_id": "bucket_id"
    },
    feature_extractor={
        "feature_extractor_name": "web_scraper",
        "version": "v1",
        "input_mappings": {
            "url": "payload.kb_url"
        },
        "parameters": {
            "max_depth": 4,
            "max_pages": 200,
            "crawl_mode": "semantic",
            "crawl_goal": "Find all articles related to troubleshooting and error resolution",
            "chunk_strategy": "paragraphs",
            "chunk_size": 1000,
            "chunk_overlap": 100,
            "generate_text_embeddings": True,
            "generate_code_embeddings": False,
            "generate_image_embeddings": False
        }
    }
)

View Documentation

Comparisons & Alternatives

Resources

This capability is referenced and used across the following resources:

Related Capabilities

often combined

Html To Markdown

Often used together to extract and convert web content to clean markdown