Mixpeek Logo
    Feature Extraction

    Web Scraper

    Recursively crawl websites and extract multimodal content with automatic embeddings for text, code, and images

    Video Overview

    Why do anything?

    Building a knowledge base from documentation requires crawling hundreds of pages, extracting code snippets, and making it all searchable.

    Why now?

    AI applications need comprehensive, up-to-date content. Manual scraping breaks on site updates and can't extract semantic meaning.

    Why this feature?

    Automated recursive crawling with built-in multimodal embeddings (E5-Large text, Jina Code, SigLIP vision). Handles SPAs, retries failures, and generates search-ready vectors automatically.

    How It Works

    Web Scraper is a recursive crawling system with BFS-based traversal, multimodal content extraction, and automatic embedding generation. Supports static and JavaScript-rendered sites with built-in resilience (retries, proxies, captcha handling).

    1

    Crawl Configuration & Setup

    Initialize crawler with seed URL, depth limits, filtering patterns, and rendering strategy (static/JS/auto)

    2

    Recursive Web Crawling

    BFS traversal from seed URL, discovering links up to max_depth, respecting include/exclude patterns, with optional semantic goal-based crawling

    3

    Content Extraction

    Extract text content, detect and parse code blocks (with language and line numbers), discover images (with metadata), identify downloadable assets

    4

    Content Chunking (Optional)

    Split large content into chunks using configurable strategy: sentences, paragraphs, words, or characters with overlap

    5

    Document Expansion

    Create individual documents per page/chunk with full metadata, crawl depth, and parent relationships

    6

    Multi-Modal Embedding Generation

    Generate E5-Large (1024D) text embeddings, Jina Code (768D) embeddings for code blocks, SigLIP (768D) visual embeddings for images

    7

    Resilience & Output

    Handle failures with exponential backoff retries, rotate proxies on errors, solve captchas if configured, output indexed documents

    Why This Approach

    Combines recursive crawling for comprehensive coverage, multimodal extraction for rich content capture, and automatic embedding generation for instant semantic search. Built-in resilience (retries, proxies, captcha solving) handles real-world challenges. Unlike manual scraping, this scales to thousands of pages with consistent output.

    Integration

    from mixpeek import Mixpeek
    client = Mixpeek(api_key="YOUR_API_KEY")
    # Basic documentation crawl
    result = client.pipelines.run(
    collection_id="docs_collection",
    feature_extractor={
    "feature_extractor_name": "web_scraper",
    "version": "v1",
    "input_mappings": {
    "url": "https://docs.example.com"
    },
    "parameters": {
    "max_depth": 2,
    "max_pages": 50,
    "generate_text_embeddings": True,
    "generate_code_embeddings": True
    }
    }
    )
    # API docs with structured extraction
    result = client.pipelines.run(
    collection_id="api_docs",
    feature_extractor={
    "feature_extractor_name": "web_scraper",
    "version": "v1",
    "input_mappings": {
    "url": "https://api.example.com/docs"
    },
    "parameters": {
    "max_depth": 3,
    "max_pages": 100,
    "include_patterns": [r".*/api/.*"],
    "response_shape": {
    "endpoint": "string",
    "method": "string",
    "description": "string"
    },
    "llm_provider": "openai",
    "llm_model": "gpt-4o-mini"
    }
    }
    )
    # High-volume crawl with chunking
    result = client.pipelines.run(
    collection_id="knowledge_base",
    feature_extractor={
    "feature_extractor_name": "web_scraper",
    "version": "v1",
    "input_mappings": {
    "url": "https://help.example.com"
    },
    "parameters": {
    "max_depth": 4,
    "max_pages": 500,
    "chunk_strategy": "sentences",
    "chunk_size": 500,
    "chunk_overlap": 50,
    "delay_between_requests": 0.5
    }
    }
    )

    Comparisons & Alternatives

    Resources

    This capability is referenced and used across the following resources: