Web Scraper
Recursively crawl websites and extract multimodal content with automatic embeddings for text, code, and images
Video Overview
Why do anything?
Building a knowledge base from documentation requires crawling hundreds of pages, extracting code snippets, and making it all searchable.
Why now?
AI applications need comprehensive, up-to-date content. Manual scraping breaks on site updates and can't extract semantic meaning.
Why this feature?
Automated recursive crawling with built-in multimodal embeddings (E5-Large text, Jina Code, SigLIP vision). Handles SPAs, retries failures, and generates search-ready vectors automatically.
How It Works
Web Scraper is a recursive crawling system with BFS-based traversal, multimodal content extraction, and automatic embedding generation. Supports static and JavaScript-rendered sites with built-in resilience (retries, proxies, captcha handling).
Crawl Configuration & Setup
Initialize crawler with seed URL, depth limits, filtering patterns, and rendering strategy (static/JS/auto)
Recursive Web Crawling
BFS traversal from seed URL, discovering links up to max_depth, respecting include/exclude patterns, with optional semantic goal-based crawling
Content Extraction
Extract text content, detect and parse code blocks (with language and line numbers), discover images (with metadata), identify downloadable assets
Content Chunking (Optional)
Split large content into chunks using configurable strategy: sentences, paragraphs, words, or characters with overlap
Document Expansion
Create individual documents per page/chunk with full metadata, crawl depth, and parent relationships
Multi-Modal Embedding Generation
Generate E5-Large (1024D) text embeddings, Jina Code (768D) embeddings for code blocks, SigLIP (768D) visual embeddings for images
Resilience & Output
Handle failures with exponential backoff retries, rotate proxies on errors, solve captchas if configured, output indexed documents
Why This Approach
Combines recursive crawling for comprehensive coverage, multimodal extraction for rich content capture, and automatic embedding generation for instant semantic search. Built-in resilience (retries, proxies, captcha solving) handles real-world challenges. Unlike manual scraping, this scales to thousands of pages with consistent output.
Integration
from mixpeek import Mixpeekclient = Mixpeek(api_key="YOUR_API_KEY")# Basic documentation crawlresult = client.pipelines.run(collection_id="docs_collection",feature_extractor={"feature_extractor_name": "web_scraper","version": "v1","input_mappings": {"url": "https://docs.example.com"},"parameters": {"max_depth": 2,"max_pages": 50,"generate_text_embeddings": True,"generate_code_embeddings": True}})# API docs with structured extractionresult = client.pipelines.run(collection_id="api_docs",feature_extractor={"feature_extractor_name": "web_scraper","version": "v1","input_mappings": {"url": "https://api.example.com/docs"},"parameters": {"max_depth": 3,"max_pages": 100,"include_patterns": [r".*/api/.*"],"response_shape": {"endpoint": "string","method": "string","description": "string"},"llm_provider": "openai","llm_model": "gpt-4o-mini"}})# High-volume crawl with chunkingresult = client.pipelines.run(collection_id="knowledge_base",feature_extractor={"feature_extractor_name": "web_scraper","version": "v1","input_mappings": {"url": "https://help.example.com"},"parameters": {"max_depth": 4,"max_pages": 500,"chunk_strategy": "sentences","chunk_size": 500,"chunk_overlap": 50,"delay_between_requests": 0.5}})
Comparisons & Alternatives
Resources
This capability is referenced and used across the following resources:
