Does it remove navigation and sidebars?

Yes. The boilerplate detection algorithm identifies and removes navigation menus, sidebars, footers, and advertising elements, extracting only the main content area.

Can I extract specific sections using CSS selectors?

Yes. Use the `css_selector` parameter to target a specific element (e.g., 'article', '.post-content', '#main'). Only content within the matched element will be extracted.

Does it handle JavaScript-rendered content?

By default, only static HTML is processed. Set `render_js` to true to enable headless browser rendering, which captures dynamically loaded content. This adds 2-5 seconds to processing time.

data

HTML
Text
Converter

Extract clean, readable text from HTML pages by stripping tags, scripts, and styles while preserving semantic structure. Handles navigation removal, boilerplate detection, and main content extraction.

Max file size: 50 MB

Estimated: 1-3 sec per page

3 input formats

How It Works

Provide a URL or upload an HTML file.

Scripts, styles, navigation, and boilerplate elements are removed.

Main content is identified using readability heuristics.

Semantic structure (headings, paragraphs, lists) is preserved as plain text.

Clean text is returned with optional metadata extraction.

Code Examples

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

result = client.convert(
    source="https://example.com/blog-post",
    from_format="html",
    to_format="text",
    options={
        "remove_boilerplate": True,
        "css_selector": "article",
        "extract_metadata": True
    }
)

print(result.text)
print(result.metadata)

Use Cases

Extract article text from news websites

Clean web scraping output for NLP pipelines

Build text datasets from web crawl data

Prepare web content for embedding and indexing

Supported Input Formats

HTML

XHTML

MHTML

Quick Info

Categorydata

Max File Size50 MB

Est. Time1-3 sec per page

Extractorweb-scraper

Try This Conversion

Get started with the Mixpeek API and convert your first file in minutes.

Frequently Asked Questions

Related Converters

PDF

Text

PDF to Text

Extract clean, structured text from PDF documents including scanned pages, multi-column layouts, headers/footers, and tables. Combines traditional parsing with OCR and layout analysis for maximum accuracy.

HTML

JSON

HTML to Structured Data

Extract structured data from web pages using a combination of CSS/XPath selectors and LLM-based extraction. Captures product details, article metadata, contact information, and custom schemas from any website.