Performance¶
html-to-markdown is designed for high-throughput HTML to Markdown conversion. The Rust core delivers performance characteristics that are difficult to achieve in interpreted languages, making it suitable for batch processing, real-time pipelines, and resource-constrained environments.
Benchmarks¶
Throughput¶
The core conversion engine processes HTML at 150--280 MB/s on modern hardware (Apple M-series, Intel 12th gen+), depending on HTML complexity:
| Document Type | Throughput | Notes |
|---|---|---|
| Simple HTML (paragraphs, headings) | ~280 MB/s | Fast-path optimization for simple structures |
| Mixed content (lists, links, images) | ~200 MB/s | Typical web page content |
| Complex HTML (nested tables, forms) | ~150 MB/s | Deep nesting and table reconstruction |
| Plain text (no HTML tags) | ~400 MB/s | Bypasses parser entirely via fast path |
Benchmark environment
Benchmarks use Criterion.rs with statistical analysis. Results measured on Apple M2 Pro, single-threaded. Real-world performance may vary based on hardware, document structure, and enabled features.
Comparison with Python Alternatives¶
When accessed through the Python binding, html-to-markdown is 10--80x faster than pure-Python alternatives:
| Library | 100 KB Document | Relative Speed |
|---|---|---|
| html-to-markdown (PyO3) | ~0.5 ms | 1x (baseline) |
| markdownify | ~8 ms | ~16x slower |
| html2text | ~15 ms | ~30x slower |
| inscriptis | ~40 ms | ~80x slower |
The gap widens with larger documents because the Rust core's memory allocation patterns scale more efficiently than Python's garbage-collected heap.
Why It Is Fast¶
html5ever Parser¶
The HTML parser is html5ever, originally built for Mozilla's Servo browser engine. It is compiled to native code and implements the WHATWG HTML5 spec in a streaming, zero-copy manner where possible.
Single-Pass Architecture¶
All operations -- Markdown generation, metadata extraction, inline image collection, and visitor callbacks -- happen in a single depth-first traversal of the DOM tree. There is no second pass, no intermediate representation, and no re-parsing.
Fast Path for Plain Text¶
When the input contains no < characters, the parser is bypassed entirely. The text goes through entity decoding, whitespace normalization, and optional escaping -- all of which are simple string operations.
Minimal Allocations¶
The converter pre-allocates output buffers based on input size and reuses them across elements. String operations use Cow<str> (clone-on-write) to avoid unnecessary copying when the input can be used directly.
Compiled to Native Code¶
All language bindings (Python, TypeScript, Ruby, PHP, etc.) call directly into compiled Rust code. There is no interpretation overhead for the core conversion logic -- only thin FFI wrapper costs at the language boundary.
Memory Efficiency¶
Predictable Memory Usage¶
Memory consumption is proportional to:
- Input size: The DOM tree holds a reference-counted representation of the parsed HTML
- Tree depth: Stack usage grows with nesting depth (bounded by recursion limits)
- Output size: The Markdown output buffer is pre-allocated
For a typical web page (50--200 KB HTML), peak memory usage is approximately 2--3x the input size.
No Unbounded Buffers¶
- Structured data extraction is size-limited (
max_structured_data_size, default 100 KB) to prevent memory exhaustion from large JSON-LD blocks - Inline image extraction has configurable limits on the number of images collected
- The wrapper module processes output in chunks rather than buffering the entire document
Streaming Strategies¶
For very large documents or high-throughput pipelines, consider these approaches:
Batch Processing¶
Process multiple documents in parallel using your language's concurrency primitives. The converter is thread-safe and has no global state:
Chunked Input¶
For extremely large HTML files (100+ MB), consider splitting the HTML into logical sections before conversion. The converter handles fragments gracefully -- you do not need to provide a complete <html> document.
Process Pool¶
For CPU-bound batch workloads in Python, use ProcessPoolExecutor to bypass the GIL and utilize multiple cores:
from concurrent.futures import ProcessPoolExecutor
from html_to_markdown import convert
with ProcessPoolExecutor(max_workers=4) as pool:
results = list(pool.map(convert, large_document_list))
Optimization Tips¶
1. Disable Unused Features¶
If you do not need metadata extraction, visitor callbacks, or inline image extraction, disable those features to reduce overhead. In Rust, use feature flags. In language bindings, simply avoid calling convert_with_metadata() or convert_with_visitor() -- the plain convert() function skips all optional collectors.
2. Skip Wrapping When Not Needed¶
Text wrapping (wrap: true) adds a post-processing pass over the output. If your downstream consumer handles line wrapping, disable it:
3. Use the Right Heading Style¶
ATX headings (# Heading) are slightly faster to generate than Setext/underlined headings (Heading\n=======) because they do not require measuring the heading text length.
4. Avoid Unnecessary Escaping¶
The escape_ascii option escapes all ASCII punctuation for strict CommonMark compliance tests. It is rarely needed in production and adds overhead:
# Only enable what you actually need
options = ConversionOptions(
escape_asterisks=False, # Default
escape_underscores=False, # Default
escape_misc=False, # Default
escape_ascii=False, # Default -- leave this off
)
5. Preprocessing for Web Content¶
When converting scraped web pages, enable preprocessing to strip navigation, ads, and boilerplate before conversion. This reduces the HTML size and produces cleaner output:
options = ConversionOptions(
preprocessing=PreprocessingOptions(
enabled=True,
preset="aggressive",
)
)
6. Reuse Options Objects¶
In hot loops, create the ConversionOptions object once and reuse it across calls to avoid repeated construction:
options = ConversionOptions(heading_style="atx", list_indent_width=2)
for html in documents:
markdown = convert(html, options) # Reuse options
Profiling¶
For Rust development, the project includes Criterion.rs benchmarks:
# Run all benchmarks
task bench
# Run specific benchmark
cargo bench --bench conversion
# Generate flamegraph (requires cargo-flamegraph)
cargo flamegraph --bench conversion
Benchmark results are stored as CI artifacts for regression detection. A slowdown of more than 5% from the baseline triggers a CI failure.
Further Reading¶
- Architecture -- how the Rust core and bindings are structured
- Conversion Pipeline -- detailed breakdown of each processing stage
- kreuzberg -- document intelligence library that uses html-to-markdown for high-throughput HTML processing