Security¶

html-to-markdown is designed with security in mind. This guide covers the built-in safety mechanisms, how to configure them, and security considerations for different deployment scenarios.

Built-in Input Validation¶

Before any parsing occurs, html-to-markdown validates input to reject non-HTML content that could cause unexpected behavior:

Binary Data Rejection¶

The converter detects and rejects binary input by checking for:

PDF magic bytes (%PDF-) -- prevents accidentally passing PDF files
High null-byte density -- detects binary files, images, and other non-text content
Other binary signatures -- common file format headers

from html_to_markdown import convert, ConversionError

try:
    # This will raise ConversionError::InvalidInput
    markdown = convert(pdf_bytes.decode("utf-8", errors="replace"))
except ConversionError:
    print("Input is not valid HTML")

UTF-16 Recovery¶

Rather than failing on UTF-16 encoded input (which appears as a mix of text and null bytes when decoded as UTF-8), the converter detects both UTF-16 LE and UTF-16 BE encodings (with or without BOM) and transparently converts them to UTF-8 before parsing.

Null Byte Stripping¶

Stray null bytes in otherwise valid HTML (common in data from legacy systems) are silently stripped rather than causing parser failures.

HTML Parsing Safety¶

html-to-markdown uses html5ever, the WHATWG-compliant HTML5 parser originally built for Mozilla Servo. This provides:

Standards-compliant error recovery -- malformed HTML is handled the same way browsers handle it, rather than through custom (potentially vulnerable) error handling
No custom parsing -- there is no regex-based or ad-hoc HTML parsing that could be exploited with crafted input
Bounded recursion -- deeply nested HTML does not cause stack overflows

Script and Style Handling¶

By default, <script> and <style> elements are completely ignored during conversion. Their content never appears in the Markdown output.

html = """
<script>alert('xss')</script>
<style>body { display: none; }</style>
<p>Safe content</p>
"""

markdown = convert(html)
# Output: Safe content

Preserve tags and scripts

If you use preserve_tags to keep certain elements as raw HTML, be careful not to preserve <script> or <style> tags, as their content would then appear in the output.

Link Security¶

JavaScript URLs¶

html-to-markdown converts all links, including those with javascript: URLs, into Markdown link syntax. The link is preserved as-is in the Markdown output.

If your downstream Markdown renderer executes JavaScript URLs, use a visitor to filter them:

class SafeLinksVisitor:
    BLOCKED_SCHEMES = {"javascript:", "data:", "vbscript:"}

    def visit_link(self, ctx, href, text, title):
        href_lower = href.lower().strip()
        for scheme in self.BLOCKED_SCHEMES:
            if href_lower.startswith(scheme):
                return {"type": "custom", "output": text}
        return {"type": "continue"}

markdown = convert_with_visitor(html, visitor=SafeLinksVisitor())

Data URLs¶

Similarly, data: URLs in images are converted to standard Markdown image syntax. If you need to block data URIs:

class NoDataUriImages:
    def visit_image(self, ctx, src, alt, title):
        if src.lower().startswith("data:"):
            return {"type": "skip"}
        return {"type": "continue"}

Preprocessing for Untrusted HTML¶

When converting HTML from untrusted sources (web scraping, user input, email), use preprocessing to strip potentially dangerous or irrelevant content:

from html_to_markdown import ConversionOptions, PreprocessingOptions, convert

options = ConversionOptions(
    preprocessing=PreprocessingOptions(
        enabled=True,
        preset="aggressive",
    ),
    strip_tags=["script", "style", "iframe", "object", "embed"],
)

markdown = convert(untrusted_html, options)

The aggressive preset removes:

Navigation elements (<nav>, menus)
Sidebars and footers
Advertising containers
Social media widgets
Cookie consent banners
Form elements
Embedded objects

Structured Data Size Limits¶

When using metadata extraction, the max_structured_data_size configuration prevents memory exhaustion from extremely large JSON-LD blocks:

from html_to_markdown import MetadataConfig, convert_with_metadata

config = MetadataConfig(
    extract_structured_data=True,
    max_structured_data_size=50000,  # 50 KB limit (default is 100 KB)
)

markdown, metadata = convert_with_metadata(html, metadata_config=config)

Why this matters

Some web pages embed megabytes of JSON-LD data (product catalogs, event listings). Without a size limit, extracting this data could consume excessive memory.

Memory Safety¶

The Rust core of html-to-markdown enforces memory safety at the language level:

No unsafe code in the core crate -- the workspace-level Cargo.toml sets unsafe_code = "forbid"
No buffer overflows -- Rust's borrow checker prevents out-of-bounds access
No use-after-free -- ownership rules prevent dangling pointer dereferences
No data races -- thread safety is enforced at compile time

The FFI bindings (C, Go, Java, C#) do contain minimal unsafe code at the boundary between languages, which is necessary for FFI. Each unsafe block is documented with a // SAFETY: comment explaining why it is safe.

Deployment Considerations¶

Web Applications¶

When using html-to-markdown in a web application to convert user-submitted HTML:

Always sanitize output if the Markdown will be rendered back to HTML. html-to-markdown converts HTML to Markdown; it does not sanitize for XSS. Your Markdown renderer should handle sanitization.
Use preprocessing with the aggressive preset to strip irrelevant content.
Set size limits on input HTML to prevent denial-of-service via very large documents.
Use visitors to filter dangerous URL schemes.

Content Migration¶

When migrating content between CMS platforms:

Validate link integrity using metadata extraction to catalog all links before and after conversion.
Preserve important HTML using preserve_tags for elements that cannot be represented in Markdown.
Review structured data extraction to ensure SEO metadata is maintained.

Batch Processing¶

When processing large volumes of documents:

Set timeouts on conversion calls to prevent hangs on pathological input.
Monitor memory usage -- while the converter is memory-efficient, very large documents (100+ MB) can consume significant memory.
Use process isolation for untrusted input to limit the blast radius of unexpected behavior.

Security Checklist¶

For production deployments processing untrusted HTML:

Enable preprocessing with an appropriate preset
Strip <script>, <style>, <iframe>, <object>, and <embed> tags
Filter javascript:, data:, and vbscript: URL schemes via a visitor
Set max_structured_data_size for metadata extraction
Limit input size before passing to the converter
Sanitize the Markdown output before rendering back to HTML
Set conversion timeouts for user-facing applications
Keep html-to-markdown updated to receive security fixes

Reporting Security Issues¶

If you discover a security vulnerability in html-to-markdown, please report it responsibly. See the project's security policy for disclosure procedures.