Python API Reference

Python API Reference v3.5.7¶

Functions¶

convert()¶

Convert HTML to Markdown, returning a ConversionResult with content, metadata, images, and warnings.

Errors:

Returns an error if HTML parsing fails or if the input contains invalid UTF-8.

Signature:

def convert(html: str, options: ConversionOptions = None) -> ConversionResult

Parameters:

Name	Type	Required	Description
`html`	`str`	Yes	The html
`options`	`ConversionOptions \\| None`	No	The options to use

Returns: ConversionResult Errors: Raises Error.

Types¶

ConversionOptions¶

Main conversion options for HTML to Markdown conversion.

Use ConversionOptions.builder() to construct, or the default constructor for defaults.

Field	Type	Default	Description
`heading_style`	`HeadingStyle`	`HeadingStyle.ATX`	Heading style to use in Markdown output (ATX `#` or Setext underline).
`list_indent_type`	`ListIndentType`	`ListIndentType.SPACES`	How to indent nested list items (spaces or tab).
`list_indent_width`	`int`	`2`	Number of spaces (or tabs) to use for each level of list indentation.
`bullets`	`str`	`"-*+"`	Bullet character(s) to use for unordered list items (e.g. `"-"`, `"*"`).
`strong_em_symbol`	`str`	`"*"`	Character used for bold/italic emphasis markers (`*` or `_`).
`escape_asterisks`	`bool`	`False`	Escape `*` characters in plain text to avoid unintended bold/italic.
`escape_underscores`	`bool`	`False`	Escape `_` characters in plain text to avoid unintended bold/italic.
`escape_misc`	`bool`	`False`	Escape miscellaneous Markdown metacharacters (`[]()#` etc.) in plain text.
`escape_ascii`	`bool`	`False`	Escape ASCII characters that have special meaning in certain Markdown dialects.
`code_language`	`str`	`""`	Default language annotation for fenced code blocks that have no language hint.
`autolinks`	`bool`	`True`	Automatically convert bare URLs into Markdown autolinks.
`default_title`	`bool`	`False`	Emit a default title when no `<title>` tag is present.
`br_in_tables`	`bool`	`False`	Render `<br>` elements inside table cells as literal line breaks.
`compact_tables`	`bool`	`False`	Emit tables without column padding (compact GFM format). When `True`, column widths are not computed and cells are emitted with no trailing spaces. Separator rows use exactly `---` per column. Produces token-efficient output suitable for RAG / LLM contexts. Default `False` (aligned padding preserved).
`highlight_style`	`HighlightStyle`	`HighlightStyle.DOUBLE_EQUAL`	Style used for `<mark>` / highlighted text (e.g. `==text==`).
`extract_metadata`	`bool`	`True`	Populate `result.metadata` with `<head>` / `<meta>` extraction (title, description, Open Graph, Twitter Card, JSON-LD, …). Default `True`. Disabling skips the metadata pass only — table extraction into `result.tables` runs unconditionally.
`whitespace_mode`	`WhitespaceMode`	`WhitespaceMode.NORMALIZED`	Controls how whitespace sequences are normalised in the converted output. - `WhitespaceMode.Normalized` (default) — collapses consecutive whitespace characters (spaces, tabs, newlines) to a single space, matching browser rendering behaviour. - `WhitespaceMode.Strict` — preserves all whitespace exactly as it appears in the source HTML, including runs of spaces and embedded newlines. Choose `Strict` only when the source HTML uses deliberate whitespace (e.g. pre-formatted content outside `<pre>` tags). For most documents `Normalized` produces cleaner output.
`strip_newlines`	`bool`	`False`	Strip all newlines from the output, producing a single-line result.
`wrap`	`bool`	`False`	Wrap long lines at `wrap_width` characters.
`wrap_width`	`int`	`80`	Maximum output line width in characters when `wrap` is `True` (default `80`). Lines are broken at word boundaries so that no line exceeds this length. A value of `0` is treated as "no limit" — equivalent to leaving `wrap` disabled. Has no effect when `wrap` is `False`.
`convert_as_inline`	`bool`	`False`	Treat the entire document as inline content (no block-level wrappers).
`sub_symbol`	`str`	`""`	Markdown notation for subscript text (e.g. `"~"`).
`sup_symbol`	`str`	`""`	Markdown notation for superscript text (e.g. `"^"`).
`newline_style`	`NewlineStyle`	`NewlineStyle.SPACES`	How to encode hard line breaks (`<br>`) in Markdown.
`code_block_style`	`CodeBlockStyle`	`CodeBlockStyle.BACKTICKS`	Style used for fenced code blocks (backticks or tilde).
`keep_inline_images_in`	`list[str]`	`[]`	HTML tag names whose `<img>` children are kept inline instead of block.
`preprocessing`	`PreprocessingOptions`	—	Options for the HTML pre-processing pass applied before conversion begins. Pre-processing runs before the HTML is handed to the converter and can perform operations such as unwrapping redundant wrapper elements, removing tracking pixels, and normalising vendor-specific markup. See `PreprocessingOptions` for the full set of knobs. Defaults to `PreprocessingOptions.default()`, which enables the standard cleaning passes. Set individual fields on `PreprocessingOptions` (or construct via `ConversionOptions.builder`) to opt in or out of specific passes.
`encoding`	`str`	`"utf-8"`	Expected character encoding of the input HTML (default `"utf-8"`).
`debug`	`bool`	`False`	Emit debug information during conversion.
`strip_tags`	`list[str]`	`[]`	HTML tag names whose content is stripped from the output entirely.
`preserve_tags`	`list[str]`	`[]`	HTML tag names that are preserved verbatim in the output.
`skip_images`	`bool`	`False`	Skip conversion of `<img>` elements (omit images from output).
`link_style`	`LinkStyle`	`LinkStyle.INLINE`	Link rendering style (inline or reference).
`output_format`	`OutputFormat`	`OutputFormat.MARKDOWN`	Target output format (Markdown, plain text, etc.).
`include_document_structure`	`bool`	`False`	Include structured document tree in result.
`extract_images`	`bool`	`False`	Extract inline images from data URIs and SVGs.
`max_image_size`	`int`	`5242880`	Maximum decoded image size in bytes (default 5MB).
`capture_svg`	`bool`	`False`	Capture SVG elements as images.
`infer_dimensions`	`bool`	`True`	Infer image dimensions from data.
`max_depth`	`int \\| None`	`None`	Maximum DOM traversal depth. `None` means unlimited. When set, subtrees beyond this depth are silently truncated.
`exclude_selectors`	`list[str]`	`[]`	CSS selectors for elements to exclude entirely (element + all content). Unlike `strip_tags` (which removes the tag wrapper but keeps children), excluded elements and all their descendants are dropped from the output. Supports any CSS selector that `tl` supports: tag names, `.class`, `#id`, `[attribute]`, etc. Invalid selectors are silently skipped at conversion time. Example: `vec![".cookie-banner".into(), "#ad-container".into(), "[role='complementary']".into()]`
`visitor`	`VisitorHandle \\| None`	`None`	Optional visitor for custom traversal logic. When set, the visitor's callbacks are invoked for matching HTML elements during conversion, allowing custom output, skipping, or HTML preservation. See `HtmlVisitor`.

Methods¶

default()¶

Signature:

@staticmethod
def default() -> ConversionOptions

ConversionResult¶

The primary result of HTML conversion and extraction.

Contains the converted text output, optional structured document tree, metadata, extracted tables, images, and processing warnings.

Field	Type	Default	Description
`content`	`str \\| None`	`None`	Converted text output (markdown, djot, or plain text). `None` when `output_format` is set to `OutputFormat.None`, indicating extraction-only mode.
`document`	`DocumentStructure \\| None`	`None`	Structured document tree with semantic elements. Populated when `ConversionOptions.include_document_structure` is `True`. `None` otherwise (the default), which avoids the overhead of building the tree. When present, the tree mirrors the converted document: headings open `Group` sections, paragraphs and list items carry inline `TextAnnotation`s, and tables reference the same `TableGrid` data exposed in `Self.tables`. Note: this field is independent of the `metadata` feature flag. Document structure collection is always available at runtime; it is gated only by the runtime option, not by a compile-time feature.
`metadata`	`HtmlMetadata`	—	Extracted HTML metadata (title, OG, links, images, structured data).
`tables`	`list[TableData]`	`[]`	Extracted tables with structured cell data and markdown representation.
`images`	`list[str]`	`[]`	Extracted inline images (data URIs and SVGs). Populated when `extract_images` is `True` in options.
`warnings`	`list[ProcessingWarning]`	`[]`	Non-fatal processing warnings.

DocumentMetadata¶

Document-level metadata extracted from <head> and top-level elements.

Contains all metadata typically used by search engines, social media platforms, and browsers for document indexing and presentation.

Field	Type	Default	Description
`title`	`str \\| None`	`None`	Document title from `<title>` tag
`description`	`str \\| None`	`None`	Document description from `<meta name="description">` tag
`keywords`	`list[str]`	`[]`	Document keywords from `<meta name="keywords">` tag, split on commas
`author`	`str \\| None`	`None`	Document author from `<meta name="author">` tag
`canonical_url`	`str \\| None`	`None`	Canonical URL from `<link rel="canonical">` tag
`base_href`	`str \\| None`	`None`	Base URL from `<base href="">` tag for resolving relative URLs
`language`	`str \\| None`	`None`	Document language from `lang` attribute
`text_direction`	`TextDirection \\| None`	`None`	Document text direction from `dir` attribute
`open_graph`	`dict[str, str]`	`{}`	Open Graph metadata (og:* properties) for social media Keys like "title", "description", "image", "url", etc.
`twitter_card`	`dict[str, str]`	`{}`	Twitter Card metadata (twitter:* properties) Keys like "card", "site", "creator", "title", "description", "image", etc.
`meta_tags`	`dict[str, str]`	`{}`	Additional meta tags not covered by specific fields Keys are meta name/property attributes, values are content

DocumentNode¶

A single node in the document tree.

Field	Type	Default	Description
`id`	`str`	—	Deterministic node identifier.
`content`	`NodeContent`	—	The semantic content of this node.
`parent`	`int \\| None`	`None`	Index of the parent node (None for root nodes).
`children`	`list[int]`	`/* serde(default) */`	Indices of child nodes in reading order.
`annotations`	`list[TextAnnotation]`	`/* serde(default) */`	Inline formatting annotations (bold, italic, links, etc.) with byte offsets into the text.
`attributes`	`dict[str, str] \\| None`	`None`	Format-specific attributes preserved from the source HTML element. Keys are lowercased attribute names as they appear in the HTML (e.g. `"class"`, `"id"`, `"data-foo"`). Values are the raw attribute strings, copied verbatim from the source — no HTML entity decoding is applied here. The map is `None` when no attributes are present (omitted entirely in serialized output). Not every HTML attribute is preserved: only attributes that carry semantic or structural significance for the node type are collected. For example, heading nodes capture the `"id"` attribute for anchor linking; other element-level attributes may be silently dropped.

DocumentStructure¶

A structured document tree representing the semantic content of an HTML document.

Uses a flat node array with index-based parent/child references for efficient traversal.

Field	Type	Default	Description
`nodes`	`list[DocumentNode]`	—	All nodes in document reading order.
`source_format`	`str \\| None`	`None`	The source format (always "html" for this library).

GridCell¶

A single cell in a table grid.

Field	Type	Default	Description
`content`	`str`	—	The text content of the cell.
`row`	`int`	—	0-indexed row position.
`col`	`int`	—	0-indexed column position.
`row_span`	`int`	`/* serde(default) */`	Number of rows this cell spans (default 1).
`col_span`	`int`	`/* serde(default) */`	Number of columns this cell spans (default 1).
`is_header`	`bool`	`/* serde(default) */`	Whether this is a header cell (`<th>`).

HeaderMetadata¶

Header element metadata with hierarchy tracking.

Captures heading elements (h1-h6) with their text content, identifiers, and position in the document structure.

Field	Type	Default	Description
`level`	`int`	—	Header level: 1 (h1) through 6 (h6)
`text`	`str`	—	Normalized text content of the header
`id`	`str \\| None`	`None`	HTML id attribute if present
`depth`	`int`	—	Document tree depth at the header element
`html_offset`	`int`	—	Byte offset in original HTML document

Methods¶

is_valid()¶

Validate that the header level is within valid range (1-6).

Returns:

True if level is 1-6, False otherwise.

Signature:

def is_valid(self) -> bool

HtmlMetadata¶

Comprehensive metadata extraction result from HTML document.

Contains all extracted metadata types in a single structure, suitable for serialization and transmission across language boundaries.

Field	Type	Default	Description
`document`	`DocumentMetadata`	—	Document-level metadata (title, description, canonical, etc.)
`headers`	`list[HeaderMetadata]`	`[]`	Extracted header elements with hierarchy
`links`	`list[LinkMetadata]`	`[]`	Extracted hyperlinks with type classification
`images`	`list[ImageMetadata]`	`[]`	Extracted images with source and dimensions
`structured_data`	`list[StructuredData]`	`[]`	Extracted structured data blocks

HtmlVisitor¶

Visitor for HTML→Markdown conversion.

Provide a visitor object whose methods customize the conversion behavior for any HTML element type. Override only the methods you care about; unimplemented methods default to Continue (emit the standard rendering).

Each callback returns one of:

Continue (the default) — keep the standard rendering.
Skip — drop the element from the output entirely.
PreserveHtml — pass the original HTML through verbatim.
Custom(text) — replace the rendering with text.
Error(message) — abort conversion with message.

Language idioms. In Rust, return one of the VisitResult variants directly. In Python, Ruby, JavaScript/TypeScript, and other duck-typed bindings, define a plain class (no base class required) and return either a string ("continue", "skip", "preserve_html") or a tagged map ({"custom": "..."}, {"error": "..."}) — the binding converts the return value to the corresponding VisitResult variant automatically.

Method Naming Convention¶

visit_*_start: Called before entering an element (pre-order traversal)
visit_*_end: Called after exiting an element (post-order traversal)
visit_*: Called for specific element types (e.g., visit_link, visit_image)

Execution Order¶

For a typical element like <div><p>text</p></div>:

visit_element_start for <div>
visit_element_start for <p>
visit_text for "text"
visit_element_end for <p>
visit_element_end for </div>

Performance Notes¶

visit_text is the most frequently called method (~100+ times per document)
Return Continue quickly for elements you don't need to customize
Avoid heavy computation in visitor methods; consider caching if needed

Methods¶

visit_text()¶

Visit text nodes (most frequent callback - ~100+ per document).

Signature:

def visit_text(self, ctx: NodeContext, text: str) -> VisitResult

visit_element_start()¶

Called before entering any element.

This is the first callback invoked for every HTML element, allowing visitors to implement generic element handling before tag-specific logic.

Signature:

def visit_element_start(self, ctx: NodeContext) -> VisitResult

visit_element_end()¶

Called after exiting any element.

Receives the default markdown output that would be generated. Visitors can inspect or replace this output.

Signature:

def visit_element_end(self, ctx: NodeContext, output: str) -> VisitResult

visit_link()¶

Visit anchor links <a href="../...">.

Signature:

def visit_link(self, ctx: NodeContext, href: str, text: str, title: str) -> VisitResult

visit_image()¶

Visit images <img src="../...">.

Signature:

def visit_image(self, ctx: NodeContext, src: str, alt: str, title: str) -> VisitResult

visit_heading()¶

Visit heading elements <h1> through <h6>.

Signature:

def visit_heading(self, ctx: NodeContext, level: int, text: str, id: str) -> VisitResult

visit_code_block()¶

Visit code blocks <pre><code>.

Signature:

def visit_code_block(self, ctx: NodeContext, lang: str, code: str) -> VisitResult

visit_code_inline()¶

Visit inline code <code>.

Signature:

def visit_code_inline(self, ctx: NodeContext, code: str) -> VisitResult

visit_list_item()¶

Visit list items <li>.

Signature:

def visit_list_item(self, ctx: NodeContext, ordered: bool, marker: str, text: str) -> VisitResult

visit_list_start()¶

Called before processing a list <ul> or <ol>.

Signature:

def visit_list_start(self, ctx: NodeContext, ordered: bool) -> VisitResult

visit_list_end()¶

Called after processing a list </ul> or </ol>.

Signature:

def visit_list_end(self, ctx: NodeContext, ordered: bool, output: str) -> VisitResult

visit_table_start()¶

Called before processing a table <table>.

Signature:

def visit_table_start(self, ctx: NodeContext) -> VisitResult

visit_table_row()¶

Visit table rows <tr>.

Signature:

def visit_table_row(self, ctx: NodeContext, cells: list[str], is_header: bool) -> VisitResult

visit_table_end()¶

Called after processing a table </table>.

Signature:

def visit_table_end(self, ctx: NodeContext, output: str) -> VisitResult

visit_blockquote()¶

Visit blockquote elements <blockquote>.

Signature:

def visit_blockquote(self, ctx: NodeContext, content: str, depth: int) -> VisitResult

visit_strong()¶

Visit strong/bold elements <strong>, <b>.

Signature:

def visit_strong(self, ctx: NodeContext, text: str) -> VisitResult

visit_emphasis()¶

Visit emphasis/italic elements <em>, <i>.

Signature:

def visit_emphasis(self, ctx: NodeContext, text: str) -> VisitResult

visit_strikethrough()¶

Visit strikethrough elements <s>, <del>, <strike>.

Signature:

def visit_strikethrough(self, ctx: NodeContext, text: str) -> VisitResult

visit_underline()¶

Visit underline elements <u>, <ins>.

Signature:

def visit_underline(self, ctx: NodeContext, text: str) -> VisitResult

visit_subscript()¶

Visit subscript elements <sub>.

Signature:

def visit_subscript(self, ctx: NodeContext, text: str) -> VisitResult

visit_superscript()¶

Visit superscript elements <sup>.

Signature:

def visit_superscript(self, ctx: NodeContext, text: str) -> VisitResult

visit_mark()¶

Visit mark/highlight elements <mark>.

Signature:

def visit_mark(self, ctx: NodeContext, text: str) -> VisitResult

visit_line_break()¶

Visit line break elements <br>.

Signature:

def visit_line_break(self, ctx: NodeContext) -> VisitResult

visit_horizontal_rule()¶

Visit horizontal rule elements <hr>.

Signature:

def visit_horizontal_rule(self, ctx: NodeContext) -> VisitResult

visit_custom_element()¶

Visit custom elements (web components) or unknown tags.

Signature:

def visit_custom_element(self, ctx: NodeContext, tag_name: str, html: str) -> VisitResult

visit_definition_list_start()¶

Visit definition list <dl>.

Signature:

def visit_definition_list_start(self, ctx: NodeContext) -> VisitResult

visit_definition_term()¶

Visit definition term <dt>.

Signature:

def visit_definition_term(self, ctx: NodeContext, text: str) -> VisitResult

visit_definition_description()¶

Visit definition description <dd>.

Signature:

def visit_definition_description(self, ctx: NodeContext, text: str) -> VisitResult

visit_definition_list_end()¶

Called after processing a definition list </dl>.

Signature:

def visit_definition_list_end(self, ctx: NodeContext, output: str) -> VisitResult

visit_form()¶

Visit form elements <form>.

Signature:

def visit_form(self, ctx: NodeContext, action: str, method: str) -> VisitResult

visit_input()¶

Visit input elements <input>.

Signature:

def visit_input(self, ctx: NodeContext, input_type: str, name: str, value: str) -> VisitResult

visit_button()¶

Visit button elements <button>.

Signature:

def visit_button(self, ctx: NodeContext, text: str) -> VisitResult

visit_audio()¶

Visit audio elements <audio>.

Signature:

def visit_audio(self, ctx: NodeContext, src: str) -> VisitResult

visit_video()¶

Visit video elements <video>.

Signature:

def visit_video(self, ctx: NodeContext, src: str) -> VisitResult

visit_iframe()¶

Visit iframe elements <iframe>.

Signature:

def visit_iframe(self, ctx: NodeContext, src: str) -> VisitResult

visit_details()¶

Visit details elements <details>.

Signature:

def visit_details(self, ctx: NodeContext, open: bool) -> VisitResult

visit_summary()¶

Visit summary elements <summary>.

Signature:

def visit_summary(self, ctx: NodeContext, text: str) -> VisitResult

visit_figure_start()¶

Visit figure elements <figure>.

Signature:

def visit_figure_start(self, ctx: NodeContext) -> VisitResult

visit_figcaption()¶

Visit figcaption elements <figcaption>.

Signature:

def visit_figcaption(self, ctx: NodeContext, text: str) -> VisitResult

visit_figure_end()¶

Called after processing a figure </figure>.

Signature:

def visit_figure_end(self, ctx: NodeContext, output: str) -> VisitResult

ImageMetadata¶

Image metadata with source and dimensions.

Captures <img> elements and inline <svg> elements with metadata for image analysis and optimization.

Field	Type	Default	Description
`src`	`str`	—	Image source (URL, data URI, or SVG content identifier)
`alt`	`str \\| None`	`None`	Alternative text from alt attribute (for accessibility)
`title`	`str \\| None`	`None`	Title attribute (often shown as tooltip)
`dimensions`	`list[int] \\| None`	`None`	Image dimensions as (width, height) if available
`image_type`	`ImageType`	—	Image type classification
`attributes`	`dict[str, str]`	—	Additional HTML attributes

LinkMetadata¶

Hyperlink metadata with categorization and attributes.

Represents <a> elements with parsed href values, text content, and link type classification.

Field	Type	Default	Description
`href`	`str`	—	The href URL value
`text`	`str`	—	Link text content (normalized, concatenated if mixed with elements)
`title`	`str \\| None`	`None`	Optional title attribute (often shown as tooltip)
`link_type`	`LinkType`	—	Link type classification
`rel`	`list[str]`	—	Rel attribute values (e.g., "nofollow", "stylesheet", "canonical")
`attributes`	`dict[str, str]`	—	Additional HTML attributes

NodeContext¶

Context information passed to all visitor methods.

Provides comprehensive metadata about the current node being visited, including its type, attributes, position in the DOM tree, and parent context.

Field	Type	Default	Description
`node_type`	`NodeType`	—	Coarse-grained node type classification
`tag_name`	`str`	—	Raw HTML tag name (e.g., "div", "h1", "custom-element")
`attributes`	`dict[str, str]`	—	All HTML attributes as key-value pairs
`depth`	`int`	—	Depth in the DOM tree (0 = root)
`index_in_parent`	`int`	—	Index among siblings (0-based)
`parent_tag`	`str \\| None`	`None`	Parent element's tag name (None if root)
`is_inline`	`bool`	—	Whether this element is treated as inline vs block

PreprocessingOptions¶

HTML preprocessing options for document cleanup before conversion.

Field	Type	Default	Description
`enabled`	`bool`	`True`	Enable HTML preprocessing globally
`preset`	`PreprocessingPreset`	`PreprocessingPreset.STANDARD`	Preprocessing preset level (Minimal, Standard, Aggressive)
`remove_navigation`	`bool`	`True`	Remove navigation elements (nav, breadcrumbs, menus, sidebars)
`remove_forms`	`bool`	`True`	Remove form elements (forms, inputs, buttons, etc.)

Methods¶

default()¶

Signature:

@staticmethod
def default() -> PreprocessingOptions

ProcessingWarning¶

A non-fatal diagnostic produced during HTML conversion.

Warnings indicate that conversion completed but some content may have been handled differently than expected — for example, an image that could not be extracted, a truncated input, or malformed HTML that was repaired with best-effort parsing.

Conversion always succeeds (returns ConversionResult) even when warnings are present. Callers should inspect warnings and decide how to handle them based on their tolerance for partial results:

Logging pipelines: emit each warning at WARN level and continue.
Strict pipelines: treat any warning as a hard error by checking result.warnings.is_empty() before using the output.

See WarningKind for the full taxonomy of warning categories.

Field	Type	Default	Description
`message`	`str`	—	Human-readable warning message.
`kind`	`WarningKind`	—	The category of warning.

StructuredData¶

Structured data block (JSON-LD, Microdata, or RDFa).

Represents machine-readable structured data found in the document. JSON-LD blocks are collected as raw JSON strings for flexibility.

Field	Type	Default	Description
`data_type`	`StructuredDataType`	—	Type of structured data (JSON-LD, Microdata, RDFa)
`raw_json`	`str`	—	Raw JSON string (for JSON-LD) or serialized representation
`schema_type`	`str \\| None`	`None`	Schema type if detectable (e.g., "Article", "Event", "Product")

TableData¶

A top-level extracted table with both structured data and markdown representation.

Field	Type	Default	Description
`grid`	`TableGrid`	—	The structured table grid.
`markdown`	`str`	—	The markdown rendering of this table.

TableGrid¶

A structured table grid with cell-level data including spans.

Field	Type	Default	Description
`rows`	`int`	—	Number of rows.
`cols`	`int`	—	Number of columns.
`cells`	`list[GridCell]`	`[]`	All cells in the table as a flat, sparse list. The list is ordered by `(row, col)` but is not a dense `rows × cols` matrix: cells that are covered by a spanning cell (via `row_span > 1` or `col_span > 1`) do not appear in the list. Only the top-left "origin" cell of a span is present, with its `row_span` and `col_span` fields set accordingly. To reconstruct the full visual grid, iterate over all cells and mark the rectangular region `[row .. row+row_span, col .. col+col_span]` as occupied by that cell. Any `(row, col)` position that is not the origin of any cell is covered by a span from an earlier cell. The length of this vec is `≤ rows * cols`. An empty table (`rows == 0 \\|\\| cols == 0`) produces an empty vec.

TextAnnotation¶

A styling or semantic annotation that applies to a byte range within a node's text.

Unlike DocumentNode, which captures block-level structure (headings, paragraphs, etc.), a TextAnnotation describes inline-level markup — bold, italic, links, code spans, and similar — that spans a contiguous run of bytes inside DocumentNode.content's text field.

Byte offsets (start..end) are into the UTF-8 encoded text of the parent node. The range follows Rust slice conventions: start is inclusive and end is exclusive, so the annotated text is text[start as usize..end as usize].

Multiple annotations on the same node can overlap (e.g. bold-italic text), and they are stored in the order they are encountered during DOM traversal.

See AnnotationKind for the full list of supported annotation types.

Field	Type	Default	Description
`start`	`int`	—	Start byte offset (inclusive) into the parent node's text.
`end`	`int`	—	End byte offset (exclusive) into the parent node's text.
`kind`	`AnnotationKind`	—	The type of annotation.

VisitorHandle¶

Shareable, thread-safe handle to a user-provided HTML visitor implementation.

Pass an instance wrapped in this handle to ConversionOptions to customise how the HTML document is traversed and converted to Markdown. The handle may be cloned and shared across threads without additional synchronisation on the caller's side.

Enums¶

TextDirection¶

Text directionality of document content.

Corresponds to the HTML dir attribute and bdi element directionality.

Value	Description
`LEFT_TO_RIGHT`	Left-to-right text flow (default for Latin scripts)
`RIGHT_TO_LEFT`	Right-to-left text flow (Hebrew, Arabic, Urdu, etc.)
`AUTO`	Automatic directionality detection

LinkType¶

Link classification based on href value and document context.

Used to categorize links during extraction for filtering and analysis.

Value	Description
`ANCHOR`	Anchor link within same document (href starts with #)
`INTERNAL`	Internal link within same domain
`EXTERNAL`	External link to different domain
`EMAIL`	Email link (mailto:)
`PHONE`	Phone link (tel:)
`OTHER`	Other protocol or unclassifiable

ImageType¶

Image source classification for proper handling and processing.

Determines whether an image is embedded (data URI), inline SVG, external, or relative.

Value	Description
`DATA_URI`	Data URI embedded image (base64 or other encoding)
`INLINE_SVG`	Inline SVG element
`EXTERNAL`	External image URL (http/https)
`RELATIVE`	Relative image path

StructuredDataType¶

Structured data format type.

Identifies the schema/format used for structured data markup.

Value	Description
`JSON_LD`	JSON-LD (JSON for Linking Data) script blocks
`MICRODATA`	HTML5 Microdata attributes (itemscope, itemtype, itemprop)
`RDFA`	RDF in Attributes (RDFa) markup

PreprocessingPreset¶

HTML preprocessing aggressiveness level.

Controls the extent of cleanup performed before conversion. Higher levels remove more elements.

Value	Description
`MINIMAL`	Minimal cleanup. Remove only essential noise (scripts, styles).
`STANDARD`	Standard cleanup. Default. Removes navigation, forms, and other auxiliary content.
`AGGRESSIVE`	Aggressive cleanup. Remove extensive non-content elements and structure.

HeadingStyle¶

Heading style options for Markdown output.

Controls how headings (h1-h6) are rendered in the output Markdown.

Value	Description
`UNDERLINED`	Underlined style (=== for h1, --- for h2).
`ATX`	ATX style (# for h1, ## for h2, etc.). Default.
`ATX_CLOSED`	ATX closed style (# title #, with closing hashes).

ListIndentType¶

List indentation character type.

Controls whether list items are indented with spaces or tabs.

Value	Description
`SPACES`	Use spaces for indentation. Default. Width controlled by `list_indent_width`.
`TABS`	Use tabs for indentation.

WhitespaceMode¶

Whitespace handling strategy during conversion.

Determines how sequences of whitespace characters (spaces, tabs, newlines) are processed.

Value	Description
`NORMALIZED`	Collapse multiple whitespace characters to single spaces. Default. Matches browser behavior.
`STRICT`	Preserve all whitespace exactly as it appears in the HTML.

NewlineStyle¶

Line break syntax in Markdown output.

Controls how soft line breaks (from <br> or line breaks in source) are rendered.

Value	Description
`SPACES`	Two trailing spaces at end of line. Default. Standard Markdown syntax.
`BACKSLASH`	Backslash at end of line. Alternative Markdown syntax.

CodeBlockStyle¶

Code block fence style in Markdown output.

Determines how code blocks (<pre><code>) are rendered in Markdown.

Value	Description
`INDENTED`	Indented code blocks (4 spaces). `CommonMark` standard.
`BACKTICKS`	Fenced code blocks with backticks (```). Default (GFM). Supports language hints.
`TILDES`	Fenced code blocks with tildes (~~~). Supports language hints.

HighlightStyle¶

Highlight rendering style for <mark> elements.

Controls how highlighted text is rendered in Markdown output.

Value	Description
`DOUBLE_EQUAL`	Double equals syntax (text). Default. Pandoc-compatible.
`HTML`	Preserve as HTML (text). Original HTML tag.
`BOLD`	Render as bold (text). Uses strong emphasis.
`NONE`	Strip formatting, render as plain text. No markup.

LinkStyle¶

Link rendering style in Markdown output.

Controls whether links and images use inline [text](url) syntax or reference-style [text][1] syntax with definitions collected at the end.

Value	Description
`INLINE`	Inline links: `[text](url)`. Default.
`REFERENCE`	Reference-style links: `[text][1]` with `[1]: url` at end of document.

OutputFormat¶

Output format for conversion.

Specifies the target markup language format for the conversion output.

Value	Description
`MARKDOWN`	Standard Markdown (CommonMark compatible). Default.
`DJOT`	Djot lightweight markup language.
`PLAIN`	Plain text output (no markup, visible text only).

NodeContent¶

The semantic content type of a document node.

Uses internally tagged representation ("node_type": "heading") for JSON serialization.

Value	Description
`HEADING`	A heading element (h1-h6). — Fields: `level`: `int`, `text`: `str`
`PARAGRAPH`	A paragraph of text. — Fields: `text`: `str`
`LIST`	A list container (ordered or unordered). Children are `ListItem` nodes. — Fields: `ordered`: `bool`
`LIST_ITEM`	A single list item. — Fields: `text`: `str`
`TABLE`	A table with structured cell data. — Fields: `grid`: `TableGrid`
`IMAGE`	An image element. — Fields: `description`: `str`, `src`: `str`, `image_index`: `int`
`CODE`	A code block or inline code. — Fields: `text`: `str`, `language`: `str`
`QUOTE`	A block quote container.
`DEFINITION_LIST`	A definition list container.
`DEFINITION_ITEM`	A definition list entry with term and description. — Fields: `term`: `str`, `definition`: `str`
`RAW_BLOCK`	A raw block preserved as-is (e.g. `<script>`, `<style>` content). — Fields: `format`: `str`, `content`: `str`
`METADATA_BLOCK`	A block of key-value metadata pairs (from `<head>` meta tags). — Fields: `entries`: `list[list[str]]`
`GROUP`	A section grouping container (auto-generated from heading hierarchy). — Fields: `label`: `str`, `heading_level`: `int`, `heading_text`: `str`

AnnotationKind¶

The type of an inline text annotation.

Uses internally tagged representation ("annotation_type": "bold") for JSON serialization.

Value	Description
`BOLD`	Bold / strong emphasis.
`ITALIC`	Italic / emphasis.
`UNDERLINE`	Underline.
`STRIKETHROUGH`	Strikethrough / deleted text.
`CODE`	Inline code.
`SUBSCRIPT`	Subscript text.
`SUPERSCRIPT`	Superscript text.
`HIGHLIGHT`	Highlighted / marked text.
`LINK`	A hyperlink sourced from an `<a href="../...">` element. — Fields: `url`: `str`, `title`: `str`

WarningKind¶

Categories of processing warnings.

Value	Description
`IMAGE_EXTRACTION_FAILED`	An image could not be extracted (e.g. invalid data URI, unsupported format).
`ENCODING_FALLBACK`	The input encoding was not recognized; fell back to UTF-8.
`TRUNCATED_INPUT`	The input was truncated due to size limits.
`MALFORMED_HTML`	The HTML was malformed but processing continued with best effort.
`SANITIZATION_APPLIED`	Sanitization was applied to remove potentially unsafe content.
`DEPTH_LIMIT_EXCEEDED`	DOM traversal was truncated because max_depth was exceeded.

NodeType¶

Node type enumeration covering all HTML element types.

This enum categorizes all HTML elements that the converter recognizes, providing a coarse-grained classification for visitor dispatch.

Value	Description
`TEXT`	Text node (most frequent - 100+ per document)
`ELEMENT`	Generic element node
`HEADING`	Heading elements (h1-h6)
`PARAGRAPH`	Paragraph element
`DIV`	Generic div container
`BLOCKQUOTE`	Blockquote element
`PRE`	Preformatted text block
`HR`	Horizontal rule
`LIST`	Ordered or unordered list (ul, ol)
`LIST_ITEM`	List item (li)
`DEFINITION_LIST`	Definition list (dl)
`DEFINITION_TERM`	Definition term (dt)
`DEFINITION_DESCRIPTION`	Definition description (dd)
`TABLE`	Table element
`TABLE_ROW`	Table row (tr)
`TABLE_CELL`	Table cell (td, th)
`TABLE_HEADER`	Table header cell (th)
`TABLE_BODY`	Table body (tbody)
`TABLE_HEAD`	Table head (thead)
`TABLE_FOOT`	Table foot (tfoot)
`LINK`	Anchor link (a)
`IMAGE`	Image (img)
`STRONG`	Strong/bold (strong, b)
`EM`	Emphasis/italic (em, i)
`CODE`	Inline code (code)
`STRIKETHROUGH`	Strikethrough (s, del, strike)
`UNDERLINE`	Underline (u, ins)
`SUBSCRIPT`	Subscript (sub)
`SUPERSCRIPT`	Superscript (sup)
`MARK`	Mark/highlight (mark)
`SMALL`	Small text (small)
`BR`	Line break (br)
`SPAN`	Span element
`ARTICLE`	Article element
`SECTION`	Section element
`NAV`	Navigation element
`ASIDE`	Aside element
`HEADER`	Header element
`FOOTER`	Footer element
`MAIN`	Main element
`FIGURE`	Figure element
`FIGCAPTION`	Figure caption
`TIME`	Time element
`DETAILS`	Details element
`SUMMARY`	Summary element
`FORM`	Form element
`INPUT`	Input element
`SELECT`	Select element
`OPTION`	Option element
`BUTTON`	Button element
`TEXTAREA`	Textarea element
`LABEL`	Label element
`FIELDSET`	Fieldset element
`LEGEND`	Legend element
`AUDIO`	Audio element
`VIDEO`	Video element
`PICTURE`	Picture element
`SOURCE`	Source element
`IFRAME`	Iframe element
`SVG`	SVG element
`CANVAS`	Canvas element
`RUBY`	Ruby annotation
`RT`	Ruby text
`RP`	Ruby parenthesis
`ABBR`	Abbreviation
`KBD`	Keyboard input
`SAMP`	Sample output
`VAR`	Variable
`CITE`	Citation
`Q`	Quote
`DEL`	Deleted text
`INS`	Inserted text
`DATA`	Data element
`METER`	Meter element
`PROGRESS`	Progress element
`OUTPUT`	Output element
`TEMPLATE`	Template element
`SLOT`	Slot element
`HTML`	HTML root element
`HEAD`	Head element
`BODY`	Body element
`TITLE`	Title element
`META`	Meta element
`LINK_TAG`	Link element (not anchor)
`STYLE`	Style element
`SCRIPT`	Script element
`BASE`	Base element
`CUSTOM`	Custom element (web components) or unknown tag

VisitResult¶

Result of a visitor callback.

Allows visitors to control the conversion flow by either proceeding with default behavior, providing custom output, skipping elements, preserving HTML, or signaling errors.

Value	Description
`CONTINUE`	Continue with default conversion behavior
`CUSTOM`	Replace default output with custom markdown The visitor takes full responsibility for the markdown output of this node and its children. — Fields: `0`: `str`
`SKIP`	Skip this element entirely (don't output anything) The element and all its children are ignored in the output.
`PRESERVE_HTML`	Preserve original HTML (don't convert to markdown) The element's raw HTML is included verbatim in the output.
`ERROR`	Stop conversion with an error The conversion process halts and returns this error message. — Fields: `0`: `str`

Errors¶

ConversionError¶

Errors that can occur during HTML to Markdown conversion.

Base class: ConversionError(Exception)

Exception	Description
`ParseError(ConversionError)`	HTML parsing error
`SanitizationError(ConversionError)`	HTML sanitization error
`ConfigError(ConversionError)`	Invalid configuration
`IoError(ConversionError)`	I/O error
`Panic(ConversionError)`	Internal error caught during conversion
`InvalidInput(ConversionError)`	Invalid input data
`Other(ConversionError)`	Generic conversion error

Edit this page on GitHub