Configuration Reference

Configuration Reference¶

This page documents all configuration types and their defaults across all languages.

DocumentMetadata¶

Document-level metadata extracted from <head> and top-level elements.

Contains all metadata typically used by search engines, social media platforms, and browsers for document indexing and presentation.

Field	Type	Default	Description
`title`	`str \\| None`	`None`	Document title from `<title>` tag
`description`	`str \\| None`	`None`	Document description from `<meta name="description">` tag
`keywords`	`list[str]`	`[]`	Document keywords from `<meta name="keywords">` tag, split on commas
`author`	`str \\| None`	`None`	Document author from `<meta name="author">` tag
`canonical_url`	`str \\| None`	`None`	Canonical URL from `<link rel="canonical">` tag
`base_href`	`str \\| None`	`None`	Base URL from `<base href="">` tag for resolving relative URLs
`language`	`str \\| None`	`None`	Document language from `lang` attribute
`text_direction`	`TextDirection \\| None`	`None`	Document text direction from `dir` attribute
`open_graph`	`dict[str, str]`	`{}`	Open Graph metadata (og:* properties) for social media Keys like "title", "description", "image", "url", etc.
`twitter_card`	`dict[str, str]`	`{}`	Twitter Card metadata (twitter:* properties) Keys like "card", "site", "creator", "title", "description", "image", etc.
`meta_tags`	`dict[str, str]`	`{}`	Additional meta tags not covered by specific fields Keys are meta name/property attributes, values are content

HtmlMetadata¶

Comprehensive metadata extraction result from HTML document.

Contains all extracted metadata types in a single structure, suitable for serialization and transmission across language boundaries.

Field	Type	Default	Description
`document`	`DocumentMetadata`	—	Document-level metadata (title, description, canonical, etc.)
`headers`	`list[HeaderMetadata]`	`[]`	Extracted header elements with hierarchy
`links`	`list[LinkMetadata]`	`[]`	Extracted hyperlinks with type classification
`images`	`list[ImageMetadata]`	`[]`	Extracted images with source and dimensions
`structured_data`	`list[StructuredData]`	`[]`	Extracted structured data blocks

ConversionOptions¶

Main conversion options for HTML to Markdown conversion.

Use ConversionOptions.builder() to construct, or the default constructor for defaults.

Field	Type	Default	Description
`heading_style`	`HeadingStyle`	`HeadingStyle.ATX`	Heading style to use in Markdown output (ATX `#` or Setext underline).
`list_indent_type`	`ListIndentType`	`ListIndentType.SPACES`	How to indent nested list items (spaces or tab).
`list_indent_width`	`int`	`2`	Number of spaces (or tabs) to use for each level of list indentation.
`bullets`	`str`	`"-*+"`	Bullet character(s) to use for unordered list items (e.g. `"-"`, `"*"`).
`strong_em_symbol`	`str`	`"*"`	Character used for bold/italic emphasis markers (`*` or `_`).
`escape_asterisks`	`bool`	`False`	Escape `*` characters in plain text to avoid unintended bold/italic.
`escape_underscores`	`bool`	`False`	Escape `_` characters in plain text to avoid unintended bold/italic.
`escape_misc`	`bool`	`False`	Escape miscellaneous Markdown metacharacters (`[]()#` etc.) in plain text.
`escape_ascii`	`bool`	`False`	Escape ASCII characters that have special meaning in certain Markdown dialects.
`code_language`	`str`	`""`	Default language annotation for fenced code blocks that have no language hint.
`autolinks`	`bool`	`True`	Automatically convert bare URLs into Markdown autolinks.
`default_title`	`bool`	`False`	Emit a default title when no `<title>` tag is present.
`br_in_tables`	`bool`	`False`	Render `<br>` elements inside table cells as literal line breaks.
`compact_tables`	`bool`	`False`	Emit tables without column padding (compact GFM format). When `True`, column widths are not computed and cells are emitted with no trailing spaces. Separator rows use exactly `---` per column. Produces token-efficient output suitable for RAG / LLM contexts. Default `False` (aligned padding preserved).
`highlight_style`	`HighlightStyle`	`HighlightStyle.DOUBLE_EQUAL`	Style used for `<mark>` / highlighted text (e.g. `==text==`).
`extract_metadata`	`bool`	`True`	Populate `result.metadata` with `<head>` / `<meta>` extraction (title, description, Open Graph, Twitter Card, JSON-LD, …). Default `True`. Disabling skips the metadata pass only — table extraction into `result.tables` runs unconditionally.
`whitespace_mode`	`WhitespaceMode`	`WhitespaceMode.NORMALIZED`	Controls how whitespace sequences are normalised in the converted output. - `WhitespaceMode.Normalized` (default) — collapses consecutive whitespace characters (spaces, tabs, newlines) to a single space, matching browser rendering behaviour. - `WhitespaceMode.Strict` — preserves all whitespace exactly as it appears in the source HTML, including runs of spaces and embedded newlines. Choose `Strict` only when the source HTML uses deliberate whitespace (e.g. pre-formatted content outside `<pre>` tags). For most documents `Normalized` produces cleaner output.
`strip_newlines`	`bool`	`False`	Strip all newlines from the output, producing a single-line result.
`wrap`	`bool`	`False`	Wrap long lines at `wrap_width` characters.
`wrap_width`	`int`	`80`	Maximum output line width in characters when `wrap` is `True` (default `80`). Lines are broken at word boundaries so that no line exceeds this length. A value of `0` is treated as "no limit" — equivalent to leaving `wrap` disabled. Has no effect when `wrap` is `False`.
`convert_as_inline`	`bool`	`False`	Treat the entire document as inline content (no block-level wrappers).
`sub_symbol`	`str`	`""`	Markdown notation for subscript text (e.g. `"~"`).
`sup_symbol`	`str`	`""`	Markdown notation for superscript text (e.g. `"^"`).
`newline_style`	`NewlineStyle`	`NewlineStyle.SPACES`	How to encode hard line breaks (`<br>`) in Markdown.
`code_block_style`	`CodeBlockStyle`	`CodeBlockStyle.BACKTICKS`	Style used for fenced code blocks (backticks or tilde).
`keep_inline_images_in`	`list[str]`	`[]`	HTML tag names whose `<img>` children are kept inline instead of block.
`preprocessing`	`PreprocessingOptions`	—	Options for the HTML pre-processing pass applied before conversion begins. Pre-processing runs before the HTML is handed to the converter and can perform operations such as unwrapping redundant wrapper elements, removing tracking pixels, and normalising vendor-specific markup. See `PreprocessingOptions` for the full set of knobs. Defaults to `PreprocessingOptions.default()`, which enables the standard cleaning passes. Set individual fields on `PreprocessingOptions` (or construct via `ConversionOptions.builder`) to opt in or out of specific passes.
`encoding`	`str`	`"utf-8"`	Expected character encoding of the input HTML (default `"utf-8"`).
`debug`	`bool`	`False`	Emit debug information during conversion.
`strip_tags`	`list[str]`	`[]`	HTML tag names whose content is stripped from the output entirely.
`preserve_tags`	`list[str]`	`[]`	HTML tag names that are preserved verbatim in the output.
`skip_images`	`bool`	`False`	Skip conversion of `<img>` elements (omit images from output).
`link_style`	`LinkStyle`	`LinkStyle.INLINE`	Link rendering style (inline or reference).
`output_format`	`OutputFormat`	`OutputFormat.MARKDOWN`	Target output format (Markdown, plain text, etc.).
`include_document_structure`	`bool`	`False`	Include structured document tree in result.
`extract_images`	`bool`	`False`	Extract inline images from data URIs and SVGs.
`max_image_size`	`int`	`5242880`	Maximum decoded image size in bytes (default 5MB).
`capture_svg`	`bool`	`False`	Capture SVG elements as images.
`infer_dimensions`	`bool`	`True`	Infer image dimensions from data.
`max_depth`	`int \\| None`	`None`	Maximum DOM traversal depth. `None` means unlimited. When set, subtrees beyond this depth are silently truncated.
`exclude_selectors`	`list[str]`	`[]`	CSS selectors for elements to exclude entirely (element + all content). Unlike `strip_tags` (which removes the tag wrapper but keeps children), excluded elements and all their descendants are dropped from the output. Supports any CSS selector that `tl` supports: tag names, `.class`, `#id`, `[attribute]`, etc. Invalid selectors are silently skipped at conversion time. Example: `vec![".cookie-banner".into(), "#ad-container".into(), "[role='complementary']".into()]`
`visitor`	`VisitorHandle \\| None`	`None`	Optional visitor for custom traversal logic. When set, the visitor's callbacks are invoked for matching HTML elements during conversion, allowing custom output, skipping, or HTML preservation. See `HtmlVisitor`.

PreprocessingOptions¶

HTML preprocessing options for document cleanup before conversion.

Field	Type	Default	Description
`enabled`	`bool`	`True`	Enable HTML preprocessing globally
`preset`	`PreprocessingPreset`	`PreprocessingPreset.STANDARD`	Preprocessing preset level (Minimal, Standard, Aggressive)
`remove_navigation`	`bool`	`True`	Remove navigation elements (nav, breadcrumbs, menus, sidebars)
`remove_forms`	`bool`	`True`	Remove form elements (forms, inputs, buttons, etc.)

ConversionResult¶

The primary result of HTML conversion and extraction.

Contains the converted text output, optional structured document tree, metadata, extracted tables, images, and processing warnings.

Field	Type	Default	Description
`content`	`str \\| None`	`None`	Converted text output (markdown, djot, or plain text). `None` when `output_format` is set to `OutputFormat.None`, indicating extraction-only mode.
`document`	`DocumentStructure \\| None`	`None`	Structured document tree with semantic elements. Populated when `ConversionOptions.include_document_structure` is `True`. `None` otherwise (the default), which avoids the overhead of building the tree. When present, the tree mirrors the converted document: headings open `Group` sections, paragraphs and list items carry inline `TextAnnotation`s, and tables reference the same `TableGrid` data exposed in `Self.tables`. Note: this field is independent of the `metadata` feature flag. Document structure collection is always available at runtime; it is gated only by the runtime option, not by a compile-time feature.
`metadata`	`HtmlMetadata`	—	Extracted HTML metadata (title, OG, links, images, structured data).
`tables`	`list[TableData]`	`[]`	Extracted tables with structured cell data and markdown representation.
`images`	`list[str]`	`[]`	Extracted inline images (data URIs and SVGs). Populated when `extract_images` is `True` in options.
`warnings`	`list[ProcessingWarning]`	`[]`	Non-fatal processing warnings.

TableGrid¶

A structured table grid with cell-level data including spans.

Field	Type	Default	Description
`rows`	`int`	—	Number of rows.
`cols`	`int`	—	Number of columns.
`cells`	`list[GridCell]`	`[]`	All cells in the table as a flat, sparse list. The list is ordered by `(row, col)` but is not a dense `rows × cols` matrix: cells that are covered by a spanning cell (via `row_span > 1` or `col_span > 1`) do not appear in the list. Only the top-left "origin" cell of a span is present, with its `row_span` and `col_span` fields set accordingly. To reconstruct the full visual grid, iterate over all cells and mark the rectangular region `[row .. row+row_span, col .. col+col_span]` as occupied by that cell. Any `(row, col)` position that is not the origin of any cell is covered by a span from an earlier cell. The length of this vec is `≤ rows * cols`. An empty table (`rows == 0 \\|\\| cols == 0`) produces an empty vec.

Enums¶

CodeBlockStyle¶

Code block fence style in Markdown output.

Determines how code blocks (<pre><code>) are rendered in Markdown.

Variant	Description
`Indented`	Indented code blocks (4 spaces). `CommonMark` standard.
`Backticks`	Fenced code blocks with backticks (```). Default (GFM). Supports language hints.
`Tildes`	Fenced code blocks with tildes (~~~). Supports language hints.

HeadingStyle¶

Heading style options for Markdown output.

Controls how headings (h1-h6) are rendered in the output Markdown.

Variant	Description
`Underlined`	Underlined style (=== for h1, --- for h2).
`Atx`	ATX style (# for h1, ## for h2, etc.). Default.
`AtxClosed`	ATX closed style (# title #, with closing hashes).

HighlightStyle¶

Highlight rendering style for <mark> elements.

Controls how highlighted text is rendered in Markdown output.

Variant	Description
`DoubleEqual`	Double equals syntax (text). Default. Pandoc-compatible.
`Html`	Preserve as HTML (text). Original HTML tag.
`Bold`	Render as bold (text). Uses strong emphasis.
`None`	Strip formatting, render as plain text. No markup.

LinkStyle¶

Link rendering style in Markdown output.

Controls whether links and images use inline [text](url) syntax or reference-style [text][1] syntax with definitions collected at the end.

Variant	Description
`Inline`	Inline links: `[text](url)`. Default.
`Reference`	Reference-style links: `[text][1]` with `[1]: url` at end of document.

ListIndentType¶

List indentation character type.

Controls whether list items are indented with spaces or tabs.

Variant	Description
`Spaces`	Use spaces for indentation. Default. Width controlled by `list_indent_width`.
`Tabs`	Use tabs for indentation.

NewlineStyle¶

Line break syntax in Markdown output.

Controls how soft line breaks (from <br> or line breaks in source) are rendered.

Variant	Description
`Spaces`	Two trailing spaces at end of line. Default. Standard Markdown syntax.
`Backslash`	Backslash at end of line. Alternative Markdown syntax.

OutputFormat¶

Output format for conversion.

Specifies the target markup language format for the conversion output.

Variant	Description
`Markdown`	Standard Markdown (CommonMark compatible). Default.
`Djot`	Djot lightweight markup language.
`Plain`	Plain text output (no markup, visible text only).

PreprocessingPreset¶

HTML preprocessing aggressiveness level.

Controls the extent of cleanup performed before conversion. Higher levels remove more elements.

Variant	Description
`Minimal`	Minimal cleanup. Remove only essential noise (scripts, styles).
`Standard`	Standard cleanup. Default. Removes navigation, forms, and other auxiliary content.
`Aggressive`	Aggressive cleanup. Remove extensive non-content elements and structure.

TextDirection¶

Text directionality of document content.

Corresponds to the HTML dir attribute and bdi element directionality.

Variant	Wire value	Description
`LeftToRight`	`ltr`	Left-to-right text flow (default for Latin scripts)
`RightToLeft`	`rtl`	Right-to-left text flow (Hebrew, Arabic, Urdu, etc.)
`Auto`	`auto`	Automatic directionality detection

WhitespaceMode¶

Whitespace handling strategy during conversion.

Determines how sequences of whitespace characters (spaces, tabs, newlines) are processed.

Variant	Description
`Normalized`	Collapse multiple whitespace characters to single spaces. Default. Matches browser behavior.
`Strict`	Preserve all whitespace exactly as it appears in the HTML.

Edit this page on GitHub