Configuration Reference
Configuration Reference¶
This page documents all configuration types and their defaults across all languages.
DocumentMetadata¶
Document-level metadata extracted from <head> and top-level elements.
Contains all metadata typically used by search engines, social media platforms, and browsers for document indexing and presentation.
| Field | Type | Default | Description |
|---|---|---|---|
title |
str \| None |
None |
Document title from <title> tag |
description |
str \| None |
None |
Document description from <meta name="description"> tag |
keywords |
list[str] |
[] |
Document keywords from <meta name="keywords"> tag, split on commas |
author |
str \| None |
None |
Document author from <meta name="author"> tag |
canonical_url |
str \| None |
None |
Canonical URL from <link rel="canonical"> tag |
base_href |
str \| None |
None |
Base URL from <base href=""> tag for resolving relative URLs |
language |
str \| None |
None |
Document language from lang attribute |
text_direction |
TextDirection \| None |
None |
Document text direction from dir attribute |
open_graph |
dict[str, str] |
{} |
Open Graph metadata (og:* properties) for social media Keys like "title", "description", "image", "url", etc. |
twitter_card |
dict[str, str] |
{} |
Twitter Card metadata (twitter:* properties) Keys like "card", "site", "creator", "title", "description", "image", etc. |
meta_tags |
dict[str, str] |
{} |
Additional meta tags not covered by specific fields Keys are meta name/property attributes, values are content |
HtmlMetadata¶
Comprehensive metadata extraction result from HTML document.
Contains all extracted metadata types in a single structure, suitable for serialization and transmission across language boundaries.
| Field | Type | Default | Description |
|---|---|---|---|
document |
DocumentMetadata |
— | Document-level metadata (title, description, canonical, etc.) |
headers |
list[HeaderMetadata] |
[] |
Extracted header elements with hierarchy |
links |
list[LinkMetadata] |
[] |
Extracted hyperlinks with type classification |
images |
list[ImageMetadata] |
[] |
Extracted images with source and dimensions |
structured_data |
list[StructuredData] |
[] |
Extracted structured data blocks |
ConversionOptions¶
Main conversion options for HTML to Markdown conversion.
Use ConversionOptions.builder() to construct, or the default constructor for defaults.
| Field | Type | Default | Description |
|---|---|---|---|
heading_style |
HeadingStyle |
HeadingStyle.ATX |
Heading style to use in Markdown output (ATX # or Setext underline). |
list_indent_type |
ListIndentType |
ListIndentType.SPACES |
How to indent nested list items (spaces or tab). |
list_indent_width |
int |
2 |
Number of spaces (or tabs) to use for each level of list indentation. |
bullets |
str |
"-*+" |
Bullet character(s) to use for unordered list items (e.g. "-", "*"). |
strong_em_symbol |
str |
"*" |
Character used for bold/italic emphasis markers (* or _). |
escape_asterisks |
bool |
False |
Escape * characters in plain text to avoid unintended bold/italic. |
escape_underscores |
bool |
False |
Escape _ characters in plain text to avoid unintended bold/italic. |
escape_misc |
bool |
False |
Escape miscellaneous Markdown metacharacters ([]()# etc.) in plain text. |
escape_ascii |
bool |
False |
Escape ASCII characters that have special meaning in certain Markdown dialects. |
code_language |
str |
"" |
Default language annotation for fenced code blocks that have no language hint. |
autolinks |
bool |
True |
Automatically convert bare URLs into Markdown autolinks. |
default_title |
bool |
False |
Emit a default title when no <title> tag is present. |
br_in_tables |
bool |
False |
Render <br> elements inside table cells as literal line breaks. |
compact_tables |
bool |
False |
Emit tables without column padding (compact GFM format). When True, column widths are not computed and cells are emitted with no trailing spaces. Separator rows use exactly --- per column. Produces token-efficient output suitable for RAG / LLM contexts. Default False (aligned padding preserved). |
highlight_style |
HighlightStyle |
HighlightStyle.DOUBLE_EQUAL |
Style used for <mark> / highlighted text (e.g. ==text==). |
extract_metadata |
bool |
True |
Populate result.metadata with <head> / <meta> extraction (title, description, Open Graph, Twitter Card, JSON-LD, …). Default True. Disabling skips the metadata pass only — table extraction into result.tables runs unconditionally. |
whitespace_mode |
WhitespaceMode |
WhitespaceMode.NORMALIZED |
Controls how whitespace sequences are normalised in the converted output. - WhitespaceMode.Normalized (default) — collapses consecutive whitespace characters (spaces, tabs, newlines) to a single space, matching browser rendering behaviour. - WhitespaceMode.Strict — preserves all whitespace exactly as it appears in the source HTML, including runs of spaces and embedded newlines. Choose Strict only when the source HTML uses deliberate whitespace (e.g. pre-formatted content outside <pre> tags). For most documents Normalized produces cleaner output. |
strip_newlines |
bool |
False |
Strip all newlines from the output, producing a single-line result. |
wrap |
bool |
False |
Wrap long lines at wrap_width characters. |
wrap_width |
int |
80 |
Maximum output line width in characters when wrap is True (default 80). Lines are broken at word boundaries so that no line exceeds this length. A value of 0 is treated as "no limit" — equivalent to leaving wrap disabled. Has no effect when wrap is False. |
convert_as_inline |
bool |
False |
Treat the entire document as inline content (no block-level wrappers). |
sub_symbol |
str |
"" |
Markdown notation for subscript text (e.g. "~"). |
sup_symbol |
str |
"" |
Markdown notation for superscript text (e.g. "^"). |
newline_style |
NewlineStyle |
NewlineStyle.SPACES |
How to encode hard line breaks (<br>) in Markdown. |
code_block_style |
CodeBlockStyle |
CodeBlockStyle.BACKTICKS |
Style used for fenced code blocks (backticks or tilde). |
keep_inline_images_in |
list[str] |
[] |
HTML tag names whose <img> children are kept inline instead of block. |
preprocessing |
PreprocessingOptions |
— | Options for the HTML pre-processing pass applied before conversion begins. Pre-processing runs before the HTML is handed to the converter and can perform operations such as unwrapping redundant wrapper elements, removing tracking pixels, and normalising vendor-specific markup. See PreprocessingOptions for the full set of knobs. Defaults to PreprocessingOptions.default(), which enables the standard cleaning passes. Set individual fields on PreprocessingOptions (or construct via ConversionOptions.builder) to opt in or out of specific passes. |
encoding |
str |
"utf-8" |
Expected character encoding of the input HTML (default "utf-8"). |
debug |
bool |
False |
Emit debug information during conversion. |
strip_tags |
list[str] |
[] |
HTML tag names whose content is stripped from the output entirely. |
preserve_tags |
list[str] |
[] |
HTML tag names that are preserved verbatim in the output. |
skip_images |
bool |
False |
Skip conversion of <img> elements (omit images from output). |
link_style |
LinkStyle |
LinkStyle.INLINE |
Link rendering style (inline or reference). |
output_format |
OutputFormat |
OutputFormat.MARKDOWN |
Target output format (Markdown, plain text, etc.). |
include_document_structure |
bool |
False |
Include structured document tree in result. |
extract_images |
bool |
False |
Extract inline images from data URIs and SVGs. |
max_image_size |
int |
5242880 |
Maximum decoded image size in bytes (default 5MB). |
capture_svg |
bool |
False |
Capture SVG elements as images. |
infer_dimensions |
bool |
True |
Infer image dimensions from data. |
max_depth |
int \| None |
None |
Maximum DOM traversal depth. None means unlimited. When set, subtrees beyond this depth are silently truncated. |
exclude_selectors |
list[str] |
[] |
CSS selectors for elements to exclude entirely (element + all content). Unlike strip_tags (which removes the tag wrapper but keeps children), excluded elements and all their descendants are dropped from the output. Supports any CSS selector that tl supports: tag names, .class, #id, [attribute], etc. Invalid selectors are silently skipped at conversion time. Example: vec![".cookie-banner".into(), "#ad-container".into(), "[role='complementary']".into()] |
visitor |
VisitorHandle \| None |
None |
Optional visitor for custom traversal logic. When set, the visitor's callbacks are invoked for matching HTML elements during conversion, allowing custom output, skipping, or HTML preservation. See HtmlVisitor. |
PreprocessingOptions¶
HTML preprocessing options for document cleanup before conversion.
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
True |
Enable HTML preprocessing globally |
preset |
PreprocessingPreset |
PreprocessingPreset.STANDARD |
Preprocessing preset level (Minimal, Standard, Aggressive) |
remove_navigation |
bool |
True |
Remove navigation elements (nav, breadcrumbs, menus, sidebars) |
remove_forms |
bool |
True |
Remove form elements (forms, inputs, buttons, etc.) |
ConversionResult¶
The primary result of HTML conversion and extraction.
Contains the converted text output, optional structured document tree, metadata, extracted tables, images, and processing warnings.
| Field | Type | Default | Description |
|---|---|---|---|
content |
str \| None |
None |
Converted text output (markdown, djot, or plain text). None when output_format is set to OutputFormat.None, indicating extraction-only mode. |
document |
DocumentStructure \| None |
None |
Structured document tree with semantic elements. Populated when ConversionOptions.include_document_structure is True. None otherwise (the default), which avoids the overhead of building the tree. When present, the tree mirrors the converted document: headings open Group sections, paragraphs and list items carry inline TextAnnotations, and tables reference the same TableGrid data exposed in Self.tables. Note: this field is independent of the metadata feature flag. Document structure collection is always available at runtime; it is gated only by the runtime option, not by a compile-time feature. |
metadata |
HtmlMetadata |
— | Extracted HTML metadata (title, OG, links, images, structured data). |
tables |
list[TableData] |
[] |
Extracted tables with structured cell data and markdown representation. |
images |
list[str] |
[] |
Extracted inline images (data URIs and SVGs). Populated when extract_images is True in options. |
warnings |
list[ProcessingWarning] |
[] |
Non-fatal processing warnings. |
TableGrid¶
A structured table grid with cell-level data including spans.
| Field | Type | Default | Description |
|---|---|---|---|
rows |
int |
— | Number of rows. |
cols |
int |
— | Number of columns. |
cells |
list[GridCell] |
[] |
All cells in the table as a flat, sparse list. The list is ordered by (row, col) but is not a dense rows × cols matrix: cells that are covered by a spanning cell (via row_span > 1 or col_span > 1) do not appear in the list. Only the top-left "origin" cell of a span is present, with its row_span and col_span fields set accordingly. To reconstruct the full visual grid, iterate over all cells and mark the rectangular region [row .. row+row_span, col .. col+col_span] as occupied by that cell. Any (row, col) position that is not the origin of any cell is covered by a span from an earlier cell. The length of this vec is ≤ rows * cols. An empty table (rows == 0 \|\| cols == 0) produces an empty vec. |
Enums¶
CodeBlockStyle¶
Code block fence style in Markdown output.
Determines how code blocks (<pre><code>) are rendered in Markdown.
| Variant | Description |
|---|---|
Indented |
Indented code blocks (4 spaces). CommonMark standard. |
Backticks |
Fenced code blocks with backticks (```). Default (GFM). Supports language hints. |
Tildes |
Fenced code blocks with tildes (~~~). Supports language hints. |
HeadingStyle¶
Heading style options for Markdown output.
Controls how headings (h1-h6) are rendered in the output Markdown.
| Variant | Description |
|---|---|
Underlined |
Underlined style (=== for h1, --- for h2). |
Atx |
ATX style (# for h1, ## for h2, etc.). Default. |
AtxClosed |
ATX closed style (# title #, with closing hashes). |
HighlightStyle¶
Highlight rendering style for <mark> elements.
Controls how highlighted text is rendered in Markdown output.
| Variant | Description |
|---|---|
DoubleEqual |
Double equals syntax (text). Default. Pandoc-compatible. |
Html |
Preserve as HTML (text). Original HTML tag. |
Bold |
Render as bold (text). Uses strong emphasis. |
None |
Strip formatting, render as plain text. No markup. |
LinkStyle¶
Link rendering style in Markdown output.
Controls whether links and images use inline [text](url) syntax or
reference-style [text][1] syntax with definitions collected at the end.
| Variant | Description |
|---|---|
Inline |
Inline links: [text](url). Default. |
Reference |
Reference-style links: [text][1] with [1]: url at end of document. |
ListIndentType¶
List indentation character type.
Controls whether list items are indented with spaces or tabs.
| Variant | Description |
|---|---|
Spaces |
Use spaces for indentation. Default. Width controlled by list_indent_width. |
Tabs |
Use tabs for indentation. |
NewlineStyle¶
Line break syntax in Markdown output.
Controls how soft line breaks (from <br> or line breaks in source) are rendered.
| Variant | Description |
|---|---|
Spaces |
Two trailing spaces at end of line. Default. Standard Markdown syntax. |
Backslash |
Backslash at end of line. Alternative Markdown syntax. |
OutputFormat¶
Output format for conversion.
Specifies the target markup language format for the conversion output.
| Variant | Description |
|---|---|
Markdown |
Standard Markdown (CommonMark compatible). Default. |
Djot |
Djot lightweight markup language. |
Plain |
Plain text output (no markup, visible text only). |
PreprocessingPreset¶
HTML preprocessing aggressiveness level.
Controls the extent of cleanup performed before conversion. Higher levels remove more elements.
| Variant | Description |
|---|---|
Minimal |
Minimal cleanup. Remove only essential noise (scripts, styles). |
Standard |
Standard cleanup. Default. Removes navigation, forms, and other auxiliary content. |
Aggressive |
Aggressive cleanup. Remove extensive non-content elements and structure. |
TextDirection¶
Text directionality of document content.
Corresponds to the HTML dir attribute and bdi element directionality.
| Variant | Wire value | Description |
|---|---|---|
LeftToRight |
ltr |
Left-to-right text flow (default for Latin scripts) |
RightToLeft |
rtl |
Right-to-left text flow (Hebrew, Arabic, Urdu, etc.) |
Auto |
auto |
Automatic directionality detection |
WhitespaceMode¶
Whitespace handling strategy during conversion.
Determines how sequences of whitespace characters (spaces, tabs, newlines) are processed.
| Variant | Description |
|---|---|
Normalized |
Collapse multiple whitespace characters to single spaces. Default. Matches browser behavior. |
Strict |
Preserve all whitespace exactly as it appears in the HTML. |