Migrating to v3¶
Background¶
html-to-markdown began as a Python library — a fork of markdownify — focused on converting HTML to Markdown. Over time, the core was rewritten in Rust for performance, and native bindings were added for 12 programming languages: Python, TypeScript, Ruby, Go, PHP, Java, C#, Elixir, R, C, and WebAssembly.
v3.0.0 rationalises the library's API surface. Instead of multiple
specialised functions, there is now a single convert() function that
returns a structured ConversionResult. This simplifies the API while
making all extraction capabilities available through one call.
hOCR support moved to kreuzberg¶
hOCR (OCR-generated HTML) processing was removed from html-to-markdown and moved into kreuzberg, the document extraction library in the kreuzberg.dev ecosystem. hOCR is a document-level concern — it belongs in the extraction pipeline, not in an HTML-to-Markdown converter.
API changes¶
Before (v2)¶
v2 had multiple functions for different extraction needs:
convert(html)→Stringconvert_with_metadata(html, options, config)→(String, Metadata)convert_with_inline_images(html, options, config)→Extractionconvert_with_tables(html, options)→TablesResultconvert_with_visitor(html, options, visitor)→String
After (v3)¶
v3 has one function:
convert(html, options?, visitor?)→ConversionResult
ConversionResult contains all data:
content— the converted text (Markdown, Djot, or plain text)metadata— HTML metadata (title, OG, links, images, structured data)tables— extracted tables with cell datadocument— structured document tree (when enabled)images— extracted inline images (when enabled)warnings— non-fatal processing warnings
Migration table¶
| v2 | v3 |
|---|---|
convert(html) |
convert(html).content |
convert_with_metadata(html, opts, cfg) |
convert(html, opts).metadata |
convert_with_inline_images(html, opts, cfg) |
convert(html, opts).images (set extract_images: true) |
convert_with_tables(html, opts) |
convert(html, opts).tables |
convert_with_visitor(html, opts, v) |
convert(html, opts, v).content |
convert_to_string(html) |
convert(html).content |
hocr_spatial_tables: true |
Use kreuzberg |
ExtendedMetadata |
HtmlMetadata |
start_profiling() |
Removed |
markdownify(html) (Python) |
convert(html).content |
Examples by language¶
Rust¶
// v2
let markdown = convert(html, None)?;
let (md, meta) = convert_with_metadata(html, None, cfg, None)?;
// v3
let result = convert(html, None)?;
let markdown = result.content.unwrap_or_default();
let metadata = result.metadata; // always available
Python¶
# v2
markdown = convert(html)
markdown, metadata = convert_with_metadata(html)
# v3
result = convert(html)
markdown = result["content"]
metadata = result["metadata"]
TypeScript¶
// v2
const markdown = convert(html);
const { markdown, metadata } = convertWithMetadata(html);
// v3
const result = convert(html);
const markdown = result.content;
const metadata = result.metadata;
Go¶
// v2
markdown, err := htmltomarkdown.Convert(html, nil)
markdown, meta, err := htmltomarkdown.ConvertWithMetadata(html, nil)
// v3
result, err := htmltomarkdown.Convert(html, nil)
markdown := result.Content
metadata := result.Metadata
Ruby¶
# v2
markdown = HtmlToMarkdown.convert(html)
markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
# v3
result = HtmlToMarkdown.convert(html)
markdown = result[:content]
metadata = result[:metadata]
PHP¶
// v2
$markdown = HtmlToMarkdown\convert($html);
[$markdown, $metadata] = HtmlToMarkdown\convert_with_metadata($html);
// v3
$result = HtmlToMarkdown\convert($html);
$markdown = $result->content;
$metadata = $result->metadata;
Java¶
// v2
String markdown = HtmlToMarkdown.convert(html);
ConversionPair pair = HtmlToMarkdown.convertWithMetadata(html);
// v3
ConversionResult result = HtmlToMarkdown.convert(html);
String markdown = result.getContent();
HtmlMetadata metadata = result.getMetadata();
C¶
// v2
string markdown = HtmlToMarkdown.Convert(html);
(string md, Metadata meta) = HtmlToMarkdown.ConvertWithMetadata(html);
// v3
ConversionResult result = HtmlToMarkdown.Convert(html);
string markdown = result.Content;
HtmlMetadata metadata = result.Metadata;
Elixir¶
# v2
{:ok, markdown} = HtmlToMarkdown.convert(html)
{:ok, markdown, metadata} = HtmlToMarkdown.convert_with_metadata(html)
# v3
{:ok, result} = HtmlToMarkdown.convert(html)
markdown = result.content
metadata = result.metadata
R¶
# v2
markdown <- html_to_markdown(html)
result <- html_to_markdown_with_metadata(html)
# v3
result <- convert(html)
markdown <- result$content
metadata <- result$metadata