Visitor Pattern¶
The visitor system is the library's main extensibility point. Implement HtmlVisitor and you can replace, skip, or augment how any HTML element becomes Markdown. No fork required.
Rust users must opt in with features = ["visitor"]. The other bindings expose the visitor through their native idiom (Visitor interface in Java, callback object in Python, etc.) and link against a Rust core built with the feature enabled.
Execution Order¶
Traversal is pre-order. For <div><p>text</p></div>:
visit_element_startfires for<div>visit_element_startfires for<p>visit_textfires for"text"visit_element_endfires for<p>with the rendered outputvisit_element_endfires for<div>with the rendered output
visit_text is hot. It runs for every text node in the document, often 100+ times on a single page. Return Continue fast when you don't care about the node, and avoid allocations in the method body.
VisitResult¶
Every callback returns a VisitResult.
| Variant | Effect |
|---|---|
Continue |
Use the default rendering. |
Custom(String) |
Replace the default output with the supplied Markdown. The visitor owns the rendering for this node and its children. |
Skip |
Drop the element and all of its children. |
PreserveHtml |
Emit the raw HTML for this element verbatim. |
Error(String) |
Halt conversion. The message surfaces as ConversionError::Visitor in Rust (behind features = ["visitor"]). |
NodeContext¶
Every callback receives a NodeContext describing the current node.
| Field | Type | Meaning |
|---|---|---|
node_type |
NodeType |
Coarse-grained classification (heading, list, link, form, …). 87 variants. |
tag_name |
String |
Raw HTML tag name. Lowercased. |
attributes |
BTreeMap<String, String> |
All attributes on the element. |
depth |
usize |
Depth in the DOM tree. Root is 0. |
index_in_parent |
usize |
0-based position among siblings. |
parent_tag |
Option<String> |
Parent element's tag, or None at the root. |
is_inline |
bool |
true when the element is rendered inline (inside a paragraph, link text, cell, …). |
Method Reference¶
All 40 methods have default implementations that return Continue. Override only the ones you care about.
Generic element callbacks¶
| Method | When it fires |
|---|---|
visit_element_start(ctx) |
Before any element. First callback for every node. |
visit_element_end(ctx, output) |
After an element, with the rendered Markdown. |
visit_text(ctx, text) |
Every text node. HTML entities already decoded. |
visit_custom_element(ctx, tag_name, html) |
Unknown tags and web components. |
Links and images¶
| Method | Arguments |
|---|---|
visit_link(ctx, href, text, title) |
<a> anchor with href, rendered text, and optional title. |
visit_image(ctx, src, alt, title) |
<img> with src, alt text, and optional title. |
Headings, rules, breaks¶
| Method | Arguments |
|---|---|
visit_heading(ctx, level, text, id) |
<h1>–<h6> with level (1-6), text, and optional id. |
visit_horizontal_rule(ctx) |
<hr>. |
visit_line_break(ctx) |
<br>. |
Code¶
| Method | Arguments |
|---|---|
visit_code_block(ctx, lang, code) |
<pre><code> with language tag and raw code. |
visit_code_inline(ctx, code) |
Inline <code>. |
Lists¶
| Method | Arguments |
|---|---|
visit_list_start(ctx, ordered) |
Before <ul> or <ol>. |
visit_list_item(ctx, ordered, marker, text) |
Each <li> with marker and rendered text. |
visit_list_end(ctx, ordered, output) |
After the list, with the rendered block. |
Definition lists¶
| Method | Arguments |
|---|---|
visit_definition_list_start(ctx) |
Before <dl>. |
visit_definition_term(ctx, text) |
<dt>. |
visit_definition_description(ctx, text) |
<dd>. |
visit_definition_list_end(ctx, output) |
After <dl>. |
Tables¶
| Method | Arguments |
|---|---|
visit_table_start(ctx) |
Before <table>. |
visit_table_row(ctx, cells, is_header) |
Each <tr>. Cells are pre-rendered Markdown. is_header is true for rows inside <thead>. |
visit_table_end(ctx, output) |
After <table>. |
Blockquote¶
| Method | Arguments |
|---|---|
visit_blockquote(ctx, content, depth) |
<blockquote> with rendered content and nesting depth. |
Inline formatting¶
| Method | Covers |
|---|---|
visit_strong(ctx, text) |
<strong>, <b>. |
visit_emphasis(ctx, text) |
<em>, <i>. |
visit_strikethrough(ctx, text) |
<s>, <del>, <strike>. |
visit_underline(ctx, text) |
<u>, <ins>. |
visit_subscript(ctx, text) |
<sub>. |
visit_superscript(ctx, text) |
<sup>. |
visit_mark(ctx, text) |
<mark>. |
Forms¶
| Method | Arguments |
|---|---|
visit_form(ctx, action, method) |
<form> with optional action URL and method. |
visit_input(ctx, input_type, name, value) |
<input>. |
visit_button(ctx, text) |
<button>. |
Media¶
| Method | Arguments |
|---|---|
visit_audio(ctx, src) |
<audio>. |
visit_video(ctx, src) |
<video>. |
visit_iframe(ctx, src) |
<iframe>. |
Interactive¶
| Method | Arguments |
|---|---|
visit_details(ctx, open) |
<details> with the open attribute. |
visit_summary(ctx, text) |
<summary>. |
Figures¶
| Method | Arguments |
|---|---|
visit_figure_start(ctx) |
Before <figure>. |
visit_figcaption(ctx, text) |
<figcaption>. |
visit_figure_end(ctx, output) |
After <figure>. |
Basic Visitor¶
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult, VisitorHandle};
use html_to_markdown_rs::{ConversionOptions, convert};
use std::cell::RefCell;
use std::rc::Rc;
#[derive(Debug)]
struct LinkRewriter;
impl HtmlVisitor for LinkRewriter {
fn visit_link(
&mut self,
_ctx: &NodeContext,
href: &str,
text: &str,
_title: Option<&str>,
) -> VisitResult {
VisitResult::Custom(format!("[{text}](https://track.example.com?url={href})"))
}
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = r#"<a href="https://example.com">Click here</a>"#;
let visitor: VisitorHandle = Rc::new(RefCell::new(LinkRewriter));
let options = ConversionOptions::builder().visitor(Some(visitor)).build();
let result = convert(html, Some(options))?;
println!("{}", result.content.unwrap_or_default());
Ok(())
}
from html_to_markdown import ConversionOptions, convert
class CustomVisitor:
def visit_link(self, ctx, href, text, title):
return {"type": "continue"}
def visit_image(self, ctx, src, alt, title):
return {"type": "continue"}
options = ConversionOptions(visitor=CustomVisitor())
result = convert(html, options)
markdown = result.content
import { convert, ConversionOptions } from "@kreuzberg/html-to-markdown";
import { Visitor, NodeContext, VisitResult } from "@kreuzberg/html-to-markdown";
const visitor: Visitor = {
visitLink(ctx: NodeContext, href: string, text: string): VisitResult {
// Custom handling for links
return {
type: "custom",
output: `[${text}](${href})`,
};
},
visitHeading(ctx: NodeContext, level: number, text: string): VisitResult {
// Custom handling for headings
return {
type: "continue",
};
},
};
const options: ConversionOptions = { visitor };
const result = convert('<h1>Title</h1><a href="url">Link</a>', options);
const markdown = result.content;
// The visitor pattern is not yet supported in the Go binding.
// Use Convert() with ConversionOptions instead.
require 'html_to_markdown'
# A visitor is any Ruby object that responds to `visit_*` methods. The
# bridge calls `respond_to?(name, false)` and dispatches via `funcall`,
# so plain methods on a class (or any object) work.
class MyVisitor
# `ctx` is a Hash: { node_type:, tag_name:, depth:, index_in_parent:, ... }
def visit_link(ctx, href, text, title = nil)
# Return a custom output by wrapping it in `{ custom: ... }`. Any
# other Hash without :custom is treated as `:continue`.
{ custom: "[#{text}](#{href})" }
end
def visit_image(ctx, src, alt, title = nil)
# Return :skip (or the string "skip") to drop the element.
# Other accepted directives: :continue, :preserve_html.
:skip
end
# `visit_text` is invoked ~100+ times per document — keep it cheap.
def visit_text(ctx, text)
:continue
end
end
html = "<p><a href='https://example.com'>Link</a><img src='x.png'></p>"
# The visitor is passed as the second positional argument. The Ruby
# binding currently does NOT support combining the visitor with a
# `ConversionOptions` Hash in a single call — pick one. To use both,
# build the options on the Rust side via the FFI directly.
result = HtmlToMarkdown.convert(html, MyVisitor.new)
puts result.content
use HtmlToMarkdown\HtmlToMarkdown;
use HtmlToMarkdown\ConversionOptions;
// Visitors are duck-typed: define any subset of visit_* methods.
// Each method returns either 'skip', ['custom' => '...'], or null/'continue'.
$visitor = new class {
public function visit_link($ctx, $href, $text, $title) {
return ['custom' => "[{$text}]({$href})"];
}
public function visit_image($ctx, $src, $alt, $title) {
return 'skip';
}
};
$options = ConversionOptions::builder()->visitor($visitor)->build();
$result = HtmlToMarkdown::convert(
'<a href="/page">Link</a><img src="pic.png" alt="pic">',
$options
);
echo $result->content;
// The visitor pattern is not yet supported in the Java binding.
// Use convert() with ConversionOptions instead.
// The visitor pattern is not yet supported in the C# binding.
// Use Convert() with ConversionOptions instead.
Visitor Pattern - Elixir¶
Customize HTML to Markdown conversion by passing a visitor map under
the :visitor key of HtmlToMarkdown.convert/2's options. Each entry
maps a callback atom (e.g. :handle_link) to a one-arity function
that receives the JSON-decoded arguments map.
The bridge spawns a system thread for the conversion, then sends
{:visitor_callback, ref_id, callback_name, args_json} messages back
to the calling process. HtmlToMarkdown.convert/2 runs a receive loop
that dispatches each message against your visitor map and calls
HtmlToMarkdown.Native.visitor_reply/2 to unblock the worker.
Basic Visitor Example¶
visitor = %{
:handle_link => fn args ->
text = Map.get(args, "text", "")
{:custom, text}
end,
:handle_image => fn _args -> :skip end,
:handle_text => fn _args -> :continue end
}
html = "<p>Visit <a href='https://example.com'>our site</a> for more!</p>"
{:ok, result} = HtmlToMarkdown.convert(html, %{visitor: visitor})
IO.puts(result.content)
# => Visit our site for more!
Visitor Return Values¶
Each function must return one of:
:continue— proceed with default conversion:skip— omit this element entirely:preserve_html— include the raw HTML verbatim{:custom, markdown_string}— replace this element's output with the given string- A bare string — treated as a custom replacement
Anything else falls back to :continue.
Callback Names¶
Callbacks are keyed by atom and use the handle_ prefix (the bridge
translates the Rust visit_X trait methods to :handle_X over the
wire). Frequently overridden:
:handle_text— text nodes:%{"ctx" => …, "text" => "…"}(called ~100+ times per document; keep it cheap):handle_link—<a>elements:%{"ctx" => …, "href" => "…", "text" => "…", "title" => …}:handle_image—<img>elements:%{"ctx" => …, "src" => "…", "alt" => "…", "title" => …}:handle_heading— headings:%{"ctx" => …, "level" => 1, "text" => "…", "id" => …}:handle_code_block—<pre><code>:%{"ctx" => …, "lang" => …, "code" => "…"}:handle_element_start/:handle_element_end— generic enter/leave hooks
Omit a callback to fall through to the default Rust implementation.
Node Context¶
The "ctx" value in every callback arg map is a JSON-decoded map:
%{
"node_type" => "Link",
"tag_name" => "a",
"depth" => 2,
"index_in_parent" => 0,
"parent_tag" => "p"
}
Combining Options and Visitor¶
Pass the visitor under the :visitor key alongside any other
ConversionOptions fields:
{:ok, result} = HtmlToMarkdown.convert(html, %{
visitor: %{:handle_link => fn _ -> :skip end},
output_format: "github",
extract_metadata: true
})
HtmlToMarkdown.convert/2 pops the :visitor key, JSON-encodes the
remaining options, and dispatches to convert_with_visitor.
library(htmltomarkdown)
html <- "<p>Visit <a href='https://example.com'>our site</a> for more!</p>"
opts <- conversion_options(extract_metadata = FALSE)
result <- convert(html, opts)
cat(result$content)
#include "html_to_markdown.h"
#include <stdio.h>
/* Each callback returns an int32 status code:
* HTM_VISIT_CONTINUE — use default conversion
* HTM_VISIT_SKIP — drop the element
* HTM_VISIT_PRESERVE_HTML — emit the raw HTML
* HTM_VISIT_CUSTOM — replace with the string written to *out
* HTM_VISIT_ERROR — abort conversion with the error in *out
*/
static int32_t visit_heading(const struct HTMNodeContext *ctx,
uint32_t level,
const char *text,
const char *title,
char **out,
void *user_data) {
(void)ctx; (void)level; (void)text; (void)title; (void)out; (void)user_data;
return HTM_VISIT_CONTINUE;
}
int main(void) {
HTMHtmVisitorCallbacks callbacks = {0};
callbacks.visit_heading = visit_heading;
HTMHtmVisitor *visitor = htm_visitor_create(&callbacks);
HTMConversionOptions *options = htm_conversion_options_default();
htm_options_set_visitor(options, (struct HTMHtmHtmlVisitorBridge *)visitor);
HTMConversionResult *result = htm_convert("<h1>Title</h1><p>Content</p>", options);
htm_conversion_options_free(options);
htm_visitor_free(visitor);
if (result == NULL) {
fprintf(stderr, "convert failed: %s\n", htm_last_error_context());
return 1;
}
char *content = htm_conversion_result_content(result);
if (content != NULL) {
printf("%s\n", content);
htm_free_string(content);
}
htm_conversion_result_free(result);
return 0;
}
import init, { convert } from "@kreuzberg/html-to-markdown-wasm";
await init();
const visitor = {
visit_link(ctx, href, text, title) {
return { type: "continue" };
},
visit_image(ctx, src, alt, title) {
return { type: "continue" };
},
};
const result = convert('<h1>Hello</h1><a href="https://example.com">link</a>', undefined, visitor);
console.log(result.content);
import HtmlToMarkdown
final class CustomVisitor: HtmlVisitorProtocol {
func visitLink(_ ctx: NodeContext, _ href: String, _ text: String, _ title: String?) -> VisitResult {
// Replace links with a bracketed custom format
return .custom(field0: "[\(text)](\(href))")
}
func visitHeading(_ ctx: NodeContext, _ level: UInt32, _ text: String, _ id: String?) -> VisitResult {
// Keep default rendering for headings
return .continue_
}
}
let visitorHandle = makeHtmlVisitorHandle(CustomVisitor())
let options = try conversionOptionsFromJsonWithVisitor("{}", visitorHandle)
let html = "<h1>Title</h1><p>See <a href=\"https://example.com\">example</a>.</p>"
let result = try convert(html, options)
let markdown = result.content()?.toString() ?? ""
print(markdown)
import 'package:h2m/h2m.dart';
import 'package:h2m/src/html_to_markdown_rs_bridge_generated/frb_generated.dart'
show RustLib;
Future<void> main() async {
await RustLib.init();
// flutter_rust_bridge requires every visit callback — default to continue_()
// and override only the hooks you care about (here: links and headings).
final visitor = await createHtmlVisitor(
visitText: (ctx, text) async => VisitResult.continue_(),
visitElementStart: (ctx) async => VisitResult.continue_(),
visitElementEnd: (ctx, output) async => VisitResult.continue_(),
visitLink: (ctx, href, text, title) async =>
VisitResult.custom(field0: '[$text]($href)'),
visitImage: (ctx, src, alt, title) async => VisitResult.continue_(),
visitHeading: (ctx, level, text, id) async =>
VisitResult.custom(field0: '${'#' * level.toInt()} $text\n'),
visitCodeBlock: (ctx, lang, code) async => VisitResult.continue_(),
visitCodeInline: (ctx, code) async => VisitResult.continue_(),
visitListItem: (ctx, ordered, marker, text) async => VisitResult.continue_(),
visitListStart: (ctx, ordered) async => VisitResult.continue_(),
visitListEnd: (ctx, ordered, output) async => VisitResult.continue_(),
visitTableStart: (ctx) async => VisitResult.continue_(),
visitTableRow: (ctx, cells, isHeader) async => VisitResult.continue_(),
visitTableEnd: (ctx, output) async => VisitResult.continue_(),
visitBlockquote: (ctx, content, depth) async => VisitResult.continue_(),
visitStrong: (ctx, text) async => VisitResult.continue_(),
visitEmphasis: (ctx, text) async => VisitResult.continue_(),
visitStrikethrough: (ctx, text) async => VisitResult.continue_(),
visitUnderline: (ctx, text) async => VisitResult.continue_(),
visitSubscript: (ctx, text) async => VisitResult.continue_(),
visitSuperscript: (ctx, text) async => VisitResult.continue_(),
visitMark: (ctx, text) async => VisitResult.continue_(),
visitLineBreak: (ctx) async => VisitResult.continue_(),
visitHorizontalRule: (ctx) async => VisitResult.continue_(),
visitCustomElement: (ctx, tagName, html) async => VisitResult.continue_(),
visitDefinitionListStart: (ctx) async => VisitResult.continue_(),
visitDefinitionTerm: (ctx, text) async => VisitResult.continue_(),
visitDefinitionDescription: (ctx, text) async => VisitResult.continue_(),
visitDefinitionListEnd: (ctx, output) async => VisitResult.continue_(),
visitForm: (ctx, action, method) async => VisitResult.continue_(),
visitInput: (ctx, inputType, name, value) async => VisitResult.continue_(),
visitButton: (ctx, text) async => VisitResult.continue_(),
visitAudio: (ctx, src) async => VisitResult.continue_(),
visitVideo: (ctx, src) async => VisitResult.continue_(),
visitIframe: (ctx, src) async => VisitResult.continue_(),
visitDetails: (ctx, open) async => VisitResult.continue_(),
visitSummary: (ctx, text) async => VisitResult.continue_(),
visitFigureStart: (ctx) async => VisitResult.continue_(),
visitFigcaption: (ctx, text) async => VisitResult.continue_(),
visitFigureEnd: (ctx, output) async => VisitResult.continue_(),
);
final options = await createConversionOptionsFromJsonWithVisitor(
json: '{}',
visitor: visitor,
);
final result = await H2mBridge.convert(
'<h1>Title</h1><a href="https://example.com">Link</a>',
options: options,
);
print(result.content);
}
// The visitor pattern is not yet supported in the Kotlin Android binding.
// Use HtmlToMarkdownRs.convert() with ConversionOptions instead — for example,
// configure exclude_selectors, strip_tags, or preserve_tags to control output.
const std = @import("std");
const html_to_markdown = @import("html_to_markdown");
const c = html_to_markdown.c;
// Visitor callbacks return an int32 status code:
// 0 (HTM_VISIT_CONTINUE) use default conversion
// 1 (HTM_VISIT_SKIP) drop the element
// 2 (HTM_VISIT_PRESERVE_HTML) emit the raw HTML
// 3 (HTM_VISIT_CUSTOM) replace with the string written via out_custom/out_len
// 4 (HTM_VISIT_ERROR) abort conversion with the error in out_custom
fn visit_heading(
_ctx: [*c]const c.HTMHtmNodeContext,
_user_data: ?*anyopaque,
_level: u32,
_text: [*c]const u8,
_id: [*c]const u8,
out_custom: [*c][*c]u8,
out_len: [*c]usize,
) callconv(.c) i32 {
_ = _ctx;
_ = _user_data;
_ = _id;
const text = std.mem.span(_text);
const buf = std.fmt.allocPrintSentinel(
std.heap.c_allocator,
"<<H{d}: {s}>>",
.{ _level, text },
0,
) catch return 0;
if (out_custom != null) out_custom.* = buf.ptr;
if (out_len != null) out_len.* = buf.len;
return 3; // HTM_VISIT_CUSTOM
}
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
var callbacks: c.HTMHtmVisitorCallbacks = std.mem.zeroes(c.HTMHtmVisitorCallbacks);
callbacks.visit_heading = &visit_heading;
const visitor = c.htm_visitor_create(&callbacks);
defer c.htm_visitor_free(visitor);
const options_z = try std.heap.c_allocator.dupeZ(u8, "{}");
defer std.heap.c_allocator.free(options_z);
const options = c.htm_conversion_options_from_json(options_z.ptr);
defer c.htm_conversion_options_free(options);
c.htm_options_set_visitor_handle(options, visitor);
const html_z = try std.heap.c_allocator.dupeZ(u8, "<h1>Title</h1><p>Body.</p>");
defer std.heap.c_allocator.free(html_z);
const result = c.htm_convert(html_z.ptr, options) orelse return error.ConvertFailed;
defer c.htm_conversion_result_free(result);
const json_ptr = c.htm_conversion_result_to_json(result);
defer c.htm_free_string(json_ptr);
const json = std.mem.sliceTo(json_ptr, 0);
var parsed = try std.json.parseFromSlice(std.json.Value, allocator, json, .{});
defer parsed.deinit();
std.debug.print("{s}\n", .{parsed.value.object.get("content").?.string});
}
Common Patterns¶
Link rewriting¶
Override visit_link, return VisitResult::Custom(...) with the new URL baked in. Useful for rewriting relative links to absolute, stripping tracking parameters, or converting internal links to anchor references.
Element filtering¶
Override visit_element_start and return VisitResult::Skip when ctx.tag_name matches an unwanted tag. The element and every descendant is dropped. A class filter works too: check ctx.attributes.get("class") and skip on match.
Content extraction¶
Override visit_text and push each text fragment into an external buffer. The visitor becomes a simple text-extraction pass that bypasses Markdown rendering. Combine with Skip on unwanted elements to exclude code blocks, navigation, or footers.
Performance¶
visit_text fires on every text node. Keep the handler small. Match the few element kinds you care about in visit_element_start and return Continue for everything else. Allocations inside the handler multiply by the number of text nodes in the input.
The visitor trait is synchronous. The core walker calls each method in place during the single-pass DOM traversal.
Found a bug or mistake on this page?
If something here is wrong or out of date, open an issue on GitHub or contribute a fix via pull request.