Skip to content

Visitor Pattern

The visitor system is the library's main extensibility point. Implement HtmlVisitor and you can replace, skip, or augment how any HTML element becomes Markdown. No fork required.

Rust users must opt in with features = ["visitor"]. The other bindings expose the visitor through their native idiom (Visitor interface in Java, callback object in Python, etc.) and link against a Rust core built with the feature enabled.

Execution Order

Traversal is pre-order. For <div><p>text</p></div>:

  1. visit_element_start fires for <div>
  2. visit_element_start fires for <p>
  3. visit_text fires for "text"
  4. visit_element_end fires for <p> with the rendered output
  5. visit_element_end fires for <div> with the rendered output

visit_text is hot. It runs for every text node in the document, often 100+ times on a single page. Return Continue fast when you don't care about the node, and avoid allocations in the method body.

VisitResult

Every callback returns a VisitResult.

Variant Effect
Continue Use the default rendering.
Custom(String) Replace the default output with the supplied Markdown. The visitor owns the rendering for this node and its children.
Skip Drop the element and all of its children.
PreserveHtml Emit the raw HTML for this element verbatim.
Error(String) Halt conversion. The message surfaces as ConversionError::Visitor in Rust (behind features = ["visitor"]).

NodeContext

Every callback receives a NodeContext describing the current node.

Field Type Meaning
node_type NodeType Coarse-grained classification (heading, list, link, form, …). 87 variants.
tag_name String Raw HTML tag name. Lowercased.
attributes BTreeMap<String, String> All attributes on the element.
depth usize Depth in the DOM tree. Root is 0.
index_in_parent usize 0-based position among siblings.
parent_tag Option<String> Parent element's tag, or None at the root.
is_inline bool true when the element is rendered inline (inside a paragraph, link text, cell, …).

Method Reference

All 40 methods have default implementations that return Continue. Override only the ones you care about.

Generic element callbacks

Method When it fires
visit_element_start(ctx) Before any element. First callback for every node.
visit_element_end(ctx, output) After an element, with the rendered Markdown.
visit_text(ctx, text) Every text node. HTML entities already decoded.
visit_custom_element(ctx, tag_name, html) Unknown tags and web components.
Method Arguments
visit_link(ctx, href, text, title) <a> anchor with href, rendered text, and optional title.
visit_image(ctx, src, alt, title) <img> with src, alt text, and optional title.

Headings, rules, breaks

Method Arguments
visit_heading(ctx, level, text, id) <h1><h6> with level (1-6), text, and optional id.
visit_horizontal_rule(ctx) <hr>.
visit_line_break(ctx) <br>.

Code

Method Arguments
visit_code_block(ctx, lang, code) <pre><code> with language tag and raw code.
visit_code_inline(ctx, code) Inline <code>.

Lists

Method Arguments
visit_list_start(ctx, ordered) Before <ul> or <ol>.
visit_list_item(ctx, ordered, marker, text) Each <li> with marker and rendered text.
visit_list_end(ctx, ordered, output) After the list, with the rendered block.

Definition lists

Method Arguments
visit_definition_list_start(ctx) Before <dl>.
visit_definition_term(ctx, text) <dt>.
visit_definition_description(ctx, text) <dd>.
visit_definition_list_end(ctx, output) After <dl>.

Tables

Method Arguments
visit_table_start(ctx) Before <table>.
visit_table_row(ctx, cells, is_header) Each <tr>. Cells are pre-rendered Markdown. is_header is true for rows inside <thead>.
visit_table_end(ctx, output) After <table>.

Blockquote

Method Arguments
visit_blockquote(ctx, content, depth) <blockquote> with rendered content and nesting depth.

Inline formatting

Method Covers
visit_strong(ctx, text) <strong>, <b>.
visit_emphasis(ctx, text) <em>, <i>.
visit_strikethrough(ctx, text) <s>, <del>, <strike>.
visit_underline(ctx, text) <u>, <ins>.
visit_subscript(ctx, text) <sub>.
visit_superscript(ctx, text) <sup>.
visit_mark(ctx, text) <mark>.

Forms

Method Arguments
visit_form(ctx, action, method) <form> with optional action URL and method.
visit_input(ctx, input_type, name, value) <input>.
visit_button(ctx, text) <button>.

Media

Method Arguments
visit_audio(ctx, src) <audio>.
visit_video(ctx, src) <video>.
visit_iframe(ctx, src) <iframe>.

Interactive

Method Arguments
visit_details(ctx, open) <details> with the open attribute.
visit_summary(ctx, text) <summary>.

Figures

Method Arguments
visit_figure_start(ctx) Before <figure>.
visit_figcaption(ctx, text) <figcaption>.
visit_figure_end(ctx, output) After <figure>.

Basic Visitor

use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult, VisitorHandle};
use html_to_markdown_rs::{ConversionOptions, convert};
use std::cell::RefCell;
use std::rc::Rc;

#[derive(Debug)]
struct LinkRewriter;

impl HtmlVisitor for LinkRewriter {
    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        text: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        VisitResult::Custom(format!("[{text}](https://track.example.com?url={href})"))
    }
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"<a href="https://example.com">Click here</a>"#;
    let visitor: VisitorHandle = Rc::new(RefCell::new(LinkRewriter));
    let options = ConversionOptions::builder().visitor(Some(visitor)).build();
    let result = convert(html, Some(options))?;
    println!("{}", result.content.unwrap_or_default());
    Ok(())
}
from html_to_markdown import ConversionOptions, convert

class CustomVisitor:
    def visit_link(self, ctx, href, text, title):
        return {"type": "continue"}

    def visit_image(self, ctx, src, alt, title):
        return {"type": "continue"}

options = ConversionOptions(visitor=CustomVisitor())
result = convert(html, options)
markdown = result.content
import { convert, ConversionOptions } from "@kreuzberg/html-to-markdown";
import { Visitor, NodeContext, VisitResult } from "@kreuzberg/html-to-markdown";

const visitor: Visitor = {
  visitLink(ctx: NodeContext, href: string, text: string): VisitResult {
    // Custom handling for links
    return {
      type: "custom",
      output: `[${text}](${href})`,
    };
  },
  visitHeading(ctx: NodeContext, level: number, text: string): VisitResult {
    // Custom handling for headings
    return {
      type: "continue",
    };
  },
};

const options: ConversionOptions = { visitor };
const result = convert('<h1>Title</h1><a href="url">Link</a>', options);
const markdown = result.content;
// The visitor pattern is not yet supported in the Go binding.
// Use Convert() with ConversionOptions instead.
require 'html_to_markdown'

# A visitor is any Ruby object that responds to `visit_*` methods. The
# bridge calls `respond_to?(name, false)` and dispatches via `funcall`,
# so plain methods on a class (or any object) work.
class MyVisitor
  # `ctx` is a Hash: { node_type:, tag_name:, depth:, index_in_parent:, ... }
  def visit_link(ctx, href, text, title = nil)
    # Return a custom output by wrapping it in `{ custom: ... }`. Any
    # other Hash without :custom is treated as `:continue`.
    { custom: "[#{text}](#{href})" }
  end

  def visit_image(ctx, src, alt, title = nil)
    # Return :skip (or the string "skip") to drop the element.
    # Other accepted directives: :continue, :preserve_html.
    :skip
  end

  # `visit_text` is invoked ~100+ times per document — keep it cheap.
  def visit_text(ctx, text)
    :continue
  end
end

html = "<p><a href='https://example.com'>Link</a><img src='x.png'></p>"
# The visitor is passed as the second positional argument. The Ruby
# binding currently does NOT support combining the visitor with a
# `ConversionOptions` Hash in a single call — pick one. To use both,
# build the options on the Rust side via the FFI directly.
result = HtmlToMarkdown.convert(html, MyVisitor.new)
puts result.content
use HtmlToMarkdown\HtmlToMarkdown;
use HtmlToMarkdown\ConversionOptions;

// Visitors are duck-typed: define any subset of visit_* methods.
// Each method returns either 'skip', ['custom' => '...'], or null/'continue'.
$visitor = new class {
    public function visit_link($ctx, $href, $text, $title) {
        return ['custom' => "[{$text}]({$href})"];
    }

    public function visit_image($ctx, $src, $alt, $title) {
        return 'skip';
    }
};

$options = ConversionOptions::builder()->visitor($visitor)->build();

$result = HtmlToMarkdown::convert(
    '<a href="/page">Link</a><img src="pic.png" alt="pic">',
    $options
);
echo $result->content;
// The visitor pattern is not yet supported in the Java binding.
// Use convert() with ConversionOptions instead.
// The visitor pattern is not yet supported in the C# binding.
// Use Convert() with ConversionOptions instead.

Visitor Pattern - Elixir

Customize HTML to Markdown conversion by passing a visitor map under the :visitor key of HtmlToMarkdown.convert/2's options. Each entry maps a callback atom (e.g. :handle_link) to a one-arity function that receives the JSON-decoded arguments map.

The bridge spawns a system thread for the conversion, then sends {:visitor_callback, ref_id, callback_name, args_json} messages back to the calling process. HtmlToMarkdown.convert/2 runs a receive loop that dispatches each message against your visitor map and calls HtmlToMarkdown.Native.visitor_reply/2 to unblock the worker.

Basic Visitor Example

visitor = %{
  :handle_link => fn args ->
    text = Map.get(args, "text", "")
    {:custom, text}
  end,
  :handle_image => fn _args -> :skip end,
  :handle_text => fn _args -> :continue end
}

html = "<p>Visit <a href='https://example.com'>our site</a> for more!</p>"
{:ok, result} = HtmlToMarkdown.convert(html, %{visitor: visitor})
IO.puts(result.content)
# => Visit our site for more!

Visitor Return Values

Each function must return one of:

  • :continue — proceed with default conversion
  • :skip — omit this element entirely
  • :preserve_html — include the raw HTML verbatim
  • {:custom, markdown_string} — replace this element's output with the given string
  • A bare string — treated as a custom replacement

Anything else falls back to :continue.

Callback Names

Callbacks are keyed by atom and use the handle_ prefix (the bridge translates the Rust visit_X trait methods to :handle_X over the wire). Frequently overridden:

  • :handle_text — text nodes: %{"ctx" => …, "text" => "…"} (called ~100+ times per document; keep it cheap)
  • :handle_link<a> elements: %{"ctx" => …, "href" => "…", "text" => "…", "title" => …}
  • :handle_image<img> elements: %{"ctx" => …, "src" => "…", "alt" => "…", "title" => …}
  • :handle_heading — headings: %{"ctx" => …, "level" => 1, "text" => "…", "id" => …}
  • :handle_code_block<pre><code>: %{"ctx" => …, "lang" => …, "code" => "…"}
  • :handle_element_start / :handle_element_end — generic enter/leave hooks

Omit a callback to fall through to the default Rust implementation.

Node Context

The "ctx" value in every callback arg map is a JSON-decoded map:

%{
  "node_type" => "Link",
  "tag_name" => "a",
  "depth" => 2,
  "index_in_parent" => 0,
  "parent_tag" => "p"
}

Combining Options and Visitor

Pass the visitor under the :visitor key alongside any other ConversionOptions fields:

{:ok, result} = HtmlToMarkdown.convert(html, %{
  visitor: %{:handle_link => fn _ -> :skip end},
  output_format: "github",
  extract_metadata: true
})

HtmlToMarkdown.convert/2 pops the :visitor key, JSON-encodes the remaining options, and dispatches to convert_with_visitor.

library(htmltomarkdown)

html <- "<p>Visit <a href='https://example.com'>our site</a> for more!</p>"

opts <- conversion_options(extract_metadata = FALSE)
result <- convert(html, opts)
cat(result$content)
#include "html_to_markdown.h"
#include <stdio.h>

/* Each callback returns an int32 status code:
 *   HTM_VISIT_CONTINUE      — use default conversion
 *   HTM_VISIT_SKIP          — drop the element
 *   HTM_VISIT_PRESERVE_HTML — emit the raw HTML
 *   HTM_VISIT_CUSTOM        — replace with the string written to *out
 *   HTM_VISIT_ERROR         — abort conversion with the error in *out
 */
static int32_t visit_heading(const struct HTMNodeContext *ctx,
                             uint32_t level,
                             const char *text,
                             const char *title,
                             char **out,
                             void *user_data) {
    (void)ctx; (void)level; (void)text; (void)title; (void)out; (void)user_data;
    return HTM_VISIT_CONTINUE;
}

int main(void) {
    HTMHtmVisitorCallbacks callbacks = {0};
    callbacks.visit_heading = visit_heading;

    HTMHtmVisitor *visitor = htm_visitor_create(&callbacks);
    HTMConversionOptions *options = htm_conversion_options_default();
    htm_options_set_visitor(options, (struct HTMHtmHtmlVisitorBridge *)visitor);

    HTMConversionResult *result = htm_convert("<h1>Title</h1><p>Content</p>", options);

    htm_conversion_options_free(options);
    htm_visitor_free(visitor);

    if (result == NULL) {
        fprintf(stderr, "convert failed: %s\n", htm_last_error_context());
        return 1;
    }

    char *content = htm_conversion_result_content(result);
    if (content != NULL) {
        printf("%s\n", content);
        htm_free_string(content);
    }
    htm_conversion_result_free(result);
    return 0;
}
import init, { convert } from "@kreuzberg/html-to-markdown-wasm";

await init();

const visitor = {
  visit_link(ctx, href, text, title) {
    return { type: "continue" };
  },
  visit_image(ctx, src, alt, title) {
    return { type: "continue" };
  },
};

const result = convert('<h1>Hello</h1><a href="https://example.com">link</a>', undefined, visitor);
console.log(result.content);
import HtmlToMarkdown

final class CustomVisitor: HtmlVisitorProtocol {
    func visitLink(_ ctx: NodeContext, _ href: String, _ text: String, _ title: String?) -> VisitResult {
        // Replace links with a bracketed custom format
        return .custom(field0: "[\(text)](\(href))")
    }

    func visitHeading(_ ctx: NodeContext, _ level: UInt32, _ text: String, _ id: String?) -> VisitResult {
        // Keep default rendering for headings
        return .continue_
    }
}

let visitorHandle = makeHtmlVisitorHandle(CustomVisitor())
let options = try conversionOptionsFromJsonWithVisitor("{}", visitorHandle)

let html = "<h1>Title</h1><p>See <a href=\"https://example.com\">example</a>.</p>"
let result = try convert(html, options)
let markdown = result.content()?.toString() ?? ""
print(markdown)
import 'package:h2m/h2m.dart';
import 'package:h2m/src/html_to_markdown_rs_bridge_generated/frb_generated.dart'
    show RustLib;

Future<void> main() async {
  await RustLib.init();

  // flutter_rust_bridge requires every visit callback — default to continue_()
  // and override only the hooks you care about (here: links and headings).
  final visitor = await createHtmlVisitor(
    visitText: (ctx, text) async => VisitResult.continue_(),
    visitElementStart: (ctx) async => VisitResult.continue_(),
    visitElementEnd: (ctx, output) async => VisitResult.continue_(),
    visitLink: (ctx, href, text, title) async =>
        VisitResult.custom(field0: '[$text]($href)'),
    visitImage: (ctx, src, alt, title) async => VisitResult.continue_(),
    visitHeading: (ctx, level, text, id) async =>
        VisitResult.custom(field0: '${'#' * level.toInt()} $text\n'),
    visitCodeBlock: (ctx, lang, code) async => VisitResult.continue_(),
    visitCodeInline: (ctx, code) async => VisitResult.continue_(),
    visitListItem: (ctx, ordered, marker, text) async => VisitResult.continue_(),
    visitListStart: (ctx, ordered) async => VisitResult.continue_(),
    visitListEnd: (ctx, ordered, output) async => VisitResult.continue_(),
    visitTableStart: (ctx) async => VisitResult.continue_(),
    visitTableRow: (ctx, cells, isHeader) async => VisitResult.continue_(),
    visitTableEnd: (ctx, output) async => VisitResult.continue_(),
    visitBlockquote: (ctx, content, depth) async => VisitResult.continue_(),
    visitStrong: (ctx, text) async => VisitResult.continue_(),
    visitEmphasis: (ctx, text) async => VisitResult.continue_(),
    visitStrikethrough: (ctx, text) async => VisitResult.continue_(),
    visitUnderline: (ctx, text) async => VisitResult.continue_(),
    visitSubscript: (ctx, text) async => VisitResult.continue_(),
    visitSuperscript: (ctx, text) async => VisitResult.continue_(),
    visitMark: (ctx, text) async => VisitResult.continue_(),
    visitLineBreak: (ctx) async => VisitResult.continue_(),
    visitHorizontalRule: (ctx) async => VisitResult.continue_(),
    visitCustomElement: (ctx, tagName, html) async => VisitResult.continue_(),
    visitDefinitionListStart: (ctx) async => VisitResult.continue_(),
    visitDefinitionTerm: (ctx, text) async => VisitResult.continue_(),
    visitDefinitionDescription: (ctx, text) async => VisitResult.continue_(),
    visitDefinitionListEnd: (ctx, output) async => VisitResult.continue_(),
    visitForm: (ctx, action, method) async => VisitResult.continue_(),
    visitInput: (ctx, inputType, name, value) async => VisitResult.continue_(),
    visitButton: (ctx, text) async => VisitResult.continue_(),
    visitAudio: (ctx, src) async => VisitResult.continue_(),
    visitVideo: (ctx, src) async => VisitResult.continue_(),
    visitIframe: (ctx, src) async => VisitResult.continue_(),
    visitDetails: (ctx, open) async => VisitResult.continue_(),
    visitSummary: (ctx, text) async => VisitResult.continue_(),
    visitFigureStart: (ctx) async => VisitResult.continue_(),
    visitFigcaption: (ctx, text) async => VisitResult.continue_(),
    visitFigureEnd: (ctx, output) async => VisitResult.continue_(),
  );

  final options = await createConversionOptionsFromJsonWithVisitor(
    json: '{}',
    visitor: visitor,
  );
  final result = await H2mBridge.convert(
    '<h1>Title</h1><a href="https://example.com">Link</a>',
    options: options,
  );
  print(result.content);
}
// The visitor pattern is not yet supported in the Kotlin Android binding.
// Use HtmlToMarkdownRs.convert() with ConversionOptions instead — for example,
// configure exclude_selectors, strip_tags, or preserve_tags to control output.
const std = @import("std");
const html_to_markdown = @import("html_to_markdown");
const c = html_to_markdown.c;

// Visitor callbacks return an int32 status code:
//   0 (HTM_VISIT_CONTINUE)      use default conversion
//   1 (HTM_VISIT_SKIP)          drop the element
//   2 (HTM_VISIT_PRESERVE_HTML) emit the raw HTML
//   3 (HTM_VISIT_CUSTOM)        replace with the string written via out_custom/out_len
//   4 (HTM_VISIT_ERROR)         abort conversion with the error in out_custom
fn visit_heading(
    _ctx: [*c]const c.HTMHtmNodeContext,
    _user_data: ?*anyopaque,
    _level: u32,
    _text: [*c]const u8,
    _id: [*c]const u8,
    out_custom: [*c][*c]u8,
    out_len: [*c]usize,
) callconv(.c) i32 {
    _ = _ctx;
    _ = _user_data;
    _ = _id;
    const text = std.mem.span(_text);
    const buf = std.fmt.allocPrintSentinel(
        std.heap.c_allocator,
        "<<H{d}: {s}>>",
        .{ _level, text },
        0,
    ) catch return 0;
    if (out_custom != null) out_custom.* = buf.ptr;
    if (out_len != null) out_len.* = buf.len;
    return 3; // HTM_VISIT_CUSTOM
}

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var callbacks: c.HTMHtmVisitorCallbacks = std.mem.zeroes(c.HTMHtmVisitorCallbacks);
    callbacks.visit_heading = &visit_heading;

    const visitor = c.htm_visitor_create(&callbacks);
    defer c.htm_visitor_free(visitor);

    const options_z = try std.heap.c_allocator.dupeZ(u8, "{}");
    defer std.heap.c_allocator.free(options_z);
    const options = c.htm_conversion_options_from_json(options_z.ptr);
    defer c.htm_conversion_options_free(options);
    c.htm_options_set_visitor_handle(options, visitor);

    const html_z = try std.heap.c_allocator.dupeZ(u8, "<h1>Title</h1><p>Body.</p>");
    defer std.heap.c_allocator.free(html_z);

    const result = c.htm_convert(html_z.ptr, options) orelse return error.ConvertFailed;
    defer c.htm_conversion_result_free(result);

    const json_ptr = c.htm_conversion_result_to_json(result);
    defer c.htm_free_string(json_ptr);
    const json = std.mem.sliceTo(json_ptr, 0);

    var parsed = try std.json.parseFromSlice(std.json.Value, allocator, json, .{});
    defer parsed.deinit();
    std.debug.print("{s}\n", .{parsed.value.object.get("content").?.string});
}

Common Patterns

Override visit_link, return VisitResult::Custom(...) with the new URL baked in. Useful for rewriting relative links to absolute, stripping tracking parameters, or converting internal links to anchor references.

Element filtering

Override visit_element_start and return VisitResult::Skip when ctx.tag_name matches an unwanted tag. The element and every descendant is dropped. A class filter works too: check ctx.attributes.get("class") and skip on match.

Content extraction

Override visit_text and push each text fragment into an external buffer. The visitor becomes a simple text-extraction pass that bypasses Markdown rendering. Combine with Skip on unwanted elements to exclude code blocks, navigation, or footers.

Performance

visit_text fires on every text node. Keep the handler small. Match the few element kinds you care about in visit_element_start and return Continue for everything else. Allocations inside the handler multiply by the number of text nodes in the input.

The visitor trait is synchronous. The core walker calls each method in place during the single-pass DOM traversal.


Found a bug or mistake on this page?

If something here is wrong or out of date, open an issue on GitHub or contribute a fix via pull request.

Edit this page on GitHub