Skip to content

Metadata Extraction v2.13.0

html-to-markdown can extract structured metadata from HTML documents during conversion, all in a single pass over the DOM tree. This is useful for SEO analysis, content indexing, table-of-contents generation, link validation, and content migration workflows.


What Metadata Is Extracted

The convert_with_metadata() API returns both the converted Markdown and an ExtendedMetadata object containing five categories of structured data.

Document Metadata

Top-level information about the HTML document:

Field Source Example
title <title> tag "My Blog Post"
description <meta name="description"> "A guide to..."
author <meta name="author"> "Jane Doe"
language <html lang="..."> "en"
direction <html dir="..."> "ltr"
charset <meta charset="..."> "utf-8"
open_graph <meta property="og:*"> {"title": "...", "image": "..."}
twitter_card <meta name="twitter:*"> {"card": "summary_large_image"}

Headers

All heading elements (<h1> through <h6>) with their hierarchy:

Field Description
level Heading level (1-6)
text Text content of the heading
id The id attribute, if present

All hyperlinks (<a> elements) classified by type:

Link Type Description Example
External Links to other domains https://example.com/page
Internal Relative paths within the same site /about, ../contact
Anchor Fragment-only links #section-id
Email mailto: links mailto:user@example.com
Phone tel: links tel:+1234567890

Each link includes href, text, title, rel attributes, and the classified link_type.

Images

All image elements (<img>) with metadata:

Field Description
src Image source URL
alt Alt text
title Title attribute
image_type External, DataUri, or Inline
width Width attribute, if present
height Height attribute, if present

Structured Data

Machine-readable data embedded in the HTML:

Type Source
JSON-LD <script type="application/ld+json"> blocks
Microdata Elements with itemscope, itemprop, itemtype attributes
RDFa Elements with typeof, property, about attributes

Each entry includes the data_type, raw content, and the schema_type when identifiable.


How It Works

Metadata extraction happens during the same DOM traversal pass as Markdown generation. The conversion engine maintains a MetadataCollector that listens for relevant elements:

graph TD
    A[DOM Traversal] --> B{Element Type?}
    B -->|head meta| C[Document Collector]
    B -->|h1-h6| D[Header Collector]
    B -->|a href| E[Link Collector]
    B -->|img| F[Image Collector]
    B -->|script ld+json| G[Structured Data Collector]
    C --> H[ExtendedMetadata]
    D --> H
    E --> H
    F --> H
    G --> H

Zero overhead when disabled

Metadata collection adds near-zero overhead to conversion. When specific extraction categories are disabled in MetadataConfig, those collectors are not invoked at all.

MetadataConfig

Control which categories of metadata to extract:

Field Default Description
extract_document true Extract document-level meta tags
extract_headers true Extract heading elements
extract_links true Extract hyperlinks
extract_images true Extract image elements
extract_structured_data true Extract JSON-LD, Microdata, RDFa
max_structured_data_size 100000 Maximum bytes for structured data (prevents memory exhaustion from large JSON-LD blocks)

Use Cases

SEO Analysis

Extract document metadata, Open Graph tags, and structured data to audit SEO health:

Title: "Product Page"
Description: "Buy our product..."
OG Image: "https://cdn.example.com/product.jpg"
Structured Data: Product schema with price, availability
Headers: H1 count (should be exactly 1)

Table of Contents Generation

Use extracted headers to build a table of contents:

## Table of Contents
- [Introduction](#introduction)       (h1)
  - [Background](#background)         (h2)
  - [Methodology](#methodology)       (h2)
    - [Data Collection](#data)         (h3)
- [Results](#results)                  (h1)

Audit all links in a document:

  • Identify broken external links
  • Find orphaned anchor links (fragment targets that do not exist)
  • Catalog all internal navigation paths
  • Flag mailto: and tel: links for review

Content Migration

When migrating content between CMS platforms, metadata extraction helps:

  • Map document titles and descriptions to the new system's fields
  • Rewrite internal links to match the new URL structure
  • Inventory all images for asset migration
  • Preserve structured data for search engine continuity

Accessibility Auditing

Check image alt text coverage, heading hierarchy, and link text quality:

Images without alt text: 3 of 15
Heading hierarchy violations: H3 after H1 (skipped H2)
Links with "click here" text: 2

API Overview

The metadata API is available in most language bindings. The function signatures follow each language's conventions, but the data structures are consistent.

use html_to_markdown_rs::{convert_with_metadata, MetadataConfig};

let html = r#"<html lang="en"><head><title>Test</title></head>
               <body><h1>Hello</h1></body></html>"#;

let config = MetadataConfig::default();
let (markdown, metadata) = convert_with_metadata(html, None, config, None)?;

println!("Title: {:?}", metadata.document.title);
println!("Headers: {}", metadata.headers.len());
from html_to_markdown import convert_with_metadata, MetadataConfig

metadata_config = MetadataConfig(
    extract_headers=True,
    extract_links=True,
    extract_images=True,
    extract_structured_data=True,
    max_structured_data_size=100000,
)
markdown, metadata = convert_with_metadata(html, metadata_config=metadata_config)
import { convertWithMetadata } from '@kreuzberg/html-to-markdown';

const result = convertWithMetadata('<h1>Title</h1><p>Content</p>');
const { markdown, metadata } = result;

console.log(markdown);           // Converted markdown
console.log(metadata.document);  // Document metadata (title, description, etc.)
console.log(metadata.headers);   // Header elements (h1-h6)
console.log(metadata.links);     // Extracted links
console.log(metadata.images);    // Extracted images
require 'html_to_markdown'

html = '<html lang="en"><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)

puts metadata[:document][:title]     # "Test"
puts metadata[:headers].first[:text] # "Hello"
use HtmlToMarkdown\Config\ConversionOptions;
use HtmlToMarkdown\Service\Converter;
use function HtmlToMarkdown\convert_with_metadata;

$html = '<html><head><title>Example</title></head><body><h1>Welcome</h1><a href="https://example.com">Link</a></body></html>';

// Object-oriented API
$converter = Converter::create();
$result = $converter->convertWithMetadata(
    $html,
    new ConversionOptions(headingStyle: 'Atx'),
    [
        'extract_headers' => true,
        'extract_links' => true,
        'extract_images' => true,
    ]
);

echo $result['markdown'];
echo $result['metadata']->document->title;
foreach ($result['metadata']->links as $link) {
    echo $link->href . ': ' . $link->text;
}

// Procedural API
$result = convert_with_metadata(
    $html,
    new ConversionOptions(headingStyle: 'Atx'),
    ['extract_headers' => true, 'extract_links' => true]
);

For complete metadata extraction guides with full code examples, see Metadata Extraction Guide.


Further Reading