Show HN: Defuddle，一个替代 Readability 的 HTML 到 Markdown 转换工具

Show HN: Defuddle, an HTML-to-Markdown alternative to Readability

Source | HN Comments

Defuddle 是一个用于从网页中提取主要内容的工具，类似于 Readability。它通过移除冗余元素，如侧边栏、页眉等，来清理网页，并输出干净的 HTML。Defuddle 旨在为 HTML-to-Markdown 转换器提供更友好的输入，支持浏览器和 Node.js 环境。它提供多种功能，包括提取元数据、标准化 HTML 元素（如标题、代码块、脚注和数学公式），并支持调试模式和多种 bundles。用户可以通过 npm 安装和使用。

kepano/defuddle

main

Branches Tags

Go to file

Code

Folders and files

Name| Name| Last commit message| Last commit date ---|---|---|---

Latest commit

History

254 Commits

.github/workflows| .github/workflows

playground| playground

src| src

.editorconfig| .editorconfig

.eslintrc.json| .eslintrc.json

.gitignore| .gitignore

LICENSE| LICENSE

README.md| README.md

package-lock.json| package-lock.json

package.json| package.json

tsconfig.declarations.json| tsconfig.declarations.json

tsconfig.json| tsconfig.json

tsconfig.node.json| tsconfig.node.json

webpack.config.js| webpack.config.js

View all files

Repository files navigation

de·fud·dle /diˈfʌdl/ transitive verb 从网页中移除不必要的元素，使其易于阅读。

注意！Defuddle 仍在开发中！

Defuddle 从网页中提取主要内容。它通过移除评论、侧边栏、页眉、页脚和其他非必要元素等杂物来清理网页，只留下主要内容。

尝试 Defuddle Playground →

Features

Defuddle 旨在输出干净且一致的 HTML 文档。它是为 Obsidian Web Clipper 编写的，目标是为像 Turndown 这样的 HTML-to-Markdown 转换器创建更有用的输入。

Defuddle 可以用作 Mozilla Readability 的替代品，但有一些区别：

更宽容，移除的存疑元素更少。
为脚注、数学公式、代码块等提供一致的输出。
使用页面的移动端样式来猜测不必要的元素。
从页面中提取更多元数据，包括 schema.org 数据。

Installation

npm install defuddle

对于 Node.js 的使用，你还需要安装 JSDOM:

npm install jsdom

Usage

Browser

import { Defuddle } from 'defuddle';

// 解析当前文档
const defuddle = new Defuddle(document);
const result = defuddle.parse();

// 访问内容和元数据
console.log(result.content);
console.log(result.title);
console.log(result.author);

Node.js

import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';

// 从字符串解析 HTML
const html = '<html><body><article>...</article></body></html>';
const result = await Defuddle(html);

// 从 URL 解析 HTML
const dom = await JSDOM.fromURL('https://example.com/article');
const result = await Defuddle(dom);

// 带选项
const result = await Defuddle(dom, {
 debug: true, // 启用调试模式，进行详细日志记录
 markdown: true, // 将内容转换为 markdown
 url: 'https://example.com/article' // 页面的原始 URL
});

// 访问内容和元数据
console.log(result.content);
console.log(result.title);
console.log(result.author);

注意: 为了使 defuddle/node 正确导入, 你 package.json 文件中的模块格式必须设置为 { "type": "module" }

Response

Defuddle 返回一个具有以下属性的对象：

Property | Type | Description ---|---|--- author | string | 文章作者 content | string | 已清理的提取内容字符串 description | string | 文章的描述或摘要 domain | string | 网站域名 favicon | string | 网站的 favicon URL image | string | 文章的主要图片 URL metaTags | object | Meta 标签 parseTime | number | 解析页面所花费的时间（毫秒） published | string | 文章的发表日期 site | string | 网站名称 schemaOrgData | object | 从页面提取的原始 schema.org 数据 title | string | 文章标题 wordCount | number | 提取内容中的总字数

Bundles

Defuddle 提供三种不同的 bundles：

Core bundle (defuddle): 主要 bundle，用于浏览器。无依赖。
Full bundle (defuddle/full): 包含用于数学公式解析的附加功能。
Node.js bundle (defuddle/node): 针对使用 JSDOM 的 Node.js 环境进行了优化。包含用于数学和 Markdown 转换的完整功能。

建议大多数用例使用 core bundle。它仍然可以处理数学内容，但不包括在 MathML 和 LaTeX 格式之间进行转换的 fallback。Full bundle 增加了使用 mathml-to-latex 和 temml 库创建可靠的 <math> 元素的功能。

Options

Option | Type | Description ---|---|--- debug | boolean | 启用调试日志记录 url | string | 被解析页面的 URL markdown | boolean | 将 content 转换为 Markdown separateMarkdown | boolean | 将 content 保留为 HTML，并将 contentMarkdown 作为 Markdown 返回 removeExactSelectors | boolean | 是否删除匹配精确选择器的元素，例如广告、社交按钮等。默认为 true。 removePartialSelectors | boolean | 是否删除匹配部分选择器的元素，例如广告、社交按钮等。默认为 true。

Debug mode

你可以通过在创建新的 Defuddle 实例时传递一个 options 对象来启用调试模式：

const article = new Defuddle(document, { debug: true }).parse();

关于解析过程的更详细的控制台日志记录
保留通常被剥离的 HTML class 和 id 属性
保留所有 data-* 属性
跳过 div 展平以保留文档结构

HTML standardization

Defuddle 尝试标准化 HTML 元素，以便为后续操作（例如转换为 Markdown）提供一致的输入。

Headings

如果第一个 H1 或 H2 标题与标题匹配，则将其删除。
H1 转换为 H2。
H1 到 H6 元素中的锚链接被删除，成为纯标题。

Code blocks

代码块被标准化。如果存在，则删除行号和语法高亮显示，但保留语言并添加为数据属性和类。

<pre>
  <code data-lang="js" class="language-js">
    // code
  </code>
</pre>

Footnotes

内联引用和脚注被转换为标准格式：

Inline reference<sup id="fnref:1"><a href="#fn:1">1</a></sup>.
<div id="footnotes">
  <ol>
    <li class="footnote" id="fn:1">
      <p>
        Footnote content.&nbsp;<a href="#fnref:1" class="footnote-backref">↩</a>
      </p>
    </li>
  </ol>
</div>

Math

数学元素，包括 MathJax 和 KaTeX，被转换为标准 MathML：

<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline" data-latex="a \neq 0">
  <mi>a</mi>
  <mo>≠</mo>
  <mn>0</mn>
</math>

Development

Build

要构建该软件包，你需要安装 Node.js 和 npm。然后运行：

# 安装依赖
npm install
# 清理并构建
npm run build