Article Extractor

RSS Guard ships a standalone helper named rssguard-article-extractor. It is used internally by the Fetch full articles feature and by the article-filtering function msg.fetchFullContents(...), but advanced users can also call it directly from scripts or other tools.

The extractor takes a web page, finds the main readable article content, and writes that cleaned-up article to standard output. It can either download the page from a URL or process HTML that you already have.

Where to Find It

The extractor is installed next to the main RSS Guard executable.

Typical names:

  • Windows: rssguard-article-extractor.exe

  • Linux/macOS: rssguard-article-extractor

Basic Usage

rssguard-article-extractor [options] <url>

The URL is required. If you do not pass HTML through standard input, the extractor downloads this URL. If you do pass HTML through standard input, the URL is still used as the article’s base address.

Options

Option

Description

-t

Output plain text instead of HTML.

-b

Embed remote images as data: URLs in the HTML output.

Without -t, the extractor writes extracted HTML.

Optional JSON Input

You can pass extra settings as JSON through standard input. If you do not pass JSON, the extractor uses its defaults.

{
  "headers": {
    "User-Agent": "Custom user agent",
    "Accept-Language": "en-US"
  },
  "html": "<!doctype html><html>...</html>",
  "proxy": {
    "type": "http",
    "address": "127.0.0.1:8080",
    "username": "optional-user",
    "password": "optional-password"
  }
}

All fields are optional.

headers

Use headers when a site needs a specific user agent, language, cookie, authorization header, or another HTTP header.

If User-Agent is not provided, RSS Guard’s extractor uses its built-in default browser-like user agent.

html

Use html when you already have the page contents. When html is non-empty, the extractor does not download the URL. It cleans up the supplied HTML instead.

This is useful when you already have article HTML, for example from an RSS item, a browser cache, or an article filter.

proxy

Use proxy when page downloads or image downloads should go through a proxy.

Supported proxy types:

  • http

  • socks5

address must contain host and port, for example 127.0.0.1:8080.

Examples

Extract HTML From a URL

rssguard-article-extractor "https://example.com/article"

Extract Plain Text

rssguard-article-extractor -t "https://example.com/article"

Extract HTML and Embed Images

rssguard-article-extractor -b "https://example.com/article"

Pass Custom Headers

printf '{"headers":{"User-Agent":"RSS Guard script","Accept-Language":"en-US"}}' \
  | rssguard-article-extractor "https://example.com/article"

PowerShell:

'{"headers":{"User-Agent":"RSS Guard script","Accept-Language":"en-US"}}' |
  rssguard-article-extractor "https://example.com/article"

Use Existing HTML Instead of Downloading the URL

printf '{"html":"<!doctype html><html><body><article><h1>Hello</h1><p>Body.</p></article></body></html>"}' \
  | rssguard-article-extractor "https://example.com/article"

PowerShell:

'{"html":"<!doctype html><html><body><article><h1>Hello</h1><p>Body.</p></article></body></html>"}' |
  rssguard-article-extractor "https://example.com/article"

Use an HTTP Proxy

printf '{"proxy":{"type":"http","address":"127.0.0.1:8080"}}' \
  | rssguard-article-extractor "https://example.com/article"

Use a SOCKS5 Proxy With Authentication

printf '{"proxy":{"type":"socks5","address":"127.0.0.1:1080","username":"user","password":"pass"}}' \
  | rssguard-article-extractor "https://example.com/article"

Errors

If extraction succeeds, the cleaned article is written to standard output.

If extraction fails, an error is written to standard error. Common causes are an invalid URL, a network failure, an invalid proxy, or a page that cannot be parsed.

When -b is used, individual image download failures are not fatal. If an image cannot be downloaded, its original src value is left unchanged.

Calling From Article Filters

Article filters can call the extractor directly with fs.runExecutableGetOutput(...). This can be useful when you want readability cleanup for HTML that is already present in msg.contents.

function filterMessage() {
  if (!msg.url || !msg.contents) {
    return Msg.Accept;
  }

  const extractor =
    "C:\\Path\\To\\rssguard-article-extractor.exe";

  const config = JSON.stringify({
    html: msg.contents
  });

  const extracted = fs.runExecutableGetOutput(
    extractor,
    [msg.url],
    config
  );

  if (extracted && extracted.trim()) {
    msg.contents = extracted.trim();
  }

  return Msg.Accept;
}

For plain-text output, pass -t before the URL:

const extracted = fs.runExecutableGetOutput(
  extractor,
  ["-t", msg.url],
  config
);

For most ordinary filters, prefer msg.fetchFullContents(...) because it uses RSS Guard’s configured extractor path and feed settings automatically. Call the extractor manually when you need direct control over arguments or want to pass a custom html string.