Article Extractor
RSS Guard ships a standalone helper named rssguard-article-extractor. It is used internally by the Fetch full articles feature and by the article-filtering function msg.fetchFullContents(...), but advanced users can also call it directly from scripts or other tools.
The extractor takes a web page, finds the main readable article content, and writes that cleaned-up article to standard output. It can either download the page from a URL or process HTML that you already have.
Where to Find It
The extractor is installed next to the main RSS Guard executable.
Typical names:
Windows:
rssguard-article-extractor.exeLinux/macOS:
rssguard-article-extractor
Basic Usage
rssguard-article-extractor [options] <url>
The URL is required. If you do not pass HTML through standard input, the extractor downloads this URL. If you do pass HTML through standard input, the URL is still used as the article’s base address.
Options
Option |
Description |
|---|---|
|
Output plain text instead of HTML. |
|
Embed remote images as |
Without -t, the extractor writes extracted HTML.
Optional JSON Input
You can pass extra settings as JSON through standard input. If you do not pass JSON, the extractor uses its defaults.
{
"headers": {
"User-Agent": "Custom user agent",
"Accept-Language": "en-US"
},
"html": "<!doctype html><html>...</html>",
"proxy": {
"type": "http",
"address": "127.0.0.1:8080",
"username": "optional-user",
"password": "optional-password"
}
}
All fields are optional.
headers
Use headers when a site needs a specific user agent, language, cookie, authorization header, or another HTTP header.
If User-Agent is not provided, RSS Guard’s extractor uses its built-in default browser-like user agent.
html
Use html when you already have the page contents. When html is non-empty, the extractor does not download the URL. It cleans up the supplied HTML instead.
This is useful when you already have article HTML, for example from an RSS item, a browser cache, or an article filter.
proxy
Use proxy when page downloads or image downloads should go through a proxy.
Supported proxy types:
httpsocks5
address must contain host and port, for example 127.0.0.1:8080.
Examples
Extract HTML From a URL
rssguard-article-extractor "https://example.com/article"
Extract Plain Text
rssguard-article-extractor -t "https://example.com/article"
Extract HTML and Embed Images
rssguard-article-extractor -b "https://example.com/article"
Pass Custom Headers
printf '{"headers":{"User-Agent":"RSS Guard script","Accept-Language":"en-US"}}' \
| rssguard-article-extractor "https://example.com/article"
PowerShell:
'{"headers":{"User-Agent":"RSS Guard script","Accept-Language":"en-US"}}' |
rssguard-article-extractor "https://example.com/article"
Use Existing HTML Instead of Downloading the URL
printf '{"html":"<!doctype html><html><body><article><h1>Hello</h1><p>Body.</p></article></body></html>"}' \
| rssguard-article-extractor "https://example.com/article"
PowerShell:
'{"html":"<!doctype html><html><body><article><h1>Hello</h1><p>Body.</p></article></body></html>"}' |
rssguard-article-extractor "https://example.com/article"
Use an HTTP Proxy
printf '{"proxy":{"type":"http","address":"127.0.0.1:8080"}}' \
| rssguard-article-extractor "https://example.com/article"
Use a SOCKS5 Proxy With Authentication
printf '{"proxy":{"type":"socks5","address":"127.0.0.1:1080","username":"user","password":"pass"}}' \
| rssguard-article-extractor "https://example.com/article"
Errors
If extraction succeeds, the cleaned article is written to standard output.
If extraction fails, an error is written to standard error. Common causes are an invalid URL, a network failure, an invalid proxy, or a page that cannot be parsed.
When -b is used, individual image download failures are not fatal. If an image cannot be downloaded, its original src value is left unchanged.
Calling From Article Filters
Article filters can call the extractor directly with fs.runExecutableGetOutput(...). This can be useful when you want readability cleanup for HTML that is already present in msg.contents.
function filterMessage() {
if (!msg.url || !msg.contents) {
return Msg.Accept;
}
const extractor =
"C:\\Path\\To\\rssguard-article-extractor.exe";
const config = JSON.stringify({
html: msg.contents
});
const extracted = fs.runExecutableGetOutput(
extractor,
[msg.url],
config
);
if (extracted && extracted.trim()) {
msg.contents = extracted.trim();
}
return Msg.Accept;
}
For plain-text output, pass -t before the URL:
const extracted = fs.runExecutableGetOutput(
extractor,
["-t", msg.url],
config
);
For most ordinary filters, prefer msg.fetchFullContents(...) because it uses RSS Guard’s configured extractor path and feed settings automatically. Call the extractor manually when you need direct control over arguments or want to pass a custom html string.