Scraping Websites
Warning
This feature is meant for advanced users. It is powerful, but it is also easier to misconfigure than a normal feed subscription.
RSS Guard can work with sources that are not simple, ready-made RSS or Atom URLs. In practice, this gives you three ways to supply feed data:
URL- RSS Guard downloads data from a normal network address.Local file- RSS Guard reads feed data from a file on your computer.Script- RSS Guard runs your command and expects valid feed data on standard output.
This is useful when a website does not publish a feed directly, when you need to transform data before RSS Guard reads it, or when you want to build a custom pipeline around external tools.
Supported Feed Types
After RSS Guard receives the source data, it can try to recognize several feed formats automatically, including:
RSSAtomRDFJSON FeedSitemapiCalendarGemlog
The Fetch metadata button can use the supplied source and optional post-processing command to detect the feed type, title, description, encoding, and often the icon too.
Source Types
URL
This is the normal mode. RSS Guard downloads the remote resource and treats it as feed data.
Use this when the source already provides a valid feed, or when a remote file becomes a valid feed only after optional post-processing.
Local file
This mode reads feed data from a file path on your local machine.
It is useful when another application or scheduled task already generates feed data for you.
Script
This mode runs your command and reads its standard output as feed data.
Your script should:
write valid feed data to standard output
write errors and diagnostics to standard error
exit with code
0on success
RSS Guard runs the command in the RSS Guard user-data folder, and the %data% placeholder is expanded to that folder automatically.
Post-Processing Script
You can optionally define a second command as a post-processing step.
In that case, RSS Guard first obtains the source data from the selected source type and then passes that raw data to the post-processing command through standard input. The post-processing command must then return valid feed data on standard output.
This is often the most practical setup when:
one command downloads data
another command converts it into a feed format RSS Guard understands
you want to reuse the same cleanup or conversion step for multiple feeds
Data Flow
flowchart TB
src{{"Source type"}}
url["Download data from URL"]
file["Read data from local file"]
scr["Run source script"]
pp{{"Post-processing script set?"}}
post["Pass source data to post-processing script"]
fin["RSS Guard reads the resulting feed data"]
src-->|URL|url
src-->|Local file|file
src-->|Script|scr
url-->pp
file-->pp
scr-->pp
pp-->|Yes|post
pp-->|No|fin
post-->fin
Command Syntax Tips
Be careful with quoting, especially on Windows.
If your executable path contains backslashes, escape them properly when needed. Quote individual arguments that contain spaces.
Examples:
C:\\MyFolder\\My.exe "arg1" "arg2" "my \"quoted\" arg3" 'my "quoted" arg4'
bash "%data%/scripts/download-feed.sh"
%data%\jq.exe '{ version: "1.1", title: "Stars", items: map( . | .title=.full_name | .content_text=.description | .date_published=.pushed_at)}'
If Fetch metadata fails, the most common causes are:
the command line is quoted incorrectly
the script does not write valid feed data
the script writes the real output to standard error instead of standard output
the script exits with a non-zero exit code
Advanced Feed Options
Standard feeds also expose several advanced options that are related to scraping or non-standard sources:
custom
HTTP headersper-feed authentication
per-feed proxy settings
optional
HTTP/2preferenceFetch full articlesFetch comments for articlesarticle date preference:
PublishedorUpdated
These settings are especially useful when a site needs extra headers, behaves differently behind a proxy, or provides only partial article contents in the feed itself.
Fetch full articles uses RSS Guard’s standalone article extractor CLI. Most users only need the checkbox, but the CLI can also be called directly from custom scripts.
If you use the web package, cookies from RSS Guard’s built-in web browser are shared with normal feed downloads. For some sites, signing in or accepting cookies in the built-in browser can make URL-based feeds from the same site accessible.
Warnings
Warning
Fetching full articles and comments can slow feed updates down significantly. It can also increase database size.
Warning
The older raw-XML extraction mode is an edge-case compatibility option. It can help with some slow feeds, but it is not something most users should enable by default.
Warning
Scrapers and post-processing commands can break when a website changes its layout, markup, or API responses. If a previously working setup suddenly fails, check the upstream site first.
Example Uses
Typical real-world setups include:
a script that downloads a web page and converts it into RSS
a local file generated by another tool
a normal URL feed that is cleaned up with a post-processing command
a custom pipeline that enriches articles before RSS Guard imports them
a script that calls the article extractor CLI to extract readable article HTML or plain text
Example Scrapers
There are examples of website scrapers. Many of them are written in Python 3, so their execution line is usually similar to python "script.py".
Always inspect an example before using it so you know what input it expects and what it outputs.
3rd-party Tools
Third-party tools made to work with RSS Guard include:
CSS2RSS - useful for scraping websites with CSS selectors
RSSGuardHelper - another helper focused on CSS-selector-based extraction
Please give the authors proper credit for their work.