Scraping Websites

Warning

This feature is meant for advanced users. It is powerful, but it is also easier to misconfigure than a normal feed subscription.

RSS Guard can work with sources that are not simple, ready-made RSS or Atom URLs. In practice, this gives you three ways to supply feed data:

  • URL - RSS Guard downloads data from a normal network address.

  • Local file - RSS Guard reads feed data from a file on your computer.

  • Script - RSS Guard runs your command and expects valid feed data on standard output.

This is useful when a website does not publish a feed directly, when you need to transform data before RSS Guard reads it, or when you want to build a custom pipeline around external tools.

Supported Feed Types

After RSS Guard receives the source data, it can try to recognize several feed formats automatically, including:

  • RSS

  • Atom

  • RDF

  • JSON Feed

  • Sitemap

  • iCalendar

  • Gemlog

The Fetch metadata button can use the supplied source and optional post-processing command to detect the feed type, title, description, encoding, and often the icon too.

Source Types

URL

This is the normal mode. RSS Guard downloads the remote resource and treats it as feed data.

Use this when the source already provides a valid feed, or when a remote file becomes a valid feed only after optional post-processing.

Local file

This mode reads feed data from a file path on your local machine.

It is useful when another application or scheduled task already generates feed data for you.

Script

This mode runs your command and reads its standard output as feed data.

Your script should:

  • write valid feed data to standard output

  • write errors and diagnostics to standard error

  • exit with code 0 on success

RSS Guard runs the command in the RSS Guard user-data folder, and the %data% placeholder is expanded to that folder automatically.

Post-Processing Script

You can optionally define a second command as a post-processing step.

In that case, RSS Guard first obtains the source data from the selected source type and then passes that raw data to the post-processing command through standard input. The post-processing command must then return valid feed data on standard output.

This is often the most practical setup when:

  • one command downloads data

  • another command converts it into a feed format RSS Guard understands

  • you want to reuse the same cleanup or conversion step for multiple feeds

Data Flow

        flowchart TB
  src{{"Source type"}}
  url["Download data from URL"]
  file["Read data from local file"]
  scr["Run source script"]
  pp{{"Post-processing script set?"}}
  post["Pass source data to post-processing script"]
  fin["RSS Guard reads the resulting feed data"]

  src-->|URL|url
  src-->|Local file|file
  src-->|Script|scr
  url-->pp
  file-->pp
  scr-->pp
  pp-->|Yes|post
  pp-->|No|fin
  post-->fin
    

Command Syntax Tips

Be careful with quoting, especially on Windows.

If your executable path contains backslashes, escape them properly when needed. Quote individual arguments that contain spaces.

Examples:

C:\\MyFolder\\My.exe "arg1" "arg2" "my \"quoted\" arg3" 'my "quoted" arg4'

bash "%data%/scripts/download-feed.sh"

%data%\jq.exe '{ version: "1.1", title: "Stars", items: map( . | .title=.full_name | .content_text=.description | .date_published=.pushed_at)}'

If Fetch metadata fails, the most common causes are:

  • the command line is quoted incorrectly

  • the script does not write valid feed data

  • the script writes the real output to standard error instead of standard output

  • the script exits with a non-zero exit code

Advanced Feed Options

Standard feeds also expose several advanced options that are related to scraping or non-standard sources:

  • custom HTTP headers

  • per-feed authentication

  • per-feed proxy settings

  • optional HTTP/2 preference

  • Fetch full articles

  • Fetch comments for articles

  • article date preference: Published or Updated

These settings are especially useful when a site needs extra headers, behaves differently behind a proxy, or provides only partial article contents in the feed itself.

Fetch full articles uses RSS Guard’s standalone article extractor CLI. Most users only need the checkbox, but the CLI can also be called directly from custom scripts.

If you use the web package, cookies from RSS Guard’s built-in web browser are shared with normal feed downloads. For some sites, signing in or accepting cookies in the built-in browser can make URL-based feeds from the same site accessible.

Warnings

Warning

Fetching full articles and comments can slow feed updates down significantly. It can also increase database size.

Warning

The older raw-XML extraction mode is an edge-case compatibility option. It can help with some slow feeds, but it is not something most users should enable by default.

Warning

Scrapers and post-processing commands can break when a website changes its layout, markup, or API responses. If a previously working setup suddenly fails, check the upstream site first.

Example Uses

Typical real-world setups include:

  • a script that downloads a web page and converts it into RSS

  • a local file generated by another tool

  • a normal URL feed that is cleaned up with a post-processing command

  • a custom pipeline that enriches articles before RSS Guard imports them

  • a script that calls the article extractor CLI to extract readable article HTML or plain text

Example Scrapers

There are examples of website scrapers. Many of them are written in Python 3, so their execution line is usually similar to python "script.py".

Always inspect an example before using it so you know what input it expects and what it outputs.

3rd-party Tools

Third-party tools made to work with RSS Guard include:

  • CSS2RSS - useful for scraping websites with CSS selectors

  • RSSGuardHelper - another helper focused on CSS-selector-based extraction

Please give the authors proper credit for their work.