Scraping Websites ================= ```{warning} This feature is meant for advanced users. It is powerful, but it is also easier to misconfigure than a normal feed subscription. ``` RSS Guard can work with sources that are not simple, ready-made RSS or Atom URLs. In practice, this gives you three ways to supply feed data: * `URL` - RSS Guard downloads data from a normal network address. * `Local file` - RSS Guard reads feed data from a file on your computer. * `Script` - RSS Guard runs your command and expects valid feed data on standard output. This is useful when a website does not publish a feed directly, when you need to transform data before RSS Guard reads it, or when you want to build a custom pipeline around external tools. ## Supported Feed Types After RSS Guard receives the source data, it can try to recognize several feed formats automatically, including: * `RSS` * `Atom` * `RDF` * `JSON Feed` * `Sitemap` * `iCalendar` * `Gemlog` The `Fetch metadata` button can use the supplied source and optional post-processing command to detect the feed type, title, description, encoding, and often the icon too. ## Source Types ### `URL` This is the normal mode. RSS Guard downloads the remote resource and treats it as feed data. Use this when the source already provides a valid feed, or when a remote file becomes a valid feed only after optional post-processing. ### `Local file` This mode reads feed data from a file path on your local machine. It is useful when another application or scheduled task already generates feed data for you. ### `Script` This mode runs your command and reads its standard output as feed data. Your script should: * write valid feed data to standard output * write errors and diagnostics to standard error * exit with code `0` on success RSS Guard runs the command in the RSS Guard [user-data folder](userdata), and the `%data%` [placeholder](userdata.md#data-placeholder) is expanded to that folder automatically. ## Post-Processing Script You can optionally define a second command as a post-processing step. In that case, RSS Guard first obtains the source data from the selected source type and then passes that raw data to the post-processing command through standard input. The post-processing command must then return valid feed data on standard output. This is often the most practical setup when: * one command downloads data * another command converts it into a feed format RSS Guard understands * you want to reuse the same cleanup or conversion step for multiple feeds ## Data Flow ```{mermaid} flowchart TB src{{"Source type"}} url["Download data from URL"] file["Read data from local file"] scr["Run source script"] pp{{"Post-processing script set?"}} post["Pass source data to post-processing script"] fin["RSS Guard reads the resulting feed data"] src-->|URL|url src-->|Local file|file src-->|Script|scr url-->pp file-->pp scr-->pp pp-->|Yes|post pp-->|No|fin post-->fin ``` ## Command Syntax Tips Be careful with quoting, especially on Windows. If your executable path contains backslashes, escape them properly when needed. Quote individual arguments that contain spaces. Examples: ```text C:\\MyFolder\\My.exe "arg1" "arg2" "my \"quoted\" arg3" 'my "quoted" arg4' bash "%data%/scripts/download-feed.sh" %data%\jq.exe '{ version: "1.1", title: "Stars", items: map( . | .title=.full_name | .content_text=.description | .date_published=.pushed_at)}' ``` If `Fetch metadata` fails, the most common causes are: * the command line is quoted incorrectly * the script does not write valid feed data * the script writes the real output to standard error instead of standard output * the script exits with a non-zero exit code ## Advanced Feed Options Standard feeds also expose several advanced options that are related to scraping or non-standard sources: * custom `HTTP headers` * per-feed authentication * per-feed proxy settings * optional `HTTP/2` preference * `Fetch full articles` * `Fetch comments for articles` * article date preference: `Published` or `Updated` These settings are especially useful when a site needs extra headers, behaves differently behind a proxy, or provides only partial article contents in the feed itself. `Fetch full articles` uses RSS Guard's standalone [article extractor CLI](extractor). Most users only need the checkbox, but the CLI can also be called directly from custom scripts. If you use the `web` package, cookies from RSS Guard's built-in web browser are shared with normal feed downloads. For some sites, signing in or accepting cookies in the built-in browser can make URL-based feeds from the same site accessible. ## Warnings ```{warning} Fetching full articles and comments can slow feed updates down significantly. It can also increase database size. ``` ```{warning} The older raw-XML extraction mode is an edge-case compatibility option. It can help with some slow feeds, but it is not something most users should enable by default. ``` ```{warning} Scrapers and post-processing commands can break when a website changes its layout, markup, or API responses. If a previously working setup suddenly fails, check the upstream site first. ``` ## Example Uses Typical real-world setups include: * a script that downloads a web page and converts it into RSS * a local file generated by another tool * a normal URL feed that is cleaned up with a post-processing command * a custom pipeline that enriches articles before RSS Guard imports them * a script that calls the [article extractor CLI](extractor) to extract readable article HTML or plain text ## Example Scrapers There are [examples of website scrapers](https://github.com/martinrotter/rssguard/tree/master/resources/scripts/scrapers). Many of them are written in Python 3, so their execution line is usually similar to `python "script.py"`. Always inspect an example before using it so you know what input it expects and what it outputs. ## 3rd-party Tools Third-party tools made to work with RSS Guard include: * [CSS2RSS](https://github.com/Owyn/CSS2RSS) - useful for scraping websites with CSS selectors * [RSSGuardHelper](https://github.com/pipiscrew/RSSGuardHelper) - another helper focused on CSS-selector-based extraction Please give the authors proper credit for their work.