Scraping Websites
=================
```{warning}
This feature is meant for advanced users. It is powerful, but it is also easier to misconfigure than a normal feed subscription.
```

RSS Guard can work with sources that are not simple, ready-made RSS or Atom URLs. In practice, this gives you three ways to supply feed data:
* `URL` - RSS Guard downloads data from a normal network address.
* `Local file` - RSS Guard reads feed data from a file on your computer.
* `Script` - RSS Guard runs your command and expects valid feed data on standard output.

This is useful when a website does not publish a feed directly, when you need to transform data before RSS Guard reads it, or when you want to build a custom pipeline around external tools.

## Supported Feed Types
After RSS Guard receives the source data, it can try to recognize several feed formats automatically, including:
* `RSS`
* `Atom`
* `RDF`
* `JSON Feed`
* `Sitemap`
* `iCalendar`
* `Gemlog`

The `Fetch metadata` button can use the supplied source and optional post-processing command to detect the feed type, title, description, encoding, and often the icon too.

## Source Types
### `URL`
This is the normal mode. RSS Guard downloads the remote resource and treats it as feed data.

Use this when the source already provides a valid feed, or when a remote file becomes a valid feed only after optional post-processing.

### `Local file`
This mode reads feed data from a file path on your local machine.

It is useful when another application or scheduled task already generates feed data for you.

### `Script`
This mode runs your command and reads its standard output as feed data.

Your script should:
* write valid feed data to standard output
* write errors and diagnostics to standard error
* exit with code `0` on success

RSS Guard runs the command in the RSS Guard [user-data folder](userdata), and the `%data%` [placeholder](userdata.md#data-placeholder) is expanded to that folder automatically.

## Post-Processing Script
You can optionally define a second command as a post-processing step.

In that case, RSS Guard first obtains the source data from the selected source type and then passes that raw data to the post-processing command through standard input. The post-processing command must then return valid feed data on standard output.

This is often the most practical setup when:
* one command downloads data
* another command converts it into a feed format RSS Guard understands
* you want to reuse the same cleanup or conversion step for multiple feeds

## Data Flow
```{mermaid}
flowchart TB
  src{{"Source type"}}
  url["Download data from URL"]
  file["Read data from local file"]
  scr["Run source script"]
  pp{{"Post-processing script set?"}}
  post["Pass source data to post-processing script"]
  fin["RSS Guard reads the resulting feed data"]

  src-->|URL|url
  src-->|Local file|file
  src-->|Script|scr
  url-->pp
  file-->pp
  scr-->pp
  pp-->|Yes|post
  pp-->|No|fin
  post-->fin
```

## Command Syntax Tips
Be careful with quoting, especially on Windows.

If your executable path contains backslashes, escape them properly when needed. Quote individual arguments that contain spaces.

Examples:

```text
C:\\MyFolder\\My.exe "arg1" "arg2" "my \"quoted\" arg3" 'my "quoted" arg4'

bash "%data%/scripts/download-feed.sh"

%data%\jq.exe '{ version: "1.1", title: "Stars", items: map( . | .title=.full_name | .content_text=.description | .date_published=.pushed_at)}'
```

If `Fetch metadata` fails, the most common causes are:
* the command line is quoted incorrectly
* the script does not write valid feed data
* the script writes the real output to standard error instead of standard output
* the script exits with a non-zero exit code

## Advanced Feed Options
Standard feeds also expose several advanced options that are related to scraping or non-standard sources:
* custom `HTTP headers`
* per-feed authentication
* per-feed proxy settings
* optional `HTTP/2` preference
* `Fetch full articles`
* `Fetch comments for articles`
* article date preference: `Published` or `Updated`

These settings are especially useful when a site needs extra headers, behaves differently behind a proxy, or provides only partial article contents in the feed itself.

`Fetch full articles` uses RSS Guard's standalone [article extractor CLI](extractor). Most users only need the checkbox, but the CLI can also be called directly from custom scripts.

If you use the `web` package, cookies from RSS Guard's built-in web browser are shared with normal feed downloads. For some sites, signing in or accepting cookies in the built-in browser can make URL-based feeds from the same site accessible.

## Warnings
```{warning}
Fetching full articles and comments can slow feed updates down significantly. It can also increase database size.
```

```{warning}
The older raw-XML extraction mode is an edge-case compatibility option. It can help with some slow feeds, but it is not something most users should enable by default.
```

```{warning}
Scrapers and post-processing commands can break when a website changes its layout, markup, or API responses. If a previously working setup suddenly fails, check the upstream site first.
```

## Example Uses
Typical real-world setups include:
* a script that downloads a web page and converts it into RSS
* a local file generated by another tool
* a normal URL feed that is cleaned up with a post-processing command
* a custom pipeline that enriches articles before RSS Guard imports them
* a script that calls the [article extractor CLI](extractor) to extract readable article HTML or plain text

## Example Scrapers
There are [examples of website scrapers](https://github.com/martinrotter/rssguard/tree/master/resources/scripts/scrapers). Many of them are written in Python 3, so their execution line is usually similar to `python "script.py"`.

Always inspect an example before using it so you know what input it expects and what it outputs.

## 3rd-party Tools
Third-party tools made to work with RSS Guard include:
* [CSS2RSS](https://github.com/Owyn/CSS2RSS) - useful for scraping websites with CSS selectors
* [RSSGuardHelper](https://github.com/pipiscrew/RSSGuardHelper) - another helper focused on CSS-selector-based extraction

Please give the authors proper credit for their work.