588 lines
28 KiB
Markdown
588 lines
28 KiB
Markdown
# Bulk Downloader for Reddit
|
|
|
|
[![PyPI Status](https://img.shields.io/pypi/status/bdfr?logo=PyPI)](https://pypi.python.org/pypi/bdfr)
|
|
[![PyPI version](https://img.shields.io/pypi/v/bdfr.svg?logo=PyPI)](https://pypi.python.org/pypi/bdfr)
|
|
[![PyPI downloads](https://img.shields.io/pypi/dm/bdfr?logo=PyPI)](https://pypi.python.org/pypi/bdfr)
|
|
[![AUR version](https://img.shields.io/aur/version/python-bdfr?logo=Arch%20Linux)](https://aur.archlinux.org/packages/python-bdfr)
|
|
[![Python Test](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml/badge.svg?branch=master)](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml)
|
|
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?logo=Python)](https://github.com/psf/black)
|
|
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
|
|
|
|
This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl
|
|
Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive
|
|
command-line interface. [List of currently supported sources](#list-of-currently-supported-sources)
|
|
|
|
If you wish to open an issue, please read [the guide on opening issues](docs/CONTRIBUTING.md#opening-an-issue) to ensure
|
|
that your issue is clear and contains everything it needs to for the developers to investigate.
|
|
|
|
Included in this README are a few example Bash tricks to get certain behaviour. For that, see [Common Command
|
|
Tricks](#common-command-tricks).
|
|
|
|
## Installation
|
|
|
|
*Bulk Downloader for Reddit* needs Python version 3.9 or above. Please update Python before installation to meet the
|
|
requirement.
|
|
|
|
Then, you can install it via pip with:
|
|
|
|
```bash
|
|
python3 -m pip install bdfr --upgrade
|
|
```
|
|
|
|
or via [pipx](https://pypa.github.io/pipx) with:
|
|
|
|
```bash
|
|
python3 -m pipx install bdfr
|
|
```
|
|
|
|
**To update BDFR**, run the above command again for pip or `pipx upgrade bdfr` for pipx installations.
|
|
|
|
**To check your version of BDFR**, run `bdfr --version`
|
|
|
|
**To install shell completions**, run `bdfr completions`
|
|
|
|
### AUR Package
|
|
|
|
If on Arch Linux or derivative operating systems such as Manjaro, the BDFR can be installed through the AUR.
|
|
|
|
- Latest Release: <https://aur.archlinux.org/packages/python-bdfr>
|
|
- Latest Development Build: <https://aur.archlinux.org/packages/python-bdfr-git>
|
|
|
|
### Source code
|
|
|
|
If you want to use the source code or make contributions, refer to
|
|
[CONTRIBUTING](docs/CONTRIBUTING.md#preparing-the-environment-for-development)
|
|
|
|
## Usage
|
|
|
|
The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These
|
|
sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded
|
|
to disk, according to a naming and organisational scheme defined by the user.
|
|
|
|
There are three modes to the BDFR: download, archive, and clone. Each one has a command that performs similar but
|
|
distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the
|
|
images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission
|
|
details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup
|
|
language form, such as JSON, XML, or YAML. Lastly, the `clone` command will perform both functions of the previous
|
|
commands at once and is more efficient than running those commands sequentially.
|
|
|
|
Note that the `clone` command is not a true, failthful clone of Reddit. It simply retrieves much of the raw data that
|
|
Reddit provides. To get a true clone of Reddit, another tool such as HTTrack should be used.
|
|
|
|
After installation, run the program from any directory as shown below:
|
|
|
|
```bash
|
|
bdfr download
|
|
```
|
|
|
|
```bash
|
|
bdfr archive
|
|
```
|
|
|
|
```bash
|
|
bdfr clone
|
|
```
|
|
|
|
However, these commands are not enough. You should chain parameters in [Options](#options) according to your use case.
|
|
Don't forget that some parameters can be provided multiple times. Some quick reference commands are:
|
|
|
|
```bash
|
|
bdfr download ./path/to/output --subreddit Python -L 10
|
|
```
|
|
|
|
```bash
|
|
bdfr download ./path/to/output --user reddituser --submitted -L 100
|
|
```
|
|
|
|
```bash
|
|
bdfr download ./path/to/output --user me --saved --authenticate -L 25 --file-scheme "{POSTID}"
|
|
```
|
|
|
|
```bash
|
|
bdfr download ./path/to/output --subreddit "Python, all, mindustry" -L 10 --make-hard-links
|
|
```
|
|
|
|
```bash
|
|
bdfr archive ./path/to/output --user reddituser --submitted --all-comments --comment-context
|
|
```
|
|
|
|
```bash
|
|
bdfr archive ./path/to/output --subreddit all --format yaml -L 500 --folder-scheme ""
|
|
```
|
|
|
|
Alternatively, you can pass options through a YAML file.
|
|
|
|
```bash
|
|
bdfr download ./path/to/output --opts my_opts.yaml
|
|
```
|
|
|
|
For example, running it with the following file
|
|
|
|
```yaml
|
|
skip: [mp4, avi]
|
|
file_scheme: "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}"
|
|
limit: 10
|
|
sort: top
|
|
subreddit:
|
|
- EarthPorn
|
|
- CityPorn
|
|
```
|
|
|
|
would be equilavent to (take note that in YAML there is `file_scheme` instead of `file-scheme`):
|
|
|
|
```bash
|
|
bdfr download ./path/to/output --skip mp4 --skip avi --file-scheme "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}" -L 10 -S top --subreddit EarthPorn --subreddit CityPorn
|
|
```
|
|
|
|
Any option that can be specified multiple times should be formatted like subreddit is above.
|
|
|
|
In case when the same option is specified both in the YAML file and in as a command line argument, the command line
|
|
argument takes priority
|
|
|
|
## Options
|
|
|
|
The following options are common between both the `archive` and `download` commands of the BDFR.
|
|
|
|
- `directory`
|
|
- This is the directory to which the BDFR will download and place all files
|
|
- `--authenticate`
|
|
- This flag will make the BDFR attempt to use an authenticated Reddit session
|
|
- See [Authentication](#authentication-and-security) for more details
|
|
- `--config`
|
|
- If the path to a configuration file is supplied with this option, the BDFR will use the specified config
|
|
- See [Configuration Files](#configuration) for more details
|
|
- `--opts`
|
|
- Load options from a YAML file.
|
|
- Has higher prority than the global config file but lower than command-line arguments.
|
|
- See [opts_example.yaml](./opts_example.yaml) for an example file.
|
|
- `--disable-module`
|
|
- Can be specified multiple times
|
|
- Disables certain modules from being used
|
|
- See [Disabling Modules](#disabling-modules) for more information and a list of module names
|
|
- `--downvoted`
|
|
- This will use a user's downvoted posts as a source of posts to scrape
|
|
- This requires an authenticated Reddit instance, using the `--authenticate` flag, as well as `--user` set to `me`
|
|
- `--filename-restriction-scheme`
|
|
- Can be: `windows`, `linux`
|
|
- Turns off the OS detection and specifies which system to use when making filenames
|
|
- See [Filesystem Restrictions](#filesystem-restrictions)
|
|
- `--ignore-user`
|
|
- This will add a user to ignore
|
|
- Can be specified multiple times
|
|
- `--include-id-file`
|
|
- This will add any submission with the IDs in the files provided
|
|
- Can be specified multiple times
|
|
- Format is one ID per line
|
|
- `--log`
|
|
- This allows one to specify the location of the logfile
|
|
- This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below
|
|
- `--saved`
|
|
- This option will make the BDFR use the supplied user's saved posts list as a download source
|
|
- This requires an authenticated Reddit instance, using the `--authenticate` flag, as well as `--user` set to `me`
|
|
- `--search`
|
|
- This will apply the input search term to specific lists when scraping submissions
|
|
- A search term can only be applied when using the `--subreddit` and `--multireddit` flags
|
|
- `--submitted`
|
|
- This will use a user's submissions as a source
|
|
- A user must be specified with `--user`
|
|
- `--upvoted`
|
|
- This will use a user's upvoted posts as a source of posts to scrape
|
|
- This requires an authenticated Reddit instance, using the `--authenticate` flag, as well as `--user` set to `me`
|
|
- `-L, --limit`
|
|
- This is the limit on the number of submissions retrieve
|
|
- Default is max possible
|
|
- Note that this limit applies to **each source individually** e.g. if a `--limit` of 10 and three subreddits are
|
|
provided, then 30 total submissions will be scraped
|
|
- If it is not supplied, then the BDFR will default to the maximum allowed by Reddit, roughly 1000 posts. **We
|
|
cannot bypass this.**
|
|
- `-S, --sort`
|
|
- This is the sort type for each applicable submission source supplied to the BDFR
|
|
- This option does not apply to upvoted, downvoted or saved posts when scraping from these sources
|
|
- The following options are available:
|
|
- `controversial`
|
|
- `hot` (default)
|
|
- `new`
|
|
- `relevance` (only available when using `--search`)
|
|
- `rising`
|
|
- `top`
|
|
- `-l, --link`
|
|
- This is a direct link to a submission to download, either as a URL or an ID
|
|
- Can be specified multiple times
|
|
- `-m, --multireddit`
|
|
- This is the name of a multireddit to add as a source
|
|
- Can be specified multiple times
|
|
- This can be done by using `-m` multiple times
|
|
- Multireddits can also be used to provide CSV multireddits e.g. `-m "chess, favourites"`
|
|
- The specified multireddits must all belong to the user specified with the `--user` option
|
|
- `-s, --subreddit`
|
|
- This adds a subreddit as a source
|
|
- Can be used mutliple times
|
|
- This can be done by using `-s` multiple times
|
|
- Subreddits can also be used to provide CSV subreddits e.g. `-m "all, python, mindustry"`
|
|
- `-t, --time`
|
|
- This is the time filter that will be applied to all applicable sources
|
|
- This option does not apply to upvoted, downvoted or saved posts when scraping from these sources
|
|
- This option only applies if sorting by top or controversial. See --sort for more detail.
|
|
- The following options are available:
|
|
- `all` (default)
|
|
- `hour`
|
|
- `day`
|
|
- `week`
|
|
- `month`
|
|
- `year`
|
|
- `--time-format`
|
|
- This specifies the format of the datetime string that replaces `{DATE}` in file and folder naming schemes
|
|
- See [Time Formatting Customisation](#time-formatting-customisation) for more details, and the formatting
|
|
scheme
|
|
- `-u, --user`
|
|
- This specifies the user to scrape in concert with other options
|
|
- When using `--authenticate`, `--user me` can be used to refer to the authenticated user
|
|
- Can be specified multiple times for multiple users
|
|
- If downloading a multireddit, only one user can be specified
|
|
- `-v, --verbose`
|
|
- Increases the verbosity of the program
|
|
- Can be specified multiple times
|
|
|
|
### Downloader Options
|
|
|
|
The following options apply only to the `download` command. This command downloads the files and resources linked to in
|
|
the submission, or a text submission itself, to the disk in the specified directory.
|
|
|
|
- `--make-hard-links`
|
|
- This flag will create hard links to an existing file when a duplicate is downloaded in the current run
|
|
- This will make the file appear in multiple directories while only taking the space of a single instance
|
|
- `--max-wait-time`
|
|
- This option specifies the maximum wait time for downloading a resource
|
|
- The default is 120 seconds
|
|
- See [Rate Limiting](#rate-limiting) for details
|
|
- `--no-dupes`
|
|
- This flag will skip writing a file to disk if that file was already downloaded in the current run
|
|
- This is calculated by MD5 hash
|
|
- `--search-existing`
|
|
- This will make the BDFR compile the hashes for every file in `directory`
|
|
- The hashes are used to skip duplicate files if `--no-dupes` is supplied or make hard links if `--make-hard-links`
|
|
is supplied
|
|
- **The use of this option is highly discouraged due to inefficiency**
|
|
- `--file-scheme`
|
|
- Sets the scheme for files
|
|
- Default is `{REDDITOR}_{TITLE}_{POSTID}`
|
|
- See [Folder and File Name Schemes](#folder-and-file-name-schemes) for more details
|
|
- `--folder-scheme`
|
|
- Sets the scheme for folders
|
|
- Default is `{SUBREDDIT}`
|
|
- See [Folder and File Name Schemes](#folder-and-file-name-schemes) for more details
|
|
- `--exclude-id`
|
|
- This will skip the download of any submission with the ID provided
|
|
- Can be specified multiple times
|
|
- `--exclude-id-file`
|
|
- This will skip the download of any submission with any of the IDs in the files provided
|
|
- Can be specified multiple times
|
|
- Format is one ID per line
|
|
- `--skip-domain`
|
|
- This adds domains to the download filter i.e. submissions coming from these domains will not be downloaded
|
|
- Can be specified multiple times
|
|
- Domains must be supplied in the form `example.com` or `img.example.com`
|
|
- `--skip`
|
|
- This adds file types to the download filter i.e. submissions with one of the supplied file extensions will not be
|
|
downloaded
|
|
- Can be specified multiple times
|
|
- `--skip-subreddit`
|
|
- This skips all submissions from the specified subreddit
|
|
- Can be specified multiple times
|
|
- Also accepts CSV subreddit names
|
|
- `--min-score`
|
|
- This skips all submissions which have fewer than specified upvotes
|
|
- `--max-score`
|
|
- This skips all submissions which have more than specified upvotes
|
|
- `--min-score-ratio`
|
|
- This skips all submissions which have lower than specified upvote ratio
|
|
- `--max-score-ratio`
|
|
- This skips all submissions which have higher than specified upvote ratio
|
|
|
|
### Archiver Options
|
|
|
|
The following options are for the `archive` command specifically.
|
|
|
|
- `--all-comments`
|
|
- When combined with the `--user` option, this will download all the user's comments
|
|
- `-f, --format`
|
|
- This specifies the format of the data file saved to disk
|
|
- The following formats are available:
|
|
- `json` (default)
|
|
- `xml`
|
|
- `yaml`
|
|
- Can be specified multiple times
|
|
- `--comment-context`
|
|
- This option will, instead of downloading an individual comment, download the submission that comment is a part of
|
|
- May result in a longer run time as it retrieves much more data
|
|
- `--skip-comments`
|
|
- Skip downloading all comments. This will result in a much shorter runtime.
|
|
- Not compatible with --comment-context
|
|
|
|
### Cloner Options
|
|
|
|
The `clone` command can take all the options listed above for both the `archive` and `download` commands since it
|
|
performs the functions of both.
|
|
|
|
## Common Command Tricks
|
|
|
|
A common use case is for subreddits/users to be loaded from a file. The BDFR supports this via YAML file options
|
|
(`--opts my_opts.yaml`).
|
|
|
|
Alternatively, you can use the command-line [xargs](https://en.wikipedia.org/wiki/Xargs) function.
|
|
For a list of users `users.txt` (one user per line), type:
|
|
|
|
```bash
|
|
cat users.txt | xargs -L 1 echo --user | xargs -L 50 bdfr download <ARGS>
|
|
```
|
|
|
|
The part `-L 50` is to make sure that the character limit for a single line isn't exceeded, but may not be necessary.
|
|
This can also be used to load subreddits from a file, simply exchange `--user` with `--subreddit` and so on.
|
|
|
|
## Authentication and Security
|
|
|
|
The BDFR uses OAuth2 authentication to connect to Reddit if authentication is required. This means that it is a secure,
|
|
token-based system for making requests. This also means that the BDFR only has access to specific parts of the account
|
|
authenticated, by default only saved posts, upvoted posts, downvoted posts, and the identity of the authenticated
|
|
account. Note that authentication is not required unless accessing private things like upvoted posts, downvoted posts,
|
|
saved posts, and private multireddits.
|
|
|
|
To authenticate, the BDFR will first look for a token in the configuration file that signals that there's been
|
|
a previous authentication. If this is not there, then the BDFR will attempt to register itself with your account. This
|
|
is normal, and if you run the program, it will pause and show a Reddit URL. Click on this URL and it will take you to
|
|
Reddit, where the permissions being requested will be shown. Read this and **confirm that there are no more permissions
|
|
than needed to run the program**. You should not grant unneeded permissions; by default, the BDFR only requests
|
|
permission to read your saved, upvoted, or downvoted submissions and identify as you.
|
|
|
|
If the permissions look safe, confirm it, and the BDFR will save a token that will allow it to authenticate with Reddit
|
|
from then on.
|
|
|
|
## Changing Permissions
|
|
|
|
Most users will not need to do anything extra to use any of the current features. However, if additional features such
|
|
as scraping messages, PMs, etc are added in the future, these will require additional scopes. Additionally, advanced
|
|
users may wish to use the BDFR with their own API key and secret. There is normally no need to do this, but it *is*
|
|
allowed by the BDFR.
|
|
|
|
The configuration file for the BDFR contains the API secret and key, as well as the scopes that the BDFR will request
|
|
when registering itself to a Reddit account via OAuth2. These can all be changed if the user wishes, however do not do
|
|
so if you don't know what you are doing. The defaults are specifically chosen to have a very low security risk if your
|
|
token were to be compromised, however unlikely that actually is. Never grant more permissions than you absolutely need.
|
|
|
|
For more details on the configuration file and the values therein, see [Configuration Files](#configuration).
|
|
|
|
## Folder and File Name Schemes
|
|
|
|
The naming and folder schemes for the BDFR are both completely customisable. A number of different fields can be given
|
|
which will be replaced with properties from a submission when downloading it. The scheme format takes the form of
|
|
`{KEY}`, where `KEY` is a string from the below list.
|
|
|
|
- `DATE`
|
|
- `FLAIR`
|
|
- `POSTID`
|
|
- `REDDITOR`
|
|
- `SUBREDDIT`
|
|
- `TITLE`
|
|
- `UPVOTES`
|
|
|
|
Each of these can be enclosed in curly bracket, `{}`, and included in the name. For example, to just title every
|
|
downloaded post with the unique submission ID, you can use `{POSTID}`. Static strings can also be included, such as
|
|
`download_{POSTID}` which will not change from submission to submission. For example, the previous string will result in
|
|
the following submission file names:
|
|
|
|
- `download_aaaaaa.png`
|
|
- `download_bbbbbb.png`
|
|
|
|
At least one key *must* be included in the file scheme, otherwise an error will be thrown. The folder scheme however,
|
|
can be null or a simple static string. In the former case, all files will be placed in the folder specified with the
|
|
`directory` argument. If the folder scheme is a static string, then all submissions will be placed in a folder of that
|
|
name. In both cases, there will be no separation between all submissions.
|
|
|
|
It is highly recommended that the file name scheme contain the parameter `{POSTID}` as this is **the only parameter
|
|
guaranteed to be unique**. No combination of other keys will necessarily be unique and may result in posts being skipped
|
|
as the BDFR will see files by the same name and skip the download, assuming that they are already downloaded.
|
|
|
|
## Configuration
|
|
|
|
The configuration files are, by default, stored in the configuration directory for the user. This differs depending on
|
|
the OS that the BDFR is being run on. For Windows, this will be:
|
|
|
|
- `C:\Users\<User>\AppData\Local\BDFR\bdfr`
|
|
|
|
If Python has been installed through the Windows Store, the folder will appear in a different place. Note that the hash
|
|
included in the file path may change from installation to installation.
|
|
|
|
- `C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\Local\BDFR\bdfr`
|
|
|
|
On Mac OSX, this will be:
|
|
|
|
- `~/Library/Application Support/bdfr`.
|
|
|
|
Lastly, on a Linux system, this will be:
|
|
|
|
- `~/.config/bdfr/`
|
|
|
|
The logging output for each run of the BDFR will be saved to this directory in the file `log_output.txt`. If you need to
|
|
submit a bug, it is this file that you will need to submit with the report.
|
|
|
|
### Configuration File
|
|
|
|
The `config.cfg` is the file that supplies the BDFR with the configuration to use. At the moment, the following keys
|
|
**must** be included in the configuration file supplied.
|
|
|
|
- `client_id`
|
|
- `client_secret`
|
|
- `scopes`
|
|
|
|
The following keys are optional, and defaults will be used if they cannot be found.
|
|
|
|
- `backup_log_count`
|
|
- `max_wait_time`
|
|
- `time_format`
|
|
- `disabled_modules`
|
|
- `filename-restriction-scheme`
|
|
|
|
All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to
|
|
function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the
|
|
configuration directory as the default.
|
|
|
|
Most of these values have to do with OAuth2 configuration and authorisation. The key `backup_log_count` however has to
|
|
do with the log rollover. The logs in the configuration directory can be verbose and for long runs of the BDFR, can grow
|
|
quite large. To combat this, the BDFR will overwrite previous logs. This value determines how many previous run logs
|
|
will be kept. The default is 3, which means that the BDFR will keep at most three past logs plus the current one. Any
|
|
runs past this will overwrite the oldest log file, called "rolling over". If you want more records of past runs,
|
|
increase this number.
|
|
|
|
#### Time Formatting Customisation
|
|
|
|
The option `time_format` will specify the format of the timestamp that replaces `{DATE}` in filename and folder name
|
|
schemes. By default, this is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format which is highly recommended
|
|
due to its standardised nature. If you don't **need** to change it, it is recommended that you do not. However, you can
|
|
specify it to anything required with this option. The `--time-format` option supersedes any specification in the
|
|
configuration file
|
|
|
|
The format can be specified through the [format
|
|
codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) that are standard in the Python
|
|
`datetime` library.
|
|
|
|
#### Disabling Modules
|
|
|
|
The individual modules of the BDFR, used to download submissions from websites, can be disabled. This is helpful
|
|
especially in the case of the fallback downloaders, since the `--skip-domain` option cannot be effectively used in these
|
|
cases. For example, the Youtube-DL downloader can retrieve data from hundreds of websites and domains; thus the only way
|
|
to fully disable it is via the `--disable-module` option.
|
|
|
|
Modules can be disabled through the command line interface for the BDFR or more permanently in the configuration file
|
|
via the `disabled_modules` option. The list of downloaders that can be disabled are the following. Note that they are
|
|
case-insensitive.
|
|
|
|
- `Direct`
|
|
- `DelayForReddit`
|
|
- `Erome`
|
|
- `Gallery` (Reddit Image Galleries)
|
|
- `Gfycat`
|
|
- `Imgur`
|
|
- `PornHub`
|
|
- `Redgifs`
|
|
- `SelfPost` (Reddit Text Post)
|
|
- `Vidble`
|
|
- `VReddit` (Reddit Video Post)
|
|
- `Youtube`
|
|
- `YtdlpFallback` (Youtube DL Fallback)
|
|
|
|
### Rate Limiting
|
|
|
|
The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of
|
|
requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so
|
|
many requests that the remote website cuts the client off to preserve the function of the site. This is a common
|
|
situation when downloading many resources from the same site. It is polite and best practice to obey the website's
|
|
wishes in these cases.
|
|
|
|
To this end, the BDFR will sleep for a time before retrying the download, giving the remote server time to "rest". This
|
|
is done in 60 second increments. For example, if a rate-limiting-related error is given, the BDFR will sleep for 60
|
|
seconds before retrying. Then, if the same type of error occurs, it will sleep for another 120 seconds, then 180
|
|
seconds, and so on.
|
|
|
|
The option `--max-wait-time` and the configuration option `max_wait_time` both specify the maximum time the BDFR will
|
|
wait. If both are present, the command-line option takes precedence. For instance, the default is 120, so the BDFR will
|
|
wait for 60 seconds, then 120 seconds, and then move one. **Note that this results in a total time of 180 seconds trying
|
|
the same download**. If you wish to try to bypass the rate-limiting system on the remote site, increasing the maximum
|
|
wait time may help. However, note that the actual wait times increase exponentially if the resource is not downloaded
|
|
i.e. specifying a max value of 300 (5 minutes), can make the BDFR pause for 15 minutes on one submission, not 5, in the
|
|
worst case.
|
|
|
|
## Multiple Instances
|
|
|
|
The BDFR can be run in multiple instances with multiple configurations, either concurrently or consecutively. The use of
|
|
scripting files facilitates this the easiest, either Powershell on Windows operating systems or Bash elsewhere. This
|
|
allows multiple scenarios to be run with data being scraped from different sources, as any two sets of scenarios might
|
|
be mutually exclusive i.e. it is not possible to download any combination of data from a single run of the BDFR. To
|
|
download from multiple users for example, multiple runs of the BDFR are required.
|
|
|
|
Running these scenarios consecutively is done easily, like any single run. Configuration files that differ may be
|
|
specified with the `--config` option to switch between tokens, for example. Otherwise, almost all configuration for data
|
|
sources can be specified per-run through the command line.
|
|
|
|
Running scenarios concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static
|
|
place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple
|
|
instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On
|
|
Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile
|
|
will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes
|
|
the program as Windows forbids multiple processes from accessing the same file.
|
|
|
|
The way to fix this is to use the `--log` option to manually specify where the logfile is to be stored. If the given
|
|
location is unique to each instance of the BDFR, then it will run fine.
|
|
|
|
## Filesystem Restrictions
|
|
|
|
Different filesystems have different restrictions for what files and directories can be named. Thesse are separated into
|
|
two broad categories: Linux-based filesystems, which have very few restrictions; and Windows-based filesystems, which
|
|
are much more restrictive in terms if forbidden characters and length of paths.
|
|
|
|
During the normal course of operation, the BDFR detects what filesystem it is running on and formats any filenames and
|
|
directories to conform to the rules that are expected of it. However, there are cases where this will fail. When running
|
|
on a Linux-based machine, or another system where the home filesystem is permissive, and accessing a share or drive with
|
|
a less permissive system, the BDFR will assume that the *home* filesystem's rules apply. For example, when downloading
|
|
to a SAMBA share from Ubuntu, there will be errors as SAMBA is more restrictive than Ubuntu.
|
|
|
|
The best option would be to always download to a filesystem that is as permission as possible, such as an NFS share or
|
|
ext4 drive. However, when this is not possible, the BDFR allows for the restriction scheme to be manually specified at
|
|
either the command-line or in the configuration file. At the command-line, this is done with
|
|
`--filename-restriction-scheme windows`, or else an option by the same name in the configuration file.
|
|
|
|
## Manipulating Logfiles
|
|
|
|
The logfiles that the BDFR outputs are consistent and quite detailed and in a format that is amenable to regex. To this
|
|
end, a number of bash scripts have been [included here](./scripts). They show examples for how to extract successfully
|
|
downloaded IDs, failed IDs, and more besides.
|
|
|
|
## Unsaving posts
|
|
|
|
Back in v1 there was an option to unsave posts from your account when downloading, but it was removed from the core BDFR
|
|
on v2 as it is considered a read-only tool. However, for those missing this functionality, a script was created that
|
|
uses the log files to achieve this. There is info on how to use this on the README.md file on the scripts subdirectory.
|
|
|
|
## List of currently supported sources
|
|
|
|
- Direct links (links leading to a file)
|
|
- Delay for Reddit
|
|
- Erome
|
|
- Gfycat
|
|
- Gif Delivery Network
|
|
- Imgur
|
|
- Reddit Galleries
|
|
- Reddit Text Posts
|
|
- Reddit Videos
|
|
- Redgifs
|
|
- Vidble
|
|
- YouTube
|
|
- Any source supported by [YT-DLP](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md) should be
|
|
compatible
|
|
|
|
## Contributing
|
|
|
|
If you wish to contribute, see [Contributing](docs/CONTRIBUTING.md) for more information.
|
|
|
|
When reporting any issues or interacting with the developers, please follow the [Code of
|
|
Conduct](docs/CODE_OF_CONDUCT.md).
|