1
0
Fork 0
mirror of synced 2024-05-18 03:02:48 +12:00

Enhance docs

This commit is contained in:
Ali Parlakçı 2021-04-18 03:26:07 +03:00 committed by Ali Parlakci
parent e78ecd5626
commit 0d407d7a39
2 changed files with 76 additions and 31 deletions

View file

@ -1,39 +1,48 @@
# Bulk Downloader for Reddit v2 \[BETA\]
[![Python Test](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml/badge.svg?branch=v2)](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml)
This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive command-line interface.
Some quick reference commands are:
```bash
python3 -m bdfr download --subreddit Python -L 10
python3 -m bdfr download --user me --saved --authenticate -L 25 --file-scheme '{POSTID}'
python3 -m bdfr download --subreddit 'Python, all, mindustry' -L 10 --make-hard-links
python3 -m bdfr archive --subreddit all --format yaml -L 500 --folder-scheme ''
```
This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive command-line interface. [List of currently supported sources](#list-of-currently-supported-sources)
If you wish to open an issue, please read [the guide on opening issues](docs/CONTRIBUTING.md#opening-an-issue) to ensure that your issue is clear and contains everything it needs to for the developers to investigate.
## Installation
*Bulk Downloader for Reddit* **requires** Python 3.9.x and it is distributed via `pip`. Install it as such:
```bash
pip install bdfr
```
If you want to use the source code or make contributions, refer to [CONTRIBUTING](docs/CONTRIBUTING.md#preparing-the-environment-for-development)
## Usage
The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.
There are two modes to the BDFR: download, and archive. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML.
Many websites and links are supported for the downloader:
After installation, run the program from any directory as shown below:
```bash
python3 -m bdfr download
```
```bash
python3 -m bdfr archive
```
- Direct links (links leading to a file)
- Erome
- Gfycat
- Gif Delivery Network
- Imgur
- Reddit Galleries
- Reddit Text Posts
- Reddit Videos
- Redgifs
- YouTube
However, these commands are not enough. You should chain parameters in [Options](#options) according to your use case. Don't forget that some parameters can be provided multiple times. Some quick reference commands are:
### Options
```bash
python3 -m bdfr download --subreddit Python -L 10
```
```bash
python3 -m bdfr download --user me --saved --authenticate -L 25 --file-scheme '{POSTID}'
```
```bash
python3 -m bdfr download --subreddit 'Python, all, mindustry' -L 10 --make-hard-links
```
```bash
python3 -m bdfr archive --subreddit all --format yaml -L 500 --folder-scheme ''
```
## Options
The following options are common between both the `archive` and `download` commands of the BDFR.
@ -103,7 +112,7 @@ The following options are common between both the `archive` and `download` comma
- Increases the verbosity of the program
- Can be specified multiple times
#### Downloader Options
### Downloader Options
The following options apply only to the `download` command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.
@ -145,7 +154,7 @@ The following options apply only to the `download` command. This command downloa
- Can be specified multiple times
- Also accepts CSV subreddit names
#### Archiver Options
### Archiver Options
The following options are for the `archive` command specifically.
@ -198,8 +207,7 @@ It is highly recommended that the file name scheme contain the parameter `{POSTI
## Configuration
The configuration files are, by default, stored in the configuration directory for the user. This differs depending on the OS that the BDFR is being run on. For Windows, this will be:
- `C:\Documents and Settings\<User>\Application Data\Local Settings\BDFR\bdfr` or
- `C:\Documents and Settings\<User>\Application Data\BDFR\bdfr`
- `C:\Users\<User>\AppData\Local\BDFR\bdfr`
On Mac OSX, this will be:
- `~/Library/Application Support/bdfr`.
@ -223,7 +231,7 @@ All of these should not be modified unless you know what you're doing, as the de
Most of these values have to do with OAuth2 configuration and authorisation. The key `backup_log_count` however has to do with the log rollover. The logs in the configuration directory can be verbose and for long runs of the BDFR, can grow quite large. To combat this, the BDFR will overwrite previous logs. This value determines how many previous run logs will be kept. The default is 3, which means that the BDFR will keep at most three past logs plus the current one. Any runs past this will overwrite the oldest log file, called "rolling over". If you want more records of past runs, increase this number.
#### Rate Limiting
### Rate Limiting
The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.
@ -231,6 +239,19 @@ To this end, the BDFR will sleep for a time before retrying the download, giving
The option `--max-wait-time` and the configuration option `max_wait_time` both specify the maximum time the BDFR will wait. If both are present, the command-line option takes precedence. For instance, the default is 120, so the BDFR will wait for 60 seconds, then 120 seconds, and then move one. **Note that this results in a total time of 180 seconds trying the same download**. If you wish to try to bypass the rate-limiting system on the remote site, increasing the maximum wait time may help. However, note that the actual wait times increase exponentially if the resource is not downloaded i.e. specifying a max value of 300 (5 minutes), can make the BDFR pause for 15 minutes on one submission, not 5, in the worst case.
## List of currently supported sources
- Direct links (links leading to a file)
- Erome
- Gfycat
- Gif Delivery Network
- Imgur
- Reddit Galleries
- Reddit Text Posts
- Reddit Videos
- Redgifs
- YouTube
## Contributing
If you wish to contribute, see [Contributing](docs/CONTRIBUTING.md) for more information.

View file

@ -2,17 +2,21 @@
When making a contribution to the BDFR project, please open an issue beforehand so that the maintainers can weigh in on it. This helps create a trail on GitHub and keeps things organised.
If you have a question, **please don't open an issue on GitHub**. There is a discussion tab on the repository's GitHub where you can interact with the developers and ask questions. If you believe that something is a bug, or that a feature should be added, then by all means open an issue.
**Please don't open an issue on GitHub** unless you are reporting a bug or proposing a feature. For questions, there is a discussion tab on the repository's GitHub page where you can interact with the developers and ask questions. If you believe that something is a bug, or that a feature should be added, then by all means open an issue.
All communication on GitHub, Discord, email, or any other medium must conform to the [Code of Conduct](CODE_OF_CONDUCT.md). It's not that hard to stay respectful.
## Opening an Issue
**Before opening a new issue**, be sure that no issues regarding your problem already exist. If a similar issue exists, try to contribute to the issue.
### Bugs
When opening an issue about a bug, **please provide the full log file for the run in which the bug occurred**. This log file is named `log_output.txt` in the configuration folder. Check the [README](../README.md) for information on where this is. This log file will contain all the information required for the developers to recreate the bug.
If you do not have or cannot find the log file, then at minimum please provide the **Reddit ID for the submission** or comment which caused the issue. Also copy in the command that you used to run the BDFR from the command line, as that will also provide helpful information when trying to find and fix the bug. If needed, more information will be asked in the thread of the bug.
In the case of requesting a feature or an enhancement, there are fewer requirements. However, please be clear in what you would like the BDFR to do and also how the feature/enhancement would be used or would be useful to more people. Be aware that proposed enhancements may be rejected for multiple reasons, or no reason, at the discretion of the developers.
### Feature requests
In the case of requesting a feature or an enhancement, there are fewer requirements. However, please be clear in what you would like the BDFR to do and also how the feature/enhancement would be used or would be useful to more people. It is crucial that the feature is justified. Any feature request without a concrete reason for it to be implemented has a very small chance to get accepted. Be aware that proposed enhancements may be rejected for multiple reasons, or no reason, at the discretion of the developers.
## Pull Requests
@ -23,12 +27,32 @@ Once you have done both of these, the below list shows the path that should be f
1. If an issue does not already exist, open one that will relate to the PR.
2. Ensure that any changes fit into the architecture specified above.
3. Ensure that you have written tests that cover the new code.
4. Ensure that no existing tests fail, unless there is a good reason for them to do so. If there is, note which tests fail and why this is expected and okay in the PR.
4. Ensure that no existing tests fail, unless there is a good reason for them to do so.
5. If needed, update any documentation with changes.
6. Open a pull request that references the relevant issue.
7. Expect changes or suggestions and heed the Code of Conduct. We're all volunteers here.
Someone will review your pull request as soon as possible, but remember that all maintainers are volunteers and this won't happen immediately. Once it is approved, congratulations! Your code is now part of the BDFR.
Someone will review your pull request as soon as possible, but remember that all maintainers are volunteers and this won't happen immediately. Once it is approved, congratulations! Your code is now part of the BDFR.
## Preparing the environment for development
Bulk Downloader for Reddit requires Python 3.9 at minimum. First, ensure that your Python installation satisfies this.
BDfR is built in a way that it can be packaged and installed via `pip`. This places BDfR next to other Python packages and enables you to run the program from any directory. Since it is managed by pip, you can also uninstall it.
To install the program, clone the repository and run pip inside the project's root directory:
```bash
$ git clone https://github.com/aliparlakci/bulk-downloader-for-reddit.git
$ cd ./bulk-downloader-for-reddit
$ python3 -m pip install -e .
```
**`-e`** parameter creates a link to that folder. That is, any change inside the folder affects the package immidiately. So, when developing, you can be sure that the package is not stale and Python is always running your latest changes. (Due to this linking, moving/removing/renaming the folder might break it)
Then, you can run the program from anywhere in your disk as such:
```bash
$ python3 -m bdfr
```
## Style Guide