1
0
Fork 0
mirror of synced 2024-05-16 02:02:45 +12:00

Update documentation

This commit is contained in:
Serene-Arc 2021-03-16 10:01:51 +10:00 committed by Ali Parlakci
parent 3a093d0844
commit 0d72bf6431
4 changed files with 157 additions and 31 deletions

View file

@ -1,8 +1,8 @@
# Bulk Downloader for Reddit
This is a tool to download data from Reddit.
This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive command-line interface.
# Usage
## Usage
The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.
@ -10,7 +10,7 @@ There are two modes to the BDFR: download, and archive. Each one has a command t
Many websites and links are supported for the downloader:
- Direct Links(links leading to a file)
- Direct Links (links leading to a file)
- Erome
- Gfycat
- Gif Delivery Network
@ -21,7 +21,7 @@ Many websites and links are supported for the downloader:
- Redgifs
- Youtube
# Options
### Options
The following options are common between both the `archive` and `download` commands of the BDFR.
@ -48,7 +48,7 @@ The following options are common between both the `archive` and `download` comma
- `-L, --limit`
- This is the limit on the number of submissions retrieve
- Default is max possible
- Note that this limit applies to ** each source individually ** e.g. if a `--limit` of 10 and three subreddits are provided, then 30 total submissions will be scraped
- Note that this limit applies to **each source individually** e.g. if a `--limit` of 10 and three subreddits are provided, then 30 total submissions will be scraped
- If it is not supplied, then the BDFR will default to the maximum allowed by Reddit, roughly 1000 posts. **We cannot bypass this.**
- `-S, --sort`
- This is the sort type for each applicable submission source supplied to the BDFR
@ -87,7 +87,7 @@ The following options are common between both the `archive` and `download` comma
- Increases the verbosity of the program
- Can be specified multiple times
# Downloader Options
#### Downloader Options
The following options apply only to the `download` command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.
@ -111,7 +111,7 @@ The following options apply only to the `download` command. This command downloa
- This adds file types to the download filter i.e. submissions with one of the supplied file extensions will not be downloaded
- Can be specified multiple times
# Archiver Options
#### Archiver Options
The following options are for the `archive` command specifically.
@ -122,13 +122,13 @@ The following options are for the `archive` command specifically.
- `xml`
- `yaml`
# Authentication
## Authentication
The BDFR uses OAuth2 authentication to connect to Reddit if authentication is required. This means that it is a secure, token - based system for making requests. This also means that the BDFR only has access to specific parts of the account authenticated, by default only saved posts, upvoted posts, and the identity of the authenticated account. Note that authentication is not required unless accessing private things like upvoted posts, saved posts, and private multireddits.
To authenticate, the BDFR will first look for a token in the configuration file that signals that there's been a previous authentication. If this is not there, then the BDFR will attempt to register itself with your account. This is normal, and if you run the program, it will pause and show a Reddit URL. Click on this URL and it will take you to Reddit, where the permissions being requested will be shown. Confirm it, and the BDFR will save a token that will allow it to authenticate with Reddit from then on.
# Changing Permissions
## Changing Permissions
Most users will not need to do anything extra to use any of the current features. However, if additional features such as scraping messages, PMs, etc are added in the future, these will require additional scopes. Additionally, advanced users may wish to use the BDFR with their own API key and secret. There is normally no need to do this, but it is allowed by the BDFR.
@ -136,7 +136,7 @@ The configuration file for the BDFR contains the API secret and key, as well as
For more details on the configuration file and the values therein, see[Configuration Files](#configuration).
# Folder and File Name Schemes
## Folder and File Name Schemes
The naming and folder schemes for the BDFR are both completely customisable. A number of different fields can be given which will be replaced with properties from a submission when downloading it. The scheme format takes the form of `{KEY}`, where `KEY` is a string from the below list.
@ -150,15 +150,15 @@ The naming and folder schemes for the BDFR are both completely customisable. A n
Each of these can be enclosed in curly bracket, `{}`, and included in the name. For example, to just title every downloaded post with the unique submission ID, you can use `{POSTID}`. Statis strings can also be included, such as `download_{POSTID}` which will not change from submission to submission.
At least one key * must * be included in the file scheme, otherwise an error will be thrown. The folder scheme however, can be null or a simple static string. In the former case, all files will be placed in the folder specified with the `directory` argument. If the folder scheme is a static string, then all submissions will be placed in a folder of that name.
At least one key *must* be included in the file scheme, otherwise an error will be thrown. The folder scheme however, can be null or a simple static string. In the former case, all files will be placed in the folder specified with the `directory` argument. If the folder scheme is a static string, then all submissions will be placed in a folder of that name.
# Configuration
## Configuration
The configuration files are, by default, stored in the configuration directory for the user. This differs depending on the OS that the BDFR is being run on. For Windows, this will be `C:\Documents and Settings\<User>\Application Data\Local Settings\BDFR\bulkredditdownloader` or `C:\Documents and Settings\<User>\Application Data\BDFR\bulkredditdownloader`. On Mac OSX, this will be `~/Library/Application Support/bulkredditdownloader`. Lastly, on a Linux system, this will be `~/.local/share/bulkredditdownloader`.
The logging output for each run of the BDFR will be saved to this directory in the file `log_output.txt`. If you need to submit a bug, it is this file that you will need to submit with the report.
# Configuration File
### Configuration File
The `config.cfg` is the file that supplies the BDFR with the configuration to use. At the moment, the following keys ** must ** be included in the configuration file supplied.
@ -167,3 +167,9 @@ The `config.cfg` is the file that supplies the BDFR with the configuration to us
- `scopes`
All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default.
## Contributing
If you wish to contribute, see [Contributing](docs/CONTRIBUTING.md) for more information.
When reporting any issues or interacting with the developers, please follow the [Code of Conduct](docs/CODE_OF_CONDUCT.md).

View file

@ -1,22 +1,37 @@
# Architecture
1. Arguments are passed to an instance of RedditDownloader
2. Internal objects are created
- Formatter created
- Filter created
- Configuration loaded
- Reddit instance created
3. Reddit lists scraped
When the project was rewritten for v2, the goal was to make the codebase easily extensible and much easier to read and modify. However, this document provides a step-by-step look through the process that the BDFR goes through, so that any prospective developers can more easily grasp the way the code works.
## The Download Process
The BDFR is organised around a central object, the RedditDownloader class. The Archiver object extends and inherits from this class.
1. The RedditDownloader parses all the arguments and configuration options, held in the Configuration object, and creates a variety of internal objects for use, such as the file name formatter, download filter, etc.
To actually download, the following happens:
2. The RedditDownloader scrapes raw submissions from Reddit via several methods relating to different sources. A source is defined as a single stream of submissions from a subreddit, multireddit, or user list.
3. These raw submissions are passed to the DownloaderFactory class to select the specialised downloader class to use. Each of these are for a specific website or link type, with some catch-all classes like Direct.
4. The BaseDownloader child, spawned by DownloaderFactory, takes the link and does any necessary processing to find the direct link to the actual resource.
5. This is returned to the RedditDownloader in the form of a Resource object. This holds the URL and some other information for the final resource.
6. The Resource is passed through the DownloadFilter instantiated in step 1.
1. RedditDownloader uses DownloadFactory to find the right module for a submission
2. Downloader instance created
3. Downloader returns a list of Resource objects (lists may have one objects)
4. RedditDownloader checks if it already exists
5. RedditDownloader checks against the DownloadFilter created earlier
6. RedditDownloader creates a formatted file path base on the Resource with FileNameFormatter
7. Resource content is written to disk
7. The destination file name for the Resource is calculated. If it already exists, then the Resource will be discarded.
8. Here the actual data is downloaded to the Resource and a hash calculated which is used to find duplicates.
9. Only then is the Resource written to the disk.
This is the step-by-step process that the BDFR goes through to download a Reddit post.
## Adding another Supported Site
This is one of the easiest changes to do with the code. First, any new class must inherit from the BaseDownloader class which provided an abstract parent to implement. However, take note of the other classes as well. Many downloaders can inherit from one another instead of just the BaseDownloader. For example, the VReddit class, used for downloading video from Reddit, inherits almost all of its code from the YouTube class. **Minimise code duplication wherever possible**.
Once the downloader class has been written **and tests added** for it as well, then the regex string for the site's URLs can be added to the DownloaderFactory. Then additional tests must be added for the DownloadFactory to ensure that the appropriate classes are called when the right URLs are passed to the factory.
## Adding Other Features

76
docs/CODE_OF_CONDUCT.md Normal file
View file

@ -0,0 +1,76 @@
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, gender identity and expression, level of experience,
education, socio-economic status, nationality, personal appearance, race,
religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team via Discord. All complaints will
be reviewed and investigated and will result in a response that is deemed
necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an
incident. Further details of specific enforcement policies may be posted
separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org

View file

@ -0,0 +1,29 @@
# Contributing
When making a contribution to the BDFR project, please open an issue beforehand so that the maintainers can weigh in on it. This helps create a trail on GitHub and keeps things organised.
If you have a question, **please don't open an issue on GitHub**. There is a subreddit specifically for the BDFR where questions can be asked. If you believe that something is a bug, or that a feature should be added, then by all means open an issue.
All communication on GitHub, Discord, email, or any other medium must conform to the [Code of Conduct](CODE_OF_CONDUCT.md). It's not that hard to stay respectful.
## Pull Requests
Before creating a pull request (PR), check out [ARCHITECTURE](ARCHITECTURE.md) for a short introduction to the way that the BDFR is coded and how the code is organised. Also read the [Style Guide](#style-guide) below before actually writing any code.
Once you have done both of these, the below list shows the path that should be followed when writing a PR.
1. If an issue does not already exist, open one that will relate to the PR.
2. Ensure that any changes fit into the architecture specified above.
3. Ensure that you have written tests that cover the new code
4. Ensure that no existing tests fail, unless there is a good reason for them to do so. If there is, note why in the PR.
5. If needed, update any documentation with changes.
6. Open a pull request that references the relevant issue.
7. Expect changes or suggestions and heed the Code of Conduct. We're all volunteers here.
Someone will review your pull request as soon as possible, but remember that all maintainers are volunteers and this won't happen immediately. Once it is approved, congratulations! Your code is now part of the BDFR.
## Style Guide
The BDFR must conform to PEP8 standard wherever there is Python code, with one exception. Line lengths may extend to 120 characters, but all other PEP8 standards must be followed.
It's easy to format your code without any manual work via a variety of tools. Autopep8 is a good one, and can be used with `autopep8 --max-line-length 120` which will format the code according to the style in use with the BDFR.