1
0
Fork 0
mirror of synced 2024-05-19 19:52:41 +12:00

Format documents according to line length

This commit is contained in:
Serene-Arc 2023-06-25 14:26:12 +10:00
parent 3584ff1953
commit 86101a27b2
3 changed files with 168 additions and 87 deletions

View file

@ -1,41 +1,89 @@
# Architecture
When the project was rewritten for v2, the goal was to make the codebase easily extensible and much easier to read and modify. However, this document provides a step-by-step look through the process that the BDFR goes through, so that any prospective developers can more easily grasp the way the code works.
When the project was rewritten for v2, the goal was to make the codebase easily
extensible and much easier to read and modify. However, this document provides
a step-by-step look through the process that the BDFR goes through, so that any
prospective developers can more easily grasp the way the code works.
## Design Ethos
The BDFR is designed to be a stateless downloader. This means that the state of the program is forgotten between each run of the program. There are no central lists, databases, or indices, that the BDFR uses, only the actual files on disk. There are several advantages to this approach:
The BDFR is designed to be a stateless downloader. This means that the state of
the program is forgotten between each run of the program. There are no central
lists, databases, or indices, that the BDFR uses, only the actual files on
disk. There are several advantages to this approach:
1. There is no chance of the database being corrupted or changed by something other than the BDFR, rendering the BDFR's "idea" of the archive wrong or incomplete.
2. Any information about the archive is contained by the archive itself i.e. for a list of all submission IDs in the archive, this can be extracted from the names of the files in said archive, assuming an appropriate naming scheme was used.
3. Archives can be merged, split, or editing without worrying about having to update a central database
4. There are no versioning issues between updates of the BDFR, where old version are stuck with a worse form of the database
5. An archive can be put on a USB, moved to another computer with possibly a very different BDFR version, and work completely fine
1. There is no chance of the database being corrupted or changed by something
other than the BDFR, rendering the BDFR's "idea" of the archive wrong or
incomplete.
2. Any information about the archive is contained by the archive itself i.e.
for a list of all submission IDs in the archive, this can be extracted from
the names of the files in said archive, assuming an appropriate naming
scheme was used.
3. Archives can be merged, split, or editing without worrying about having to
update a central database
4. There are no versioning issues between updates of the BDFR, where old
version are stuck with a worse form of the database
5. An archive can be put on a USB, moved to another computer with possibly
a very different BDFR version, and work completely fine
Another major part of the ethos of the design is DOTADIW, Do One Thing And Do It Well. It's a major part of Unix philosophy and states that each tool should have a well-defined, limited purpose. To this end, the BDFR is, as the name implies, a *downloader*. That is the scope of the tool. Managing the files downloaded can be for better-suited programs, since the BDFR is not a file manager. Nor the BDFR concern itself with how any of the data downloaded is displayed, changed, parsed, or analysed. This makes the BDFR suitable for data science-related tasks, archiving, personal downloads, or analysis of various Reddit sources as the BDFR is completely agnostic on how the data is used.
Another major part of the ethos of the design is DOTADIW, Do One Thing And Do
It Well. It's a major part of Unix philosophy and states that each tool should
have a well-defined, limited purpose. To this end, the BDFR is, as the name
implies, a *downloader*. That is the scope of the tool. Managing the files
downloaded can be for better-suited programs, since the BDFR is not a file
manager. Nor the BDFR concern itself with how any of the data downloaded is
displayed, changed, parsed, or analysed. This makes the BDFR suitable for data
science-related tasks, archiving, personal downloads, or analysis of various
Reddit sources as the BDFR is completely agnostic on how the data is used.
## The Download Process
The BDFR is organised around a central object, the RedditDownloader class. The Archiver object extends and inherits from this class.
The BDFR is organised around a central object, the RedditDownloader class. The
Archiver object extends and inherits from this class.
1. The RedditDownloader parses all the arguments and configuration options, held in the Configuration object, and creates a variety of internal objects for use, such as the file name formatter, download filter, etc.
2. The RedditDownloader scrapes raw submissions from Reddit via several methods relating to different sources. A source is defined as a single stream of submissions from a subreddit, multireddit, or user list.
3. These raw submissions are passed to the DownloaderFactory class to select the specialised downloader class to use. Each of these are for a specific website or link type, with some catch-all classes like Direct.
4. The BaseDownloader child, spawned by DownloaderFactory, takes the link and does any necessary processing to find the direct link to the actual resource.
5. This is returned to the RedditDownloader in the form of a Resource object. This holds the URL and some other information for the final resource.
1. The RedditDownloader parses all the arguments and configuration options,
held in the Configuration object, and creates a variety of internal objects
for use, such as the file name formatter, download filter, etc.
2. The RedditDownloader scrapes raw submissions from Reddit via several methods
relating to different sources. A source is defined as a single stream of
submissions from a subreddit, multireddit, or user list.
3. These raw submissions are passed to the DownloaderFactory class to select
the specialised downloader class to use. Each of these are for a specific
website or link type, with some catch-all classes like Direct.
4. The BaseDownloader child, spawned by DownloaderFactory, takes the link and
does any necessary processing to find the direct link to the actual
resource.
5. This is returned to the RedditDownloader in the form of a Resource object.
This holds the URL and some other information for the final resource.
6. The Resource is passed through the DownloadFilter instantiated in step 1.
7. The destination file name for the Resource is calculated. If it already exists, then the Resource will be discarded.
8. Here the actual data is downloaded to the Resource and a hash calculated which is used to find duplicates.
7. The destination file name for the Resource is calculated. If it already
exists, then the Resource will be discarded.
8. Here the actual data is downloaded to the Resource and a hash calculated
which is used to find duplicates.
9. Only then is the Resource written to the disk.
This is the step-by-step process that the BDFR goes through to download a Reddit post.
This is the step-by-step process that the BDFR goes through to download
a Reddit post.
## Adding another Supported Site
This is one of the easiest changes to do with the code. First, any new class must inherit from the BaseDownloader class which provided an abstract parent to implement. However, take note of the other classes as well. Many downloaders can inherit from one another instead of just the BaseDownloader. For example, the VReddit class, used for downloading video from Reddit, inherits almost all of its code from the YouTube class. **Minimise code duplication wherever possible**.
This is one of the easiest changes to do with the code. First, any new class
must inherit from the BaseDownloader class which provided an abstract parent to
implement. However, take note of the other classes as well. Many downloaders
can inherit from one another instead of just the BaseDownloader. For example,
the VReddit class, used for downloading video from Reddit, inherits almost all
of its code from the YouTube class. **Minimise code duplication wherever
possible**.
Once the downloader class has been written **and tests added** for it as well, then the regex string for the site's URLs can be added to the DownloaderFactory. Then additional tests must be added for the DownloadFactory to ensure that the appropriate classes are called when the right URLs are passed to the factory.
Once the downloader class has been written **and tests added** for it as well,
then the regex string for the site's URLs can be added to the
DownloaderFactory. Then additional tests must be added for the DownloadFactory
to ensure that the appropriate classes are called when the right URLs are
passed to the factory.
## Adding Other Features
For a fundamentally different form of execution path for the program, such as the difference between the `archive` and `download` commands, it is best to inherit from the RedditDownloader class and override or add functionality as needed.
For a fundamentally different form of execution path for the program, such as
the difference between the `archive` and `download` commands, it is best to
inherit from the RedditDownloader class and override or add functionality as
needed.

View file

@ -2,17 +2,14 @@
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, gender identity and expression, level of experience,
education, socio-economic status, nationality, personal appearance, race,
religion, or sexual identity and orientation.
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making
participation in our project and our community a harassment-free experience for everyone, regardless of age, body size,
disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
Examples of behavior that contributes to creating a positive environment include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
@ -22,53 +19,41 @@ include:
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* The use of sexualized language or imagery and unwelcome sexual attention or advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
* Publishing others' private information, such as a physical or electronic address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take
appropriate and fair corrective action in response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits,
issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any
contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the
project or its community. Examples of representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed representative at an online or offline
event. Representation of a project may be further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team via Discord. All complaints will
be reviewed and investigated and will result in a response that is deemed
necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an
incident. Further details of specific enforcement policies may be posted
separately.
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team via
Discord. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and
appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter
of an incident. Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent
repercussions as determined by other members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at <https://www.contributor-covenant.org/version/1/4/code-of-conduct.html>
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at
<https://www.contributor-covenant.org/version/1/4/code-of-conduct.html>
[homepage]: https://www.contributor-covenant.org

View file

@ -1,32 +1,52 @@
# Contributing
When making a contribution to the BDFR project, please open an issue beforehand so that the maintainers can weigh in on it. This helps create a trail on GitHub and keeps things organised.
When making a contribution to the BDFR project, please open an issue beforehand so that the maintainers can weigh in on
it. This helps create a trail on GitHub and keeps things organised.
**Please don't open an issue on GitHub** unless you are reporting a bug or proposing a feature. For questions, there is a discussion tab on the repository's GitHub page where you can interact with the developers and ask questions. If you believe that something is a bug, or that a feature should be added, then by all means open an issue.
**Please don't open an issue on GitHub** unless you are reporting a bug or proposing a feature. For questions, there is
a discussion tab on the repository's GitHub page where you can interact with the developers and ask questions. If you
believe that something is a bug, or that a feature should be added, then by all means open an issue.
All communication on GitHub, Discord, email, or any other medium must conform to the [Code of Conduct](CODE_OF_CONDUCT.md). It's not that hard to stay respectful.
All communication on GitHub, Discord, email, or any other medium must conform to the [Code of
Conduct](CODE_OF_CONDUCT.md). It's not that hard to stay respectful.
## Opening an Issue
**Before opening a new issue**, be sure that no issues regarding your problem already exist. If a similar issue exists, try to contribute to the issue.
**Before opening a new issue**, be sure that no issues regarding your problem already exist. If a similar issue exists,
try to contribute to the issue.
**If you are asking a question** about the functioning of the BDFR or the interface, please use the discussions page. Bug reports are not the right medium for asking and answering questions, and the discussions page makes it much easier to discuss, answer, and save questions and responses for others going forwards.
**If you are asking a question** about the functioning of the BDFR or the interface, please use the discussions page.
Bug reports are not the right medium for asking and answering questions, and the discussions page makes it much easier
to discuss, answer, and save questions and responses for others going forwards.
### Bugs
When opening an issue about a bug, **please provide the full log file for the run in which the bug occurred**. This log file is named `log_output.txt` in the configuration folder. Check the [README](../README.md) for information on where this is. This log file will contain all the information required for the developers to recreate the bug.
When opening an issue about a bug, **please provide the full log file for the run in which the bug occurred**. This log
file is named `log_output.txt` in the configuration folder. Check the [README](../README.md) for information on where
this is. This log file will contain all the information required for the developers to recreate the bug.
If you do not have or cannot find the log file, then at minimum please provide the **Reddit ID for the submission** or comment which caused the issue. Also copy in the command that you used to run the BDFR from the command line, as that will also provide helpful information when trying to find and fix the bug. If needed, more information will be asked in the thread of the bug.
If you do not have or cannot find the log file, then at minimum please provide the **Reddit ID for the submission** or
comment which caused the issue. Also copy in the command that you used to run the BDFR from the command line, as that
will also provide helpful information when trying to find and fix the bug. If needed, more information will be asked in
the thread of the bug.
Adding this information is **not optional**. If a bug report is opened without this information, it cannot be replicated by developers. The logs will be asked for once and if they are not supplied, the issue will be closed due to lack of information.
Adding this information is **not optional**. If a bug report is opened without this information, it cannot be replicated
by developers. The logs will be asked for once and if they are not supplied, the issue will be closed due to lack of
information.
### Feature requests
In the case of requesting a feature or an enhancement, there are fewer requirements. However, please be clear in what you would like the BDFR to do and also how the feature/enhancement would be used or would be useful to more people. It is crucial that the feature is justified. Any feature request without a concrete reason for it to be implemented has a very small chance to get accepted. Be aware that proposed enhancements may be rejected for multiple reasons, or no reason, at the discretion of the developers.
In the case of requesting a feature or an enhancement, there are fewer requirements. However, please be clear in what
you would like the BDFR to do and also how the feature/enhancement would be used or would be useful to more people. It
is crucial that the feature is justified. Any feature request without a concrete reason for it to be implemented has
a very small chance to get accepted. Be aware that proposed enhancements may be rejected for multiple reasons, or no
reason, at the discretion of the developers.
## Pull Requests
Before creating a pull request (PR), check out [ARCHITECTURE](ARCHITECTURE.md) for a short introduction to the way that the BDFR is coded and how the code is organised. Also read the [Style Guide](#style-guide) section below before actually writing any code.
Before creating a pull request (PR), check out [ARCHITECTURE](ARCHITECTURE.md) for a short introduction to the way that
the BDFR is coded and how the code is organised. Also read the [Style Guide](#style-guide) section below before actually
writing any code.
Once you have done both of these, the below list shows the path that should be followed when writing a PR.
@ -38,13 +58,15 @@ Once you have done both of these, the below list shows the path that should be f
6. Open a pull request that references the relevant issue.
7. Expect changes or suggestions and heed the Code of Conduct. We're all volunteers here.
Someone will review your pull request as soon as possible, but remember that all maintainers are volunteers and this won't happen immediately. Once it is approved, congratulations! Your code is now part of the BDFR.
Someone will review your pull request as soon as possible, but remember that all maintainers are volunteers and this
won't happen immediately. Once it is approved, congratulations! Your code is now part of the BDFR.
## Preparing the environment for development
Bulk Downloader for Reddit requires Python 3.9 at minimum. First, ensure that your Python installation satisfies this.
BDfR is built in a way that it can be packaged and installed via `pip`. This places BDfR next to other Python packages and enables you to run the program from any directory. Since it is managed by pip, you can also uninstall it.
BDfR is built in a way that it can be packaged and installed via `pip`. This places BDfR next to other Python packages
and enables you to run the program from any directory. Since it is managed by pip, you can also uninstall it.
To install the program, clone the repository and run pip inside the project's root directory:
@ -54,7 +76,9 @@ cd ./bulk-downloader-for-reddit
python3 -m pip install -e .
```
**`-e`** parameter creates a link to that folder. That is, any change inside the folder affects the package immidiately. So, when developing, you can be sure that the package is not stale and Python is always running your latest changes. (Due to this linking, moving/removing/renaming the folder might break it)
**`-e`** parameter creates a link to that folder. That is, any change inside the folder affects the package immidiately.
So, when developing, you can be sure that the package is not stale and Python is always running your latest changes.
(Due to this linking, moving/removing/renaming the folder might break it)
Then, you can run the program from anywhere in your disk as such:
@ -62,7 +86,8 @@ Then, you can run the program from anywhere in your disk as such:
bdfr
```
There are additional Python packages that are required to develop the BDFR. These can be installed with the following command:
There are additional Python packages that are required to develop the BDFR. These can be installed with the following
command:
```bash
python3 -m pip install -e .[dev]
@ -78,30 +103,40 @@ The BDFR project uses several tools to manage the code of the project. These inc
- [tox](https://tox.wiki/en/latest/)
- [pre-commit](https://github.com/pre-commit/pre-commit)
The first three tools are formatters. These change the code to the standards expected for the BDFR project. The configuration details for these tools are contained in the [pyproject.toml](../pyproject.toml) file for the project.
The first three tools are formatters. These change the code to the standards expected for the BDFR project. The
configuration details for these tools are contained in the [pyproject.toml](../pyproject.toml) file for the project.
The tool `tox` is used to run tests and tools on demand and has the following environments:
- `format`
- `format_check`
The tool `pre-commit` is optional, and runs the three formatting tools automatically when a commit is made. This is **highly recommended** to ensure that all code submitted for this project is formatted acceptably. Note that any PR that does not follow the formatting guide will not be accepted. For information on how to use pre-commit to avoid this, see [the pre-commit documentation](https://pre-commit.com/).
The tool `pre-commit` is optional, and runs the three formatting tools automatically when a commit is made. This is
**highly recommended** to ensure that all code submitted for this project is formatted acceptably. Note that any PR that
does not follow the formatting guide will not be accepted. For information on how to use pre-commit to avoid this, see
[the pre-commit documentation](https://pre-commit.com/).
## Style Guide
The BDFR uses the Black formatting standard and enforces this with the tool by the same name. Additionally, the tool isort is used as well to format imports.
The BDFR uses the Black formatting standard and enforces this with the tool by the same name. Additionally, the tool
isort is used as well to format imports.
See [Preparing the Environment for Development](#preparing-the-environment-for-development) for how to setup these tools to run automatically.
See [Preparing the Environment for Development](#preparing-the-environment-for-development) for how to setup these tools
to run automatically.
## Tests
### Running Tests
There are a lot of tests in the BDFR. In fact, there are more tests than lines of functional code. This is one of the strengths of the BDFR in that it is fully tested. The codebase uses the package pytest to create the tests, which is a third-party package that provides many functions and objects useful for testing Python code.
There are a lot of tests in the BDFR. In fact, there are more tests than lines of functional code. This is one of the
strengths of the BDFR in that it is fully tested. The codebase uses the package pytest to create the tests, which is
a third-party package that provides many functions and objects useful for testing Python code.
When submitting a PR, it is required that you run **all** possible tests to ensure that any new commits haven't broken anything. Otherwise, while writing the request, it can be helpful (and much quicker) to run only a subset of the tests.
When submitting a PR, it is required that you run **all** possible tests to ensure that any new commits haven't broken
anything. Otherwise, while writing the request, it can be helpful (and much quicker) to run only a subset of the tests.
This is accomplished with marks, a system that pytest uses to categorise tests. There are currently the current marks in use in the BDFR test suite.
This is accomplished with marks, a system that pytest uses to categorise tests. There are currently the current marks in
use in the BDFR test suite.
- `slow`
- This marks a test that may take a long time to complete
@ -113,7 +148,9 @@ This is accomplished with marks, a system that pytest uses to categorise tests.
- `authenticated`
- This marks a test that requires a test configuration file with a valid OAuth2 token
These tests can be run either all at once, or excluding certain marks. The tests that require online resources, such as those marked `reddit` or `online`, will naturally require more time to run than tests that are entirely offline. To run tests, you must be in the root directory of the project and can use the following command.
These tests can be run either all at once, or excluding certain marks. The tests that require online resources, such as
those marked `reddit` or `online`, will naturally require more time to run than tests that are entirely offline. To run
tests, you must be in the root directory of the project and can use the following command.
```bash
pytest
@ -128,18 +165,29 @@ pytest -m "not reddit and not authenticated"
### Configuration for authenticated tests
There should be configuration file `test_config.cfg` in the project's root directory to be able to run the integration tests with reddit authentication. See how to create such files [here](../README.md#configuration). The easiest way of creating this file is copying your existing `default_config.cfg` file from the path stated in the previous link and renaming it to `test_config.cfg` Be sure that user_token key exists in test_config.cfg.
There should be configuration file `test_config.cfg` in the project's root directory to be able to run the integration
tests with reddit authentication. See how to create such files [here](../README.md#configuration). The easiest way of
creating this file is copying your existing `default_config.cfg` file from the path stated in the previous link and
renaming it to `test_config.cfg` Be sure that user_token key exists in test_config.cfg.
---
For more details, review the pytest documentation that is freely available online.
Many IDEs also provide integrated functionality to run and display the results from tests, and almost all of them support pytest in some capacity. This would be the recommended method due to the additional debugging and general capabilities.
Many IDEs also provide integrated functionality to run and display the results from tests, and almost all of them
support pytest in some capacity. This would be the recommended method due to the additional debugging and general
capabilities.
### Writing Tests
When writing tests, ensure that they follow the style guide. The BDFR uses pytest to run tests. Wherever possible, parameterise tests, even if you only have one test case. This makes it easier to expand in the future, as the ultimate goal is to have multiple test cases for every test, instead of just one.
When writing tests, ensure that they follow the style guide. The BDFR uses pytest to run tests. Wherever possible,
parameterise tests, even if you only have one test case. This makes it easier to expand in the future, as the ultimate
goal is to have multiple test cases for every test, instead of just one.
If required, use of mocks is expected to simplify tests and reduce the resources or complexity required. Tests should be as small as possible and test as small a part of the code as possible. Comprehensive or integration tests are run with the `click` framework and are located in their own file.
If required, use of mocks is expected to simplify tests and reduce the resources or complexity required. Tests should be
as small as possible and test as small a part of the code as possible. Comprehensive or integration tests are run with
the `click` framework and are located in their own file.
It is also expected that new tests be classified correctly with the marks described above i.e. if a test accesses Reddit through a `reddit_instance` object, it must be given the `reddit` mark. If it requires an authenticated Reddit instance, then it must have the `authenticated` mark.
It is also expected that new tests be classified correctly with the marks described above i.e. if a test accesses Reddit
through a `reddit_instance` object, it must be given the `reddit` mark. If it requires an authenticated Reddit instance,
then it must have the `authenticated` mark.