diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 8fc4e13..d7aa99e 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -1,41 +1,89 @@ # Architecture -When the project was rewritten for v2, the goal was to make the codebase easily extensible and much easier to read and modify. However, this document provides a step-by-step look through the process that the BDFR goes through, so that any prospective developers can more easily grasp the way the code works. +When the project was rewritten for v2, the goal was to make the codebase easily +extensible and much easier to read and modify. However, this document provides +a step-by-step look through the process that the BDFR goes through, so that any +prospective developers can more easily grasp the way the code works. ## Design Ethos -The BDFR is designed to be a stateless downloader. This means that the state of the program is forgotten between each run of the program. There are no central lists, databases, or indices, that the BDFR uses, only the actual files on disk. There are several advantages to this approach: +The BDFR is designed to be a stateless downloader. This means that the state of +the program is forgotten between each run of the program. There are no central +lists, databases, or indices, that the BDFR uses, only the actual files on +disk. There are several advantages to this approach: -1. There is no chance of the database being corrupted or changed by something other than the BDFR, rendering the BDFR's "idea" of the archive wrong or incomplete. -2. Any information about the archive is contained by the archive itself i.e. for a list of all submission IDs in the archive, this can be extracted from the names of the files in said archive, assuming an appropriate naming scheme was used. -3. Archives can be merged, split, or editing without worrying about having to update a central database -4. There are no versioning issues between updates of the BDFR, where old version are stuck with a worse form of the database -5. An archive can be put on a USB, moved to another computer with possibly a very different BDFR version, and work completely fine +1. There is no chance of the database being corrupted or changed by something + other than the BDFR, rendering the BDFR's "idea" of the archive wrong or + incomplete. +2. Any information about the archive is contained by the archive itself i.e. + for a list of all submission IDs in the archive, this can be extracted from + the names of the files in said archive, assuming an appropriate naming + scheme was used. +3. Archives can be merged, split, or editing without worrying about having to + update a central database +4. There are no versioning issues between updates of the BDFR, where old + version are stuck with a worse form of the database +5. An archive can be put on a USB, moved to another computer with possibly + a very different BDFR version, and work completely fine -Another major part of the ethos of the design is DOTADIW, Do One Thing And Do It Well. It's a major part of Unix philosophy and states that each tool should have a well-defined, limited purpose. To this end, the BDFR is, as the name implies, a *downloader*. That is the scope of the tool. Managing the files downloaded can be for better-suited programs, since the BDFR is not a file manager. Nor the BDFR concern itself with how any of the data downloaded is displayed, changed, parsed, or analysed. This makes the BDFR suitable for data science-related tasks, archiving, personal downloads, or analysis of various Reddit sources as the BDFR is completely agnostic on how the data is used. +Another major part of the ethos of the design is DOTADIW, Do One Thing And Do +It Well. It's a major part of Unix philosophy and states that each tool should +have a well-defined, limited purpose. To this end, the BDFR is, as the name +implies, a *downloader*. That is the scope of the tool. Managing the files +downloaded can be for better-suited programs, since the BDFR is not a file +manager. Nor the BDFR concern itself with how any of the data downloaded is +displayed, changed, parsed, or analysed. This makes the BDFR suitable for data +science-related tasks, archiving, personal downloads, or analysis of various +Reddit sources as the BDFR is completely agnostic on how the data is used. ## The Download Process -The BDFR is organised around a central object, the RedditDownloader class. The Archiver object extends and inherits from this class. +The BDFR is organised around a central object, the RedditDownloader class. The +Archiver object extends and inherits from this class. -1. The RedditDownloader parses all the arguments and configuration options, held in the Configuration object, and creates a variety of internal objects for use, such as the file name formatter, download filter, etc. -2. The RedditDownloader scrapes raw submissions from Reddit via several methods relating to different sources. A source is defined as a single stream of submissions from a subreddit, multireddit, or user list. -3. These raw submissions are passed to the DownloaderFactory class to select the specialised downloader class to use. Each of these are for a specific website or link type, with some catch-all classes like Direct. -4. The BaseDownloader child, spawned by DownloaderFactory, takes the link and does any necessary processing to find the direct link to the actual resource. -5. This is returned to the RedditDownloader in the form of a Resource object. This holds the URL and some other information for the final resource. +1. The RedditDownloader parses all the arguments and configuration options, + held in the Configuration object, and creates a variety of internal objects + for use, such as the file name formatter, download filter, etc. +2. The RedditDownloader scrapes raw submissions from Reddit via several methods + relating to different sources. A source is defined as a single stream of + submissions from a subreddit, multireddit, or user list. +3. These raw submissions are passed to the DownloaderFactory class to select + the specialised downloader class to use. Each of these are for a specific + website or link type, with some catch-all classes like Direct. +4. The BaseDownloader child, spawned by DownloaderFactory, takes the link and + does any necessary processing to find the direct link to the actual + resource. +5. This is returned to the RedditDownloader in the form of a Resource object. + This holds the URL and some other information for the final resource. 6. The Resource is passed through the DownloadFilter instantiated in step 1. -7. The destination file name for the Resource is calculated. If it already exists, then the Resource will be discarded. -8. Here the actual data is downloaded to the Resource and a hash calculated which is used to find duplicates. +7. The destination file name for the Resource is calculated. If it already + exists, then the Resource will be discarded. +8. Here the actual data is downloaded to the Resource and a hash calculated + which is used to find duplicates. 9. Only then is the Resource written to the disk. -This is the step-by-step process that the BDFR goes through to download a Reddit post. +This is the step-by-step process that the BDFR goes through to download +a Reddit post. ## Adding another Supported Site -This is one of the easiest changes to do with the code. First, any new class must inherit from the BaseDownloader class which provided an abstract parent to implement. However, take note of the other classes as well. Many downloaders can inherit from one another instead of just the BaseDownloader. For example, the VReddit class, used for downloading video from Reddit, inherits almost all of its code from the YouTube class. **Minimise code duplication wherever possible**. +This is one of the easiest changes to do with the code. First, any new class +must inherit from the BaseDownloader class which provided an abstract parent to +implement. However, take note of the other classes as well. Many downloaders +can inherit from one another instead of just the BaseDownloader. For example, +the VReddit class, used for downloading video from Reddit, inherits almost all +of its code from the YouTube class. **Minimise code duplication wherever +possible**. -Once the downloader class has been written **and tests added** for it as well, then the regex string for the site's URLs can be added to the DownloaderFactory. Then additional tests must be added for the DownloadFactory to ensure that the appropriate classes are called when the right URLs are passed to the factory. +Once the downloader class has been written **and tests added** for it as well, +then the regex string for the site's URLs can be added to the +DownloaderFactory. Then additional tests must be added for the DownloadFactory +to ensure that the appropriate classes are called when the right URLs are +passed to the factory. ## Adding Other Features -For a fundamentally different form of execution path for the program, such as the difference between the `archive` and `download` commands, it is best to inherit from the RedditDownloader class and override or add functionality as needed. +For a fundamentally different form of execution path for the program, such as +the difference between the `archive` and `download` commands, it is best to +inherit from the RedditDownloader class and override or add functionality as +needed. diff --git a/docs/CODE_OF_CONDUCT.md b/docs/CODE_OF_CONDUCT.md index fe0374d..70e7e37 100644 --- a/docs/CODE_OF_CONDUCT.md +++ b/docs/CODE_OF_CONDUCT.md @@ -2,17 +2,14 @@ ## Our Pledge -In the interest of fostering an open and welcoming environment, we as -contributors and maintainers pledge to making participation in our project and -our community a harassment-free experience for everyone, regardless of age, body -size, disability, ethnicity, gender identity and expression, level of experience, -education, socio-economic status, nationality, personal appearance, race, -religion, or sexual identity and orientation. +In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making +participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, +disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status, +nationality, personal appearance, race, religion, or sexual identity and orientation. ## Our Standards -Examples of behavior that contributes to creating a positive environment -include: +Examples of behavior that contributes to creating a positive environment include: * Using welcoming and inclusive language * Being respectful of differing viewpoints and experiences @@ -22,53 +19,41 @@ include: Examples of unacceptable behavior by participants include: -* The use of sexualized language or imagery and unwelcome sexual attention or - advances +* The use of sexualized language or imagery and unwelcome sexual attention or advances * Trolling, insulting/derogatory comments, and personal or political attacks * Public or private harassment -* Publishing others' private information, such as a physical or electronic - address, without explicit permission -* Other conduct which could reasonably be considered inappropriate in a - professional setting +* Publishing others' private information, such as a physical or electronic address, without explicit permission +* Other conduct which could reasonably be considered inappropriate in a professional setting ## Our Responsibilities -Project maintainers are responsible for clarifying the standards of acceptable -behavior and are expected to take appropriate and fair corrective action in -response to any instances of unacceptable behavior. +Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take +appropriate and fair corrective action in response to any instances of unacceptable behavior. -Project maintainers have the right and responsibility to remove, edit, or -reject comments, commits, code, wiki edits, issues, and other contributions -that are not aligned to this Code of Conduct, or to ban temporarily or -permanently any contributor for other behaviors that they deem inappropriate, -threatening, offensive, or harmful. +Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, +issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any +contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. ## Scope -This Code of Conduct applies both within project spaces and in public spaces -when an individual is representing the project or its community. Examples of -representing a project or community include using an official project e-mail -address, posting via an official social media account, or acting as an appointed -representative at an online or offline event. Representation of a project may be -further defined and clarified by project maintainers. +This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the +project or its community. Examples of representing a project or community include using an official project e-mail +address, posting via an official social media account, or acting as an appointed representative at an online or offline +event. Representation of a project may be further defined and clarified by project maintainers. ## Enforcement -Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported by contacting the project team via Discord. All complaints will -be reviewed and investigated and will result in a response that is deemed -necessary and appropriate to the circumstances. The project team is -obligated to maintain confidentiality with regard to the reporter of an -incident. Further details of specific enforcement policies may be posted -separately. +Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team via +Discord. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and +appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter +of an incident. Further details of specific enforcement policies may be posted separately. -Project maintainers who do not follow or enforce the Code of Conduct in good -faith may face temporary or permanent repercussions as determined by other -members of the project's leadership. +Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent +repercussions as determined by other members of the project's leadership. ## Attribution -This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, -available at +This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at + [homepage]: https://www.contributor-covenant.org diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index 1168863..8845936 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -1,32 +1,52 @@ # Contributing -When making a contribution to the BDFR project, please open an issue beforehand so that the maintainers can weigh in on it. This helps create a trail on GitHub and keeps things organised. +When making a contribution to the BDFR project, please open an issue beforehand so that the maintainers can weigh in on +it. This helps create a trail on GitHub and keeps things organised. -**Please don't open an issue on GitHub** unless you are reporting a bug or proposing a feature. For questions, there is a discussion tab on the repository's GitHub page where you can interact with the developers and ask questions. If you believe that something is a bug, or that a feature should be added, then by all means open an issue. +**Please don't open an issue on GitHub** unless you are reporting a bug or proposing a feature. For questions, there is +a discussion tab on the repository's GitHub page where you can interact with the developers and ask questions. If you +believe that something is a bug, or that a feature should be added, then by all means open an issue. -All communication on GitHub, Discord, email, or any other medium must conform to the [Code of Conduct](CODE_OF_CONDUCT.md). It's not that hard to stay respectful. +All communication on GitHub, Discord, email, or any other medium must conform to the [Code of +Conduct](CODE_OF_CONDUCT.md). It's not that hard to stay respectful. ## Opening an Issue -**Before opening a new issue**, be sure that no issues regarding your problem already exist. If a similar issue exists, try to contribute to the issue. +**Before opening a new issue**, be sure that no issues regarding your problem already exist. If a similar issue exists, +try to contribute to the issue. -**If you are asking a question** about the functioning of the BDFR or the interface, please use the discussions page. Bug reports are not the right medium for asking and answering questions, and the discussions page makes it much easier to discuss, answer, and save questions and responses for others going forwards. +**If you are asking a question** about the functioning of the BDFR or the interface, please use the discussions page. +Bug reports are not the right medium for asking and answering questions, and the discussions page makes it much easier +to discuss, answer, and save questions and responses for others going forwards. ### Bugs -When opening an issue about a bug, **please provide the full log file for the run in which the bug occurred**. This log file is named `log_output.txt` in the configuration folder. Check the [README](../README.md) for information on where this is. This log file will contain all the information required for the developers to recreate the bug. +When opening an issue about a bug, **please provide the full log file for the run in which the bug occurred**. This log +file is named `log_output.txt` in the configuration folder. Check the [README](../README.md) for information on where +this is. This log file will contain all the information required for the developers to recreate the bug. -If you do not have or cannot find the log file, then at minimum please provide the **Reddit ID for the submission** or comment which caused the issue. Also copy in the command that you used to run the BDFR from the command line, as that will also provide helpful information when trying to find and fix the bug. If needed, more information will be asked in the thread of the bug. +If you do not have or cannot find the log file, then at minimum please provide the **Reddit ID for the submission** or +comment which caused the issue. Also copy in the command that you used to run the BDFR from the command line, as that +will also provide helpful information when trying to find and fix the bug. If needed, more information will be asked in +the thread of the bug. -Adding this information is **not optional**. If a bug report is opened without this information, it cannot be replicated by developers. The logs will be asked for once and if they are not supplied, the issue will be closed due to lack of information. +Adding this information is **not optional**. If a bug report is opened without this information, it cannot be replicated +by developers. The logs will be asked for once and if they are not supplied, the issue will be closed due to lack of +information. ### Feature requests -In the case of requesting a feature or an enhancement, there are fewer requirements. However, please be clear in what you would like the BDFR to do and also how the feature/enhancement would be used or would be useful to more people. It is crucial that the feature is justified. Any feature request without a concrete reason for it to be implemented has a very small chance to get accepted. Be aware that proposed enhancements may be rejected for multiple reasons, or no reason, at the discretion of the developers. +In the case of requesting a feature or an enhancement, there are fewer requirements. However, please be clear in what +you would like the BDFR to do and also how the feature/enhancement would be used or would be useful to more people. It +is crucial that the feature is justified. Any feature request without a concrete reason for it to be implemented has +a very small chance to get accepted. Be aware that proposed enhancements may be rejected for multiple reasons, or no +reason, at the discretion of the developers. ## Pull Requests -Before creating a pull request (PR), check out [ARCHITECTURE](ARCHITECTURE.md) for a short introduction to the way that the BDFR is coded and how the code is organised. Also read the [Style Guide](#style-guide) section below before actually writing any code. +Before creating a pull request (PR), check out [ARCHITECTURE](ARCHITECTURE.md) for a short introduction to the way that +the BDFR is coded and how the code is organised. Also read the [Style Guide](#style-guide) section below before actually +writing any code. Once you have done both of these, the below list shows the path that should be followed when writing a PR. @@ -38,13 +58,15 @@ Once you have done both of these, the below list shows the path that should be f 6. Open a pull request that references the relevant issue. 7. Expect changes or suggestions and heed the Code of Conduct. We're all volunteers here. -Someone will review your pull request as soon as possible, but remember that all maintainers are volunteers and this won't happen immediately. Once it is approved, congratulations! Your code is now part of the BDFR. +Someone will review your pull request as soon as possible, but remember that all maintainers are volunteers and this +won't happen immediately. Once it is approved, congratulations! Your code is now part of the BDFR. ## Preparing the environment for development Bulk Downloader for Reddit requires Python 3.9 at minimum. First, ensure that your Python installation satisfies this. -BDfR is built in a way that it can be packaged and installed via `pip`. This places BDfR next to other Python packages and enables you to run the program from any directory. Since it is managed by pip, you can also uninstall it. +BDfR is built in a way that it can be packaged and installed via `pip`. This places BDfR next to other Python packages +and enables you to run the program from any directory. Since it is managed by pip, you can also uninstall it. To install the program, clone the repository and run pip inside the project's root directory: @@ -54,7 +76,9 @@ cd ./bulk-downloader-for-reddit python3 -m pip install -e . ``` -**`-e`** parameter creates a link to that folder. That is, any change inside the folder affects the package immidiately. So, when developing, you can be sure that the package is not stale and Python is always running your latest changes. (Due to this linking, moving/removing/renaming the folder might break it) +**`-e`** parameter creates a link to that folder. That is, any change inside the folder affects the package immidiately. +So, when developing, you can be sure that the package is not stale and Python is always running your latest changes. +(Due to this linking, moving/removing/renaming the folder might break it) Then, you can run the program from anywhere in your disk as such: @@ -62,7 +86,8 @@ Then, you can run the program from anywhere in your disk as such: bdfr ``` -There are additional Python packages that are required to develop the BDFR. These can be installed with the following command: +There are additional Python packages that are required to develop the BDFR. These can be installed with the following +command: ```bash python3 -m pip install -e .[dev] @@ -78,30 +103,40 @@ The BDFR project uses several tools to manage the code of the project. These inc - [tox](https://tox.wiki/en/latest/) - [pre-commit](https://github.com/pre-commit/pre-commit) -The first three tools are formatters. These change the code to the standards expected for the BDFR project. The configuration details for these tools are contained in the [pyproject.toml](../pyproject.toml) file for the project. +The first three tools are formatters. These change the code to the standards expected for the BDFR project. The +configuration details for these tools are contained in the [pyproject.toml](../pyproject.toml) file for the project. The tool `tox` is used to run tests and tools on demand and has the following environments: - `format` - `format_check` -The tool `pre-commit` is optional, and runs the three formatting tools automatically when a commit is made. This is **highly recommended** to ensure that all code submitted for this project is formatted acceptably. Note that any PR that does not follow the formatting guide will not be accepted. For information on how to use pre-commit to avoid this, see [the pre-commit documentation](https://pre-commit.com/). +The tool `pre-commit` is optional, and runs the three formatting tools automatically when a commit is made. This is +**highly recommended** to ensure that all code submitted for this project is formatted acceptably. Note that any PR that +does not follow the formatting guide will not be accepted. For information on how to use pre-commit to avoid this, see +[the pre-commit documentation](https://pre-commit.com/). ## Style Guide -The BDFR uses the Black formatting standard and enforces this with the tool by the same name. Additionally, the tool isort is used as well to format imports. +The BDFR uses the Black formatting standard and enforces this with the tool by the same name. Additionally, the tool +isort is used as well to format imports. -See [Preparing the Environment for Development](#preparing-the-environment-for-development) for how to setup these tools to run automatically. +See [Preparing the Environment for Development](#preparing-the-environment-for-development) for how to setup these tools +to run automatically. ## Tests ### Running Tests -There are a lot of tests in the BDFR. In fact, there are more tests than lines of functional code. This is one of the strengths of the BDFR in that it is fully tested. The codebase uses the package pytest to create the tests, which is a third-party package that provides many functions and objects useful for testing Python code. +There are a lot of tests in the BDFR. In fact, there are more tests than lines of functional code. This is one of the +strengths of the BDFR in that it is fully tested. The codebase uses the package pytest to create the tests, which is +a third-party package that provides many functions and objects useful for testing Python code. -When submitting a PR, it is required that you run **all** possible tests to ensure that any new commits haven't broken anything. Otherwise, while writing the request, it can be helpful (and much quicker) to run only a subset of the tests. +When submitting a PR, it is required that you run **all** possible tests to ensure that any new commits haven't broken +anything. Otherwise, while writing the request, it can be helpful (and much quicker) to run only a subset of the tests. -This is accomplished with marks, a system that pytest uses to categorise tests. There are currently the current marks in use in the BDFR test suite. +This is accomplished with marks, a system that pytest uses to categorise tests. There are currently the current marks in +use in the BDFR test suite. - `slow` - This marks a test that may take a long time to complete @@ -113,7 +148,9 @@ This is accomplished with marks, a system that pytest uses to categorise tests. - `authenticated` - This marks a test that requires a test configuration file with a valid OAuth2 token -These tests can be run either all at once, or excluding certain marks. The tests that require online resources, such as those marked `reddit` or `online`, will naturally require more time to run than tests that are entirely offline. To run tests, you must be in the root directory of the project and can use the following command. +These tests can be run either all at once, or excluding certain marks. The tests that require online resources, such as +those marked `reddit` or `online`, will naturally require more time to run than tests that are entirely offline. To run +tests, you must be in the root directory of the project and can use the following command. ```bash pytest @@ -128,18 +165,29 @@ pytest -m "not reddit and not authenticated" ### Configuration for authenticated tests -There should be configuration file `test_config.cfg` in the project's root directory to be able to run the integration tests with reddit authentication. See how to create such files [here](../README.md#configuration). The easiest way of creating this file is copying your existing `default_config.cfg` file from the path stated in the previous link and renaming it to `test_config.cfg` Be sure that user_token key exists in test_config.cfg. +There should be configuration file `test_config.cfg` in the project's root directory to be able to run the integration +tests with reddit authentication. See how to create such files [here](../README.md#configuration). The easiest way of +creating this file is copying your existing `default_config.cfg` file from the path stated in the previous link and +renaming it to `test_config.cfg` Be sure that user_token key exists in test_config.cfg. --- For more details, review the pytest documentation that is freely available online. -Many IDEs also provide integrated functionality to run and display the results from tests, and almost all of them support pytest in some capacity. This would be the recommended method due to the additional debugging and general capabilities. +Many IDEs also provide integrated functionality to run and display the results from tests, and almost all of them +support pytest in some capacity. This would be the recommended method due to the additional debugging and general +capabilities. ### Writing Tests -When writing tests, ensure that they follow the style guide. The BDFR uses pytest to run tests. Wherever possible, parameterise tests, even if you only have one test case. This makes it easier to expand in the future, as the ultimate goal is to have multiple test cases for every test, instead of just one. +When writing tests, ensure that they follow the style guide. The BDFR uses pytest to run tests. Wherever possible, +parameterise tests, even if you only have one test case. This makes it easier to expand in the future, as the ultimate +goal is to have multiple test cases for every test, instead of just one. -If required, use of mocks is expected to simplify tests and reduce the resources or complexity required. Tests should be as small as possible and test as small a part of the code as possible. Comprehensive or integration tests are run with the `click` framework and are located in their own file. +If required, use of mocks is expected to simplify tests and reduce the resources or complexity required. Tests should be +as small as possible and test as small a part of the code as possible. Comprehensive or integration tests are run with +the `click` framework and are located in their own file. -It is also expected that new tests be classified correctly with the marks described above i.e. if a test accesses Reddit through a `reddit_instance` object, it must be given the `reddit` mark. If it requires an authenticated Reddit instance, then it must have the `authenticated` mark. +It is also expected that new tests be classified correctly with the marks described above i.e. if a test accesses Reddit +through a `reddit_instance` object, it must be given the `reddit` mark. If it requires an authenticated Reddit instance, +then it must have the `authenticated` mark.