diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md index 87364f4..ab4e1ab 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -9,7 +9,7 @@ assignees: '' - [ ] I am reporting a bug. - [ ] I am running the latest version of BDfR -- [ ] I have read the [Opening an issue](README.md#configuration) +- [ ] I have read the [Opening an issue](../../README.md#configuration) ## Description A clear and concise description of what the bug is. diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md index fbf7f6b..ce9f0b3 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -9,7 +9,7 @@ assignees: '' - [ ] I am requesting a feature. - [ ] I am running the latest version of BDfR -- [ ] I have read the [Opening an issue](README.md#configuration) +- [ ] I have read the [Opening an issue](../../README.md#configuration) ## Description Clearly state the current situation and issues you experience. Then, explain how this feature would solve these issues and make life easier. Also, explain the feature with as many detail as possible. diff --git a/.github/ISSUE_TEMPLATE/site-support-request.md b/.github/ISSUE_TEMPLATE/site-support-request.md index 8524bd8..fd400aa 100644 --- a/.github/ISSUE_TEMPLATE/site-support-request.md +++ b/.github/ISSUE_TEMPLATE/site-support-request.md @@ -9,7 +9,7 @@ assignees: '' - [ ] I am requesting a site support. - [ ] I am running the latest version of BDfR -- [ ] I have read the [Opening an issue](README.md#configuration) +- [ ] I have read the [Opening an issue](../../README.md#configuration) ## Site Provide a URL to domain of the site. diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index bf3bfbb..5aa8c61 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -8,15 +8,17 @@ on: jobs: test: - - runs-on: ubuntu-latest - + runs-on: ${{ matrix.os }} strategy: matrix: + os: [ubuntu-latest, macos-latest] python-version: [3.9] - + ext: [.sh] + include: + - os: windows-latest + python-version: 3.9 + ext: .ps1 steps: - - uses: actions/checkout@v2 - name: Setup Python uses: actions/setup-python@v2 @@ -26,19 +28,19 @@ jobs: - name: Install dependencies run: | python -m pip install --upgrade pip flake8 pytest pytest-cov - if [ -f requirements.txt ]; then pip install -r requirements.txt; fi + pip install -r requirements.txt - - name: Setup test configuration + - name: Make configuration for tests + env: + REDDIT_TOKEN: ${{ secrets.REDDIT_TEST_TOKEN }} run: | - cp bdfr/default_config.cfg ./test_config.cfg - echo -e "\nuser_token = ${{ secrets.REDDIT_TEST_TOKEN }}" >> ./test_config.cfg + ./devscripts/configure${{ matrix.ext }} - - name: Lint w/ flake8 + - name: Lint with flake8 run: | - # stop the build if there are Python syntax errors or undefined names flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics - - name: Test w/ PyTest + - name: Test with pytest run: | pytest -m 'not slow' --verbose --cov=./bdfr/ --cov-report term:skip-covered --cov-report html diff --git a/README.md b/README.md index 1f119cd..624cc5f 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,11 @@ If you wish to open an issue, please read [the guide on opening issues](docs/CON python3 -m pip install bdfr ``` +If on Arch Linux or derivative operating systems such as Manjaro, the BDFR can be installed through the AUR. + +- Latest Release: https://aur.archlinux.org/packages/python-bdfr/ +- Latest Development Build: https://aur.archlinux.org/packages/python-bdfr-git/ + If you want to use the source code or make contributions, refer to [CONTRIBUTING](docs/CONTRIBUTING.md#preparing-the-environment-for-development) ## Usage @@ -55,6 +60,9 @@ The following options are common between both the `archive` and `download` comma - `--config` - If the path to a configuration file is supplied with this option, the BDFR will use the specified config - See [Configuration Files](#configuration) for more details +- `--log` + - This allows one to specify the location of the logfile + - This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below - `--saved` - This option will make the BDFR use the supplied user's saved posts list as a download source - This requires an authenticated Reddit instance, using the `--authenticate` flag, as well as `--user` set to `me` @@ -106,6 +114,9 @@ The following options are common between both the `archive` and `download` comma - `week` - `month` - `year` + - `--time-format` + - This specifies the format of the datetime string that replaces `{DATE}` in file and folder naming schemes + - See [Time Formatting Customisation](#time-formatting-customisation) for more details, and the formatting scheme - `-u, --user` - This specifies the user to scrape in concert with other options - When using `--authenticate`, `--user me` can be used to refer to the authenticated user @@ -208,13 +219,20 @@ It is highly recommended that the file name scheme contain the parameter `{POSTI ## Configuration The configuration files are, by default, stored in the configuration directory for the user. This differs depending on the OS that the BDFR is being run on. For Windows, this will be: + - `C:\Users\\AppData\Local\BDFR\bdfr` +If Python has been installed through the Windows Store, the folder will appear in a different place. Note that the hash included in the file path may change from installation to installation. + + - `C:\Users\\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\Local\BDFR\bdfr` + On Mac OSX, this will be: + - `~/Library/Application Support/bdfr`. Lastly, on a Linux system, this will be: - - `~/.local/share/bdfr` + + - `~/.config/bdfr/` The logging output for each run of the BDFR will be saved to this directory in the file `log_output.txt`. If you need to submit a bug, it is this file that you will need to submit with the report. @@ -222,16 +240,26 @@ The logging output for each run of the BDFR will be saved to this directory in t The `config.cfg` is the file that supplies the BDFR with the configuration to use. At the moment, the following keys **must** be included in the configuration file supplied. - - `backup_log_count` - - `max_wait_time` - `client_id` - `client_secret` - `scopes` +The following keys are optional, and defaults will be used if they cannot be found. + + - `backup_log_count` + - `max_wait_time` + - `time_format` + All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default. Most of these values have to do with OAuth2 configuration and authorisation. The key `backup_log_count` however has to do with the log rollover. The logs in the configuration directory can be verbose and for long runs of the BDFR, can grow quite large. To combat this, the BDFR will overwrite previous logs. This value determines how many previous run logs will be kept. The default is 3, which means that the BDFR will keep at most three past logs plus the current one. Any runs past this will overwrite the oldest log file, called "rolling over". If you want more records of past runs, increase this number. +#### Time Formatting Customisation + +The option `time_format` will specify the format of the timestamp that replaces `{DATE}` in filename and folder name schemes. By default, this is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format which is highly recommended due to its standardised nature. If you don't **need** to change it, it is recommended that you do not. However, you can specify it to anything required with this option. The `--time-format` option supersedes any specification in the configuration file + +The format can be specified through the [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) that are standard in the Python `datetime` library. + ### Rate Limiting The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases. @@ -240,6 +268,16 @@ To this end, the BDFR will sleep for a time before retrying the download, giving The option `--max-wait-time` and the configuration option `max_wait_time` both specify the maximum time the BDFR will wait. If both are present, the command-line option takes precedence. For instance, the default is 120, so the BDFR will wait for 60 seconds, then 120 seconds, and then move one. **Note that this results in a total time of 180 seconds trying the same download**. If you wish to try to bypass the rate-limiting system on the remote site, increasing the maximum wait time may help. However, note that the actual wait times increase exponentially if the resource is not downloaded i.e. specifying a max value of 300 (5 minutes), can make the BDFR pause for 15 minutes on one submission, not 5, in the worst case. +## Multiple Instances + +The BDFR can be run in multiple instances with multiple configurations, either concurrently or consecutively. The use of scripting files facilitates this the easiest, either Powershell on Windows operating systems or Bash elsewhere. This allows multiple scenarios to be run with data being scraped from different sources, as any two sets of scenarios might be mutually exclusive i.e. it is not possible to download any combination of data from a single run of the BDFR. To download from multiple users for example, multiple runs of the BDFR are required. + +Running these scenarios consecutively is done easily, like any single run. Configuration files that differ may be specified with the `--config` option to switch between tokens, for example. Otherwise, almost all configuration for data sources can be specified per-run through the command line. + +Running scenarious concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file. + +The way to fix this is to use the `--log` option to manually specify where the logfile is to be stored. If the given location is unique to each instance of the BDFR, then it will run fine. + ## List of currently supported sources - Direct links (links leading to a file) @@ -252,6 +290,7 @@ The option `--max-wait-time` and the configuration option `max_wait_time` both s - Reddit Videos - Redgifs - YouTube + - Streamable ## Contributing diff --git a/bdfr/__main__.py b/bdfr/__main__.py index 26759a1..372c7c3 100644 --- a/bdfr/__main__.py +++ b/bdfr/__main__.py @@ -20,10 +20,12 @@ _common_options = [ click.option('-m', '--multireddit', multiple=True, default=None, type=str), click.option('-L', '--limit', default=None, type=int), click.option('--authenticate', is_flag=True, default=None), + click.option('--log', type=str, default=None), click.option('--submitted', is_flag=True, default=None), click.option('--upvoted', is_flag=True, default=None), click.option('--saved', is_flag=True, default=None), click.option('--search', default=None, type=str), + click.option('--time-format', type=str, default=None), click.option('-u', '--user', type=str, default=None), click.option('-t', '--time', type=click.Choice(('all', 'hour', 'day', 'week', 'month', 'year')), default=None), click.option('-S', '--sort', type=click.Choice(('hot', 'top', 'new', @@ -73,7 +75,7 @@ def cli_download(context: click.Context, **_): @cli.command('archive') @_add_common_options @click.option('--all-comments', is_flag=True, default=None) -@click.option('-f,', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None) +@click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None) @click.pass_context def cli_archive(context: click.Context, **_): config = Configuration() diff --git a/bdfr/archiver.py b/bdfr/archiver.py index c6e4299..1945dfe 100644 --- a/bdfr/archiver.py +++ b/bdfr/archiver.py @@ -89,7 +89,7 @@ class Archiver(RedditDownloader): def _write_content_to_disk(self, resource: Resource, content: str): file_path = self.file_name_formatter.format_path(resource, self.download_directory) file_path.parent.mkdir(exist_ok=True, parents=True) - with open(file_path, 'w') as file: + with open(file_path, 'w', encoding="utf-8") as file: logger.debug( f'Writing entry {resource.source_submission.id} to file in {resource.extension[1:].upper()}' f' format at {file_path}') diff --git a/bdfr/configuration.py b/bdfr/configuration.py index 1d9610c..9ab9d45 100644 --- a/bdfr/configuration.py +++ b/bdfr/configuration.py @@ -17,6 +17,7 @@ class Configuration(Namespace): self.exclude_id_file = [] self.limit: Optional[int] = None self.link: list[str] = [] + self.log: Optional[str] = None self.max_wait_time = None self.multireddit: list[str] = [] self.no_dupes: bool = False @@ -32,6 +33,7 @@ class Configuration(Namespace): self.submitted: bool = False self.subreddit: list[str] = [] self.time: str = 'all' + self.time_format = None self.upvoted: bool = False self.user: Optional[str] = None self.verbose: int = 0 diff --git a/bdfr/default_config.cfg b/bdfr/default_config.cfg index 1bcb02b..b8039a9 100644 --- a/bdfr/default_config.cfg +++ b/bdfr/default_config.cfg @@ -3,4 +3,5 @@ client_id = U-6gk4ZCh3IeNQ client_secret = 7CZHY6AmKweZME5s50SfDGylaPg scopes = identity, history, read, save backup_log_count = 3 -max_wait_time = 120 \ No newline at end of file +max_wait_time = 120 +time_format = ISO \ No newline at end of file diff --git a/bdfr/download_filter.py b/bdfr/download_filter.py index 37a6ce9..3bbbdec 100644 --- a/bdfr/download_filter.py +++ b/bdfr/download_filter.py @@ -4,6 +4,8 @@ import logging import re +from bdfr.resource import Resource + logger = logging.getLogger(__name__) @@ -21,13 +23,20 @@ class DownloadFilter: else: return True - def _check_extension(self, url: str) -> bool: + def check_resource(self, res: Resource) -> bool: + if not self._check_extension(res.extension): + return False + elif not self._check_domain(res.url): + return False + return True + + def _check_extension(self, resource_extension: str) -> bool: if not self.excluded_extensions: return True combined_extensions = '|'.join(self.excluded_extensions) pattern = re.compile(r'.*({})$'.format(combined_extensions)) - if re.match(pattern, url): - logger.log(9, f'Url "{url}" matched with "{str(pattern)}"') + if re.match(pattern, resource_extension): + logger.log(9, f'Url "{resource_extension}" matched with "{str(pattern)}"') return False else: return True diff --git a/bdfr/downloader.py b/bdfr/downloader.py index c24b5cd..1625c8f 100644 --- a/bdfr/downloader.py +++ b/bdfr/downloader.py @@ -14,7 +14,7 @@ from datetime import datetime from enum import Enum, auto from multiprocessing import Pool from pathlib import Path -from typing import Iterator +from typing import Callable, Iterator import appdirs import praw @@ -105,6 +105,12 @@ class RedditDownloader: logger.log(9, 'Wrote default download wait time download to config file') self.args.max_wait_time = self.cfg_parser.getint('DEFAULT', 'max_wait_time') logger.debug(f'Setting maximum download wait time to {self.args.max_wait_time} seconds') + if self.args.time_format is None: + option = self.cfg_parser.get('DEFAULT', 'time_format', fallback='ISO') + if re.match(r'^[ \'\"]*$', option): + option = 'ISO' + logger.debug(f'Setting datetime format string to {option}') + self.args.time_format = option # Update config on disk with open(self.config_location, 'w') as file: self.cfg_parser.write(file) @@ -190,7 +196,12 @@ class RedditDownloader: def _create_file_logger(self): main_logger = logging.getLogger() - log_path = Path(self.config_directory, 'log_output.txt') + if self.args.log is None: + log_path = Path(self.config_directory, 'log_output.txt') + else: + log_path = Path(self.args.log).resolve().expanduser() + if not log_path.parent.exists(): + raise errors.BulkDownloaderException(f'Designated location for logfile does not exist') backup_count = self.cfg_parser.getint('DEFAULT', 'backup_log_count', fallback=3) file_handler = logging.handlers.RotatingFileHandler( log_path, @@ -198,7 +209,13 @@ class RedditDownloader: backupCount=backup_count, ) if log_path.exists(): - file_handler.doRollover() + try: + file_handler.doRollover() + except PermissionError as e: + logger.critical( + 'Cannot rollover logfile, make sure this is the only ' + 'BDFR process or specify alternate logfile location') + raise formatter = logging.Formatter('[%(asctime)s - %(name)s - %(levelname)s] - %(message)s') file_handler.setFormatter(formatter) file_handler.setLevel(0) @@ -207,7 +224,7 @@ class RedditDownloader: @staticmethod def _sanitise_subreddit_name(subreddit: str) -> str: - pattern = re.compile(r'^(?:https://www\.reddit\.com/)?(?:r/)?(.*?)(?:/)?$') + pattern = re.compile(r'^(?:https://www\.reddit\.com/)?(?:r/)?(.*?)/?$') match = re.match(pattern, subreddit) if not match: raise errors.BulkDownloaderException(f'Could not find subreddit name in string {subreddit}') @@ -225,10 +242,14 @@ class RedditDownloader: def _get_subreddits(self) -> list[praw.models.ListingGenerator]: if self.args.subreddit: out = [] - sort_function = self._determine_sort_function() for reddit in self._split_args_input(self.args.subreddit): try: reddit = self.reddit_instance.subreddit(reddit) + try: + self._check_subreddit_status(reddit) + except errors.BulkDownloaderException as e: + logger.error(e) + continue if self.args.search: out.append(reddit.search( self.args.search, @@ -265,7 +286,7 @@ class RedditDownloader: supplied_submissions.append(self.reddit_instance.submission(url=sub_id)) return [supplied_submissions] - def _determine_sort_function(self): + def _determine_sort_function(self) -> Callable: if self.sort_filter is RedditTypes.SortType.NEW: sort_function = praw.models.Subreddit.new elif self.sort_filter is RedditTypes.SortType.RISING: @@ -304,8 +325,10 @@ class RedditDownloader: def _get_user_data(self) -> list[Iterator]: if any([self.args.submitted, self.args.upvoted, self.args.saved]): if self.args.user: - if not self._check_user_existence(self.args.user): - logger.error(f'User {self.args.user} does not exist') + try: + self._check_user_existence(self.args.user) + except errors.BulkDownloaderException as e: + logger.error(e) return [] generators = [] if self.args.submitted: @@ -329,17 +352,19 @@ class RedditDownloader: else: return [] - def _check_user_existence(self, name: str) -> bool: + def _check_user_existence(self, name: str): user = self.reddit_instance.redditor(name=name) try: - if not user.id: - return False + if user.id: + return except prawcore.exceptions.NotFound: - return False - return True + raise errors.BulkDownloaderException(f'Could not find user {name}') + except AttributeError: + if hasattr(user, 'is_suspended'): + raise errors.BulkDownloaderException(f'User {name} is banned') def _create_file_name_formatter(self) -> FileNameFormatter: - return FileNameFormatter(self.args.file_scheme, self.args.folder_scheme) + return FileNameFormatter(self.args.file_scheme, self.args.folder_scheme, self.args.time_format) def _create_time_filter(self) -> RedditTypes.TimeType: try: @@ -375,9 +400,6 @@ class RedditDownloader: if not isinstance(submission, praw.models.Submission): logger.warning(f'{submission.id} is not a submission') return - if not self.download_filter.check_url(submission.url): - logger.debug(f'Download filter removed submission {submission.id} with URL {submission.url}') - return try: downloader_class = DownloadFactory.pull_lever(submission.url) downloader = downloader_class(submission) @@ -394,12 +416,14 @@ class RedditDownloader: for destination, res in self.file_name_formatter.format_resource_paths(content, self.download_directory): if destination.exists(): logger.debug(f'File {destination} already exists, continuing') + elif not self.download_filter.check_resource(res): + logger.debug(f'Download filter removed {submission.id} with URL {submission.url}') else: try: res.download(self.args.max_wait_time) except errors.BulkDownloaderException as e: - logger.error( - f'Failed to download resource {res.url} with downloader {downloader_class.__name__}: {e}') + logger.error(f'Failed to download resource {res.url} in submission {submission.id} ' + f'with downloader {downloader_class.__name__}: {e}') return resource_hash = res.hash.hexdigest() destination.parent.mkdir(parents=True, exist_ok=True) @@ -446,3 +470,14 @@ class RedditDownloader: for line in file: out.append(line.strip()) return set(out) + + @staticmethod + def _check_subreddit_status(subreddit: praw.models.Subreddit): + if subreddit.display_name == 'all': + return + try: + assert subreddit.id + except prawcore.NotFound: + raise errors.BulkDownloaderException(f'Source {subreddit.display_name} does not exist or cannot be found') + except prawcore.Forbidden: + raise errors.BulkDownloaderException(f'Source {subreddit.display_name} is private and cannot be scraped') diff --git a/bdfr/file_name_formatter.py b/bdfr/file_name_formatter.py index c4bf4b5..c6c13c2 100644 --- a/bdfr/file_name_formatter.py +++ b/bdfr/file_name_formatter.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # coding=utf-8 - +import datetime import logging import platform import re @@ -26,18 +26,18 @@ class FileNameFormatter: 'upvotes', ) - def __init__(self, file_format_string: str, directory_format_string: str): + def __init__(self, file_format_string: str, directory_format_string: str, time_format_string: str): if not self.validate_string(file_format_string): raise BulkDownloaderException(f'"{file_format_string}" is not a valid format string') self.file_format_string = file_format_string self.directory_format_string: list[str] = directory_format_string.split('/') + self.time_format_string = time_format_string - @staticmethod - def _format_name(submission: (Comment, Submission), format_string: str) -> str: + def _format_name(self, submission: (Comment, Submission), format_string: str) -> str: if isinstance(submission, Submission): - attributes = FileNameFormatter._generate_name_dict_from_submission(submission) + attributes = self._generate_name_dict_from_submission(submission) elif isinstance(submission, Comment): - attributes = FileNameFormatter._generate_name_dict_from_comment(submission) + attributes = self._generate_name_dict_from_comment(submission) else: raise BulkDownloaderException(f'Cannot name object {type(submission).__name__}') result = format_string @@ -65,8 +65,7 @@ class FileNameFormatter: in_string = in_string.replace(match, converted_match) return in_string - @staticmethod - def _generate_name_dict_from_submission(submission: Submission) -> dict: + def _generate_name_dict_from_submission(self, submission: Submission) -> dict: submission_attributes = { 'title': submission.title, 'subreddit': submission.subreddit.display_name, @@ -74,12 +73,18 @@ class FileNameFormatter: 'postid': submission.id, 'upvotes': submission.score, 'flair': submission.link_flair_text, - 'date': submission.created_utc + 'date': self._convert_timestamp(submission.created_utc), } return submission_attributes - @staticmethod - def _generate_name_dict_from_comment(comment: Comment) -> dict: + def _convert_timestamp(self, timestamp: float) -> str: + input_time = datetime.datetime.fromtimestamp(timestamp) + if self.time_format_string.upper().strip() == 'ISO': + return input_time.isoformat() + else: + return input_time.strftime(self.time_format_string) + + def _generate_name_dict_from_comment(self, comment: Comment) -> dict: comment_attributes = { 'title': comment.submission.title, 'subreddit': comment.subreddit.display_name, @@ -87,7 +92,7 @@ class FileNameFormatter: 'postid': comment.id, 'upvotes': comment.score, 'flair': '', - 'date': comment.created_utc, + 'date': self._convert_timestamp(comment.created_utc), } return comment_attributes @@ -155,9 +160,8 @@ class FileNameFormatter: result = any([f'{{{key}}}' in test_string.lower() for key in FileNameFormatter.key_terms]) if result: if 'POSTID' not in test_string: - logger.warning( - 'Some files might not be downloaded due to name conflicts as filenames are' - ' not guaranteed to be be unique without {POSTID}') + logger.warning('Some files might not be downloaded due to name conflicts as filenames are' + ' not guaranteed to be be unique without {POSTID}') return True else: return False diff --git a/bdfr/oauth2.py b/bdfr/oauth2.py index 505d5bd..6b27599 100644 --- a/bdfr/oauth2.py +++ b/bdfr/oauth2.py @@ -81,7 +81,7 @@ class OAuth2Authenticator: return client @staticmethod - def send_message(client: socket.socket, message: str): + def send_message(client: socket.socket, message: str = ''): client.send(f'HTTP/1.1 200 OK\r\n\r\n{message}'.encode('utf-8')) client.close() diff --git a/bdfr/site_downloaders/base_downloader.py b/bdfr/site_downloaders/base_downloader.py index ac45dc3..10787b8 100644 --- a/bdfr/site_downloaders/base_downloader.py +++ b/bdfr/site_downloaders/base_downloader.py @@ -8,7 +8,7 @@ from typing import Optional import requests from praw.models import Submission -from bdfr.exceptions import ResourceNotFound +from bdfr.exceptions import ResourceNotFound, SiteDownloaderError from bdfr.resource import Resource from bdfr.site_authenticator import SiteAuthenticator @@ -27,7 +27,11 @@ class BaseDownloader(ABC): @staticmethod def retrieve_url(url: str, cookies: dict = None, headers: dict = None) -> requests.Response: - res = requests.get(url, cookies=cookies, headers=headers) + try: + res = requests.get(url, cookies=cookies, headers=headers) + except requests.exceptions.RequestException as e: + logger.exception(e) + raise SiteDownloaderError(f'Failed to get page {url}') if res.status_code != 200: raise ResourceNotFound(f'Server responded with {res.status_code} to {url}') return res diff --git a/bdfr/site_downloaders/download_factory.py b/bdfr/site_downloaders/download_factory.py index 4bd6225..7035dc2 100644 --- a/bdfr/site_downloaders/download_factory.py +++ b/bdfr/site_downloaders/download_factory.py @@ -9,13 +9,12 @@ from bdfr.exceptions import NotADownloadableLinkError from bdfr.site_downloaders.base_downloader import BaseDownloader from bdfr.site_downloaders.direct import Direct from bdfr.site_downloaders.erome import Erome +from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback from bdfr.site_downloaders.gallery import Gallery from bdfr.site_downloaders.gfycat import Gfycat -from bdfr.site_downloaders.gif_delivery_network import GifDeliveryNetwork from bdfr.site_downloaders.imgur import Imgur from bdfr.site_downloaders.redgifs import Redgifs from bdfr.site_downloaders.self_post import SelfPost -from bdfr.site_downloaders.vreddit import VReddit from bdfr.site_downloaders.youtube import Youtube @@ -33,22 +32,21 @@ class DownloadFactory: return Gallery elif re.match(r'gfycat\.', sanitised_url): return Gfycat - elif re.match(r'gifdeliverynetwork', sanitised_url): - return GifDeliveryNetwork elif re.match(r'(m\.)?imgur.*', sanitised_url): return Imgur - elif re.match(r'redgifs.com', sanitised_url): + elif re.match(r'(redgifs|gifdeliverynetwork)', sanitised_url): return Redgifs elif re.match(r'reddit\.com/r/', sanitised_url): return SelfPost - elif re.match(r'v\.redd\.it', sanitised_url): - return VReddit elif re.match(r'(m\.)?youtu\.?be', sanitised_url): return Youtube elif re.match(r'i\.redd\.it.*', sanitised_url): return Direct + elif YoutubeDlFallback.can_handle_link(sanitised_url): + return YoutubeDlFallback else: - raise NotADownloadableLinkError(f'No downloader module exists for url {url}') + raise NotADownloadableLinkError( + f'No downloader module exists for url {url}') @staticmethod def _sanitise_url(url: str) -> str: diff --git a/bdfr/site_downloaders/fallback_downloaders/__init__.py b/bdfr/site_downloaders/fallback_downloaders/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/bdfr/site_downloaders/fallback_downloaders/fallback_downloader.py b/bdfr/site_downloaders/fallback_downloaders/fallback_downloader.py new file mode 100644 index 0000000..deeb213 --- /dev/null +++ b/bdfr/site_downloaders/fallback_downloaders/fallback_downloader.py @@ -0,0 +1,15 @@ +#!/usr/bin/env python3 +# coding=utf-8 + +from abc import ABC, abstractmethod + +from bdfr.site_downloaders.base_downloader import BaseDownloader + + +class BaseFallbackDownloader(BaseDownloader, ABC): + + @staticmethod + @abstractmethod + def can_handle_link(url: str) -> bool: + """Returns whether the fallback downloader can download this link""" + raise NotImplementedError diff --git a/bdfr/site_downloaders/fallback_downloaders/youtubedl_fallback.py b/bdfr/site_downloaders/fallback_downloaders/youtubedl_fallback.py new file mode 100644 index 0000000..281182a --- /dev/null +++ b/bdfr/site_downloaders/fallback_downloaders/youtubedl_fallback.py @@ -0,0 +1,40 @@ +#!/usr/bin/env python3 +# coding=utf-8 + +import logging +from typing import Optional + +import youtube_dl +from praw.models import Submission + +from bdfr.resource import Resource +from bdfr.site_authenticator import SiteAuthenticator +from bdfr.site_downloaders.fallback_downloaders.fallback_downloader import BaseFallbackDownloader +from bdfr.site_downloaders.youtube import Youtube + +logger = logging.getLogger(__name__) + + +class YoutubeDlFallback(BaseFallbackDownloader, Youtube): + def __init__(self, post: Submission): + super(YoutubeDlFallback, self).__init__(post) + + def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]: + out = super()._download_video({}) + return [out] + + @staticmethod + def can_handle_link(url: str) -> bool: + yt_logger = logging.getLogger('youtube-dl') + yt_logger.setLevel(logging.CRITICAL) + with youtube_dl.YoutubeDL({ + 'logger': yt_logger, + }) as ydl: + try: + result = ydl.extract_info(url, download=False) + if result: + return True + except Exception as e: + logger.exception(e) + return False + return False diff --git a/bdfr/site_downloaders/gfycat.py b/bdfr/site_downloaders/gfycat.py index f140660..eb33620 100644 --- a/bdfr/site_downloaders/gfycat.py +++ b/bdfr/site_downloaders/gfycat.py @@ -10,10 +10,10 @@ from praw.models import Submission from bdfr.exceptions import SiteDownloaderError from bdfr.resource import Resource from bdfr.site_authenticator import SiteAuthenticator -from bdfr.site_downloaders.gif_delivery_network import GifDeliveryNetwork +from bdfr.site_downloaders.redgifs import Redgifs -class Gfycat(GifDeliveryNetwork): +class Gfycat(Redgifs): def __init__(self, post: Submission): super().__init__(post) @@ -26,15 +26,15 @@ class Gfycat(GifDeliveryNetwork): url = 'https://gfycat.com/' + gfycat_id response = Gfycat.retrieve_url(url) - if 'gifdeliverynetwork' in response.url: - return GifDeliveryNetwork._get_link(url) + if re.search(r'(redgifs|gifdeliverynetwork)', response.url): + return Redgifs._get_link(url) soup = BeautifulSoup(response.text, 'html.parser') content = soup.find('script', attrs={'data-react-helmet': 'true', 'type': 'application/ld+json'}) try: out = json.loads(content.contents[0])['video']['contentUrl'] - except (IndexError, KeyError) as e: + except (IndexError, KeyError, AttributeError) as e: raise SiteDownloaderError(f'Failed to download Gfycat link {url}: {e}') except json.JSONDecodeError as e: raise SiteDownloaderError(f'Did not receive valid JSON data: {e}') diff --git a/bdfr/site_downloaders/gif_delivery_network.py b/bdfr/site_downloaders/gif_delivery_network.py deleted file mode 100644 index dbe2cf5..0000000 --- a/bdfr/site_downloaders/gif_delivery_network.py +++ /dev/null @@ -1,36 +0,0 @@ -#!/usr/bin/env python3 - -from typing import Optional -import json - -from bs4 import BeautifulSoup -from praw.models import Submission - -from bdfr.exceptions import NotADownloadableLinkError, SiteDownloaderError -from bdfr.resource import Resource -from bdfr.site_authenticator import SiteAuthenticator -from bdfr.site_downloaders.base_downloader import BaseDownloader - - -class GifDeliveryNetwork(BaseDownloader): - def __init__(self, post: Submission): - super().__init__(post) - - def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]: - media_url = self._get_link(self.post.url) - return [Resource(self.post, media_url, '.mp4')] - - @staticmethod - def _get_link(url: str) -> str: - page = GifDeliveryNetwork.retrieve_url(url) - - soup = BeautifulSoup(page.text, 'html.parser') - content = soup.find('script', attrs={'data-react-helmet': 'true', 'type': 'application/ld+json'}) - - try: - content = json.loads(content.string) - out = content['video']['contentUrl'] - except (json.JSONDecodeError, KeyError, TypeError): - raise SiteDownloaderError('Could not find source link') - - return out diff --git a/bdfr/site_downloaders/imgur.py b/bdfr/site_downloaders/imgur.py index 3458a45..6ae8a5e 100644 --- a/bdfr/site_downloaders/imgur.py +++ b/bdfr/site_downloaders/imgur.py @@ -7,7 +7,7 @@ from typing import Optional import bs4 from praw.models import Submission -from bdfr.exceptions import NotADownloadableLinkError, SiteDownloaderError +from bdfr.exceptions import SiteDownloaderError from bdfr.resource import Resource from bdfr.site_authenticator import SiteAuthenticator from bdfr.site_downloaders.base_downloader import BaseDownloader @@ -26,12 +26,12 @@ class Imgur(BaseDownloader): if 'album_images' in self.raw_data: images = self.raw_data['album_images'] for image in images['images']: - out.append(self._download_image(image)) + out.append(self._compute_image_url(image)) else: - out.append(self._download_image(self.raw_data)) + out.append(self._compute_image_url(self.raw_data)) return out - def _download_image(self, image: dict) -> Resource: + def _compute_image_url(self, image: dict) -> Resource: image_url = 'https://i.imgur.com/' + image['hash'] + self._validate_extension(image['ext']) return Resource(self.post, image_url) diff --git a/bdfr/site_downloaders/redgifs.py b/bdfr/site_downloaders/redgifs.py index 2436d33..051bc12 100644 --- a/bdfr/site_downloaders/redgifs.py +++ b/bdfr/site_downloaders/redgifs.py @@ -7,18 +7,19 @@ from typing import Optional from bs4 import BeautifulSoup from praw.models import Submission -from bdfr.exceptions import NotADownloadableLinkError, SiteDownloaderError +from bdfr.exceptions import SiteDownloaderError from bdfr.resource import Resource from bdfr.site_authenticator import SiteAuthenticator -from bdfr.site_downloaders.gif_delivery_network import GifDeliveryNetwork +from bdfr.site_downloaders.base_downloader import BaseDownloader -class Redgifs(GifDeliveryNetwork): +class Redgifs(BaseDownloader): def __init__(self, post: Submission): super().__init__(post) def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]: - return super().find_resources(authenticator) + media_url = self._get_link(self.post.url) + return [Resource(self.post, media_url, '.mp4')] @staticmethod def _get_link(url: str) -> str: @@ -27,24 +28,19 @@ class Redgifs(GifDeliveryNetwork): except AttributeError: raise SiteDownloaderError(f'Could not extract Redgifs ID from {url}') - url = 'https://redgifs.com/watch/' + redgif_id - headers = { - 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)' - ' Chrome/67.0.3396.87 Safari/537.36 OPR/54.0.2952.64', + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' + 'Chrome/90.0.4430.93 Safari/537.36', } - page = Redgifs.retrieve_url(url, headers=headers) - - soup = BeautifulSoup(page.text, 'html.parser') - content = soup.find('script', attrs={'data-react-helmet': 'true', 'type': 'application/ld+json'}) + content = Redgifs.retrieve_url(f'https://api.redgifs.com/v1/gfycats/{redgif_id}', headers=headers) if content is None: raise SiteDownloaderError('Could not read the page source') try: - out = json.loads(content.contents[0])['video']['contentUrl'] - except (IndexError, KeyError): + out = json.loads(content.text)['gfyItem']['mp4Url'] + except (KeyError, AttributeError): raise SiteDownloaderError('Failed to find JSON data in page') except json.JSONDecodeError as e: raise SiteDownloaderError(f'Received data was not valid JSON: {e}') diff --git a/bdfr/site_downloaders/vreddit.py b/bdfr/site_downloaders/vreddit.py deleted file mode 100644 index bff96be..0000000 --- a/bdfr/site_downloaders/vreddit.py +++ /dev/null @@ -1,21 +0,0 @@ -#!/usr/bin/env python3 - -import logging -from typing import Optional - -from praw.models import Submission - -from bdfr.resource import Resource -from bdfr.site_authenticator import SiteAuthenticator -from bdfr.site_downloaders.youtube import Youtube - -logger = logging.getLogger(__name__) - - -class VReddit(Youtube): - def __init__(self, post: Submission): - super().__init__(post) - - def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]: - out = super()._download_video({}) - return [out] diff --git a/bdfr/site_downloaders/youtube.py b/bdfr/site_downloaders/youtube.py index 7b62dc1..482d4bc 100644 --- a/bdfr/site_downloaders/youtube.py +++ b/bdfr/site_downloaders/youtube.py @@ -30,7 +30,10 @@ class Youtube(BaseDownloader): return [out] def _download_video(self, ytdl_options: dict) -> Resource: + yt_logger = logging.getLogger('youtube-dl') + yt_logger.setLevel(logging.CRITICAL) ytdl_options['quiet'] = True + ytdl_options['logger'] = yt_logger with tempfile.TemporaryDirectory() as temp_dir: download_path = Path(temp_dir).resolve() ytdl_options['outtmpl'] = str(download_path) + '/' + 'test.%(ext)s' diff --git a/devscripts/configure.ps1 b/devscripts/configure.ps1 new file mode 100644 index 0000000..8ac0ce1 --- /dev/null +++ b/devscripts/configure.ps1 @@ -0,0 +1,2 @@ +copy .\\bdfr\\default_config.cfg .\\test_config.cfg +echo "`nuser_token = $env:REDDIT_TOKEN" >> ./test_config.cfg \ No newline at end of file diff --git a/devscripts/configure.sh b/devscripts/configure.sh new file mode 100755 index 0000000..48e7c3e --- /dev/null +++ b/devscripts/configure.sh @@ -0,0 +1,2 @@ +cp ./bdfr/default_config.cfg ./test_config.cfg +echo -e "\nuser_token = $REDDIT_TOKEN" >> ./test_config.cfg \ No newline at end of file diff --git a/scripts/README.md b/scripts/README.md new file mode 100644 index 0000000..4bb098b --- /dev/null +++ b/scripts/README.md @@ -0,0 +1,69 @@ +# Useful Scripts + +Due to the verboseness of the logs, a great deal of information can be gathered quite easily from the BDFR's logfiles. In this folder, there is a selection of scripts that parse these logs, scraping useful bits of information. Since the logfiles are recurring patterns of strings, it is a fairly simple matter to write scripts that utilise tools included on most Linux systems. + + - [Script to extract all successfully downloaded IDs](#extract-all-successfully-downloaded-ids) + - [Script to extract all failed download IDs](#extract-all-failed-ids) + - [Timestamp conversion](#converting-bdfrv1-timestamps-to-bdfrv2-timestamps) + - [Printing summary statistics for a run](#printing-summary-statistics) + +## Extract all Successfully Downloaded IDs + +This script is contained [here](extract_successful_ids.sh) and will result in a file that contains the IDs of everything that was successfully downloaded without an error. That is, a list will be created of submissions that, with the `--exclude-id-file` option, can be used so that the BDFR will not attempt to redownload these submissions/comments. This is likely to cause a performance increase, especially when the BDFR run finds many resources. + +The script can be used with the following signature: + +```bash +./extract_successful_ids.sh LOGFILE_LOCATION +``` + +By default, if the second argument is not supplied, the script will write the results to `successful.txt`. + +An example of the script being run on a Linux machine is the following: + +```bash +./extract_successful_ids.sh ~/.config/bdfr/log_output.txt +``` + +## Extract all Failed IDs + +[This script](extract_failed_ids.sh) will output a file of all IDs that failed to be downloaded from the logfile in question. This may be used to prevent subsequent runs of the BDFR from re-attempting those submissions if that is desired, potentially increasing performance. +The script can be used with the following signature: + +```bash +./extract_failed_ids.sh LOGFILE_LOCATION +``` + +By default, if the second argument is not supplied, the script will write the results to `failed.txt`. + +An example of the script being run on a Linux machine is the following: + +```bash +./extract_failed_ids.sh ~/.config/bdfr/log_output.txt +``` + +## Converting BDFRv1 Timestamps to BDFRv2 Timestamps + +BDFRv2 uses an internationally recognised and standardised format for timestamps, namely ISO 8601. This is highly recommended due to the nature of using such a widespread and understood standard. However, the BDFRv1 does not use this standard. Due to this, if you've used the old timestamp in filenames or folders, the BDFR will no longer recognise them as the same file and potentially redownload duplicate resources. + +To prevent this, it is recommended that you rename existing files to ISO 8601 standard. This can be done using the [timestamp-converter](https://github.com/Serene-Arc/timestamp-converter) tool made for this purpose. Instructions specifically for the BDFR are available in that project. + +## Printing Summary Statistics + +A simple script has been included to print sumamry statistics for a run of the BDFR. This is mainly to showcase how easy it is to extract statistics from the logfiles. You can extend this quite easily. For example, you can print how often the Imgur module is used, or how many 404 errors there are in the last run, or which module has caused the most errors. The possibilities really are endless. + +```bash +./print_summary.sh LOGFILE_LOCATION +``` + +This will create an output like the following: + +``` +Downloaded submissions: 250 +Failed downloads: 103 +Files already downloaded: 20073 +Hard linked submissions: 30 +Excluded submissions: 1146 +Files with existing hash skipped: 0 +Submissions from excluded subreddits: 0 +``` diff --git a/scripts/extract_failed_ids.sh b/scripts/extract_failed_ids.sh new file mode 100755 index 0000000..cdf1f21 --- /dev/null +++ b/scripts/extract_failed_ids.sh @@ -0,0 +1,18 @@ +#!/bin/bash + +if [ -e "$1" ]; then + file="$1" +else + echo 'CANNOT FIND LOG FILE' + exit 1 +fi + +if [ -n "$2" ]; then + output="$2" + echo "Outputting IDs to $output" +else + output="failed.txt" +fi + +grep 'Could not download submission' "$file" | awk '{ print $12 }' | rev | cut -c 2- | rev >>"$output" +grep 'Failed to download resource' "$file" | awk '{ print $15 }' >>"$output" diff --git a/scripts/extract_successful_ids.sh b/scripts/extract_successful_ids.sh new file mode 100755 index 0000000..3b6f7bc --- /dev/null +++ b/scripts/extract_successful_ids.sh @@ -0,0 +1,17 @@ +#!/bin/bash + +if [ -e "$1" ]; then + file="$1" +else + echo 'CANNOT FIND LOG FILE' + exit 1 +fi + +if [ -n "$2" ]; then + output="$2" + echo "Outputting IDs to $output" +else + output="successful.txt" +fi + +grep 'Downloaded submission' "$file" | awk '{ print $(NF-2) }' >> "$output" diff --git a/scripts/print_summary.sh b/scripts/print_summary.sh new file mode 100755 index 0000000..052ef1e --- /dev/null +++ b/scripts/print_summary.sh @@ -0,0 +1,16 @@ +#!/bin/bash + +if [ -e "$1" ]; then + file="$1" +else + echo 'CANNOT FIND LOG FILE' + exit 1 +fi + +echo "Downloaded submissions: $( grep -c 'Downloaded submission' "$file" )" +echo "Failed downloads: $( grep -c 'failed to download submission' "$file" )" +echo "Files already downloaded: $( grep -c 'already exists, continuing' "$file" )" +echo "Hard linked submissions: $( grep -c 'Hard link made' "$file" )" +echo "Excluded submissions: $( grep -c 'in exclusion list' "$file" )" +echo "Files with existing hash skipped: $( grep -c 'downloaded elsewhere' "$file" )" +echo "Submissions from excluded subreddits: $( grep -c 'in skip list' "$file" )" diff --git a/setup.cfg b/setup.cfg index 3b57d7a..1bba6b1 100644 --- a/setup.cfg +++ b/setup.cfg @@ -4,7 +4,7 @@ description_file = README.md description_content_type = text/markdown home_page = https://github.com/aliparlakci/bulk-downloader-for-reddit keywords = reddit, download, archive -version = 2.0.3 +version = 2.1.0 author = Ali Parlakci author_email = parlakciali@gmail.com maintainer = Serene Arc @@ -16,7 +16,6 @@ classifiers = Natural Language :: English Environment :: Console Operating System :: OS Independent -requires_python = >=3.9 platforms = any [files] diff --git a/setup.py b/setup.py index 40c6185..c5518a6 100644 --- a/setup.py +++ b/setup.py @@ -3,4 +3,4 @@ from setuptools import setup -setup(setup_requires=['pbr', 'appdirs'], pbr=True, data_files=[('config', ['bdfr/default_config.cfg'])]) +setup(setup_requires=['pbr', 'appdirs'], pbr=True, data_files=[('config', ['bdfr/default_config.cfg'])], python_requires='>=3.9.0') diff --git a/tests/conftest.py b/tests/conftest.py index ce4b681..da02948 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -13,7 +13,11 @@ from bdfr.oauth2 import OAuth2TokenManager @pytest.fixture(scope='session') def reddit_instance(): - rd = praw.Reddit(client_id='U-6gk4ZCh3IeNQ', client_secret='7CZHY6AmKweZME5s50SfDGylaPg', user_agent='test') + rd = praw.Reddit( + client_id='U-6gk4ZCh3IeNQ', + client_secret='7CZHY6AmKweZME5s50SfDGylaPg', + user_agent='test', + ) return rd @@ -27,8 +31,10 @@ def authenticated_reddit_instance(): if not cfg_parser.has_option('DEFAULT', 'user_token'): pytest.skip('Refresh token must be provided to authenticate with OAuth2') token_manager = OAuth2TokenManager(cfg_parser, test_config_path) - reddit_instance = praw.Reddit(client_id=cfg_parser.get('DEFAULT', 'client_id'), - client_secret=cfg_parser.get('DEFAULT', 'client_secret'), - user_agent=socket.gethostname(), - token_manager=token_manager) + reddit_instance = praw.Reddit( + client_id=cfg_parser.get('DEFAULT', 'client_id'), + client_secret=cfg_parser.get('DEFAULT', 'client_secret'), + user_agent=socket.gethostname(), + token_manager=token_manager, + ) return reddit_instance diff --git a/tests/site_downloaders/fallback_downloaders/__init__.py b/tests/site_downloaders/fallback_downloaders/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/site_downloaders/fallback_downloaders/youtubedl_fallback.py b/tests/site_downloaders/fallback_downloaders/youtubedl_fallback.py new file mode 100644 index 0000000..f70a91c --- /dev/null +++ b/tests/site_downloaders/fallback_downloaders/youtubedl_fallback.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python3 + +from unittest.mock import MagicMock + +import pytest + +from bdfr.resource import Resource +from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback + + +@pytest.mark.online +@pytest.mark.parametrize(('test_url', 'expected'), ( + ('https://www.reddit.com/r/specializedtools/comments/n2nw5m/bamboo_splitter/', True), + ('https://www.youtube.com/watch?v=P19nvJOmqCc', True), + ('https://www.example.com/test', False), +)) +def test_can_handle_link(test_url: str, expected: bool): + result = YoutubeDlFallback.can_handle_link(test_url) + assert result == expected + + +@pytest.mark.online +@pytest.mark.slow +@pytest.mark.parametrize(('test_url', 'expected_hash'), ( + ('https://streamable.com/dt46y', '1e7f4928e55de6e3ca23d85cc9246bbb'), + ('https://streamable.com/t8sem', '49b2d1220c485455548f1edbc05d4ecf'), + ('https://www.reddit.com/r/specializedtools/comments/n2nw5m/bamboo_splitter/', '21968d3d92161ea5e0abdcaf6311b06c'), + ('https://v.redd.it/9z1dnk3xr5k61', '351a2b57e888df5ccbc508056511f38d'), +)) +def test_find_resources(test_url: str, expected_hash: str): + test_submission = MagicMock() + test_submission.url = test_url + downloader = YoutubeDlFallback(test_submission) + resources = downloader.find_resources() + assert len(resources) == 1 + assert isinstance(resources[0], Resource) + assert resources[0].hash.hexdigest() == expected_hash diff --git a/tests/site_downloaders/test_download_factory.py b/tests/site_downloaders/test_download_factory.py index 65625b7..f02e9f7 100644 --- a/tests/site_downloaders/test_download_factory.py +++ b/tests/site_downloaders/test_download_factory.py @@ -9,18 +9,17 @@ from bdfr.site_downloaders.base_downloader import BaseDownloader from bdfr.site_downloaders.direct import Direct from bdfr.site_downloaders.download_factory import DownloadFactory from bdfr.site_downloaders.erome import Erome +from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback from bdfr.site_downloaders.gallery import Gallery from bdfr.site_downloaders.gfycat import Gfycat -from bdfr.site_downloaders.gif_delivery_network import GifDeliveryNetwork from bdfr.site_downloaders.imgur import Imgur from bdfr.site_downloaders.redgifs import Redgifs from bdfr.site_downloaders.self_post import SelfPost -from bdfr.site_downloaders.vreddit import VReddit from bdfr.site_downloaders.youtube import Youtube +@pytest.mark.online @pytest.mark.parametrize(('test_submission_url', 'expected_class'), ( - ('https://v.redd.it/9z1dnk3xr5k61', VReddit), ('https://www.reddit.com/r/TwoXChromosomes/comments/lu29zn/i_refuse_to_live_my_life' '_in_anything_but_comfort/', SelfPost), ('https://i.imgur.com/bZx1SJQ.jpg', Direct), @@ -35,12 +34,16 @@ from bdfr.site_downloaders.youtube import Youtube ('https://www.erome.com/a/NWGw0F09', Erome), ('https://youtube.com/watch?v=Gv8Wz74FjVA', Youtube), ('https://redgifs.com/watch/courageousimpeccablecanvasback', Redgifs), - ('https://www.gifdeliverynetwork.com/repulsivefinishedandalusianhorse', GifDeliveryNetwork), + ('https://www.gifdeliverynetwork.com/repulsivefinishedandalusianhorse', Redgifs), ('https://youtu.be/DevfjHOhuFc', Youtube), ('https://m.youtube.com/watch?v=kr-FeojxzUM', Youtube), ('https://i.imgur.com/3SKrQfK.jpg?1', Direct), ('https://dynasty-scans.com/system/images_images/000/017/819/original/80215103_p0.png?1612232781', Direct), ('https://m.imgur.com/a/py3RW0j', Imgur), + ('https://v.redd.it/9z1dnk3xr5k61', YoutubeDlFallback), + ('https://streamable.com/dt46y', YoutubeDlFallback), + ('https://vimeo.com/channels/31259/53576664', YoutubeDlFallback), + ('http://video.pbs.org/viralplayer/2365173446/', YoutubeDlFallback), )) def test_factory_lever_good(test_submission_url: str, expected_class: BaseDownloader, reddit_instance: praw.Reddit): result = DownloadFactory.pull_lever(test_submission_url) diff --git a/tests/site_downloaders/test_gif_delivery_network.py b/tests/site_downloaders/test_gif_delivery_network.py deleted file mode 100644 index 38819c1..0000000 --- a/tests/site_downloaders/test_gif_delivery_network.py +++ /dev/null @@ -1,37 +0,0 @@ -#!/usr/bin/env python3 -# coding=utf-8 - -from unittest.mock import Mock - -import pytest - -from bdfr.resource import Resource -from bdfr.site_downloaders.gif_delivery_network import GifDeliveryNetwork - - -@pytest.mark.online -@pytest.mark.parametrize(('test_url', 'expected'), ( - ('https://www.gifdeliverynetwork.com/regalshoddyhorsechestnutleafminer', - 'https://thumbs2.redgifs.com/RegalShoddyHorsechestnutleafminer.mp4'), - ('https://www.gifdeliverynetwork.com/maturenexthippopotamus', - 'https://thumbs2.redgifs.com/MatureNextHippopotamus.mp4'), -)) -def test_get_link(test_url: str, expected: str): - result = GifDeliveryNetwork._get_link(test_url) - assert result == expected - - -@pytest.mark.online -@pytest.mark.parametrize(('test_url', 'expected_hash'), ( - ('https://www.gifdeliverynetwork.com/maturenexthippopotamus', '9bec0a9e4163a43781368ed5d70471df'), - ('https://www.gifdeliverynetwork.com/regalshoddyhorsechestnutleafminer', '8afb4e2c090a87140230f2352bf8beba'), -)) -def test_download_resource(test_url: str, expected_hash: str): - mock_submission = Mock() - mock_submission.url = test_url - test_site = GifDeliveryNetwork(mock_submission) - resources = test_site.find_resources() - assert len(resources) == 1 - assert isinstance(resources[0], Resource) - resources[0].download(120) - assert resources[0].hash.hexdigest() == expected_hash diff --git a/tests/site_downloaders/test_redgifs.py b/tests/site_downloaders/test_redgifs.py index a325025..71fc18e 100644 --- a/tests/site_downloaders/test_redgifs.py +++ b/tests/site_downloaders/test_redgifs.py @@ -15,6 +15,10 @@ from bdfr.site_downloaders.redgifs import Redgifs 'https://thumbs2.redgifs.com/FrighteningVictoriousSalamander.mp4'), ('https://redgifs.com/watch/springgreendecisivetaruca', 'https://thumbs2.redgifs.com/SpringgreenDecisiveTaruca.mp4'), + ('https://www.gifdeliverynetwork.com/regalshoddyhorsechestnutleafminer', + 'https://thumbs2.redgifs.com/RegalShoddyHorsechestnutleafminer.mp4'), + ('https://www.gifdeliverynetwork.com/maturenexthippopotamus', + 'https://thumbs2.redgifs.com/MatureNextHippopotamus.mp4'), )) def test_get_link(test_url: str, expected: str): result = Redgifs._get_link(test_url) @@ -25,6 +29,8 @@ def test_get_link(test_url: str, expected: str): @pytest.mark.parametrize(('test_url', 'expected_hash'), ( ('https://redgifs.com/watch/frighteningvictorioussalamander', '4007c35d9e1f4b67091b5f12cffda00a'), ('https://redgifs.com/watch/springgreendecisivetaruca', '8dac487ac49a1f18cc1b4dabe23f0869'), + ('https://www.gifdeliverynetwork.com/maturenexthippopotamus', '9bec0a9e4163a43781368ed5d70471df'), + ('https://www.gifdeliverynetwork.com/regalshoddyhorsechestnutleafminer', '8afb4e2c090a87140230f2352bf8beba'), )) def test_download_resource(test_url: str, expected_hash: str): mock_submission = Mock() diff --git a/tests/site_downloaders/test_vreddit.py b/tests/site_downloaders/test_vreddit.py deleted file mode 100644 index 3b663c2..0000000 --- a/tests/site_downloaders/test_vreddit.py +++ /dev/null @@ -1,23 +0,0 @@ -#!/usr/bin/env python3 -# coding=utf-8 - -import praw -import pytest - -from bdfr.resource import Resource -from bdfr.site_downloaders.vreddit import VReddit - - -@pytest.mark.online -@pytest.mark.reddit -@pytest.mark.parametrize(('test_submission_id'), ( - ('lu8l8g'), -)) -def test_find_resources(test_submission_id: str, reddit_instance: praw.Reddit): - test_submission = reddit_instance.submission(id=test_submission_id) - downloader = VReddit(test_submission) - resources = downloader.find_resources() - assert len(resources) == 1 - assert isinstance(resources[0], Resource) - resources[0].download(120) - assert resources[0].content is not None diff --git a/tests/test_download_filter.py b/tests/test_download_filter.py index 3c2adba..ead2b2f 100644 --- a/tests/test_download_filter.py +++ b/tests/test_download_filter.py @@ -1,9 +1,12 @@ #!/usr/bin/env python3 # coding=utf-8 +from unittest.mock import MagicMock + import pytest from bdfr.download_filter import DownloadFilter +from bdfr.resource import Resource @pytest.fixture() @@ -11,13 +14,14 @@ def download_filter() -> DownloadFilter: return DownloadFilter(['mp4', 'mp3'], ['test.com', 'reddit.com']) -@pytest.mark.parametrize(('test_url', 'expected'), ( - ('test.mp4', False), - ('test.avi', True), - ('test.random.mp3', False), +@pytest.mark.parametrize(('test_extension', 'expected'), ( + ('.mp4', False), + ('.avi', True), + ('.random.mp3', False), + ('mp4', False), )) -def test_filter_extension(test_url: str, expected: bool, download_filter: DownloadFilter): - result = download_filter._check_extension(test_url) +def test_filter_extension(test_extension: str, expected: bool, download_filter: DownloadFilter): + result = download_filter._check_extension(test_extension) assert result == expected @@ -42,7 +46,8 @@ def test_filter_domain(test_url: str, expected: bool, download_filter: DownloadF ('http://reddit.com/test.gif', False), )) def test_filter_all(test_url: str, expected: bool, download_filter: DownloadFilter): - result = download_filter.check_url(test_url) + test_resource = Resource(MagicMock(), test_url) + result = download_filter.check_resource(test_resource) assert result == expected @@ -54,5 +59,6 @@ def test_filter_all(test_url: str, expected: bool, download_filter: DownloadFilt )) def test_filter_empty_filter(test_url: str): download_filter = DownloadFilter() - result = download_filter.check_url(test_url) + test_resource = Resource(MagicMock(), test_url) + result = download_filter.check_resource(test_resource) assert result is True diff --git a/tests/test_downloader.py b/tests/test_downloader.py index 0d609ef..f1a20fc 100644 --- a/tests/test_downloader.py +++ b/tests/test_downloader.py @@ -22,6 +22,7 @@ from bdfr.site_authenticator import SiteAuthenticator @pytest.fixture() def args() -> Configuration: args = Configuration() + args.time_format = 'ISO' return args @@ -458,3 +459,75 @@ def test_read_excluded_submission_ids_from_file(downloader_mock: MagicMock, tmp_ downloader_mock.args.exclude_id_file = [test_file] results = RedditDownloader._read_excluded_ids(downloader_mock) assert results == {'aaaaaa', 'bbbbbb'} + + +@pytest.mark.online +@pytest.mark.reddit +@pytest.mark.parametrize('test_redditor_name', ( + 'Paracortex', + 'crowdstrike', + 'HannibalGoddamnit', +)) +def test_check_user_existence_good( + test_redditor_name: str, + reddit_instance: praw.Reddit, + downloader_mock: MagicMock, +): + downloader_mock.reddit_instance = reddit_instance + RedditDownloader._check_user_existence(downloader_mock, test_redditor_name) + + +@pytest.mark.online +@pytest.mark.reddit +@pytest.mark.parametrize('test_redditor_name', ( + 'lhnhfkuhwreolo', + 'adlkfmnhglojh', +)) +def test_check_user_existence_nonexistent( + test_redditor_name: str, + reddit_instance: praw.Reddit, + downloader_mock: MagicMock, +): + downloader_mock.reddit_instance = reddit_instance + with pytest.raises(BulkDownloaderException, match='Could not find'): + RedditDownloader._check_user_existence(downloader_mock, test_redditor_name) + + +@pytest.mark.online +@pytest.mark.reddit +@pytest.mark.parametrize('test_redditor_name', ( + 'Bree-Boo', +)) +def test_check_user_existence_banned( + test_redditor_name: str, + reddit_instance: praw.Reddit, + downloader_mock: MagicMock, +): + downloader_mock.reddit_instance = reddit_instance + with pytest.raises(BulkDownloaderException, match='is banned'): + RedditDownloader._check_user_existence(downloader_mock, test_redditor_name) + + +@pytest.mark.online +@pytest.mark.reddit +@pytest.mark.parametrize(('test_subreddit_name', 'expected_message'), ( + ('donaldtrump', 'cannot be found'), + ('submitters', 'private and cannot be scraped') +)) +def test_check_subreddit_status_bad(test_subreddit_name: str, expected_message: str, reddit_instance: praw.Reddit): + test_subreddit = reddit_instance.subreddit(test_subreddit_name) + with pytest.raises(BulkDownloaderException, match=expected_message): + RedditDownloader._check_subreddit_status(test_subreddit) + + +@pytest.mark.online +@pytest.mark.reddit +@pytest.mark.parametrize('test_subreddit_name', ( + 'Python', + 'Mindustry', + 'TrollXChromosomes', + 'all', +)) +def test_check_subreddit_status_good(test_subreddit_name: str, reddit_instance: praw.Reddit): + test_subreddit = reddit_instance.subreddit(test_subreddit_name) + RedditDownloader._check_subreddit_status(test_subreddit) diff --git a/tests/test_file_name_formatter.py b/tests/test_file_name_formatter.py index 7a91d8c..b1faf86 100644 --- a/tests/test_file_name_formatter.py +++ b/tests/test_file_name_formatter.py @@ -1,9 +1,11 @@ #!/usr/bin/env python3 # coding=utf-8 +from datetime import datetime from pathlib import Path from typing import Optional from unittest.mock import MagicMock +import platform import praw.models import pytest @@ -21,29 +23,45 @@ def submission() -> MagicMock: test.id = '12345' test.score = 1000 test.link_flair_text = 'test_flair' - test.created_utc = 123456789 + test.created_utc = datetime(2021, 4, 21, 9, 30, 0).timestamp() test.__class__ = praw.models.Submission return test +def do_test_string_equality(result: str, expected: str) -> bool: + if platform.system() == 'Windows': + expected = FileNameFormatter._format_for_windows(expected) + return expected == result + + +def do_test_path_equality(result: Path, expected: str) -> bool: + if platform.system() == 'Windows': + expected = expected.split('/') + expected = [FileNameFormatter._format_for_windows(part) for part in expected] + expected = Path(*expected) + else: + expected = Path(expected) + return result == expected + + @pytest.fixture(scope='session') def reddit_submission(reddit_instance: praw.Reddit) -> praw.models.Submission: return reddit_instance.submission(id='lgilgt') -@pytest.mark.parametrize(('format_string', 'expected'), ( +@pytest.mark.parametrize(('test_format_string', 'expected'), ( ('{SUBREDDIT}', 'randomreddit'), ('{REDDITOR}', 'person'), ('{POSTID}', '12345'), ('{UPVOTES}', '1000'), ('{FLAIR}', 'test_flair'), - ('{DATE}', '123456789'), + ('{DATE}', '2021-04-21T09:30:00'), ('{REDDITOR}_{TITLE}_{POSTID}', 'person_name_12345'), - ('{RANDOM}', '{RANDOM}'), )) -def test_format_name_mock(format_string: str, expected: str, submission: MagicMock): - result = FileNameFormatter._format_name(submission, format_string) - assert result == expected +def test_format_name_mock(test_format_string: str, expected: str, submission: MagicMock): + test_formatter = FileNameFormatter(test_format_string, '', 'ISO') + result = test_formatter._format_name(submission, test_format_string) + assert do_test_string_equality(result, expected) @pytest.mark.parametrize(('test_string', 'expected'), ( @@ -62,7 +80,7 @@ def test_check_format_string_validity(test_string: str, expected: bool): @pytest.mark.online @pytest.mark.reddit -@pytest.mark.parametrize(('format_string', 'expected'), ( +@pytest.mark.parametrize(('test_format_string', 'expected'), ( ('{SUBREDDIT}', 'Mindustry'), ('{REDDITOR}', 'Gamer_player_boi'), ('{POSTID}', 'lgilgt'), @@ -70,9 +88,10 @@ def test_check_format_string_validity(test_string: str, expected: bool): ('{SUBREDDIT}_{TITLE}', 'Mindustry_Toxopid that is NOT humane >:('), ('{REDDITOR}_{TITLE}_{POSTID}', 'Gamer_player_boi_Toxopid that is NOT humane >:(_lgilgt') )) -def test_format_name_real(format_string: str, expected: str, reddit_submission: praw.models.Submission): - result = FileNameFormatter._format_name(reddit_submission, format_string) - assert result == expected +def test_format_name_real(test_format_string: str, expected: str, reddit_submission: praw.models.Submission): + test_formatter = FileNameFormatter(test_format_string, '', '') + result = test_formatter._format_name(reddit_submission, test_format_string) + assert do_test_string_equality(result, expected) @pytest.mark.online @@ -100,9 +119,9 @@ def test_format_full( expected: str, reddit_submission: praw.models.Submission): test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png') - test_formatter = FileNameFormatter(format_string_file, format_string_directory) + test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO') result = test_formatter.format_path(test_resource, Path('test')) - assert str(result) == expected + assert do_test_path_equality(result, expected) @pytest.mark.online @@ -117,7 +136,7 @@ def test_format_full_conform( format_string_file: str, reddit_submission: praw.models.Submission): test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png') - test_formatter = FileNameFormatter(format_string_file, format_string_directory) + test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO') test_formatter.format_path(test_resource, Path('test')) @@ -137,9 +156,9 @@ def test_format_full_with_index_suffix( reddit_submission: praw.models.Submission, ): test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png') - test_formatter = FileNameFormatter(format_string_file, format_string_directory) + test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO') result = test_formatter.format_path(test_resource, Path('test'), index) - assert str(result) == expected + assert do_test_path_equality(result, expected) def test_format_multiple_resources(): @@ -151,7 +170,7 @@ def test_format_multiple_resources(): new_mock.source_submission.title = 'test' new_mock.source_submission.__class__ = praw.models.Submission mocks.append(new_mock) - test_formatter = FileNameFormatter('{TITLE}', '') + test_formatter = FileNameFormatter('{TITLE}', '', 'ISO') results = test_formatter.format_resource_paths(mocks, Path('.')) results = set([str(res[0]) for res in results]) assert results == {'test_1.png', 'test_2.png', 'test_3.png', 'test_4.png'} @@ -195,7 +214,7 @@ def test_shorten_filenames(submission: MagicMock, tmp_path: Path): submission.subreddit.display_name = 'test' submission.id = 'BBBBBB' test_resource = Resource(submission, 'www.example.com/empty', '.jpeg') - test_formatter = FileNameFormatter('{REDDITOR}_{TITLE}_{POSTID}', '{SUBREDDIT}') + test_formatter = FileNameFormatter('{REDDITOR}_{TITLE}_{POSTID}', '{SUBREDDIT}', 'ISO') result = test_formatter.format_path(test_resource, tmp_path) result.parent.mkdir(parents=True) result.touch() @@ -236,7 +255,8 @@ def test_strip_emojies(test_string: str, expected: str): )) def test_generate_dict_for_submission(test_submission_id: str, expected: dict, reddit_instance: praw.Reddit): test_submission = reddit_instance.submission(id=test_submission_id) - result = FileNameFormatter._generate_name_dict_from_submission(test_submission) + test_formatter = FileNameFormatter('{TITLE}', '', 'ISO') + result = test_formatter._generate_name_dict_from_submission(test_submission) assert all([result.get(key) == expected[key] for key in expected.keys()]) @@ -252,7 +272,8 @@ def test_generate_dict_for_submission(test_submission_id: str, expected: dict, r )) def test_generate_dict_for_comment(test_comment_id: str, expected: dict, reddit_instance: praw.Reddit): test_comment = reddit_instance.comment(id=test_comment_id) - result = FileNameFormatter._generate_name_dict_from_comment(test_comment) + test_formatter = FileNameFormatter('{TITLE}', '', 'ISO') + result = test_formatter._generate_name_dict_from_comment(test_comment) assert all([result.get(key) == expected[key] for key in expected.keys()]) @@ -271,10 +292,10 @@ def test_format_archive_entry_comment( reddit_instance: praw.Reddit, ): test_comment = reddit_instance.comment(id=test_comment_id) - test_formatter = FileNameFormatter(test_file_scheme, test_folder_scheme) + test_formatter = FileNameFormatter(test_file_scheme, test_folder_scheme, 'ISO') test_entry = Resource(test_comment, '', '.json') result = test_formatter.format_path(test_entry, tmp_path) - assert result.name == expected_name + assert do_test_string_equality(result.name, expected_name) @pytest.mark.parametrize(('test_folder_scheme', 'expected'), ( @@ -287,13 +308,13 @@ def test_multilevel_folder_scheme( tmp_path: Path, submission: MagicMock, ): - test_formatter = FileNameFormatter('{POSTID}', test_folder_scheme) + test_formatter = FileNameFormatter('{POSTID}', test_folder_scheme, 'ISO') test_resource = MagicMock() test_resource.source_submission = submission test_resource.extension = '.png' result = test_formatter.format_path(test_resource, tmp_path) result = result.relative_to(tmp_path) - assert str(result.parent) == expected + assert do_test_path_equality(result.parent, expected) assert len(result.parents) == (len(expected.split('/')) + 1) @@ -307,8 +328,9 @@ def test_multilevel_folder_scheme( )) def test_preserve_emojis(test_name_string: str, expected: str, submission: MagicMock): submission.title = test_name_string - result = FileNameFormatter._format_name(submission, '{TITLE}') - assert result == expected + test_formatter = FileNameFormatter('{TITLE}', '', 'ISO') + result = test_formatter._format_name(submission, '{TITLE}') + assert do_test_string_equality(result, expected) @pytest.mark.parametrize(('test_string', 'expected'), ( @@ -318,3 +340,27 @@ def test_preserve_emojis(test_name_string: str, expected: str, submission: Magic def test_convert_unicode_escapes(test_string: str, expected: str): result = FileNameFormatter._convert_unicode_escapes(test_string) assert result == expected + + +@pytest.mark.parametrize(('test_datetime', 'expected'), ( + (datetime(2020, 1, 1, 8, 0, 0), '2020-01-01T08:00:00'), + (datetime(2020, 1, 1, 8, 0), '2020-01-01T08:00:00'), + (datetime(2021, 4, 21, 8, 30, 21), '2021-04-21T08:30:21'), +)) +def test_convert_timestamp(test_datetime: datetime, expected: str): + test_timestamp = test_datetime.timestamp() + test_formatter = FileNameFormatter('{POSTID}', '', 'ISO') + result = test_formatter._convert_timestamp(test_timestamp) + assert result == expected + + +@pytest.mark.parametrize(('test_time_format', 'expected'), ( + ('ISO', '2021-05-02T13:33:00'), + ('%Y_%m', '2021_05'), + ('%Y-%m-%d', '2021-05-02'), +)) +def test_time_string_formats(test_time_format: str, expected: str): + test_time = datetime(2021, 5, 2, 13, 33) + test_formatter = FileNameFormatter('{TITLE}', '', test_time_format) + result = test_formatter._convert_timestamp(test_time.timestamp()) + assert result == expected diff --git a/tests/test_integration.py b/tests/test_integration.py index 396025b..7aec0eb 100644 --- a/tests/test_integration.py +++ b/tests/test_integration.py @@ -11,6 +11,28 @@ from bdfr.__main__ import cli does_test_config_exist = Path('test_config.cfg').exists() + +def create_basic_args_for_download_runner(test_args: list[str], tmp_path: Path): + out = [ + 'download', str(tmp_path), + '-v', + '--config', 'test_config.cfg', + '--log', str(Path(tmp_path, 'test_log.txt')), + ] + test_args + return out + + +def create_basic_args_for_archive_runner(test_args: list[str], tmp_path: Path): + out = [ + 'archive', + str(tmp_path), + '-v', + '--config', 'test_config.cfg', + '--log', str(Path(tmp_path, 'test_log.txt')), + ] + test_args + return out + + @pytest.mark.online @pytest.mark.reddit @pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests') @@ -35,7 +57,7 @@ does_test_config_exist = Path('test_config.cfg').exists() )) def test_cli_download_subreddits(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'Added submissions from subreddit ' in result.output @@ -53,7 +75,7 @@ def test_cli_download_subreddits(test_args: list[str], tmp_path: Path): )) def test_cli_download_links(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 @@ -69,7 +91,7 @@ def test_cli_download_links(test_args: list[str], tmp_path: Path): )) def test_cli_download_multireddit(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'Added submissions from multireddit ' in result.output @@ -83,7 +105,7 @@ def test_cli_download_multireddit(test_args: list[str], tmp_path: Path): )) def test_cli_download_multireddit_nonexistent(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'Failed to get submissions for multireddit' in result.output @@ -104,7 +126,7 @@ def test_cli_download_multireddit_nonexistent(test_args: list[str], tmp_path: Pa )) def test_cli_download_user_data_good(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'Downloaded submission ' in result.output @@ -119,7 +141,7 @@ def test_cli_download_user_data_good(test_args: list[str], tmp_path: Path): )) def test_cli_download_user_data_bad_me_unauthenticated(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'To use "me" as a user, an authenticated Reddit instance must be used' in result.output @@ -134,7 +156,7 @@ def test_cli_download_user_data_bad_me_unauthenticated(test_args: list[str], tmp def test_cli_download_search_existing(test_args: list[str], tmp_path: Path): Path(tmp_path, 'test.txt').touch() runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'Calculating hashes for' in result.output @@ -145,13 +167,14 @@ def test_cli_download_search_existing(test_args: list[str], tmp_path: Path): @pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests') @pytest.mark.parametrize('test_args', ( ['--subreddit', 'tumblr', '-L', '25', '--skip', 'png', '--skip', 'jpg'], + ['--subreddit', 'MaliciousCompliance', '-L', '25', '--skip', 'txt'], )) def test_cli_download_download_filters(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 - assert 'Download filter removed submission' in result.output + assert 'Download filter removed ' in result.output @pytest.mark.online @@ -163,7 +186,7 @@ def test_cli_download_download_filters(test_args: list[str], tmp_path: Path): )) def test_cli_download_long(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 @@ -173,11 +196,12 @@ def test_cli_download_long(test_args: list[str], tmp_path: Path): @pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests') @pytest.mark.parametrize('test_args', ( ['-l', 'gstd4hk'], - ['-l', 'm2601g'], + ['-l', 'm2601g', '-f', 'yaml'], + ['-l', 'n60t4c', '-f', 'xml'], )) def test_cli_archive_single(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['archive', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_archive_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert re.search(r'Writing entry .*? to file in .*? format', result.output) @@ -196,7 +220,7 @@ def test_cli_archive_single(test_args: list[str], tmp_path: Path): )) def test_cli_archive_subreddit(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['archive', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_archive_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert re.search(r'Writing entry .*? to file in .*? format', result.output) @@ -210,7 +234,7 @@ def test_cli_archive_subreddit(test_args: list[str], tmp_path: Path): )) def test_cli_archive_all_user_comments(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['archive', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_archive_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 @@ -225,7 +249,7 @@ def test_cli_archive_all_user_comments(test_args: list[str], tmp_path: Path): )) def test_cli_archive_long(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['archive', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_archive_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert re.search(r'Writing entry .*? to file in .*? format', result.output) @@ -239,10 +263,12 @@ def test_cli_archive_long(test_args: list[str], tmp_path: Path): ['--user', 'sdclhgsolgjeroij', '--submitted', '-L', 10], ['--user', 'me', '--upvoted', '-L', 10], ['--user', 'sdclhgsolgjeroij', '--upvoted', '-L', 10], + ['--subreddit', 'submitters', '-L', 10], # Private subreddit + ['--subreddit', 'donaldtrump', '-L', 10], # Banned subreddit )) def test_cli_download_soft_fail(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 @@ -257,7 +283,7 @@ def test_cli_download_soft_fail(test_args: list[str], tmp_path: Path): )) def test_cli_download_hard_fail(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code != 0 @@ -277,7 +303,7 @@ def test_cli_download_use_default_config(tmp_path: Path): )) def test_cli_download_links_exclusion(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'in exclusion list' in result.output @@ -293,7 +319,7 @@ def test_cli_download_links_exclusion(test_args: list[str], tmp_path: Path): )) def test_cli_download_subreddit_exclusion(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'in skip list' in result.output @@ -309,7 +335,7 @@ def test_cli_download_subreddit_exclusion(test_args: list[str], tmp_path: Path): )) def test_cli_file_scheme_warning(test_args: list[str], tmp_path: Path): runner = CliRunner() - test_args = ['download', str(tmp_path), '-v', '--config', 'test_config.cfg'] + test_args + test_args = create_basic_args_for_download_runner(test_args, tmp_path) result = runner.invoke(cli, test_args) assert result.exit_code == 0 assert 'Some files might not be downloaded due to name conflicts' in result.output