1
0
Fork 0
mirror of synced 2024-07-05 06:20:37 +12:00
Commit graph

171 commits

Author SHA1 Message Date
Nick Sweeting bd6d9c165b enforce utf8 on literally all file operations because windows sucks 2021-03-27 01:16:29 -04:00
Nick Sweeting 084cf7ff51 add more explanation about snapshot.save timestamp bump 2021-02-17 13:34:46 -05:00
Nick Sweeting acb932ba12 improve readability and mercury error handling and fix output path to be relative 2021-02-16 15:53:11 -05:00
Nick Sweeting c95698e608 bump Snapshot.updated time after each extractor, change extractor order 2021-02-16 15:52:18 -05:00
Nick Sweeting d0f8a5e710 change mercury atomic_write output order 2021-02-16 06:19:16 -05:00
Nick Sweeting 7d0f5653c3 fix lgtm alerts 2021-02-01 02:27:24 -05:00
Nick Sweeting 04c951cdd5 fix alerts 2021-02-01 02:22:02 -05:00
Nick Sweeting 846c966c4d use globbing to find wget output path 2021-01-30 22:02:39 -05:00
Nick Sweeting e6fa16e13a only chmod wget output if it exists 2021-01-30 22:02:11 -05:00
Nick Sweeting 385daf9af8 save the url as title for staticfiles or non html files 2021-01-30 22:01:49 -05:00
Nick Sweeting b9b1c3d9e8 fix singlefile output path not relative 2021-01-30 20:44:49 -05:00
Nick Sweeting d6de04a83a fix lgtm errors 2021-01-30 06:07:35 -05:00
Nick Sweeting c2aaa41c76 fix missing str path 2021-01-30 01:25:08 -05:00
Nick Sweeting 15e58bd366 fix using os.path calls on pathlib paths 2021-01-27 11:27:40 -05:00
Nick Sweeting 9764a8ed9b check for non html files from wget 2021-01-25 18:15:16 -05:00
Dan Arnfield 5420903102 Refactor should_save_extractor methods to accept overwrite parameter 2021-01-21 15:56:32 -06:00
Nick Sweeting ef7711ffa0 fix cookies file arg is path 2021-01-20 19:13:53 -05:00
Cristian 6031ffa3b2 fix: Mercury extractor error was incorrectly initialized 2021-01-07 09:22:46 -05:00
Cristian e9e4adfc34 fix: wget_output_path failing on some extractors. Add a new condition 2021-01-07 09:07:29 -05:00
Cristian 81d766aba1 refactor: Remove setup_django from title.py 2020-12-11 16:03:50 -05:00
Cristian 275ad22db7 refactor: Remove skip_index from archive related functions 2020-12-08 18:42:25 -05:00
Cristian f6c73f9aeb fix: Issue with oneshot command 2020-12-08 18:42:25 -05:00
JDC 7903db6dfb Add ArchiveResult Manager and sorted indexable filter 2020-12-06 01:13:39 +02:00
JDC b1f70b2197 Initial implementation 2020-12-06 01:12:45 +02:00
Cristian 33182fd53c fix: Add missing assignation 2020-11-04 15:07:45 -05:00
Cristian d064a3eeff fix: Handle case when update tries to re-add a link that is not in the sql index 2020-11-04 15:02:54 -05:00
Cristian f292cface2 fix: Add condition for oneshot when archiving links 2020-11-04 14:40:44 -05:00
Cristian 4484491fb7 feat: Create ArchiveResult after finishing an extractor process 2020-11-04 11:22:55 -05:00
Cristian ac0ec160d1 lint: Fix warnings in master branch 2020-11-02 08:51:48 -05:00
Nick Sweeting ac9e0e356d config fixes 2020-10-31 07:57:11 -04:00
Nick Sweeting 18355dc2c6 clean up config loading in settings and config file layout 2020-10-31 03:08:03 -04:00
Cristian e7e33ea7a5 tests: Add tests for several different ways to extract the title 2020-10-30 08:04:26 -05:00
Nick Sweeting f727ece7b3 add regex fallback back to title parser 2020-10-30 04:57:31 -04:00
Nick Sweeting 79bef1384e
Merge pull request #493 from ttimasdf/feat-ogtitle
Feature: add og:title metadata as alternative title
2020-10-30 04:51:14 -04:00
Cristian c12fe0e3d7 feat: Use CURL_ARGS on title extractor 2020-10-22 08:46:16 -05:00
Cristian 563d0f94ec feat: Use CURL_ARGS in favicon extractor 2020-10-22 08:46:16 -05:00
Cristian 2e1cdca789 feat: Use CURL_ARGS on header extractor 2020-10-22 08:46:16 -05:00
Cristian 972d57bd08 feat: Add CURL_ARGS to control curl arguments 2020-10-22 08:46:16 -05:00
Cristian 24e7a74855 feat: Add WGET_ARGS to control wget arguments 2020-10-22 08:46:16 -05:00
Cristian bc02e0ffe3 feat: Add config for youtubedl (YOUTUBEDL_ARGS) 2020-10-22 08:46:16 -05:00
Angel Rey ce71747538 replaced os.path in init extractors 2020-10-02 15:46:39 -05:00
Angel Rey 3fb410a604 Replaced os.path in favicon.py 2020-10-02 15:46:39 -05:00
ttimasdf eda3836dee feat: add og:title metadata as alternative title 2020-09-27 12:54:52 +08:00
Cristian abde871a3c fix: Wget absolute path generating issues 2020-09-25 08:24:06 -05:00
Cristian 7d3767b882 fix: oneshot command not running extractors 2020-09-24 12:56:16 -05:00
Cristian 62ed11a5ca fix: Improve headers handling 2020-09-24 12:55:51 -05:00
Angel Rey a40af98ced removed static file check 2020-09-24 12:55:51 -05:00
Angel Rey dc160daba8 Fixed lint 2020-09-23 11:07:00 -05:00
Angel Rey 7fd7dced9a Added curl params 2020-09-23 11:07:00 -05:00
Angel Rey 852e3c9cff Added headers extractor 2020-09-23 11:07:00 -05:00
Cristian eb34a6af62 lint: Fix mercury extractor lint issues 2020-09-23 10:35:39 -05:00
Cristian 46b9e3d536 fix: Fix mercury extractor test 2020-09-23 10:34:05 -05:00
ttimasdf 357b677363 fix: add mercury-parser to extractors list 2020-09-22 18:44:12 -05:00
ttimasdf 706bd895e0 feat: Add mercury-parser 2020-09-22 18:44:12 -05:00
Cristian b18bbf8874 test: Fix tests post-rebase 2020-09-17 09:09:52 -05:00
Cristian 50f3f16203 lint: Remove unused import 2020-09-15 08:05:46 -05:00
Cristian 0a83392cbf fix: Replace any typing with Union[Iterable[Link], QuerySet] in archive_links 2020-09-15 08:05:46 -05:00
Cristian 018bd91745 refactor: Remove get_iter lambda from archive_links 2020-09-15 08:05:46 -05:00
Cristian 01fb44fd40 refactor: Change archive_links check to focus on queryset, so it allows other iterables and not just lists 2020-09-15 08:05:46 -05:00
Cristian fe9604a772 feat: Add tests for remove command 2020-09-15 08:05:46 -05:00
Cristian be520d137a feat: Refactor add method to use querysets 2020-09-15 08:05:46 -05:00
Cristian 874403e667 feat: Remove patch_main_index 2020-09-15 08:05:46 -05:00
Cristian 31343c1367 feat: Update extractors and add command to use sql index as source of truth 2020-09-15 08:05:46 -05:00
Cristian bd3c824d45 fix: Escape JSON output on command failure so the user can run the command manually 2020-09-04 10:23:41 -05:00
Nick Sweeting a645f36b87
add comment about fake cmd 2020-09-01 19:42:22 -04:00
Cristian 66037535fd feat: Add curl command on readability as default command to debug 2020-09-01 10:16:24 -05:00
Cristian bf3ea42141 fix: Add a default cmd value to handle case where the html cannot be retrieved 2020-08-27 09:51:33 -05:00
Nick Sweeting a2c158e43e catch OSErrors due to missing path 2020-08-18 19:09:45 -04:00
Nick Sweeting 7144e0bdce search for node dependencies in output dir first 2020-08-18 18:40:19 -04:00
Nick Sweeting e87f1d57a3 fix linters 2020-08-18 09:22:12 -04:00
Nick Sweeting c9b3bab84d fix pull title not working 2020-08-18 08:49:26 -04:00
Nick Sweeting b0c0a676f8 re-enable readability and singlefile by default now that its less noisy 2020-08-18 08:29:46 -04:00
Nick Sweeting d7d53cfb12 dont show skipped extractors to reduce visual noise 2020-08-18 08:13:35 -04:00
Nick Sweeting 92de20af15 better detect missing dependencies on startup 2020-08-18 04:38:13 -04:00
Nick Sweeting b681a477ae add overwrite flag to add command to force re-archiving 2020-08-18 04:37:54 -04:00
Cristian 05c71fc302 fix: Organize readability extractor so a timeout does not break the whole process 2020-08-17 08:34:40 -05:00
Nick Sweeting 58e928520a tweak log output for skipped methods 2020-08-14 13:12:50 -04:00
Nick Sweeting 03b73bfe77
Update archivebox/extractors/readability.py 2020-08-14 12:55:22 -04:00
Cristian b7aa3df8d2 feat: Disable singlefile and readability by default 2020-08-12 14:42:21 -05:00
Cristian 5dc7e63792 feat: Update dockerfile to support readability 2020-08-11 11:52:43 -05:00
Cristian 2a68af1b94 tests: Add readability tests 2020-08-11 11:15:15 -05:00
Cristian 8aa7b34de7 tests: Add readability to ignored methods in tests 2020-08-11 08:58:49 -05:00
Cristian dc87d8b68c tests: Update failing tests 2020-08-11 08:48:13 -05:00
Cristian 0ec747f64e feat: Look in wget, singlefile or dom outputs before attempting to download the information again 2020-08-11 08:37:12 -05:00
Cristian a14762640e feat: Avoid running readability when the target is a file 2020-08-11 08:37:12 -05:00
Cristian 61e08a7c43 docs: Update docs link 2020-08-11 08:37:12 -05:00
Cristian b33c66a9f7 feat: Split output of readability into multiple files 2020-08-11 08:37:12 -05:00
Cristian 7e2b249388 feat: Initial version of readability extractor 2020-08-11 08:37:12 -05:00
Nick Sweeting 430be7bc68 add missing staticfile check to singlefile 2020-08-10 13:42:20 -04:00
Cristian 06d0e9de6c feat: Add support for singlefile in docker 2020-08-03 13:23:05 -05:00
Nick Sweeting 5b6eb5e4ad make filenames consistent with program name 2020-08-03 13:23:05 -05:00
Cristian 42b0c80465 feat: Add singlefile to link_details 2020-08-03 13:22:06 -05:00
Cristian 787a5ad43e fix: Commit code review suggestions 2020-08-03 13:22:06 -05:00
Cristian 853685668c feat: Add initial support for singlefile extractor 2020-08-03 13:22:06 -05:00
Cristian e6c571beb2 fix: Remove title from extractors for oneshot 2020-07-31 10:24:58 -05:00
Cristian 8bcb171e74 fix: Remove support for multiple urls in oneshot command 2020-07-31 09:05:40 -05:00
Cristian 3afb2401bc fix: Add condition to avoid breaking the add command 2020-07-29 11:53:49 -05:00
Cristian c073ea141d feat: Initial oneshot command proposal 2020-07-29 11:19:06 -05:00
Nick Sweeting 2e0b751376 accept methods argument to filder archive_link 2020-07-28 05:58:38 -04:00
Nick Sweeting 032c2458de add missing setup_django import 2020-07-28 05:58:13 -04:00