1
0
Fork 0
mirror of synced 2024-07-01 04:20:55 +12:00
Commit graph

153 commits

Author SHA1 Message Date
Cristian e9e4adfc34 fix: wget_output_path failing on some extractors. Add a new condition 2021-01-07 09:07:29 -05:00
Cristian 81d766aba1 refactor: Remove setup_django from title.py 2020-12-11 16:03:50 -05:00
Cristian 275ad22db7 refactor: Remove skip_index from archive related functions 2020-12-08 18:42:25 -05:00
Cristian f6c73f9aeb fix: Issue with oneshot command 2020-12-08 18:42:25 -05:00
JDC 7903db6dfb Add ArchiveResult Manager and sorted indexable filter 2020-12-06 01:13:39 +02:00
JDC b1f70b2197 Initial implementation 2020-12-06 01:12:45 +02:00
Cristian 33182fd53c fix: Add missing assignation 2020-11-04 15:07:45 -05:00
Cristian d064a3eeff fix: Handle case when update tries to re-add a link that is not in the sql index 2020-11-04 15:02:54 -05:00
Cristian f292cface2 fix: Add condition for oneshot when archiving links 2020-11-04 14:40:44 -05:00
Cristian 4484491fb7 feat: Create ArchiveResult after finishing an extractor process 2020-11-04 11:22:55 -05:00
Cristian ac0ec160d1 lint: Fix warnings in master branch 2020-11-02 08:51:48 -05:00
Nick Sweeting ac9e0e356d config fixes 2020-10-31 07:57:11 -04:00
Nick Sweeting 18355dc2c6 clean up config loading in settings and config file layout 2020-10-31 03:08:03 -04:00
Cristian e7e33ea7a5 tests: Add tests for several different ways to extract the title 2020-10-30 08:04:26 -05:00
Nick Sweeting f727ece7b3 add regex fallback back to title parser 2020-10-30 04:57:31 -04:00
Nick Sweeting 79bef1384e
Merge pull request #493 from ttimasdf/feat-ogtitle
Feature: add og:title metadata as alternative title
2020-10-30 04:51:14 -04:00
Cristian c12fe0e3d7 feat: Use CURL_ARGS on title extractor 2020-10-22 08:46:16 -05:00
Cristian 563d0f94ec feat: Use CURL_ARGS in favicon extractor 2020-10-22 08:46:16 -05:00
Cristian 2e1cdca789 feat: Use CURL_ARGS on header extractor 2020-10-22 08:46:16 -05:00
Cristian 972d57bd08 feat: Add CURL_ARGS to control curl arguments 2020-10-22 08:46:16 -05:00
Cristian 24e7a74855 feat: Add WGET_ARGS to control wget arguments 2020-10-22 08:46:16 -05:00
Cristian bc02e0ffe3 feat: Add config for youtubedl (YOUTUBEDL_ARGS) 2020-10-22 08:46:16 -05:00
Angel Rey ce71747538 replaced os.path in init extractors 2020-10-02 15:46:39 -05:00
Angel Rey 3fb410a604 Replaced os.path in favicon.py 2020-10-02 15:46:39 -05:00
ttimasdf eda3836dee feat: add og:title metadata as alternative title 2020-09-27 12:54:52 +08:00
Cristian abde871a3c fix: Wget absolute path generating issues 2020-09-25 08:24:06 -05:00
Cristian 7d3767b882 fix: oneshot command not running extractors 2020-09-24 12:56:16 -05:00
Cristian 62ed11a5ca fix: Improve headers handling 2020-09-24 12:55:51 -05:00
Angel Rey a40af98ced removed static file check 2020-09-24 12:55:51 -05:00
Angel Rey dc160daba8 Fixed lint 2020-09-23 11:07:00 -05:00
Angel Rey 7fd7dced9a Added curl params 2020-09-23 11:07:00 -05:00
Angel Rey 852e3c9cff Added headers extractor 2020-09-23 11:07:00 -05:00
Cristian eb34a6af62 lint: Fix mercury extractor lint issues 2020-09-23 10:35:39 -05:00
Cristian 46b9e3d536 fix: Fix mercury extractor test 2020-09-23 10:34:05 -05:00
ttimasdf 357b677363 fix: add mercury-parser to extractors list 2020-09-22 18:44:12 -05:00
ttimasdf 706bd895e0 feat: Add mercury-parser 2020-09-22 18:44:12 -05:00
Cristian b18bbf8874 test: Fix tests post-rebase 2020-09-17 09:09:52 -05:00
Cristian 50f3f16203 lint: Remove unused import 2020-09-15 08:05:46 -05:00
Cristian 0a83392cbf fix: Replace any typing with Union[Iterable[Link], QuerySet] in archive_links 2020-09-15 08:05:46 -05:00
Cristian 018bd91745 refactor: Remove get_iter lambda from archive_links 2020-09-15 08:05:46 -05:00
Cristian 01fb44fd40 refactor: Change archive_links check to focus on queryset, so it allows other iterables and not just lists 2020-09-15 08:05:46 -05:00
Cristian fe9604a772 feat: Add tests for remove command 2020-09-15 08:05:46 -05:00
Cristian be520d137a feat: Refactor add method to use querysets 2020-09-15 08:05:46 -05:00
Cristian 874403e667 feat: Remove patch_main_index 2020-09-15 08:05:46 -05:00
Cristian 31343c1367 feat: Update extractors and add command to use sql index as source of truth 2020-09-15 08:05:46 -05:00
Cristian bd3c824d45 fix: Escape JSON output on command failure so the user can run the command manually 2020-09-04 10:23:41 -05:00
Nick Sweeting a645f36b87
add comment about fake cmd 2020-09-01 19:42:22 -04:00
Cristian 66037535fd feat: Add curl command on readability as default command to debug 2020-09-01 10:16:24 -05:00
Cristian bf3ea42141 fix: Add a default cmd value to handle case where the html cannot be retrieved 2020-08-27 09:51:33 -05:00
Nick Sweeting a2c158e43e catch OSErrors due to missing path 2020-08-18 19:09:45 -04:00
Nick Sweeting 7144e0bdce search for node dependencies in output dir first 2020-08-18 18:40:19 -04:00
Nick Sweeting e87f1d57a3 fix linters 2020-08-18 09:22:12 -04:00
Nick Sweeting c9b3bab84d fix pull title not working 2020-08-18 08:49:26 -04:00
Nick Sweeting b0c0a676f8 re-enable readability and singlefile by default now that its less noisy 2020-08-18 08:29:46 -04:00
Nick Sweeting d7d53cfb12 dont show skipped extractors to reduce visual noise 2020-08-18 08:13:35 -04:00
Nick Sweeting 92de20af15 better detect missing dependencies on startup 2020-08-18 04:38:13 -04:00
Nick Sweeting b681a477ae add overwrite flag to add command to force re-archiving 2020-08-18 04:37:54 -04:00
Cristian 05c71fc302 fix: Organize readability extractor so a timeout does not break the whole process 2020-08-17 08:34:40 -05:00
Nick Sweeting 58e928520a tweak log output for skipped methods 2020-08-14 13:12:50 -04:00
Nick Sweeting 03b73bfe77
Update archivebox/extractors/readability.py 2020-08-14 12:55:22 -04:00
Cristian b7aa3df8d2 feat: Disable singlefile and readability by default 2020-08-12 14:42:21 -05:00
Cristian 5dc7e63792 feat: Update dockerfile to support readability 2020-08-11 11:52:43 -05:00
Cristian 2a68af1b94 tests: Add readability tests 2020-08-11 11:15:15 -05:00
Cristian 8aa7b34de7 tests: Add readability to ignored methods in tests 2020-08-11 08:58:49 -05:00
Cristian dc87d8b68c tests: Update failing tests 2020-08-11 08:48:13 -05:00
Cristian 0ec747f64e feat: Look in wget, singlefile or dom outputs before attempting to download the information again 2020-08-11 08:37:12 -05:00
Cristian a14762640e feat: Avoid running readability when the target is a file 2020-08-11 08:37:12 -05:00
Cristian 61e08a7c43 docs: Update docs link 2020-08-11 08:37:12 -05:00
Cristian b33c66a9f7 feat: Split output of readability into multiple files 2020-08-11 08:37:12 -05:00
Cristian 7e2b249388 feat: Initial version of readability extractor 2020-08-11 08:37:12 -05:00
Nick Sweeting 430be7bc68 add missing staticfile check to singlefile 2020-08-10 13:42:20 -04:00
Cristian 06d0e9de6c feat: Add support for singlefile in docker 2020-08-03 13:23:05 -05:00
Nick Sweeting 5b6eb5e4ad make filenames consistent with program name 2020-08-03 13:23:05 -05:00
Cristian 42b0c80465 feat: Add singlefile to link_details 2020-08-03 13:22:06 -05:00
Cristian 787a5ad43e fix: Commit code review suggestions 2020-08-03 13:22:06 -05:00
Cristian 853685668c feat: Add initial support for singlefile extractor 2020-08-03 13:22:06 -05:00
Cristian e6c571beb2 fix: Remove title from extractors for oneshot 2020-07-31 10:24:58 -05:00
Cristian 8bcb171e74 fix: Remove support for multiple urls in oneshot command 2020-07-31 09:05:40 -05:00
Cristian 3afb2401bc fix: Add condition to avoid breaking the add command 2020-07-29 11:53:49 -05:00
Cristian c073ea141d feat: Initial oneshot command proposal 2020-07-29 11:19:06 -05:00
Nick Sweeting 2e0b751376 accept methods argument to filder archive_link 2020-07-28 05:58:38 -04:00
Nick Sweeting 032c2458de add missing setup_django import 2020-07-28 05:58:13 -04:00
Nick Sweeting 55a237a435 also set snapshot title inside of fetch_title directly 2020-07-28 05:56:34 -04:00
Nick Sweeting 273059f054 accept gzipped responses when using curl 2020-07-28 05:55:54 -04:00
Nick Sweeting af9084ee95 update Snapshot.title to latest_title after fetching 2020-07-28 05:55:09 -04:00
Nick Sweeting 943453a9a8 pass overwrite properly 2020-07-28 05:54:42 -04:00
Cristian a5550b2105 fix: Rename logging folder to avoid naming conflicts (and circular import issues) 2020-07-22 11:02:13 -05:00
Nick Sweeting 0965031d8f fix archive_org header rename 2020-07-22 01:46:38 -04:00
Cristian f4d1b5121e refactor: Move logging.py to main module to avoid circular import issues 2020-07-17 18:00:04 -05:00
Cristian 23e6803f02 fix: Add change to calculate wget folder when there is a port present 2020-07-17 16:55:56 -05:00
Nick Sweeting ae208435c9 fix the add links form 2020-07-13 12:21:37 -04:00
Nick Sweeting 215d5eae32 normal git clone instead of mirror 2020-07-13 11:41:37 -04:00
Nick Sweeting b4ce20cbe5 write link details json before and after archiving 2020-07-13 11:41:27 -04:00
Nick Sweeting d3bfa98a91 fix depth flag and tweak logging 2020-07-13 11:26:34 -04:00
Nick Sweeting df593dea0a fix missing imports 2020-06-30 05:55:34 -04:00
Nick Sweeting 602e141f08 fix config file atomic writing bugs 2020-06-30 02:04:16 -04:00
Nick Sweeting 79b19ddf35 use atomic writes for config file writing as well 2020-06-30 01:12:06 -04:00
Nick Sweeting 5c2bbe7efe bufixes 2020-06-25 22:14:40 -04:00
Nick Sweeting cb67b09f9d Merge branch 'master' into django 2020-06-25 21:30:29 -04:00
Nick Sweeting 43c471e4af cli experience improvements 2020-06-25 17:47:55 -04:00