1
0
Fork 0
mirror of synced 2024-06-02 02:25:20 +12:00
Commit graph

79 commits

Author SHA1 Message Date
Nick Sweeting a680724367
Merge branch 'dev' into search_index_extract_html_text 2023-10-27 23:09:28 -07:00
Ross Williams 310b4d1242 Add htmltotext extractor
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Ross Williams b44f7e68b1 Add URL-specific method allow/deny lists
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Sascha Ißbrücker 40c122515a fix: make oneshot command return successful exist code 2023-05-29 10:01:27 +02:00
Nick Sweeting 9f1470cf03 fix output permissions tests 2021-05-31 20:57:46 -04:00
Nick Sweeting eef9adbfcb fix select invalid test 2021-04-03 15:50:48 -04:00
Nick Sweeting 354b4627ed fix tests 2021-03-30 23:39:15 -04:00
Nick Sweeting bd6d9c165b enforce utf8 on literally all file operations because windows sucks 2021-03-27 01:16:29 -04:00
Nick Sweeting 33df9c1ebe fix after and before in remove tests 2021-02-18 06:21:44 -05:00
Nick Sweeting 4f5bb3776c fix sql err 2021-02-18 05:51:53 -05:00
Nick Sweeting 46a4197514 fix tests 2021-02-18 04:26:56 -05:00
Cristian e82161a768 refactor: Remove setup_django from search 2020-12-11 16:43:48 -05:00
Nick Sweeting e03d17c208 test extract flag on oneshot 2020-12-11 16:49:18 +02:00
Cristian f6c73f9aeb fix: Issue with oneshot command 2020-12-08 18:42:25 -05:00
Nick Sweeting 1b22f8eeef
Merge pull request #515 from cdvv7788/POC-setup-django-on-init 2020-11-27 23:56:37 -05:00
Nick Sweeting efe3027797
Merge branch 'master' into archive-result 2020-11-27 23:18:11 -05:00
Nick Sweeting 0e2ccbc10d update urls to new repo path 2020-11-23 02:06:46 -05:00
Nick Sweeting fdd4effc92
Merge pull request #535 from cdvv7788/extractors-flag 2020-11-13 14:53:17 -05:00
JDC b1dbfcb73f Add test remove tag filter 2020-11-13 14:17:12 -05:00
Cristian 44eede96e5 feat: Add extract flag to add command 2020-11-13 09:24:34 -05:00
Cristian 33182fd53c fix: Add missing assignation 2020-11-04 15:07:45 -05:00
Cristian d064a3eeff fix: Handle case when update tries to re-add a link that is not in the sql index 2020-11-04 15:02:54 -05:00
Cristian e7e33ea7a5 tests: Add tests for several different ways to extract the title 2020-10-30 08:04:26 -05:00
Cristian f6ce1de882 fix: archivebox version was being called as root 2020-10-27 09:15:14 -05:00
Cristian a6bee5f111 feat: Move setup_django to an inner module 2020-10-26 08:02:04 -05:00
Cristian e1d0b8bce7 feat: Initialize django at the beginning 2020-10-26 07:45:21 -05:00
Cristian ae1484b8bf feat: Remove index.json and index.html generation from the regular process 2020-10-23 06:45:56 -05:00
Cristian Vargas a850b4a9d9
Merge branch 'master' into tags 2020-10-20 08:23:25 -05:00
Cristian 62c78e1d10 refactor: Remove django-taggit and replace it with a local tags setup 2020-10-12 13:47:03 -05:00
Angel Rey 73418836f8 Replaced os.path in server.py 2020-10-02 15:46:39 -05:00
Angel Rey 62c9028212 Improved tags 2020-09-24 15:34:23 -05:00
Cristian 0158efb1d0 test: Improve oneshot test 2020-09-24 12:56:16 -05:00
Cristian 62ed11a5ca fix: Improve headers handling 2020-09-24 12:55:51 -05:00
Angel Rey ee6caca3ca Added more asserts 2020-09-23 11:07:00 -05:00
Angel Rey 1cce786d6d Added test headers extractor 2020-09-23 11:07:00 -05:00
Cristian 46b9e3d536 fix: Fix mercury extractor test 2020-09-23 10:34:05 -05:00
ttimasdf e3329be291 tests: add test for mercury-parser 2020-09-22 18:44:12 -05:00
Cristian fa622d3e14 refactor: Replace --index with --with-headers in the list command to make it more explicit. Change it so it affects the csv output too. 2020-09-15 08:05:46 -05:00
Cristian 2aa8d69b72 fix: Save history in main index (to mimic previous behaviour) 2020-09-15 08:05:46 -05:00
Cristian 7e9d195d13 feat: Update list command to sort using sqlite 2020-09-15 08:05:46 -05:00
Cristian f55153eab3 feat: Update update command to work with querysets 2020-09-15 08:05:46 -05:00
Cristian dafa1dd63c tests: Add tests for before and after flags in remove command 2020-09-15 08:05:46 -05:00
Cristian fe9604a772 feat: Add tests for remove command 2020-09-15 08:05:46 -05:00
Cristian be0dff8126 feat: Add tests to refactored init command 2020-09-15 08:05:46 -05:00
Cristian a77d6dc235 feat: list command fails when --index is used without --json or --html 2020-09-15 08:05:46 -05:00
Cristian 885ff50449 feat: Add html export to list command 2020-09-15 08:05:46 -05:00
Cristian aab8f96520 feat: Add flag to list command to support index like output 2020-09-15 08:05:46 -05:00
Cristian cc0fa747ce feat: Add options to ease management of node related extractors 2020-08-18 10:34:28 -05:00
Nick Sweeting a218ceb4e8 add test for overwrite flag 2020-08-18 04:52:56 -04:00
Cristian 2a68af1b94 tests: Add readability tests 2020-08-11 11:15:15 -05:00