1
0
Fork 0
mirror of synced 2024-06-29 03:20:58 +12:00
Commit graph

174 commits

Author SHA1 Message Date
Nick Sweeting 4c5a3fba8b
more fixes for wget_output_path 2024-05-07 05:38:29 -07:00
Nick Sweeting 9b21ce490e
add workaround logic to catch paths that are too long or contain unprintable characters 2024-05-07 05:03:23 -07:00
Nick Sweeting f770bba3cf
fix OSError 36 caused by checking for path that is too long to exist 2024-05-07 04:12:07 -07:00
Nick Sweeting b4c3aa5097 Merge branch 'main' into dev 2024-03-26 15:01:36 -07:00
Ben Muthalaly f4deb97f59 Add ARGS and EXTRA_ARGS for Mercury extractor 2024-03-05 21:15:38 -06:00
Ben Muthalaly d8cf09c21e Remove unnecessary variable length args for dedupe 2024-03-05 21:13:45 -06:00
Naomi Phillips a729480b75
Add COOKIES_FILE support for singlefile extractor 2024-03-03 02:32:46 -05:00
Ben Muthalaly d74ddd42ae Flip dedupe precedence order 2024-03-01 14:50:32 -06:00
Ben Muthalaly ab8f395e0a Add YOUTUBEDL_EXTRA_ARGS 2024-02-23 15:40:31 -06:00
Ben Muthalaly 4e69d2c9e1 Add EXTRA_*_ARGS for wget, curl, and singlefile 2024-02-22 23:04:11 -06:00
Nick Sweeting 8b9bc3dec8 minor fixes 2024-02-22 04:50:22 -08:00
Nick Sweeting 6a4e568d1b new archivebox update speed improvements 2024-02-22 04:50:22 -08:00
Nick Sweeting 0a25495520 add fallback to check wget output dir with port stripped 2024-01-19 03:47:38 -08:00
Nick Sweeting c1fd2cfa42 tag URLs immediately once added instead of waiting until archival completes 2024-01-03 20:31:46 -08:00
Nick Sweeting db2984e47b prefer dom dump to singlefile for generating readability output 2024-01-03 20:11:06 -08:00
Nick Sweeting 78d942ac22 show more detail in readabiliity error messages 2024-01-03 20:09:31 -08:00
Nick Sweeting 5b07a1126c add comment about why DOM is preferred over singlefile for readability parsing 2024-01-03 19:09:24 -08:00
Nick Sweeting 2c54e55697 prefer dom dump to singlefile for generating readability output 2024-01-02 19:50:56 -08:00
Nick Sweeting f0033f75d0 config.py lint fixes 2023-11-14 02:07:35 -08:00
Nick Sweeting a680724367
Merge branch 'dev' into search_index_extract_html_text 2023-10-27 23:09:28 -07:00
Ross Williams 310b4d1242 Add htmltotext extractor
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Nick Sweeting 63ad43f46c
Merge branch 'dev' into method_allow_deny 2023-10-20 04:25:44 -07:00
Nick Sweeting 82d8662c74 add more readability error output 2023-10-20 04:14:28 -07:00
Ben Muthalaly 77917e9b55 Fix HTML title parsing bugs.
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.

Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.

The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
spresse1 603ce7ec10 After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
Ross Williams 2076474252 Drop use of TypeAlias to maintain Python 3.9 compat
TypeAlias annotation was introduced in Python 3.10, and is not strictly
necessary. Drop use of it to maintain Python 3.9 compatibility.
2023-08-02 10:56:48 -04:00
Ross Williams b44f7e68b1 Add URL-specific method allow/deny lists
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Sascha Ißbrücker 7bf4f40da0 just use out_dir 2023-05-29 10:03:49 +02:00
Sascha Ißbrücker 40c122515a fix: make oneshot command return successful exist code 2023-05-29 10:01:27 +02:00
Micah R Ledbetter 1e50ca243e Add FAVICON_PROVIDER option for custom favicon service 2023-05-05 20:42:36 -05:00
ふぁ d77c770c47
add CHROME_TIMEOUT args
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-14 20:29:41 +09:00
Nick Sweeting 9599845b56 ensure DOM HTML dump is non-zero length file when retrying 2023-03-13 10:49:26 +00:00
Nick Sweeting 0cbeeb4346
Merge pull request #1021 from renaisun/dev 2023-01-09 18:17:39 -08:00
Joseph Turian 07de4a79a1
Merge branch 'dev' into feature/kludge-984-UTF8-bug 2022-12-20 11:39:01 +01:00
Joseph Turian 081a12b079 Add ts 2022-09-12 21:32:47 +00:00
Joseph Turian daef48e59b flake8 2022-09-12 21:31:33 +00:00
Joseph Turian 983f485cc0 flake8 2022-09-12 21:29:43 +00:00
Joseph Turian b864c38d9e Don't be strict on unicode errors 2022-09-12 20:40:45 +00:00
Joseph Turian dba423a568 A few more youtube-dl tweaks 2022-09-12 20:36:23 +00:00
Joseph Turian f5f7aff3b4 Added yt-dlp everywhere 2022-09-12 20:34:02 +00:00
renaisun 0ea955b3ed add a missing comma 2022-09-12 09:08:28 +08:00
notevenaperson 40659b5e9d singlefile.py: Code to ensure options are deduplicated 2022-09-12 09:08:28 +08:00
Joseph Turian 2b58cce43f Attempted to warn on #984 and #1014 2022-09-11 12:19:16 +02:00
renaisun 8899fe0b92
Add SINGLEFILE_ARGS to control single-file arguments 2022-06-09 14:35:48 +08:00
Nick Sweeting 950b5cbbb6
Merge pull request #924 from prnake/dev
improve title extractor
2022-05-09 18:38:12 -07:00
Nick Sweeting 57df65f28f use yt-dlp for media archiving instead of youtube-dl 2022-04-21 07:11:35 -07:00
prnake 011bd104cb
remove unused import 2022-02-09 10:48:51 +08:00
papersnake de8e22efb7 improve title extractor 2022-02-08 23:17:52 +08:00
Nick Sweeting 4715ace7dd ignore BaseException lgtm errors 2021-05-31 20:59:05 -04:00
Nick Sweeting eb4d3bca9d
Update readability.py 2021-05-13 00:13:32 -04:00