1
0
Fork 0
mirror of synced 2024-06-01 10:09:49 +12:00
Commit graph

25 commits

Author SHA1 Message Date
Ben Muthalaly d8cf09c21e Remove unnecessary variable length args for dedupe 2024-03-05 21:13:45 -06:00
Ben Muthalaly d74ddd42ae Flip dedupe precedence order 2024-03-01 14:50:32 -06:00
Ben Muthalaly 4e69d2c9e1 Add EXTRA_*_ARGS for wget, curl, and singlefile 2024-02-22 23:04:11 -06:00
Nick Sweeting 6a4e568d1b new archivebox update speed improvements 2024-02-22 04:50:22 -08:00
Nick Sweeting db2984e47b prefer dom dump to singlefile for generating readability output 2024-01-03 20:11:06 -08:00
Ben Muthalaly 77917e9b55 Fix HTML title parsing bugs.
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.

Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.

The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
papersnake de8e22efb7 improve title extractor 2022-02-08 23:17:52 +08:00
Nick Sweeting 04c951cdd5 fix alerts 2021-02-01 02:22:02 -05:00
Nick Sweeting 385daf9af8 save the url as title for staticfiles or non html files 2021-01-30 22:01:49 -05:00
Dan Arnfield 5420903102 Refactor should_save_extractor methods to accept overwrite parameter 2021-01-21 15:56:32 -06:00
Cristian 81d766aba1 refactor: Remove setup_django from title.py 2020-12-11 16:03:50 -05:00
Cristian e7e33ea7a5 tests: Add tests for several different ways to extract the title 2020-10-30 08:04:26 -05:00
Nick Sweeting f727ece7b3 add regex fallback back to title parser 2020-10-30 04:57:31 -04:00
Nick Sweeting 79bef1384e
Merge pull request #493 from ttimasdf/feat-ogtitle
Feature: add og:title metadata as alternative title
2020-10-30 04:51:14 -04:00
Cristian c12fe0e3d7 feat: Use CURL_ARGS on title extractor 2020-10-22 08:46:16 -05:00
ttimasdf eda3836dee feat: add og:title metadata as alternative title 2020-09-27 12:54:52 +08:00
Cristian b18bbf8874 test: Fix tests post-rebase 2020-09-17 09:09:52 -05:00
Nick Sweeting 032c2458de add missing setup_django import 2020-07-28 05:58:13 -04:00
Nick Sweeting 55a237a435 also set snapshot title inside of fetch_title directly 2020-07-28 05:56:34 -04:00
Nick Sweeting 273059f054 accept gzipped responses when using curl 2020-07-28 05:55:54 -04:00
Cristian a5550b2105 fix: Rename logging folder to avoid naming conflicts (and circular import issues) 2020-07-22 11:02:13 -05:00
Cristian f4d1b5121e refactor: Move logging.py to main module to avoid circular import issues 2020-07-17 18:00:04 -05:00
Nick Sweeting 5c2bbe7efe bufixes 2020-06-25 22:14:40 -04:00
Nick Sweeting 95007d9137 split up utils into separate files 2019-04-30 23:13:04 -04:00
Nick Sweeting 1b8abc0961 move everything out of legacy folder 2019-04-27 17:26:24 -04:00