ArchiveBox

mirror of synced 2024-09-30 17:17:12 +13:00

History

Ben Muthalaly 77917e9b55 Fix HTML title parsing bugs. This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness.		2023-10-09 02:00:01 -05:00
..
__init__.py	just use out_dir	2023-05-29 10:03:49 +02:00
archive_org.py
dom.py	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
favicon.py	Add FAVICON_PROVIDER option for custom favicon service	2023-05-05 20:42:36 -05:00
git.py
headers.py
media.py
mercury.py
pdf.py	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
readability.py
screenshot.py	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
singlefile.py	add CHROME_TIMEOUT args	2023-03-14 20:29:41 +09:00
title.py	Fix HTML title parsing bugs.	2023-10-09 02:00:01 -05:00
wget.py