Fix HTML title parsing bugs.

This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness.
2024-06-27 18:40:52 +12:00 · 2023-10-09 02:00:01 -05:00 · 2023-10-09 02:00:01 -05:00 · 77917e9b55
parent 4950cee3b6
commit 77917e9b55
1 changed files with 1 additions and 1 deletions
--- a/archivebox/extractors/title.py
+++ b/archivebox/extractors/title.py
@ -26,7 +26,7 @@ from ..logging_util import TimedProgress

 HTML_TITLE_REGEX = re.compile(
    r'<title.*?>'                      # start matching text after <title> tag
-    r'(.[^<>]+)',                      # get everything up to these symbols
+    r'([^<>]+)',                      # get everything up to these symbols
    re.IGNORECASE | re.MULTILINE | re.DOTALL | re.UNICODE,
 )