ArchiveBox

mirror of synced 2024-06-29 19:41:05 +12:00

History

Ross Williams c039ef05b3 Fix hyphen placement in util.URL_REGEX Incorrect hyphen placement in `URL_REGEX` was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters. The issue fixed here caused the range of characters from `[$-_]` be treated as valid URL characters, instead of the intended set of three characters `[-_$]`. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match. This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example: ``` https://<b>www</b>.example.com/ # MATCHES but should not https://[for example] # MATCHES but should not scheme='https://' # MATCHES, including final quote, but should not ``` Some test cases have been added to the `URL_REGEX` assert in archivebox.parsers to cover this possibility.		2023-08-08 15:24:16 -04:00
..
__init__.py	Fix hyphen placement in util.URL_REGEX	2023-08-08 15:24:16 -04:00
generic_html.py	add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support	2021-04-10 04:21:36 -04:00
generic_json.py	add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support	2021-04-10 04:21:36 -04:00
generic_rss.py	use KEY, NAME, and PARSER to define parsers instead of hardcoding in init	2021-03-31 01:05:49 -04:00
generic_txt.py	add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support	2021-04-10 04:21:36 -04:00
medium_rss.py	use KEY, NAME, and PARSER to define parsers instead of hardcoding in init	2021-03-31 01:05:49 -04:00
netscape_html.py	use KEY, NAME, and PARSER to define parsers instead of hardcoding in init	2021-03-31 01:05:49 -04:00
pinboard_rss.py	Fix Pinboard RSS parsing valid links as `None`	2021-08-04 10:13:37 -04:00
pocket_api.py	fix typo in pocket_api articl variable name	2021-11-12 19:23:47 -05:00
pocket_html.py	use KEY, NAME, and PARSER to define parsers instead of hardcoding in init	2021-03-31 01:05:49 -04:00
shaarli_rss.py	use KEY, NAME, and PARSER to define parsers instead of hardcoding in init	2021-03-31 01:05:49 -04:00
url_list.py	add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support	2021-04-10 04:21:36 -04:00
wallabag_atom.py	handle new wallabag export format with newlines mid-tag attributes	2022-05-09 19:07:48 -07:00