1
0
Fork 0
mirror of synced 2024-06-29 19:41:05 +12:00
ArchiveBox/archivebox/parsers
Ross Williams c039ef05b3 Fix hyphen placement in util.URL_REGEX
Incorrect hyphen placement in `URL_REGEX` was allowing it to match more
characters than intended. In a regex character class, a literal hyphen
can only appear as the first character in the class, or it will be
interpreted as the delimiter of a range of characters.

The issue fixed here caused the range of characters from `[$-_]`
be treated as valid URL characters, instead of the intended set of three
characters `[-_$]`. The incorrect range interpretation inadvertantly
included most ASCII punctuation, most importantly the angle brackets,
square brackets, and single quote that the expression uses
to mark the end of a match.

This causes the expression to match a URL that has a "hostname" portion
beginning with one of the intended "stop parsing" characters. For
example:

```
https://<b>www</b>.example.com/  # MATCHES but should not
https://[for example]            # MATCHES but should not
scheme='https://'                # MATCHES, including final quote, but should not
```

Some test cases have been added to the `URL_REGEX` assert in
archivebox.parsers to cover this possibility.
2023-08-08 15:24:16 -04:00
..
__init__.py Fix hyphen placement in util.URL_REGEX 2023-08-08 15:24:16 -04:00
generic_html.py add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support 2021-04-10 04:21:36 -04:00
generic_json.py add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support 2021-04-10 04:21:36 -04:00
generic_rss.py use KEY, NAME, and PARSER to define parsers instead of hardcoding in init 2021-03-31 01:05:49 -04:00
generic_txt.py add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support 2021-04-10 04:21:36 -04:00
medium_rss.py use KEY, NAME, and PARSER to define parsers instead of hardcoding in init 2021-03-31 01:05:49 -04:00
netscape_html.py use KEY, NAME, and PARSER to define parsers instead of hardcoding in init 2021-03-31 01:05:49 -04:00
pinboard_rss.py Fix Pinboard RSS parsing valid links as None 2021-08-04 10:13:37 -04:00
pocket_api.py fix typo in pocket_api articl variable name 2021-11-12 19:23:47 -05:00
pocket_html.py use KEY, NAME, and PARSER to define parsers instead of hardcoding in init 2021-03-31 01:05:49 -04:00
shaarli_rss.py use KEY, NAME, and PARSER to define parsers instead of hardcoding in init 2021-03-31 01:05:49 -04:00
url_list.py add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support 2021-04-10 04:21:36 -04:00
wallabag_atom.py handle new wallabag export format with newlines mid-tag attributes 2022-05-09 19:07:48 -07:00