singlefile.html contains a lot of large strings in the form of `data:`
URLs, which can be unnecessarily stored in full-text indices. Also,
large chunks of JavaScript shouldn't be indexed, either, as they pollute
search results for searches about JS functions, etc.
This commit takes a blanket approach of parsing singlefile.html as it is
read and only outputting text and selected textual attributes (like
`alt`) for indexing.
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.
Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.
The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
Incorrect hyphen placement in `URL_REGEX` was allowing it to match more
characters than intended. In a regex character class, a literal hyphen
can only appear as the first character in the class, or it will be
interpreted as the delimiter of a range of characters.
The issue fixed here caused the range of characters from `[$-_]`
be treated as valid URL characters, instead of the intended set of three
characters `[-_$]`. The incorrect range interpretation inadvertantly
included most ASCII punctuation, most importantly the angle brackets,
square brackets, and single quote that the expression uses
to mark the end of a match.
This causes the expression to match a URL that has a "hostname" portion
beginning with one of the intended "stop parsing" characters. For
example:
```
https://<b>www</b>.example.com/ # MATCHES but should not
https://[for example] # MATCHES but should not
scheme='https://' # MATCHES, including final quote, but should not
```
Some test cases have been added to the `URL_REGEX` assert in
archivebox.parsers to cover this possibility.
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
Previous method was splitting on the first whitespace, and missing the
version number when it appeared as `"Google Chrome 115.0.234.2342"`
instead of, i.e. `"Chromium 115.0.234.8283"`.
This commit changes the version detection to regex search for
whitespace, then one or more digits followed by a period, then at least
one more digit. Only the first sequence of digits is captured. Unless
Chrome radically changes their version numbering, this should capture
the first group of digits after the reported browser name, which would
be the major version.
the `bin_version` function means to modify the environment,
not replace it entirely. Fixes bugs that occur when it wipes out the
PATH environment variable, such as when running in a virtual
environment.
Previously, when you clicked the Add button, the page wouldn't change.
It looked like it wasn't doing anything, as noted by @rcarmo
(https://github.com/ArchiveBox/ArchiveBox/issues/658#issuecomment-948300055)
I didn't track it down the exact reason why. It may be that Safari
didn't like the two opening <h3>s, but I was able to find a bunch of
people complaining about Safari being very finicky with innerHTML
and actually repainting the page, enough that I decided to try just
extending the block hide/show behavior already done with the
delay-warning, and it works for me now in both Chrome and Safari.
For #658.
I noticed that Safari was submitting both the empty option and the
selected options back to the server.
Digging into it, I was able to get Safari to deselect the ---------
option by using '[selected]' as the selector.
For #658
When trying to import my pocket library I got a lot of ` KeyError` on Python. Pocket API has a few idiosyncrasies, such as sometimes returning the keys on json, sometimes not.
` ` ` sh
archivebox add --parser pocket_api pocket://my_username
` ` `
Gave me this errors
` ` `
File "/app/archivebox/parsers/pocket_api.py", line 54, in link_from_article
title = article['resolved_title'] or article['given_title'] or url
KeyError: 'resolved_title'
` ` `
This commit are the patches I've changed to successfully import my library
`item.find(p)` returns either an `ElementTree.Element` or `None`. The
[lambda on line 24][lambda] coerces the return value to a bool, which is
`False` if the `<link>` element has no children (see
[`ElementTree.py` line 207][etbooldef]), so the lambda returns `None`.
Further, returning a `Link` with `url=None` violates
[an assertion in `index/schema.py`][assertion], which crashes
the `archivebox add` command.
[lambda]: 3d54b1321b/archivebox/parsers/pinboard_rss.py (L24)
[etbooldef]: 3d8993a744/Lib/xml/etree/ElementTree.py (L207)
[assertion]: 3d54b1321b/archivebox/index/schema.py (L165)