1
0
Fork 0
mirror of synced 2024-06-01 10:09:49 +12:00
Commit graph

128 commits

Author SHA1 Message Date
Nick Sweeting ca2c484a8e
Add _EXTRA_ARGS for various extractors (#1360) 2024-03-14 01:55:09 -07:00
Ben Muthalaly d8cf09c21e Remove unnecessary variable length args for dedupe 2024-03-05 21:13:45 -06:00
Ben Muthalaly 4686da91e6 Fix cookies being set incorrectly 2024-03-05 01:48:35 -06:00
Ben Muthalaly d74ddd42ae Flip dedupe precedence order 2024-03-01 14:50:32 -06:00
Ben Muthalaly 68326a60ee Add cookies file to http request in download_url 2024-02-27 15:30:31 -06:00
Ben Muthalaly 4d9c5a7b4b Add CHROME_EXTRA_ARGS
Also fix `YOUTUBEDL_EXTRA_ARGS`.
2024-02-23 18:40:03 -06:00
Ben Muthalaly 4e69d2c9e1 Add EXTRA_*_ARGS for wget, curl, and singlefile 2024-02-22 23:04:11 -06:00
Nick Sweeting 6a4e568d1b new archivebox update speed improvements 2024-02-22 04:50:22 -08:00
Nick Sweeting 8c07b7e127 disable automatic chrome selfupdating 2024-01-11 19:51:27 -08:00
Nick Sweeting 6184f659dc improve window size chrome cli handling 2024-01-11 19:02:46 -08:00
spresse1 603ce7ec10 After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
Ross Williams c039ef05b3 Fix hyphen placement in util.URL_REGEX
Incorrect hyphen placement in `URL_REGEX` was allowing it to match more
characters than intended. In a regex character class, a literal hyphen
can only appear as the first character in the class, or it will be
interpreted as the delimiter of a range of characters.

The issue fixed here caused the range of characters from `[$-_]`
be treated as valid URL characters, instead of the intended set of three
characters `[-_$]`. The incorrect range interpretation inadvertantly
included most ASCII punctuation, most importantly the angle brackets,
square brackets, and single quote that the expression uses
to mark the end of a match.

This causes the expression to match a URL that has a "hostname" portion
beginning with one of the intended "stop parsing" characters. For
example:

```
https://<b>www</b>.example.com/  # MATCHES but should not
https://[for example]            # MATCHES but should not
scheme='https://'                # MATCHES, including final quote, but should not
```

Some test cases have been added to the `URL_REGEX` assert in
archivebox.parsers to cover this possibility.
2023-08-08 15:24:16 -04:00
Ross Williams d0e65eba7f More reliably detect Google Chrome version number
Previous method was splitting on the first whitespace, and missing the
version number when it appeared as `"Google Chrome 115.0.234.2342"`
instead of, i.e. `"Chromium 115.0.234.8283"`.

This commit changes the version detection to regex search for
whitespace, then one or more digits followed by a period, then at least
one more digit. Only the first sequence of digits is captured. Unless
Chrome radically changes their version numbering, this should capture
the first group of digits after the reported browser name, which would
be the major version.
2023-07-31 15:34:58 -04:00
ふぁ 44a5a5ed7e
add explicitly specify --headless=new
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-17 19:30:14 +09:00
ふぁ d77c770c47
add CHROME_TIMEOUT args
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-14 20:29:41 +09:00
Nick Sweeting 606fa397a4 disable passing timeout arg to chrome because v111 is crashing when passed 2023-03-13 10:50:18 +00:00
Nick Sweeting 1f1c70a8b1 remove --single-process from chrome args and add some rendering optimization args 2023-03-13 10:49:57 +00:00
Nick Sweeting 49faec8f6d
add no-zygote and single-process args to try and prevent orphan chrome processes after exit 2021-05-13 05:04:23 -04:00
Nick Sweeting 9f05cf8283 virtual-time-budget doesnt work with some chrome stuff 2021-04-10 08:04:59 -04:00
Nick Sweeting 0c321a06d0 hide scrollbars in screenshots 2021-04-10 05:45:19 -04:00
Nick Sweeting a9986f1f05 add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support 2021-04-10 04:21:36 -04:00
Nick Sweeting 5a9f27204a dont use chrome when its not available on windows systems 2021-04-05 23:33:08 -04:00
Nick Sweeting 3e26ae4a66 support finding multiple urls as substrings in text 2021-03-27 04:30:40 -04:00
Nick Sweeting c089501073 add response status code to headers.json 2021-01-30 20:44:49 -05:00
Nick Sweeting a0a79cead8 move utils and vendored libs into subfolders 2020-12-06 02:01:18 +02:00
Nick Sweeting 104553489f remove redundant utils file 2020-11-28 02:12:27 -05:00
Nick Sweeting 83693a5c03 add packaging setup with stdeb for debian and apt
vendor the base32_crockford lib
add build script for debain packages
2020-11-23 16:57:05 -05:00
Nick Sweeting c47398851b nicer timeout hints 2020-10-31 07:57:11 -04:00
Cristian 62ed11a5ca fix: Improve headers handling 2020-09-24 12:55:51 -05:00
Angel Rey f0915a56aa Replaced get method 2020-09-24 12:55:51 -05:00
Angel Rey a8a8fd14ac Fixed indent headers.json 2020-09-23 11:07:00 -05:00
Angel Rey 852e3c9cff Added headers extractor 2020-09-23 11:07:00 -05:00
Cristian b18bbf8874 test: Fix tests post-rebase 2020-09-17 09:09:52 -05:00
apkallum 008769d296 add support for Paths in json encoder 2020-09-17 09:09:52 -05:00
Nick Sweeting 3658153cf8 fix url parsing through quotes 2020-08-18 08:04:57 -04:00
Cristian d0d2991c69 fix: Change import that was not working 2020-07-31 12:15:00 -05:00
Cristian 6006b4f93b refactor: Organize code to remove flake8 issues 2020-07-24 12:25:25 -05:00
Cristian 949f78aa65 fix: Use w3lib to improve the encoding extraction 2020-07-22 10:24:08 -05:00
Nick Sweeting 8cb530230c fix docker SHM limited to 64mb chrome crash 2020-07-21 23:39:21 -04:00
apkallum b7785c4138 use dateparser for parsing, let it handle error 2020-07-16 19:38:38 -04:00
Nick Sweeting dfb83b4f27 add AttributeDict 2020-07-13 11:24:49 -04:00
Cristian 528fc8f1f6 fix: Improve encoding detection for rss+xml content types 2020-07-02 12:11:23 -05:00
Nick Sweeting 3ec97e5528 fix git conflict commited by accident 2020-07-02 03:22:37 -04:00
Nick Sweeting 8840ad72bb remove circular import possibilities 2020-07-02 03:13:35 -04:00
Cristian c971e00c9c feat: Add stdout from process to the template 2020-07-01 12:23:59 -05:00
Nick Sweeting c415420f33 improve sort columns and UI placeholders 2020-06-30 06:41:48 -04:00
Nick Sweeting 9f440c2cf8 use requests.get to fetch and decode instead of urllib 2020-06-30 05:55:54 -04:00
Nick Sweeting cb67b09f9d Merge branch 'master' into django 2020-06-25 21:30:29 -04:00
michael.bub c79ce2b1f5 guess encoding via chardet if available 2020-02-15 13:58:07 +01:00
Mashiat Sarker Shakkhar 0bb216ce02 util.py: Use dateparser to parse date strings. 2019-09-10 23:51:09 -04:00