CodeMirror/ArchiveBox

mirror of synced 2024-10-02 10:07:35 +13:00

Author	SHA1	Message	Date
Ben Muthalaly	11d473e536	Add config options to add admin user on first run	2023-10-14 00:38:04 -05:00
Ross Williams	d8aa84ac98	Make extracting text for indexing optional Add a configuration option to enable/disable HTML text extraction for indexing	2023-10-12 13:14:39 -04:00
Ross Williams	b6a20c962a	Extract text from singlefile.html when indexing singlefile.html contains a lot of large strings in the form of `data:` URLs, which can be unnecessarily stored in full-text indices. Also, large chunks of JavaScript shouldn't be indexed, either, as they pollute search results for searches about JS functions, etc. This commit takes a blanket approach of parsing singlefile.html as it is read and only outputting text and selected textual attributes (like `alt`) for indexing.	2023-10-12 13:06:35 -04:00
Ben Muthalaly	77917e9b55	Fix HTML title parsing bugs. This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness.	2023-10-09 02:00:01 -05:00
Nick Sweeting	5c1a14e4f2	ignore errors while getting system user name	2023-09-14 03:39:44 -07:00
Nick Sweeting	ffe2968e4f	improve some comments	2023-09-14 02:41:27 -07:00
Nick Sweeting	f809efce4d	Merge pull request #996 from barthalion/dev	2023-09-03 21:40:49 -07:00
Nick Sweeting	aaca74f6a8	only start parsing json after the first open brace	2023-09-03 21:40:12 -07:00
Nick Sweeting	cd9f228b2f	Merge pull request #1214 from DanielBatteryStapler/DanielBatteryStapler-patch-1	2023-09-03 21:25:12 -07:00
Nick Sweeting	16d278fbdb	Merge pull request #1168 from mAAdhaTTah/add-readwise-reader	2023-09-03 21:24:49 -07:00
Nick Sweeting	110a22ee32	Merge branch 'dev' into DanielBatteryStapler-patch-1	2023-08-31 15:20:46 -07:00
Nick Sweeting	73a5f74d38	update default YOUTUBEDL_ARGS to fix subs and filesize	2023-08-31 15:17:45 -07:00
Nick Sweeting	86366d5640	Update logging_util.py to fix generator subscripting error	2023-08-31 15:12:43 -07:00
spresse1	603ce7ec10	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
root	23f086aa40	add LDAP support	2023-08-17 19:51:02 -05:00
DanielBatteryStapler	94dacc49c7	Fix archive_org icon "exists"	2023-08-15 23:49:54 -04:00
Ross Williams	c039ef05b3	Fix hyphen placement in util.URL_REGEX Incorrect hyphen placement in `URL_REGEX` was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters. The issue fixed here caused the range of characters from `[$-_]` be treated as valid URL characters, instead of the intended set of three characters `[-_$]`. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match. This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example: ``` https://<b>www</b>.example.com/ # MATCHES but should not https://[for example] # MATCHES but should not scheme='https://' # MATCHES, including final quote, but should not ``` Some test cases have been added to the `URL_REGEX` assert in archivebox.parsers to cover this possibility.	2023-08-08 15:24:16 -04:00
Ross Williams	2076474252	Drop use of TypeAlias to maintain Python 3.9 compat TypeAlias annotation was introduced in Python 3.10, and is not strictly necessary. Drop use of it to maintain Python 3.9 compatibility.	2023-08-02 10:56:48 -04:00
Ross Williams	b44f7e68b1	Add URL-specific method allow/deny lists Allows enabling only allow-listed extractors or disabling specific deny-listed extractors for a regular expression matched against an added site's URL.	2023-08-02 09:36:40 -04:00
Ross Williams	46e80dd509	Rename URL_(WHITE\|BLACK)LIST to URL_(ALLOW\|DENY)LIST Retain aliases for old configuration files	2023-08-02 09:31:48 -04:00
Nick Sweeting	b773041952	Merge pull request #1199 from overhacked/chrome_version_detection_fix	2023-08-01 10:14:18 -07:00
Ross Williams	d0e65eba7f	More reliably detect Google Chrome version number Previous method was splitting on the first whitespace, and missing the version number when it appeared as `"Google Chrome 115.0.234.2342"` instead of, i.e. `"Chromium 115.0.234.8283"`. This commit changes the version detection to regex search for whitespace, then one or more digits followed by a period, then at least one more digit. Only the first sequence of digits is captured. Unless Chrome radically changes their version numbering, this should capture the first group of digits after the reported browser name, which would be the major version.	2023-07-31 15:34:58 -04:00
Ross Williams	9d9872d325	bin_version means to modify, not replace environ the `bin_version` function means to modify the environment, not replace it entirely. Fixes bugs that occur when it wipes out the PATH environment variable, such as when running in a virtual environment.	2023-07-31 11:36:34 -04:00
mAAdhaTTah	181501fd36	Add Readwise Reader API parser Implemented similar to the Pocket API.	2023-07-02 11:20:58 -04:00
Sascha Ißbrücker	7bf4f40da0	just use out_dir	2023-05-29 10:03:49 +02:00
Sascha Ißbrücker	40c122515a	fix: make oneshot command return successful exist code	2023-05-29 10:01:27 +02:00
Micah R Ledbetter	1e50ca243e	Add FAVICON_PROVIDER option for custom favicon service	2023-05-05 20:42:36 -05:00
David Calano	f48e48e6da	Fix for Issue #1008 - Added missing decode() when setting pkg_path variable	2023-03-29 01:48:12 -04:00
Tom Ryder	53af810ff8	Add missing closing quote to style attribute	2023-03-27 10:54:04 +13:00
ふぁ	44a5a5ed7e	add explicitly specify --headless=new Signed-off-by: ふぁ <yuki@yuki0311.com>	2023-03-17 19:30:14 +09:00
Nick Sweeting	9f42a3bf29	fix whitespace	2023-03-15 16:01:02 -07:00
ふぁ	d77c770c47	add CHROME_TIMEOUT args Signed-off-by: ふぁ <yuki@yuki0311.com>	2023-03-14 20:29:41 +09:00
Nick Sweeting	606fa397a4	disable passing timeout arg to chrome because v111 is crashing when passed	2023-03-13 10:50:18 +00:00
Nick Sweeting	1f1c70a8b1	remove --single-process from chrome args and add some rendering optimization args	2023-03-13 10:49:57 +00:00
Nick Sweeting	9599845b56	ensure DOM HTML dump is non-zero length file when retrying	2023-03-13 10:49:26 +00:00
Nick Sweeting	dca69933eb	Update archivebox/config.py Co-authored-by: dugite-code <dugite-code@users.noreply.github.com>	2023-01-09 18:22:01 -08:00
Nick Sweeting	2538b170c7	Merge branch 'dev' into feat/reverse-proxy-auth	2023-01-09 18:20:45 -08:00
Nick Sweeting	0cbeeb4346	Merge pull request #1021 from renaisun/dev	2023-01-09 18:17:39 -08:00
Joseph Turian	07de4a79a1	Merge branch 'dev' into feature/kludge-984-UTF8-bug	2022-12-20 11:39:01 +01:00
Nick Sweeting	e114b1f6dc	Merge pull request #1027 from turian/feature/migrations-0021_auto_20220914_0934.py	2022-11-27 19:28:55 -08:00
SnZ	2db830c6a8	Method typo? Fixes '[Errno 2] No such file or directory' error during add	2022-11-20 01:51:16 +01:00
Joseph Turian	a26a91d09f	Merge branch 'feature/migrations-0021_auto_20220914_0934.py' into feature/kludge-984-UTF8-bug	2022-09-14 09:44:55 +00:00
Joseph Turian	22d8e57637	Add missing migration 0021	2022-09-14 09:36:17 +00:00
Joseph Turian	30947aeb07	yt-dlp flag cleanup	2022-09-14 06:29:57 +02:00
Joseph Turian	f729bbe122	yt-dlp fixes	2022-09-14 06:27:58 +02:00
Joseph Turian	081a12b079	Add ts	2022-09-12 21:32:47 +00:00
Joseph Turian	daef48e59b	flake8	2022-09-12 21:31:33 +00:00
Joseph Turian	983f485cc0	flake8	2022-09-12 21:29:43 +00:00
Joseph Turian	b864c38d9e	Don't be strict on unicode errors	2022-09-12 20:40:45 +00:00
Joseph Turian	dba423a568	A few more youtube-dl tweaks	2022-09-12 20:36:23 +00:00
Joseph Turian	f5f7aff3b4	Added yt-dlp everywhere	2022-09-12 20:34:02 +00:00
renaisun	0ea955b3ed	add a missing comma	2022-09-12 09:08:28 +08:00
notevenaperson	40659b5e9d	singlefile.py: Code to ensure options are deduplicated	2022-09-12 09:08:28 +08:00
Joseph Turian	2b58cce43f	Attempted to warn on #984 and #1014	2022-09-11 12:19:16 +02:00
Bartłomiej Piotrowski	eb97fd427b	Skip first line of the "JSON" file ArchiveBox moves the file to parse to the sources directory and adds the original filename at the top, making the file invalid.	2022-07-05 10:56:40 +02:00
Nick Sweeting	03eb7e5875	Update config.py	2022-06-09 01:04:55 -07:00
renaisun	8899fe0b92	Add SINGLEFILE_ARGS to control single-file arguments	2022-06-09 14:35:48 +08:00
Nick Sweeting	d586a8babc	show mount paths with at symbol in version output	2022-06-08 20:22:58 -07:00
Nick Sweeting	319ea481b8	Update config.py	2022-06-08 20:17:38 -07:00
Nick Sweeting	01555dfe34	Update main.py	2022-06-08 20:17:31 -07:00
Nick Sweeting	2bbc742017	typo fix	2022-06-08 20:16:08 -07:00
Nick Sweeting	e2fa68dba6	resolve config paths before using	2022-06-08 20:15:22 -07:00
Nick Sweeting	ccce4a6a2f	use new is_mount and COMMIT_HASH config options	2022-06-08 20:13:22 -07:00
Nick Sweeting	9f90a2d60d	disable unused sqlite3 stuff	2022-06-08 20:12:55 -07:00
Nick Sweeting	c78a2edc42	add is_mount and COMMIT_HASH to config.py	2022-06-08 20:04:01 -07:00
Nick Sweeting	375ba9d135	Update settings.py	2022-06-08 20:00:29 -07:00
Nick Sweeting	ae5c8f2bf8	fix newline included in commit hash	2022-06-08 19:57:38 -07:00
Nick Sweeting	cb3ebbe69a	fix git commit hash location	2022-06-08 19:52:48 -07:00
Nick Sweeting	413aa2ef04	fix commit hash detection	2022-06-08 19:51:46 -07:00
Nick Sweeting	33ec2117e9	Update main.py	2022-06-08 19:50:45 -07:00
Nick Sweeting	dd29e1bf78	clean up first line of CLI version output for easier downstream parsing	2022-06-08 19:46:09 -07:00
Nick Sweeting	0c6d4c82c3	Update config.py	2022-06-08 19:11:02 -07:00
Nick Sweeting	f9c5808940	Update config.py	2022-06-08 19:09:11 -07:00
Nick Sweeting	5509b5cd8b	Update main.py	2022-06-08 19:08:33 -07:00
Nick Sweeting	19b88d30b2	fix missing brace	2022-06-08 19:06:03 -07:00
Nick Sweeting	31d5fbbf17	Update config.py	2022-06-08 19:04:06 -07:00
Nick Sweeting	6b019da3e9	Update config.py	2022-06-08 19:01:55 -07:00
Nick Sweeting	c752c7053d	Update main.py	2022-06-08 18:59:08 -07:00
Nick Sweeting	f9c82841ad	fix sqlite option detection	2022-06-08 18:58:15 -07:00
Nick Sweeting	1fd5830f58	enforce UTC timezone on server	2022-06-08 18:41:22 -07:00
Nick Sweeting	3e3c011f86	enforce UTC timezone on server	2022-06-08 18:40:48 -07:00
Nick Sweeting	e06717419c	fix sqlite3 version detection	2022-06-08 18:35:31 -07:00
Nick Sweeting	d0f129295f	move sqlite3 checks up a level	2022-06-08 18:29:53 -07:00
Nick Sweeting	0c7d7deb32	add missing brace	2022-06-08 18:26:42 -07:00
Nick Sweeting	ca16c88a3d	show PUID, PGID, ENFORCE_ATOMIC_WRITES, and OUTPUT_PERMISSIONS in version output header	2022-06-08 18:24:58 -07:00
Nick Sweeting	89175ccb22	check SQLite3 version and enabled extensions on startup	2022-06-08 18:24:17 -07:00
Nick Sweeting	c245d36e44	add PUID and PGID as config options in archivebox	2022-06-08 17:42:52 -07:00
Nick Sweeting	c5fc3e1e65	--ammend	2022-05-09 23:59:27 -07:00
Nick Sweeting	0b4df768ba	hack to check for generator type cause too lazy to import	2022-05-09 23:50:56 -07:00
Nick Sweeting	5e4ddbbf25	fix mercury bin parsing back	2022-05-09 21:58:17 -07:00
Nick Sweeting	e96c1bcf13	bump mercury parser to git head version	2022-05-09 21:48:41 -07:00
Nick Sweeting	d581a5081f	correctly handle bytes strings in hints	2022-05-09 21:29:37 -07:00
Nick Sweeting	a6767671fb	append content of referenced files to imports	2022-05-09 21:21:39 -07:00
Nick Sweeting	f6d6a06c78	always show all totals in log output	2022-05-09 21:21:26 -07:00
Nick Sweeting	d05510f844	fix version string parsing on macOS in some cases where LANG C is not supported	2022-05-09 21:21:08 -07:00
Nick Sweeting	4b8b17e788	add update flag support to archivebox schedule	2022-05-09 20:18:43 -07:00
Nick Sweeting	8cfe6f4afb	cleanup update flag handling and show better logging to clarify when its working	2022-05-09 20:15:55 -07:00
Nick Sweeting	38e54b93fe	allow parsing to continue even when fetching URL contents fails	2022-05-09 19:56:24 -07:00
Nick Sweeting	ecbcb6a1b3	fix bracing in template tag for PREVIEW_ORIGINALS	2022-05-09 19:56:08 -07:00
Nick Sweeting	8ebf3e2f93	add config option PREVIEW_ORIGINALS to hide original iframes in snapshot detail pages	2022-05-09 19:31:41 -07:00
Nick Sweeting	acd53c854d	handle new wallabag export format with newlines mid-tag attributes	2022-05-09 19:07:48 -07:00
Nick Sweeting	950b5cbbb6	Merge pull request #924 from prnake/dev improve title extractor	2022-05-09 18:38:12 -07:00
Nick Sweeting	6e66863871	add max 5s writing delay for concurrent writers and flush WAL slower	2022-05-09 18:36:40 -07:00
Nick Sweeting	57df65f28f	use yt-dlp for media archiving instead of youtube-dl	2022-04-21 07:11:35 -07:00
Nick Sweeting	eb81d41f84	bump Dockerfile base image version and install yt-dlp	2022-04-21 07:11:35 -07:00
Ross	c63822a5e5	Fix missing input redirection in a hint text	2022-04-19 22:25:49 +01:00
Igor Rzegocki	d4f534e612	add `LOGOUT_REDIRECT_URL`	2022-03-31 21:40:14 +02:00
Pellaeon Lin	5e9d05483e	Fix bin_version: set LANG=C when calling executables to avoid parsing localized output.	2022-02-24 17:01:00 +08:00
prnake	011bd104cb	remove unused import	2022-02-09 10:48:51 +08:00
papersnake	de8e22efb7	improve title extractor	2022-02-08 23:17:52 +08:00
Nick Sweeting	666ab20df5	Update archivebox/config.py	2022-01-10 20:42:09 -05:00
hannah98	fc3d2bb4dc	rename TAG_SEPARATORS to TAG_SEPARATOR_PATTERN	2022-01-06 14:14:41 +00:00
hannah98	049f88def9	Added TAG_SEPARATORS option to supply a regex of characters to use when splitting tags	2021-12-30 20:19:48 +00:00
Nick Sweeting	d7f01922f3	fix direct assignment of tags to many-to-many set	2021-12-23 12:29:17 -05:00
Nick Sweeting	b1b7ee2b85	Update sql.py	2021-12-23 12:17:55 -05:00
hannah98	4b8962b60b	Fix #725 - correctly parse tags on json import	2021-12-20 08:58:58 -06:00
Mika Tuupola	f14a861605	Change logfile open to write mode only	2021-12-19 23:17:33 +02:00
TheCakeIsNaOH	decab91ea2	(#847 ) Decode error output hints to string if needed	2021-12-16 16:46:12 -06:00
Nick Sweeting	44f5338470	fix typo in pocket_api articl variable name	2021-11-12 19:23:47 -05:00
Nick Sweeting	8878dcc5e8	Merge pull request #843 from bltavares/patch-1	2021-11-12 15:57:19 -08:00
Nick Sweeting	d5240f1a1d	Merge pull request #885 from adamwolf/safari-admin-actions	2021-11-12 08:56:31 -08:00
Adam Wolf	18e1fb0d96	Fixes Add button behavior on Safari Previously, when you clicked the Add button, the page wouldn't change. It looked like it wasn't doing anything, as noted by @rcarmo (https://github.com/ArchiveBox/ArchiveBox/issues/658#issuecomment-948300055) I didn't track it down the exact reason why. It may be that Safari didn't like the two opening <h3>s, but I was able to find a bunch of people complaining about Safari being very finicky with innerHTML and actually repainting the page, enough that I decided to try just extending the block hide/show behavior already done with the delay-warning, and it works for me now in both Chrome and Safari. For #658.	2021-10-28 22:31:54 -05:00
Adam Wolf	83731f5a68	Tweak JS so Safari can choose admin actions I noticed that Safari was submitting both the empty option and the selected options back to the server. Digging into it, I was able to get Safari to deselect the --------- option by using '[selected]' as the selector. For #658	2021-10-28 22:22:46 -05:00
Igor Rzegocki	05de1c9fe6	healthcheck endpoint	2021-10-03 19:12:03 +02:00
Igor Rzegocki	95cf85f8cf	Support for Reverse Proxy authentication backends (like authelia)	2021-09-30 17:40:13 +02:00
Bruno Tavares	bb2a2e758a	Avoid KeyError on Pocket API parser When trying to import my pocket library I got a lot of ` KeyError` on Python. Pocket API has a few idiosyncrasies, such as sometimes returning the keys on json, sometimes not. ` ` ` sh archivebox add --parser pocket_api pocket://my_username ` ` ` Gave me this errors ` ` ` File "/app/archivebox/parsers/pocket_api.py", line 54, in link_from_article title = article['resolved_title'] or article['given_title'] or url KeyError: 'resolved_title' ` ` ` This commit are the patches I've changed to successfully import my library	2021-09-07 21:53:36 -03:00
Ross Williams	f6cf35a45d	Fix Pinboard RSS parsing valid links as `None` `item.find(p)` returns either an `ElementTree.Element` or `None`. The [lambda on line 24][lambda] coerces the return value to a bool, which is `False` if the `<link>` element has no children (see [`ElementTree.py` line 207][etbooldef]), so the lambda returns `None`. Further, returning a `Link` with `url=None` violates [an assertion in `index/schema.py`][assertion], which crashes the `archivebox add` command. [lambda]: `3d54b1321b/archivebox/parsers/pinboard_rss.py (L24)` [etbooldef]: `3d8993a744/Lib/xml/etree/ElementTree.py (L207)` [assertion]: `3d54b1321b/archivebox/index/schema.py (L165)`	2021-08-04 10:13:37 -04:00
Inndy	0e81a0722e	Discard Referer header from iframe and link to original URL	2021-07-19 21:48:01 +08:00
Nick Sweeting	5a2c78e14b	add proper support for URL_WHITELIST instead of using negation regexes	2021-07-06 23:42:00 -04:00
Nick Sweeting	e4974d3536	support negation patterns by checking both re.search and re.match	2021-07-06 23:17:05 -04:00
TJ Horner	cdcfb7fa44	Exempt /add route from CSRF	2021-07-01 20:55:51 -04:00
Nick Sweeting	e0a2b2e252	ominous warnings	2021-06-01 03:03:42 -04:00
Nick Sweeting	aa53fe653c	fix use of uneeded perms arg	2021-06-01 02:58:36 -04:00
Nick Sweeting	c2d1a57581	fix umask dir permissions	2021-06-01 00:50:18 -04:00
Nick Sweeting	4715ace7dd	ignore BaseException lgtm errors	2021-05-31 20:59:05 -04:00
Nick Sweeting	9f1470cf03	fix output permissions tests	2021-05-31 20:57:46 -04:00
Nick Sweeting	8230f88d80	change default OUTPUT_PERMISSIONS to disallow execution except on dirs	2021-05-31 19:31:51 -04:00
Nick Sweeting	1112526543	add option ENFORCE_ATOMIC_WRITES to allow disabling forced FSYNC writes on network drives	2021-05-31 19:31:51 -04:00
Nick Sweeting	49faec8f6d	add no-zygote and single-process args to try and prevent orphan chrome processes after exit	2021-05-13 05:04:23 -04:00
Nick Sweeting	eb4d3bca9d	Update readability.py	2021-05-13 00:13:32 -04:00
Nick Sweeting	c3d009e4ec	fix python file execution checking	2021-04-24 04:43:02 -04:00
Nick Sweeting	79e19ecd47	shield pwd import for windows	2021-04-24 03:51:38 -04:00
Nick Sweeting	3db77fd1a2	fix lint errors	2021-04-24 03:49:01 -04:00
Nick Sweeting	3992e0fee3	auto prepend python binary to args when running system command with python file as first argument	2021-04-24 03:29:22 -04:00
Nick Sweeting	fa84f749ff	run archivebox version using python binary	2021-04-24 03:29:22 -04:00
Nick Sweeting	226e26852c	dont try to autodetect whether node, chrome, etc are needed during setup	2021-04-24 03:29:22 -04:00
Nick Sweeting	f129b9f443	allow executing main	2021-04-24 03:11:06 -04:00
Nick Sweeting	208f866fc4	fix user detection on freebsd always returning root	2021-04-24 02:52:09 -04:00
Nick Sweeting	06f6084e3b	ignore stdin when passed instead of throwing an error	2021-04-24 00:09:52 -04:00
Nick Sweeting	eb80dc26a1	cleanup config files	2021-04-23 22:58:44 -04:00

1 2 3 4 5 ...

1365 commits