From ab225104c5ca7e06d796ed5f0657fe698978e3d9 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 12:54:56 -0800 Subject: [PATCH 01/25] Update README.md --- README.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index a961cb47..4d105d02 100644 --- a/README.md +++ b/README.md @@ -141,21 +141,23 @@ curl -fsSL 'https://get.archivebox.io' | sh ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs, governments, and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): -- 🗞️ **Journalists:** +- **Journalists:** `crawling during research`, `preserving cited pages`, `fact-checking & review` -- ⚖️ **Lawyers:** +- **Lawyers:** `collecting & preserving evidence`, `detecting changes`, `tagging & review` -- 🔬 **Researchers:** +- **Researchers:** `analyzing social media trends`, `getting LLM training data`, `crawling pipelines` -- 👩🏽 **Individuals:** +- **Individuals:** `saving bookmarks`, `preserving portfolio content`, `legacy / memoirs archival` +- **Governments:** + `snapshoting public records / govt sites`, `recordkeeping compliance`, `libraries` -> ***[Contact our team](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your institution/org wants to use ArchiveBox professionally. We offer services such as:* +> ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally. We offer services such as:* > > - setup & support, hosting, custom features, security, hashing & audit logging for chain-of-custody, etc. > - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... -*We are a 🏛️ 501(c)(3) nonprofit and all our work goes towards supporting open-source development.* +*ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes towards supporting open-source development.*
From c7cdc2fc27d39f2d8dbb16b21bae6139c8d14deb Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 12:55:36 -0800 Subject: [PATCH 02/25] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4d105d02..174f1a39 100644 --- a/README.md +++ b/README.md @@ -150,7 +150,7 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur - **Individuals:** `saving bookmarks`, `preserving portfolio content`, `legacy / memoirs archival` - **Governments:** - `snapshoting public records / govt sites`, `recordkeeping compliance`, `libraries` + `snapshoting public service sites`, `recordkeeping compliance`, `libraries` > ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally. We offer services such as:* > From a00b34cc13c5f2e31a2bf7009be5a9158cd4e7a2 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 12:58:28 -0800 Subject: [PATCH 03/25] Update README.md --- README.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 174f1a39..e0162847 100644 --- a/README.md +++ b/README.md @@ -149,15 +149,12 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur `analyzing social media trends`, `getting LLM training data`, `crawling pipelines` - **Individuals:** `saving bookmarks`, `preserving portfolio content`, `legacy / memoirs archival` -- **Governments:** - `snapshoting public service sites`, `recordkeeping compliance`, `libraries` +- **Governments:** + `snapshoting public service sites`, `recordkeeping compliance` -> ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally. We offer services such as:* -> -> - setup & support, hosting, custom features, security, hashing & audit logging for chain-of-custody, etc. -> - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... - -*ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes towards supporting open-source development.* +> ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally.* +> We offer: setup & support, hosting, custom features, security, hashing & audit logging for chain-of-custody, etc. +> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes towards supporting open-source development.*
From 597f1a39e06ef667401d84f23fc7b8ba2fd277a5 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 23:20:48 -0800 Subject: [PATCH 04/25] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e0162847..74f03a66 100644 --- a/README.md +++ b/README.md @@ -153,8 +153,8 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur `snapshoting public service sites`, `recordkeeping compliance` > ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally.* -> We offer: setup & support, hosting, custom features, security, hashing & audit logging for chain-of-custody, etc. -> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes towards supporting open-source development.* +> We offer: setup & support, hosting, custom features, security, hashing & audit logging/chain-of-custody, etc. +> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes supports open-source development.*
From f02b27920c41a9a1182da4d1871f7ba693c20c3a Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 23:21:23 -0800 Subject: [PATCH 05/25] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 74f03a66..f6663013 100644 --- a/README.md +++ b/README.md @@ -154,7 +154,7 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur > ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally.* > We offer: setup & support, hosting, custom features, security, hashing & audit logging/chain-of-custody, etc. -> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes supports open-source development.* +> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work supports open-source development.*
From a729480b753a10cd3e97884ff0804eebd0d9cd8b Mon Sep 17 00:00:00 2001 From: Naomi Phillips Date: Sun, 3 Mar 2024 02:32:46 -0500 Subject: [PATCH 06/25] Add COOKIES_FILE support for singlefile extractor --- archivebox/extractors/singlefile.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/archivebox/extractors/singlefile.py b/archivebox/extractors/singlefile.py index e3860527..377e4a0e 100644 --- a/archivebox/extractors/singlefile.py +++ b/archivebox/extractors/singlefile.py @@ -19,6 +19,7 @@ from ..config import ( SINGLEFILE_VERSION, SINGLEFILE_ARGS, CHROME_BINARY, + COOKIES_FILE, ) from ..logging_util import TimedProgress @@ -48,6 +49,7 @@ def save_singlefile(link: Link, out_dir: Optional[Path]=None, timeout: int=TIMEO browser_args = '--browser-args={}'.format(json.dumps(browser_args[1:])) options = [ *SINGLEFILE_ARGS, + *(["--browser-cookies-file={}".format(COOKIES_FILE)] if COOKIES_FILE else []), '--browser-executable-path={}'.format(CHROME_BINARY), browser_args, ] From 86c3e271adeec95a94758a54e81a409f0a1e55ef Mon Sep 17 00:00:00 2001 From: Ricky de Laveaga Date: Thu, 7 Mar 2024 09:45:41 -0800 Subject: [PATCH 07/25] Update README.md Browser Extension link Point to GH repo with all browsers, not Chrome Webstore --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f6663013..6c17b7f5 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ Without active preservation effort, everything on the internet eventually dissap

-📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our [Browser Extension](https://chromewebstore.google.com/detail/archivebox-exporter/habonpimjphpdnmcfkaockjnffodikoj), and more. +📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our [Browser Extension](https://github.com/ArchiveBox/archivebox-browser-extension), and more. See Input Formats for a full list of supported input formats...
From e00845f58c917e2129de8b2be66ba9151849d9b6 Mon Sep 17 00:00:00 2001 From: Nicholas Hebert <68243838+n-hebert@users.noreply.github.com> Date: Tue, 19 Mar 2024 11:13:47 -0300 Subject: [PATCH 08/25] Revise md section not formatting properly in html --- README.md | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 6c17b7f5..43f0080c 100644 --- a/README.md +++ b/README.md @@ -1060,7 +1060,6 @@ Improved support for saving multiple snapshots of a single URL without this hash
- ### Storage Requirements Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. There are also some special requirements when using filesystems like NFS/SMB/FUSE. @@ -1070,17 +1069,16 @@ Because ArchiveBox is designed to ingest a large volume of URLs with multiple co Click to learn more about ArchiveBox's filesystem and hosting requirements...
- -**ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles**, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`. - -Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). - -**Don't store large collections on older filesystems like EXT3/FAT** as they may not be able to handle more than 50k directory entries in the `data/archive/` folder. - -**Try to keep the `data/index.sqlite3` file on local drive (not a network mount)** or SSD for maximum performance, however the `data/archive/` folder can be on a network mount or slower HDD. - -If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server. - +
    +
  • ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`.
  • +
  • Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). +
  • +
  • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the `data/archive/` folder. +
  • +
  • Try to keep the `data/index.sqlite3` file on local drive (not a network mount) or SSD for maximum performance, however the `data/archive/` folder can be on a network mount or slower HDD.
  • +
  • If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server. +
  • +

Learn More

From 37c9a33c8b7d7b9d57696ff24008c24aa5ce5658 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Wed, 20 Mar 2024 23:19:23 -0700 Subject: [PATCH 09/25] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 43f0080c..e3fe581b 100644 --- a/README.md +++ b/README.md @@ -497,7 +497,7 @@ docker run -it -v $PWD:/data archivebox/archivebox help
-curl sh automatic setup script CLI Usage Examples (non-Docker) +curl sh automatic setup script CLI Usage Examples: non-Docker

 # make sure you have pip-installed ArchiveBox and it's available in your $PATH first  
@@ -514,7 +514,7 @@ archivebox add --depth=1 'https://news.ycombinator.com'
 
-Docker Docker Compose CLI Usage Examples +Docker CLI Usage Examples: Docker Compose

 # make sure you have `docker-compose.yml` from the Quickstart instructions first
@@ -532,7 +532,7 @@ docker compose run archivebox add --depth=1 'https://news.ycombinator.com'
 
-Docker Docker CLI Usage Examples +Docker CLI Usage Examples: Docker

 # make sure you create and cd into in a new empty directory first  

From d32413d74b2cd7b6dff1504851fb2098ef23758a Mon Sep 17 00:00:00 2001
From: Nick Sweeting 
Date: Wed, 20 Mar 2024 23:23:26 -0700
Subject: [PATCH 10/25] Update README.md

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index e3fe581b..e2c15f34 100644
--- a/README.md
+++ b/README.md
@@ -654,13 +654,13 @@ docker run -it -v $PWD:/data archivebox/archivebox add --depth=1 'https://exampl
   ArchiveBox supports injesting URLs in [any text-based format](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file).
 
 -  From manually exported [browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) (in Netscape format)  
-  See instructions for: Chrome, Firefox, Safari, IE, Opera, and more...
+  Instructions: Chrome, Firefox, Safari, IE, Opera, and more...
 
 -  From URLs visited through a [MITM Proxy](https://mitmproxy.org/) with [`archivebox-proxy`](https://github.com/ArchiveBox/archivebox-proxy)  
   Provides [realtime archiving](https://github.com/ArchiveBox/ArchiveBox/issues/577) of all traffic from any device going through the proxy.
 
 -  From bookmarking services or social media (e.g. Twitter bookmarks, Reddit saved posts, etc.)  
-  See instructions for: Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved, Wallabag, Unmark.it, OneTab, Firefox Sync, and more...
+  Instructions: Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved, Wallabag, Unmark.it, OneTab, Firefox Sync, and more...
 
 
 

From d9beebdee71be5d7bcc9cab16fee3df594dfd2d2 Mon Sep 17 00:00:00 2001
From: Nick Sweeting 
Date: Wed, 20 Mar 2024 23:25:06 -0700
Subject: [PATCH 11/25] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index e2c15f34..50081c38 100644
--- a/README.md
+++ b/README.md
@@ -1017,7 +1017,7 @@ For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) active
 
 

From 67baea172edf598c1a218d633e267d0f315365b0 Mon Sep 17 00:00:00 2001
From: Nick Sweeting 
Date: Wed, 20 Mar 2024 23:28:02 -0700
Subject: [PATCH 12/25] Update README.md

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 50081c38..9d705655 100644
--- a/README.md
+++ b/README.md
@@ -1070,8 +1070,8 @@ Because ArchiveBox is designed to ingest a large volume of URLs with multiple co
 
    -
  • ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`.
  • -
  • Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). +
  • ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using SAVE_MEDIA=True and whether you lower MEDIA_MAX_SIZE=750mb.
  • +
  • Disk usage can be reduced by using a compressed/[deduplicated](https://www.ixsystems.com/blog/ixsystems-and-klara-systems-celebrate-valentines-day-with-a-heartfelt-donation-of-fast-dedupe-to-openzfs-and-truenas/) filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like fdupes or rdfind.
  • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the `data/archive/` folder.
  • From 28e85e0b95cc5948663762d8b1922968d8c9e1f0 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Wed, 20 Mar 2024 23:31:04 -0700 Subject: [PATCH 13/25] Update README.md --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 9d705655..1ae5dde2 100644 --- a/README.md +++ b/README.md @@ -1070,13 +1070,13 @@ Because ArchiveBox is designed to ingest a large volume of URLs with multiple co
      -
    • ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using SAVE_MEDIA=True and whether you lower MEDIA_MAX_SIZE=750mb.
    • -
    • Disk usage can be reduced by using a compressed/[deduplicated](https://www.ixsystems.com/blog/ixsystems-and-klara-systems-celebrate-valentines-day-with-a-heartfelt-donation-of-fast-dedupe-to-openzfs-and-truenas/) filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like fdupes or rdfind. +
    • ArchiveBox can use anywhere from ~1gb per 1000 Snapshots, to ~50gb per 1000 Snapshots, mostly dependent on whether you're saving audio & video using SAVE_MEDIA=True and whether you lower MEDIA_MAX_SIZE=750mb.
    • +
    • Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like fdupes or rdfind.
    • -
    • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the `data/archive/` folder. +
    • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the data/archive/ folder.
    • -
    • Try to keep the `data/index.sqlite3` file on local drive (not a network mount) or SSD for maximum performance, however the `data/archive/` folder can be on a network mount or slower HDD.
    • -
    • If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server. +
    • Try to keep the data/index.sqlite3 file on local drive (not a network mount) or SSD for maximum performance, however the data/archive/ folder can be on a network mount or slower HDD.
    • +
    • If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set PUID & PGID and disable root_squash on your fileshare server.
    From a1ef5f60350eacdde4caf8fbb8ba9d7e6aee25c2 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:00:14 -0700 Subject: [PATCH 14/25] Update README.md --- README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 1ae5dde2..12a1335e 100644 --- a/README.md +++ b/README.md @@ -1160,19 +1160,18 @@ ArchiveBox aims to enable more of the internet to be saved from deterioration by Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity. -Whether it's to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010's flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears. +Whether it's to resist censorship by saving news articles before they get taken down or edited, or just to save a collection of early 2010's flash games you loved to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.

    Image from Perma.cc...
    +The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about, just like libraries do. Without the work of archivists saving physical books, manuscrips, and paintings we wouldn't have any knowledge of our ancestors' history. We believe archiving the web is just as important to provide the same benefit to future generations. -The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about. - -Because modern websites are complicated and often rely on dynamic content, -ArchiveBox archives the sites in **several different formats** beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats. +We believe duplication of other people's content is only ethical if it a: A. doesn't deprive the original creators of revenue and B. is responsibly curated. In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this excemption, as they are as a non-profit providing public service, and they respond to DMCA removal requests. +As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA contact email in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR).

From 2220a5350ca0a62513c8e312fec8559468625e69 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:02:08 -0700 Subject: [PATCH 15/25] Update README.md --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 12a1335e..ea3a2917 100644 --- a/README.md +++ b/README.md @@ -1169,7 +1169,12 @@ Whether it's to resist censorship by saving news articles before they get taken The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about, just like libraries do. Without the work of archivists saving physical books, manuscrips, and paintings we wouldn't have any knowledge of our ancestors' history. We believe archiving the web is just as important to provide the same benefit to future generations. -We believe duplication of other people's content is only ethical if it a: A. doesn't deprive the original creators of revenue and B. is responsibly curated. In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this excemption, as they are as a non-profit providing public service, and they respond to DMCA removal requests. +We believe duplication of other people's content is only ethical if it: + +- A. doesn't deprive the original creators of revenue and +- B. is responsibly curated. + +In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this excemption, as they are as a non-profit providing public service, and they respond to DMCA removal requests. As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA contact email in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR). From 1dbe08872cd600257c00e209ac64cdbed2559136 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:10:19 -0700 Subject: [PATCH 16/25] Update README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index ea3a2917..24b91110 100644 --- a/README.md +++ b/README.md @@ -1167,14 +1167,14 @@ Whether it's to resist censorship by saving news articles before they get taken Image from Perma.cc...
-The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about, just like libraries do. Without the work of archivists saving physical books, manuscrips, and paintings we wouldn't have any knowledge of our ancestors' history. We believe archiving the web is just as important to provide the same benefit to future generations. +The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about, just like libraries do. Without the work of archivists saving physical books, manuscrips, and paintings we wouldn't have any knowledge of our ancestors' history. I believe archiving the web is just as important to provide the same benefit to future generations. -We believe duplication of other people's content is only ethical if it: +ArchiveBox's stance is that duplication of other people's content is only ethical if it: - A. doesn't deprive the original creators of revenue and -- B. is responsibly curated. +- B. is responsibly curated by an individual/institution. -In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this excemption, as they are as a non-profit providing public service, and they respond to DMCA removal requests. +In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this exemption, as they are as a non-profit providing public service, and they respond to unethical content/DMCA/GDPR removal requests. As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA contact email in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR). From 2c6704b1d099425abf7744e1a8d5b6677006e85a Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:11:57 -0700 Subject: [PATCH 17/25] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 24b91110..7362948b 100644 --- a/README.md +++ b/README.md @@ -1176,7 +1176,7 @@ ArchiveBox's stance is that duplication of other people's content is only ethica In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this exemption, as they are as a non-profit providing public service, and they respond to unethical content/DMCA/GDPR removal requests. -As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA contact email in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR). +As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA/GDPR contact info in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR).

From 88f21d0d70dcce27ad944fece540d32d30897386 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:12:31 -0700 Subject: [PATCH 18/25] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7362948b..68f8bfe5 100644 --- a/README.md +++ b/README.md @@ -1189,7 +1189,7 @@ As long as you A. don't try to profit off pirating copyrighted content and B. ha > **Check out our [community wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) for a list of web archiving tools and orgs.** -A variety of open and closed-source archiving projects exist, but few provide a nice UI and CLI to manage a large, high-fidelity archive collection over time. +A variety of open and closed-source archiving projects exist, but few provide a nice UI and CLI to manage a large, high-fidelity collection over time.
From ee2809eb4fd53c07a406abef7e9b4ad72c8ebb74 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:27:49 -0700 Subject: [PATCH 19/25] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 68f8bfe5..d93e3989 100644 --- a/README.md +++ b/README.md @@ -1577,7 +1577,7 @@ Extractors take the URL of a page to archive, write their output to the filesyst -- [ArchiveBox.io Homepage](https://archivebox.io) / [Source Code (Github)](https://github.com/ArchiveBox/ArchiveBox) / [Demo Server](https://demo.archivebox.io) +- [ArchiveBox.io Website](https://archivebox.io) / [ArchiveBox Github (Source Code)](https://github.com/ArchiveBox/ArchiveBox) / [ArchiveBox Demo Server](https://demo.archivebox.io) - [Documentation Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki) / [API Reference Docs](https://docs.archivebox.io) / [Changelog](https://github.com/ArchiveBox/ArchiveBox/releases) - [Bug Tracker](https://github.com/ArchiveBox/ArchiveBox/issues) / [Discussions](https://github.com/ArchiveBox/ArchiveBox/discussions) / [Community Chat Forum (Zulip)](https://zulip.archivebox.io) - Find us on social media: [Twitter](https://twitter.com/ArchiveBoxApp), [LinkedIn](https://www.linkedin.com/company/archivebox/), [YouTube](https://www.youtube.com/@ArchiveBoxApp), [SaaSHub](https://www.saashub.com/archivebox), [Alternative.to](https://alternativeto.net/software/archivebox/about/), [Reddit](https://www.reddit.com/r/ArchiveBox/) From 05213794642b726b4b6dedabaa27c96628e2d5c2 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:29:54 -0700 Subject: [PATCH 20/25] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index d93e3989..ea331241 100644 --- a/README.md +++ b/README.md @@ -1578,9 +1578,9 @@ Extractors take the URL of a page to archive, write their output to the filesyst - [ArchiveBox.io Website](https://archivebox.io) / [ArchiveBox Github (Source Code)](https://github.com/ArchiveBox/ArchiveBox) / [ArchiveBox Demo Server](https://demo.archivebox.io) -- [Documentation Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki) / [API Reference Docs](https://docs.archivebox.io) / [Changelog](https://github.com/ArchiveBox/ArchiveBox/releases) -- [Bug Tracker](https://github.com/ArchiveBox/ArchiveBox/issues) / [Discussions](https://github.com/ArchiveBox/ArchiveBox/discussions) / [Community Chat Forum (Zulip)](https://zulip.archivebox.io) -- Find us on social media: [Twitter](https://twitter.com/ArchiveBoxApp), [LinkedIn](https://www.linkedin.com/company/archivebox/), [YouTube](https://www.youtube.com/@ArchiveBoxApp), [SaaSHub](https://www.saashub.com/archivebox), [Alternative.to](https://alternativeto.net/software/archivebox/about/), [Reddit](https://www.reddit.com/r/ArchiveBox/) +- [Documentation (Github Wiki)](https://github.com/ArchiveBox/ArchiveBox/wiki) / [API Reference Docs (ReadTheDocs)](https://docs.archivebox.io) / [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap) / [Changelog](https://github.com/ArchiveBox/ArchiveBox/releases) +- [Bug Tracker (Github Issues)](https://github.com/ArchiveBox/ArchiveBox/issues) / [Discussions (Github Discussions)](https://github.com/ArchiveBox/ArchiveBox/discussions) / [Community Chat Forum (Zulip)](https://zulip.archivebox.io) +- Find us on social media: [Twitter `@ArchiveBoxApp`](https://twitter.com/ArchiveBoxApp), [LinkedIn](https://www.linkedin.com/company/archivebox/), [YouTube](https://www.youtube.com/@ArchiveBoxApp), [SaaSHub](https://www.saashub.com/archivebox), [Alternative.to](https://alternativeto.net/software/archivebox/about/), [Reddit](https://www.reddit.com/r/ArchiveBox/) --- From 1d49bee90bcf6a0b04905266f3e7e73306ed6f9c Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:31:48 -0700 Subject: [PATCH 21/25] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ea331241..27a84956 100644 --- a/README.md +++ b/README.md @@ -1599,7 +1599,7 @@ Extractors take the URL of a page to archive, write their output to the filesyst    
-ArchiveBox was started by Nick Sweeting in 2017, and has grown steadily with help from our amazing contributors. +ArchiveBox was started by Nick Sweeting in 2017, and has grown steadily with help from our amazing contributors.
✨ Have spare CPU/disk/bandwidth after all your 网站存档爬 and want to help the world?
Check out our Good Karma Kit...
From 8b1b01e508bf5827fd8d98a9cd1cdaf028d09a15 Mon Sep 17 00:00:00 2001 From: jim winstead Date: Mon, 25 Mar 2024 17:46:01 -0700 Subject: [PATCH 22/25] Update to Django 4.2.x, now in LTS until April 2026 --- archivebox/core/__init__.py | 1 - archivebox/core/admin.py | 177 ++++++++++++++++++++---------------- archivebox/core/apps.py | 2 - archivebox/core/settings.py | 4 - archivebox/core/urls.py | 4 +- pyproject.toml | 4 +- 6 files changed, 105 insertions(+), 87 deletions(-) diff --git a/archivebox/core/__init__.py b/archivebox/core/__init__.py index 9cd0ce16..ac3ec769 100644 --- a/archivebox/core/__init__.py +++ b/archivebox/core/__init__.py @@ -1,3 +1,2 @@ __package__ = 'archivebox.core' -default_app_config = 'archivebox.core.apps.CoreConfig' diff --git a/archivebox/core/admin.py b/archivebox/core/admin.py index 65baa52b..172a8caf 100644 --- a/archivebox/core/admin.py +++ b/archivebox/core/admin.py @@ -48,6 +48,60 @@ GLOBAL_CONTEXT = {'VERSION': VERSION, 'VERSIONS_AVAILABLE': VERSIONS_AVAILABLE, # TODO: https://stackoverflow.com/questions/40760880/add-custom-button-to-django-admin-panel +class ArchiveBoxAdmin(admin.AdminSite): + site_header = 'ArchiveBox' + index_title = 'Links' + site_title = 'Index' + namespace = 'admin' + + def get_urls(self): + return [ + path('core/snapshot/add/', self.add_view, name='Add'), + ] + super().get_urls() + + def add_view(self, request): + if not request.user.is_authenticated: + return redirect(f'/admin/login/?next={request.path}') + + request.current_app = self.name + context = { + **self.each_context(request), + 'title': 'Add URLs', + } + + if request.method == 'GET': + context['form'] = AddLinkForm() + + elif request.method == 'POST': + form = AddLinkForm(request.POST) + if form.is_valid(): + url = form.cleaned_data["url"] + print(f'[+] Adding URL: {url}') + depth = 0 if form.cleaned_data["depth"] == "0" else 1 + input_kwargs = { + "urls": url, + "depth": depth, + "update_all": False, + "out_dir": OUTPUT_DIR, + } + add_stdout = StringIO() + with redirect_stdout(add_stdout): + add(**input_kwargs) + print(add_stdout.getvalue()) + + context.update({ + "stdout": ansi_to_html(add_stdout.getvalue().strip()), + "form": AddLinkForm() + }) + else: + context["form"] = form + + return render(template_name='add.html', request=request, context=context) + +archivebox_admin = ArchiveBoxAdmin() +archivebox_admin.register(get_user_model()) +archivebox_admin.disable_action('delete_selected') + class ArchiveResultInline(admin.TabularInline): model = ArchiveResult @@ -57,11 +111,11 @@ class TagInline(admin.TabularInline): from django.contrib.admin.helpers import ActionForm from django.contrib.admin.widgets import AutocompleteSelectMultiple -# WIP: broken by Django 3.1.2 -> 4.0 migration class AutocompleteTags: model = Tag search_fields = ['name'] name = 'tags' + remote_field = TagInline class AutocompleteTagsAdminStub: name = 'admin' @@ -71,7 +125,6 @@ class SnapshotActionForm(ActionForm): tags = forms.ModelMultipleChoiceField( queryset=Tag.objects.all(), required=False, - # WIP: broken by Django 3.1.2 -> 4.0 migration widget=AutocompleteSelectMultiple( AutocompleteTags(), AutocompleteTagsAdminStub(), @@ -90,6 +143,7 @@ class SnapshotActionForm(ActionForm): # ) +@admin.register(Snapshot, site=archivebox_admin) class SnapshotAdmin(SearchResultsAdminMixin, admin.ModelAdmin): list_display = ('added', 'title_str', 'files', 'size', 'url_str') sort_fields = ('title_str', 'url_str', 'added', 'files') @@ -176,6 +230,10 @@ class SnapshotAdmin(SearchResultsAdminMixin, admin.ModelAdmin): obj.id, ) + @admin.display( + description='Title', + ordering='title', + ) def title_str(self, obj): canon = obj.as_link().canonical_outputs() tags = ''.join( @@ -197,12 +255,17 @@ class SnapshotAdmin(SearchResultsAdminMixin, admin.ModelAdmin): urldecode(htmldecode(obj.latest_title or obj.title or ''))[:128] or 'Pending...' ) + mark_safe(f' {tags}') + @admin.display( + description='Files Saved', + ordering='archiveresult_count', + ) def files(self, obj): return snapshot_icons(obj) - files.admin_order_field = 'archiveresult_count' - files.short_description = 'Files Saved' + @admin.display( + ordering='archiveresult_count' + ) def size(self, obj): archive_size = (Path(obj.link_dir) / 'index.html').exists() and obj.archive_size if archive_size: @@ -217,8 +280,11 @@ class SnapshotAdmin(SearchResultsAdminMixin, admin.ModelAdmin): size_txt, ) - size.admin_order_field = 'archiveresult_count' + @admin.display( + description='Original URL', + ordering='url', + ) def url_str(self, obj): return format_html( '{}', @@ -255,65 +321,76 @@ class SnapshotAdmin(SearchResultsAdminMixin, admin.ModelAdmin): # print('[*] Got request', request.method, request.POST) # return super().changelist_view(request, extra_context=None) + @admin.action( + description="Pull" + ) def update_snapshots(self, request, queryset): archive_links([ snapshot.as_link() for snapshot in queryset ], out_dir=OUTPUT_DIR) - update_snapshots.short_description = "Pull" + @admin.action( + description="⬇️ Title" + ) def update_titles(self, request, queryset): archive_links([ snapshot.as_link() for snapshot in queryset ], overwrite=True, methods=('title','favicon'), out_dir=OUTPUT_DIR) - update_titles.short_description = "⬇️ Title" + @admin.action( + description="Re-Snapshot" + ) def resnapshot_snapshot(self, request, queryset): for snapshot in queryset: timestamp = datetime.now(timezone.utc).isoformat('T', 'seconds') new_url = snapshot.url.split('#')[0] + f'#{timestamp}' add(new_url, tag=snapshot.tags_str()) - resnapshot_snapshot.short_description = "Re-Snapshot" + @admin.action( + description="Reset" + ) def overwrite_snapshots(self, request, queryset): archive_links([ snapshot.as_link() for snapshot in queryset ], overwrite=True, out_dir=OUTPUT_DIR) - overwrite_snapshots.short_description = "Reset" + @admin.action( + description="Delete" + ) def delete_snapshots(self, request, queryset): remove(snapshots=queryset, yes=True, delete=True, out_dir=OUTPUT_DIR) - delete_snapshots.short_description = "Delete" + @admin.action( + description="+" + ) def add_tags(self, request, queryset): tags = request.POST.getlist('tags') print('[+] Adding tags', tags, 'to Snapshots', queryset) for obj in queryset: obj.tags.add(*tags) - add_tags.short_description = "+" + @admin.action( + description="–" + ) def remove_tags(self, request, queryset): tags = request.POST.getlist('tags') print('[-] Removing tags', tags, 'to Snapshots', queryset) for obj in queryset: obj.tags.remove(*tags) - remove_tags.short_description = "–" - title_str.short_description = 'Title' - url_str.short_description = 'Original URL' - - title_str.admin_order_field = 'title' - url_str.admin_order_field = 'url' + +@admin.register(Tag, site=archivebox_admin) class TagAdmin(admin.ModelAdmin): list_display = ('slug', 'name', 'num_snapshots', 'snapshots', 'id') sort_fields = ('id', 'name', 'slug') @@ -344,6 +421,7 @@ class TagAdmin(admin.ModelAdmin): ) + (f'
and {total_count-10} more...' if obj.snapshot_set.count() > 10 else '')) +@admin.register(ArchiveResult, site=archivebox_admin) class ArchiveResultAdmin(admin.ModelAdmin): list_display = ('id', 'start_ts', 'extractor', 'snapshot_str', 'tags_str', 'cmd_str', 'status', 'output_str') sort_fields = ('start_ts', 'extractor', 'status') @@ -356,6 +434,9 @@ class ArchiveResultAdmin(admin.ModelAdmin): ordering = ['-start_ts'] list_per_page = SNAPSHOTS_PER_PAGE + @admin.display( + description='snapshot' + ) def snapshot_str(self, obj): return format_html( '[{}]
' @@ -365,6 +446,9 @@ class ArchiveResultAdmin(admin.ModelAdmin): obj.snapshot.url[:128], ) + @admin.display( + description='tags' + ) def tags_str(self, obj): return obj.snapshot.tags_str() @@ -381,62 +465,3 @@ class ArchiveResultAdmin(admin.ModelAdmin): obj.output if (obj.status == 'succeeded') and obj.extractor not in ('title', 'archive_org') else 'index.html', obj.output, ) - - tags_str.short_description = 'tags' - snapshot_str.short_description = 'snapshot' - -class ArchiveBoxAdmin(admin.AdminSite): - site_header = 'ArchiveBox' - index_title = 'Links' - site_title = 'Index' - - def get_urls(self): - return [ - path('core/snapshot/add/', self.add_view, name='Add'), - ] + super().get_urls() - - def add_view(self, request): - if not request.user.is_authenticated: - return redirect(f'/admin/login/?next={request.path}') - - request.current_app = self.name - context = { - **self.each_context(request), - 'title': 'Add URLs', - } - - if request.method == 'GET': - context['form'] = AddLinkForm() - - elif request.method == 'POST': - form = AddLinkForm(request.POST) - if form.is_valid(): - url = form.cleaned_data["url"] - print(f'[+] Adding URL: {url}') - depth = 0 if form.cleaned_data["depth"] == "0" else 1 - input_kwargs = { - "urls": url, - "depth": depth, - "update_all": False, - "out_dir": OUTPUT_DIR, - } - add_stdout = StringIO() - with redirect_stdout(add_stdout): - add(**input_kwargs) - print(add_stdout.getvalue()) - - context.update({ - "stdout": ansi_to_html(add_stdout.getvalue().strip()), - "form": AddLinkForm() - }) - else: - context["form"] = form - - return render(template_name='add.html', request=request, context=context) - -admin.site = ArchiveBoxAdmin() -admin.site.register(get_user_model()) -admin.site.register(Snapshot, SnapshotAdmin) -admin.site.register(Tag, TagAdmin) -admin.site.register(ArchiveResult, ArchiveResultAdmin) -admin.site.disable_action('delete_selected') diff --git a/archivebox/core/apps.py b/archivebox/core/apps.py index 32088de4..f3e35dbd 100644 --- a/archivebox/core/apps.py +++ b/archivebox/core/apps.py @@ -3,8 +3,6 @@ from django.apps import AppConfig class CoreConfig(AppConfig): name = 'core' - # WIP: broken by Django 3.1.2 -> 4.0 migration - default_auto_field = 'django.db.models.UUIDField' def ready(self): from .auth import register_signals diff --git a/archivebox/core/settings.py b/archivebox/core/settings.py index 06e798ab..9b80c336 100644 --- a/archivebox/core/settings.py +++ b/archivebox/core/settings.py @@ -269,9 +269,6 @@ AUTH_PASSWORD_VALIDATORS = [ {'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator'}, ] -# WIP: broken by Django 3.1.2 -> 4.0 migration -DEFAULT_AUTO_FIELD = 'django.db.models.UUIDField' - ################################################################################ ### Shell Settings ################################################################################ @@ -290,7 +287,6 @@ if IS_SHELL: LANGUAGE_CODE = 'en-us' USE_I18N = True -USE_L10N = True USE_TZ = True DATETIME_FORMAT = 'Y-m-d g:iA' SHORT_DATETIME_FORMAT = 'Y-m-d h:iA' diff --git a/archivebox/core/urls.py b/archivebox/core/urls.py index 1111ead4..ce38af32 100644 --- a/archivebox/core/urls.py +++ b/archivebox/core/urls.py @@ -1,4 +1,4 @@ -from django.contrib import admin +from .admin import archivebox_admin from django.urls import path, include from django.views import static @@ -29,7 +29,7 @@ urlpatterns = [ path('accounts/', include('django.contrib.auth.urls')), - path('admin/', admin.site.urls), + path('admin/', archivebox_admin.urls), path('health/', HealthCheckView.as_view(), name='healthcheck'), path('error/', lambda _: 1/0), diff --git a/pyproject.toml b/pyproject.toml index eedea90c..969b6318 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -13,8 +13,8 @@ dependencies = [ # pdm update [--unconstrained] "croniter>=0.3.34", "dateparser>=1.0.0", - "django-extensions>=3.0.3", - "django>=3.1.3,<3.2", + "django-extensions>=3.2.3", + "django>=4.2.0,<5.0", "feedparser>=6.0.11", "ipython>5.0.0", "mypy-extensions>=0.4.3", From a4453b6f8745cbe7c21eceeb3cce05eb4fb71111 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Tue, 26 Mar 2024 14:19:25 -0700 Subject: [PATCH 23/25] fix PERSONAS PERSONAS_DIR typo --- archivebox/config.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/archivebox/config.py b/archivebox/config.py index 8b2f3a7e..a08d73e6 100644 --- a/archivebox/config.py +++ b/archivebox/config.py @@ -1029,10 +1029,10 @@ def get_data_locations(config: ConfigDict) -> ConfigValue: 'enabled': True, 'is_valid': config['LOGS_DIR'].exists(), }, - 'PERSONAS': { - 'path': config['PERSONAS'].resolve(), + 'PERSONAS_DIR': { + 'path': config['PERSONAS_DIR'].resolve(), 'enabled': True, - 'is_valid': config['PERSONAS'].exists(), + 'is_valid': config['PERSONAS_DIR'].exists(), }, 'ARCHIVE_DIR': { 'path': config['ARCHIVE_DIR'].resolve(), From ac73fb51297a49f3f6087796472832f9563c0cbe Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Tue, 26 Mar 2024 15:22:40 -0700 Subject: [PATCH 24/25] merge fixes --- Dockerfile | 13 +++++---- README.md | 2 +- archivebox/index/__init__.py | 2 +- docker-compose.yml | 54 +++++++++++++++++------------------- package.json | 2 +- pyproject.toml | 8 +++--- 6 files changed, 40 insertions(+), 41 deletions(-) diff --git a/Dockerfile b/Dockerfile index 82647329..fbb56a78 100644 --- a/Dockerfile +++ b/Dockerfile @@ -10,7 +10,7 @@ # docker run -v "$PWD/data":/data -p 8000:8000 archivebox server # Multi-arch build: # docker buildx create --use -# docker buildx build . --platform=linux/amd64,linux/arm64,linux/arm/v7 --push -t archivebox/archivebox:latest -t archivebox/archivebox:dev +# docker buildx build . --platform=linux/amd64,linux/arm64--push -t archivebox/archivebox:latest -t archivebox/archivebox:dev # # Read more about [developing Archivebox](https://github.com/ArchiveBox/ArchiveBox#archivebox-development). @@ -194,10 +194,12 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked,id=apt-$TARGETARCH$T && playwright install --with-deps chromium \ && export CHROME_BINARY="$(python -c 'from playwright.sync_api import sync_playwright; print(sync_playwright().start().chromium.executable_path)')"; \ else \ - # fall back to installing Chromium via apt-get on platforms not supported by playwright (e.g. risc, ARMv7, etc.) - apt-get install -qq -y -t bookworm-backports --no-install-recommends \ - chromium \ - && export CHROME_BINARY="$(which chromium)"; \ + # fall back to installing Chromium via apt-get on platforms not supported by playwright (e.g. risc, ARMv7, etc.) + # apt-get install -qq -y -t bookworm-backports --no-install-recommends \ + # chromium \ + # && export CHROME_BINARY="$(which chromium)"; \ + echo 'armv7 no longer supported in versions after v0.7.3' \ + exit 1; \ fi \ && rm -rf /var/lib/apt/lists/* \ && ln -s "$CHROME_BINARY" /usr/bin/chromium-browser \ @@ -275,7 +277,6 @@ ENV IN_DOCKER=True \ GOOGLE_DEFAULT_CLIENT_SECRET=no \ ALLOWED_HOSTS=* ## No need to set explicitly, these values will be autodetected by archivebox in docker: - # CHROME_SANDBOX=False \ # WGET_BINARY="wget" \ # YOUTUBEDL_BINARY="yt-dlp" \ # CHROME_BINARY="/usr/bin/chromium-browser" \ diff --git a/README.md b/README.md index 27a84956..4d1bcf0d 100644 --- a/README.md +++ b/README.md @@ -1076,7 +1076,7 @@ Because ArchiveBox is designed to ingest a large volume of URLs with multiple co
  • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the data/archive/ folder.
  • Try to keep the data/index.sqlite3 file on local drive (not a network mount) or SSD for maximum performance, however the data/archive/ folder can be on a network mount or slower HDD.
  • -
  • If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set PUID & PGID and disable root_squash on your fileshare server. +
  • If using Docker or NFS/SMB/FUSE for the data/archive/ folder, you may need to set PUID & PGID and disable root_squash on your fileshare server.
  • diff --git a/archivebox/index/__init__.py b/archivebox/index/__init__.py index 9912b4c7..fb3688f3 100644 --- a/archivebox/index/__init__.py +++ b/archivebox/index/__init__.py @@ -250,7 +250,7 @@ def load_main_index(out_dir: Path=OUTPUT_DIR, warn: bool=True) -> List[Link]: """parse and load existing index with any new links from import_path merged in""" from core.models import Snapshot try: - return Snapshot.objects.all() + return Snapshot.objects.all().only('id') except (KeyboardInterrupt, SystemExit): raise SystemExit(0) diff --git a/docker-compose.yml b/docker-compose.yml index ea3d3ab7..a8293705 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -8,32 +8,26 @@ # Documentation: # https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#docker-compose -version: '3.9' services: archivebox: - #image: ${DOCKER_IMAGE:-archivebox/archivebox:dev} - image: archivebox/archivebox:dev - command: server --quick-init 0.0.0.0:8000 + image: archivebox/archivebox ports: - 8000:8000 volumes: - ./data:/data - # - ./etc/crontabs:/var/spool/cron/crontabs # uncomment this and archivebox_scheduler below to set up automatic recurring archive jobs - # - ./archivebox:/app/archivebox # uncomment this to mount the ArchiveBox source code at runtime (for developers working on archivebox) - # build: . # uncomment this to build the image from source code at buildtime (for developers working on archivebox) environment: - ALLOWED_HOSTS=* # restrict this to only accept incoming traffic via specific domain name - # - PUBLIC_INDEX=True # set to False to prevent anonymous users from viewing snapshot list - # - PUBLIC_SNAPSHOTS=True # set to False to prevent anonymous users from viewing snapshot content - # - PUBLIC_ADD_VIEW=False # set to True to allow anonymous users to submit new URLs to archive # - ADMIN_USERNAME=admin # create an admin user on first run with the given user/pass combo # - ADMIN_PASSWORD=SomeSecretPassword # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues # - PGID=911 - # - SEARCH_BACKEND_ENGINE=sonic # uncomment these and sonic container below for better full-text search - # - SEARCH_BACKEND_HOST_NAME=sonic - # - SEARCH_BACKEND_PASSWORD=SomeSecretPassword + # - PUBLIC_INDEX=True # set to False to prevent anonymous users from viewing snapshot list + # - PUBLIC_SNAPSHOTS=True # set to False to prevent anonymous users from viewing snapshot content + # - PUBLIC_ADD_VIEW=False # set to True to allow anonymous users to submit new URLs to archive + - SEARCH_BACKEND_ENGINE=sonic # uncomment these and sonic container below for better full-text search + - SEARCH_BACKEND_HOST_NAME=sonic + - SEARCH_BACKEND_PASSWORD=SomeSecretPassword # - MEDIA_MAX_SIZE=750m # increase this filesize limit to allow archiving larger audio/video files # - TIMEOUT=60 # increase this number to 120+ seconds if you see many slow downloads timing out # - CHECK_SSL_VALIDITY=True # set to False to disable strict SSL checking (allows saving URLs w/ broken certs) @@ -42,7 +36,7 @@ services: # add further configuration options from archivebox/config.py as needed (to apply them only to this container) # or set using `docker compose run archivebox config --set SOME_KEY=someval` (to persist config across all containers) - # For ad-blocking during archiving, uncomment this section and pihole service section below + # For ad-blocking during archiving, uncomment this section and pihole service section below # networks: # - dns # dns: @@ -51,22 +45,26 @@ services: ######## Optional Addons: tweak examples below as needed for your specific use case ######## - ### Example: To run the Sonic full-text search backend, first download the config file to sonic.cfg - # $ curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/etc/sonic.cfg - # After starting, backfill any existing Snapshots into the full-text index: + ### Runs the Sonic full-text search backend, config file is auto-downloaded into sonic.cfg: + # After starting, backfill any existing Snapshots into the full-text index: # $ docker-compose run archivebox update --index-only - # sonic: - # image: valeriansaliou/sonic:latest - # expose: - # - 1491 - # environment: - # - SEARCH_BACKEND_PASSWORD=SomeSecretPassword - # volumes: - # - ./sonic.cfg:/etc/sonic.cfg:ro - # - ./data/sonic:/var/lib/sonic/store - - + sonic: + image: valeriansaliou/sonic + build: + dockerfile_inline: | + FROM quay.io/curl/curl:latest AS setup + RUN curl -fsSL 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/main/etc/sonic.cfg' > /tmp/sonic.cfg + FROM valeriansaliou/sonic:latest + COPY --from=setup /tmp/sonic.cfg /etc/sonic.cfg + expose: + - 1491 + environment: + - SEARCH_BACKEND_PASSWORD=SomeSecretPassword + volumes: + - ./etc/sonic.cfg:/etc/sonic.cfg + - ./data/sonic:/var/lib/sonic/store + ### Example: To run pihole in order to block ad/tracker requests during archiving, # uncomment this block and set up pihole using its admin interface diff --git a/package.json b/package.json index 1377ef99..3c42a8b9 100644 --- a/package.json +++ b/package.json @@ -8,6 +8,6 @@ "dependencies": { "@postlight/parser": "^2.2.3", "readability-extractor": "github:ArchiveBox/readability-extractor", - "single-file-cli": "^1.1.46" + "single-file-cli": "^1.1.54" } } diff --git a/pyproject.toml b/pyproject.toml index 969b6318..98a1a055 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -15,15 +15,16 @@ dependencies = [ "dateparser>=1.0.0", "django-extensions>=3.2.3", "django>=4.2.0,<5.0", + "setuptools>=69.0.3", "feedparser>=6.0.11", "ipython>5.0.0", "mypy-extensions>=0.4.3", "python-crontab>=2.5.1", "requests>=2.24.0", "w3lib>=1.22.0", - "yt-dlp>=2023.10.13", + "yt-dlp>=2024.3.10", # dont add playwright becuase packages without sdists cause trouble on many build systems that refuse to install wheel-only packages - # "playwright>=1.39.0; platform_machine != 'armv7l'", + "playwright>=1.39.0; platform_machine != 'armv7l'", ] classifiers = [ @@ -64,11 +65,11 @@ classifiers = [ sonic = [ # echo "deb [signed-by=/usr/share/keyrings/valeriansaliou_sonic.gpg] https://packagecloud.io/valeriansaliou/sonic/debian/ bookworm main" > /etc/apt/sources.list.d/valeriansaliou_sonic.list # curl -fsSL https://packagecloud.io/valeriansaliou/sonic/gpgkey | gpg --dearmor -o /usr/share/keyrings/valeriansaliou_sonic.gpg + # apt install sonic "sonic-client>=0.0.5", ] ldap = [ # apt install libldap2-dev libsasl2-dev python3-ldap - "setuptools>=69.0.3", "python-ldap>=3.4.3", "django-auth-ldap>=4.1.0", ] @@ -83,7 +84,6 @@ ldap = [ [tool.pdm.dev-dependencies] dev = [ # building - "setuptools>=69.0.3", "wheel", "pdm", "homebrew-pypi-poet>=0.10.0", From e48159b8a0011d934facc38cb71ae6e738980da9 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Tue, 26 Mar 2024 15:23:43 -0700 Subject: [PATCH 25/25] cleanup docker-compose by storing crontabs in data dir --- archivebox/config.py | 1 + bin/docker_entrypoint.sh | 11 ++++ docker-compose.yml | 119 +++++++++++++++++++-------------------- 3 files changed, 71 insertions(+), 60 deletions(-) diff --git a/archivebox/config.py b/archivebox/config.py index a08d73e6..1a75229c 100644 --- a/archivebox/config.py +++ b/archivebox/config.py @@ -355,6 +355,7 @@ ALLOWED_IN_OUTPUT_DIR = { 'static', 'sonic', 'search.sqlite3', + 'crontabs', ARCHIVE_DIR_NAME, SOURCES_DIR_NAME, LOGS_DIR_NAME, diff --git a/bin/docker_entrypoint.sh b/bin/docker_entrypoint.sh index 74e7a3a9..4996b3d6 100755 --- a/bin/docker_entrypoint.sh +++ b/bin/docker_entrypoint.sh @@ -163,6 +163,17 @@ else fi fi +# symlink etc crontabs into place +mkdir -p "$DATA_DIR/crontabs" +if ! test -L /var/spool/cron/crontabs; then + # copy files from old location into new data dir location + for file in $(ls /var/spool/cron/crontabs); do + cp /var/spool/cron/crontabs/"$file" "$DATA_DIR/crontabs" + done + # replace old system path with symlink to data dir location + rm -Rf /var/spool/cron/crontabs + ln -s "$DATA_DIR/crontabs" /var/spool/cron/crontabs +fi # set DBUS_SYSTEM_BUS_ADDRESS & DBUS_SESSION_BUS_ADDRESS # (dbus is not actually needed, it makes chrome log fewer warnings but isn't worth making our docker images bigger) diff --git a/docker-compose.yml b/docker-compose.yml index a8293705..bfcb4f1e 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -11,23 +11,23 @@ services: archivebox: - image: archivebox/archivebox + image: archivebox/archivebox:latest ports: - 8000:8000 volumes: - ./data:/data environment: - - ALLOWED_HOSTS=* # restrict this to only accept incoming traffic via specific domain name # - ADMIN_USERNAME=admin # create an admin user on first run with the given user/pass combo # - ADMIN_PASSWORD=SomeSecretPassword - # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues - # - PGID=911 - # - PUBLIC_INDEX=True # set to False to prevent anonymous users from viewing snapshot list - # - PUBLIC_SNAPSHOTS=True # set to False to prevent anonymous users from viewing snapshot content - # - PUBLIC_ADD_VIEW=False # set to True to allow anonymous users to submit new URLs to archive - - SEARCH_BACKEND_ENGINE=sonic # uncomment these and sonic container below for better full-text search + - ALLOWED_HOSTS=* # restrict this to only accept incoming traffic via specific domain name + - PUBLIC_INDEX=True # set to False to prevent anonymous users from viewing snapshot list + - PUBLIC_SNAPSHOTS=True # set to False to prevent anonymous users from viewing snapshot content + - PUBLIC_ADD_VIEW=False # set to True to allow anonymous users to submit new URLs to archive + - SEARCH_BACKEND_ENGINE=sonic # uncomment these and sonic container below for better full-text search - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=SomeSecretPassword + # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues + # - PGID=911 # - MEDIA_MAX_SIZE=750m # increase this filesize limit to allow archiving larger audio/video files # - TIMEOUT=60 # increase this number to 120+ seconds if you see many slow downloads timing out # - CHECK_SSL_VALIDITY=True # set to False to disable strict SSL checking (allows saving URLs w/ broken certs) @@ -45,13 +45,35 @@ services: ######## Optional Addons: tweak examples below as needed for your specific use case ######## + ### Enable ability to run regularly scheduled archiving tasks by uncommenting this container + # $ docker compose run archivebox schedule --every=day --depth=1 'https://example.com/some/rss/feed.xml' + # then restart the scheduler container to apply the changes to the schedule + # $ docker compose restart archivebox_scheduler + + archivebox_scheduler: + image: archivebox/archivebox:latest + command: schedule --foreground + environment: + - TIMEOUT=120 # increase if you see timeouts often during archiving / on slow networks + - ONLY_NEW=True # set to False to retry previously failed URLs when re-adding instead of skipping them + # - PUID=502 # set to your host user's UID & GID if you encounter permissions issues + # - PGID=20 + volumes: + - ./data:/data + # cpus: 2 # uncomment / edit these values to limit container resource consumption + # mem_limit: 2048m + # shm_size: 1024m + + ### Runs the Sonic full-text search backend, config file is auto-downloaded into sonic.cfg: # After starting, backfill any existing Snapshots into the full-text index: # $ docker-compose run archivebox update --index-only sonic: - image: valeriansaliou/sonic + image: valeriansaliou/sonic:latest build: + # custom build just auto-downloads archivebox's default sonic.cfg as a convenience + # not needed if you have already have /etc/sonic.cfg dockerfile_inline: | FROM quay.io/curl/curl:latest AS setup RUN curl -fsSL 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/main/etc/sonic.cfg' > /tmp/sonic.cfg @@ -65,6 +87,34 @@ services: - ./etc/sonic.cfg:/etc/sonic.cfg - ./data/sonic:/var/lib/sonic/store + + ### Example: Watch the ArchiveBox browser in realtime as it archives things, + # or remote control it to set up logins and credentials for sites you want to archive. + # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile + + novnc: + image: theasp/novnc:latest + environment: + - DISPLAY_WIDTH=1920 + - DISPLAY_HEIGHT=1080 + - RUN_XTERM=no + ports: + # to view/control ArchiveBox's browser, visit: http://localhost:8080/vnc.html + - "8080:8080" + + + ### Example: Put Nginx in front of the ArchiveBox server for SSL termination + + # nginx: + # image: nginx:alpine + # ports: + # - 443:443 + # - 80:80 + # volumes: + # - ./etc/nginx.conf:/etc/nginx/nginx.conf + # - ./data:/var/www + + ### Example: To run pihole in order to block ad/tracker requests during archiving, # uncomment this block and set up pihole using its admin interface @@ -86,57 +136,6 @@ services: # - ./etc/dnsmasq:/etc/dnsmasq.d - ### Example: Enable ability to run regularly scheduled archiving tasks by uncommenting this container - # $ docker compose run archivebox schedule --every=day --depth=1 'https://example.com/some/rss/feed.xml' - # then restart the scheduler container to apply the changes to the schedule - # $ docker compose restart archivebox_scheduler - - # archivebox_scheduler: - # image: ${DOCKER_IMAGE:-archivebox/archivebox:dev} - # command: schedule --foreground - # environment: - # - MEDIA_MAX_SIZE=750m # increase this number to allow archiving larger audio/video files - # # - TIMEOUT=60 # increase if you see timeouts often during archiving / on slow networks - # # - ONLY_NEW=True # set to False to retry previously failed URLs when re-adding instead of skipping them - # # - CHECK_SSL_VALIDITY=True # set to False to allow saving URLs w/ broken SSL certs - # # - SAVE_ARCHIVE_DOT_ORG=True # set to False to disable submitting URLs to Archive.org when archiving - # # - PUID=502 # set to your host user's UID & GID if you encounter permissions issues - # # - PGID=20 - # volumes: - # - ./data:/data - # - ./etc/crontabs:/var/spool/cron/crontabs - # # cpus: 2 # uncomment / edit these values to limit container resource consumption - # # mem_limit: 2048m - # # shm_size: 1024m - - - ### Example: Put Nginx in front of the ArchiveBox server for SSL termination - - # nginx: - # image: nginx:alpine - # ports: - # - 443:443 - # - 80:80 - # volumes: - # - ./etc/nginx.conf:/etc/nginx/nginx.conf - # - ./data:/var/www - - - ### Example: Watch the ArchiveBox browser in realtime as it archives things, - # or remote control it to set up logins and credentials for sites you want to archive. - # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile - - novnc: - image: theasp/novnc:latest - environment: - - DISPLAY_WIDTH=1920 - - DISPLAY_HEIGHT=1080 - - RUN_XTERM=no - ports: - # to view/control ArchiveBox's browser, visit: http://localhost:8080/vnc.html - - "8080:8080" - - ### Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel # wireguard: