From ab225104c5ca7e06d796ed5f0657fe698978e3d9 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 12:54:56 -0800 Subject: [PATCH 01/21] Update README.md --- README.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index a961cb47..4d105d02 100644 --- a/README.md +++ b/README.md @@ -141,21 +141,23 @@ curl -fsSL 'https://get.archivebox.io' | sh ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs, governments, and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): -- 🗞️ **Journalists:** +- **Journalists:** `crawling during research`, `preserving cited pages`, `fact-checking & review` -- ⚖️ **Lawyers:** +- **Lawyers:** `collecting & preserving evidence`, `detecting changes`, `tagging & review` -- 🔬 **Researchers:** +- **Researchers:** `analyzing social media trends`, `getting LLM training data`, `crawling pipelines` -- 👩🏽 **Individuals:** +- **Individuals:** `saving bookmarks`, `preserving portfolio content`, `legacy / memoirs archival` +- **Governments:** + `snapshoting public records / govt sites`, `recordkeeping compliance`, `libraries` -> ***[Contact our team](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your institution/org wants to use ArchiveBox professionally. We offer services such as:* +> ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally. We offer services such as:* > > - setup & support, hosting, custom features, security, hashing & audit logging for chain-of-custody, etc. > - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... -*We are a 🏛️ 501(c)(3) nonprofit and all our work goes towards supporting open-source development.* +*ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes towards supporting open-source development.*
From c7cdc2fc27d39f2d8dbb16b21bae6139c8d14deb Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 12:55:36 -0800 Subject: [PATCH 02/21] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4d105d02..174f1a39 100644 --- a/README.md +++ b/README.md @@ -150,7 +150,7 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur - **Individuals:** `saving bookmarks`, `preserving portfolio content`, `legacy / memoirs archival` - **Governments:** - `snapshoting public records / govt sites`, `recordkeeping compliance`, `libraries` + `snapshoting public service sites`, `recordkeeping compliance`, `libraries` > ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally. We offer services such as:* > From a00b34cc13c5f2e31a2bf7009be5a9158cd4e7a2 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 12:58:28 -0800 Subject: [PATCH 03/21] Update README.md --- README.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 174f1a39..e0162847 100644 --- a/README.md +++ b/README.md @@ -149,15 +149,12 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur `analyzing social media trends`, `getting LLM training data`, `crawling pipelines` - **Individuals:** `saving bookmarks`, `preserving portfolio content`, `legacy / memoirs archival` -- **Governments:** - `snapshoting public service sites`, `recordkeeping compliance`, `libraries` +- **Governments:** + `snapshoting public service sites`, `recordkeeping compliance` -> ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally. We offer services such as:* -> -> - setup & support, hosting, custom features, security, hashing & audit logging for chain-of-custody, etc. -> - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... - -*ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes towards supporting open-source development.* +> ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally.* +> We offer: setup & support, hosting, custom features, security, hashing & audit logging for chain-of-custody, etc. +> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes towards supporting open-source development.*
From 597f1a39e06ef667401d84f23fc7b8ba2fd277a5 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 23:20:48 -0800 Subject: [PATCH 04/21] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e0162847..74f03a66 100644 --- a/README.md +++ b/README.md @@ -153,8 +153,8 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur `snapshoting public service sites`, `recordkeeping compliance` > ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally.* -> We offer: setup & support, hosting, custom features, security, hashing & audit logging for chain-of-custody, etc. -> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes towards supporting open-source development.* +> We offer: setup & support, hosting, custom features, security, hashing & audit logging/chain-of-custody, etc. +> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes supports open-source development.*
From f02b27920c41a9a1182da4d1871f7ba693c20c3a Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 23 Feb 2024 23:21:23 -0800 Subject: [PATCH 05/21] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 74f03a66..f6663013 100644 --- a/README.md +++ b/README.md @@ -154,7 +154,7 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur > ***[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your org wants help using ArchiveBox professionally.* > We offer: setup & support, hosting, custom features, security, hashing & audit logging/chain-of-custody, etc. -> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work goes supports open-source development.* +> *ArchiveBox has 🏛️ 501(c)(3) [nonprofit status](https://hackclub.com/hcb/) and all our work supports open-source development.*
From a729480b753a10cd3e97884ff0804eebd0d9cd8b Mon Sep 17 00:00:00 2001 From: Naomi Phillips Date: Sun, 3 Mar 2024 02:32:46 -0500 Subject: [PATCH 06/21] Add COOKIES_FILE support for singlefile extractor --- archivebox/extractors/singlefile.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/archivebox/extractors/singlefile.py b/archivebox/extractors/singlefile.py index e3860527..377e4a0e 100644 --- a/archivebox/extractors/singlefile.py +++ b/archivebox/extractors/singlefile.py @@ -19,6 +19,7 @@ from ..config import ( SINGLEFILE_VERSION, SINGLEFILE_ARGS, CHROME_BINARY, + COOKIES_FILE, ) from ..logging_util import TimedProgress @@ -48,6 +49,7 @@ def save_singlefile(link: Link, out_dir: Optional[Path]=None, timeout: int=TIMEO browser_args = '--browser-args={}'.format(json.dumps(browser_args[1:])) options = [ *SINGLEFILE_ARGS, + *(["--browser-cookies-file={}".format(COOKIES_FILE)] if COOKIES_FILE else []), '--browser-executable-path={}'.format(CHROME_BINARY), browser_args, ] From 86c3e271adeec95a94758a54e81a409f0a1e55ef Mon Sep 17 00:00:00 2001 From: Ricky de Laveaga Date: Thu, 7 Mar 2024 09:45:41 -0800 Subject: [PATCH 07/21] Update README.md Browser Extension link Point to GH repo with all browsers, not Chrome Webstore --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f6663013..6c17b7f5 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ Without active preservation effort, everything on the internet eventually dissap

-📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our [Browser Extension](https://chromewebstore.google.com/detail/archivebox-exporter/habonpimjphpdnmcfkaockjnffodikoj), and more. +📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our [Browser Extension](https://github.com/ArchiveBox/archivebox-browser-extension), and more. See Input Formats for a full list of supported input formats...
From e00845f58c917e2129de8b2be66ba9151849d9b6 Mon Sep 17 00:00:00 2001 From: Nicholas Hebert <68243838+n-hebert@users.noreply.github.com> Date: Tue, 19 Mar 2024 11:13:47 -0300 Subject: [PATCH 08/21] Revise md section not formatting properly in html --- README.md | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 6c17b7f5..43f0080c 100644 --- a/README.md +++ b/README.md @@ -1060,7 +1060,6 @@ Improved support for saving multiple snapshots of a single URL without this hash
- ### Storage Requirements Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. There are also some special requirements when using filesystems like NFS/SMB/FUSE. @@ -1070,17 +1069,16 @@ Because ArchiveBox is designed to ingest a large volume of URLs with multiple co Click to learn more about ArchiveBox's filesystem and hosting requirements...
- -**ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles**, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`. - -Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). - -**Don't store large collections on older filesystems like EXT3/FAT** as they may not be able to handle more than 50k directory entries in the `data/archive/` folder. - -**Try to keep the `data/index.sqlite3` file on local drive (not a network mount)** or SSD for maximum performance, however the `data/archive/` folder can be on a network mount or slower HDD. - -If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server. - +
    +
  • ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`.
  • +
  • Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). +
  • +
  • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the `data/archive/` folder. +
  • +
  • Try to keep the `data/index.sqlite3` file on local drive (not a network mount) or SSD for maximum performance, however the `data/archive/` folder can be on a network mount or slower HDD.
  • +
  • If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server. +
  • +

Learn More

From 37c9a33c8b7d7b9d57696ff24008c24aa5ce5658 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Wed, 20 Mar 2024 23:19:23 -0700 Subject: [PATCH 09/21] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 43f0080c..e3fe581b 100644 --- a/README.md +++ b/README.md @@ -497,7 +497,7 @@ docker run -it -v $PWD:/data archivebox/archivebox help
-curl sh automatic setup script CLI Usage Examples (non-Docker) +curl sh automatic setup script CLI Usage Examples: non-Docker

 # make sure you have pip-installed ArchiveBox and it's available in your $PATH first  
@@ -514,7 +514,7 @@ archivebox add --depth=1 'https://news.ycombinator.com'
 
-Docker Docker Compose CLI Usage Examples +Docker CLI Usage Examples: Docker Compose

 # make sure you have `docker-compose.yml` from the Quickstart instructions first
@@ -532,7 +532,7 @@ docker compose run archivebox add --depth=1 'https://news.ycombinator.com'
 
-Docker Docker CLI Usage Examples +Docker CLI Usage Examples: Docker

 # make sure you create and cd into in a new empty directory first  

From d32413d74b2cd7b6dff1504851fb2098ef23758a Mon Sep 17 00:00:00 2001
From: Nick Sweeting 
Date: Wed, 20 Mar 2024 23:23:26 -0700
Subject: [PATCH 10/21] Update README.md

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index e3fe581b..e2c15f34 100644
--- a/README.md
+++ b/README.md
@@ -654,13 +654,13 @@ docker run -it -v $PWD:/data archivebox/archivebox add --depth=1 'https://exampl
   ArchiveBox supports injesting URLs in [any text-based format](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file).
 
 -  From manually exported [browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) (in Netscape format)  
-  See instructions for: Chrome, Firefox, Safari, IE, Opera, and more...
+  Instructions: Chrome, Firefox, Safari, IE, Opera, and more...
 
 -  From URLs visited through a [MITM Proxy](https://mitmproxy.org/) with [`archivebox-proxy`](https://github.com/ArchiveBox/archivebox-proxy)  
   Provides [realtime archiving](https://github.com/ArchiveBox/ArchiveBox/issues/577) of all traffic from any device going through the proxy.
 
 -  From bookmarking services or social media (e.g. Twitter bookmarks, Reddit saved posts, etc.)  
-  See instructions for: Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved, Wallabag, Unmark.it, OneTab, Firefox Sync, and more...
+  Instructions: Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved, Wallabag, Unmark.it, OneTab, Firefox Sync, and more...
 
 
 

From d9beebdee71be5d7bcc9cab16fee3df594dfd2d2 Mon Sep 17 00:00:00 2001
From: Nick Sweeting 
Date: Wed, 20 Mar 2024 23:25:06 -0700
Subject: [PATCH 11/21] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index e2c15f34..50081c38 100644
--- a/README.md
+++ b/README.md
@@ -1017,7 +1017,7 @@ For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) active
 
 

From 67baea172edf598c1a218d633e267d0f315365b0 Mon Sep 17 00:00:00 2001
From: Nick Sweeting 
Date: Wed, 20 Mar 2024 23:28:02 -0700
Subject: [PATCH 12/21] Update README.md

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 50081c38..9d705655 100644
--- a/README.md
+++ b/README.md
@@ -1070,8 +1070,8 @@ Because ArchiveBox is designed to ingest a large volume of URLs with multiple co
 
    -
  • ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`.
  • -
  • Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). +
  • ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using SAVE_MEDIA=True and whether you lower MEDIA_MAX_SIZE=750mb.
  • +
  • Disk usage can be reduced by using a compressed/[deduplicated](https://www.ixsystems.com/blog/ixsystems-and-klara-systems-celebrate-valentines-day-with-a-heartfelt-donation-of-fast-dedupe-to-openzfs-and-truenas/) filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like fdupes or rdfind.
  • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the `data/archive/` folder.
  • From 28e85e0b95cc5948663762d8b1922968d8c9e1f0 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Wed, 20 Mar 2024 23:31:04 -0700 Subject: [PATCH 13/21] Update README.md --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 9d705655..1ae5dde2 100644 --- a/README.md +++ b/README.md @@ -1070,13 +1070,13 @@ Because ArchiveBox is designed to ingest a large volume of URLs with multiple co
      -
    • ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using SAVE_MEDIA=True and whether you lower MEDIA_MAX_SIZE=750mb.
    • -
    • Disk usage can be reduced by using a compressed/[deduplicated](https://www.ixsystems.com/blog/ixsystems-and-klara-systems-celebrate-valentines-day-with-a-heartfelt-donation-of-fast-dedupe-to-openzfs-and-truenas/) filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like fdupes or rdfind. +
    • ArchiveBox can use anywhere from ~1gb per 1000 Snapshots, to ~50gb per 1000 Snapshots, mostly dependent on whether you're saving audio & video using SAVE_MEDIA=True and whether you lower MEDIA_MAX_SIZE=750mb.
    • +
    • Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like fdupes or rdfind.
    • -
    • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the `data/archive/` folder. +
    • Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the data/archive/ folder.
    • -
    • Try to keep the `data/index.sqlite3` file on local drive (not a network mount) or SSD for maximum performance, however the `data/archive/` folder can be on a network mount or slower HDD.
    • -
    • If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server. +
    • Try to keep the data/index.sqlite3 file on local drive (not a network mount) or SSD for maximum performance, however the data/archive/ folder can be on a network mount or slower HDD.
    • +
    • If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set PUID & PGID and disable root_squash on your fileshare server.
    From a1ef5f60350eacdde4caf8fbb8ba9d7e6aee25c2 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:00:14 -0700 Subject: [PATCH 14/21] Update README.md --- README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 1ae5dde2..12a1335e 100644 --- a/README.md +++ b/README.md @@ -1160,19 +1160,18 @@ ArchiveBox aims to enable more of the internet to be saved from deterioration by Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity. -Whether it's to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010's flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears. +Whether it's to resist censorship by saving news articles before they get taken down or edited, or just to save a collection of early 2010's flash games you loved to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.

    Image from Perma.cc...
    +The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about, just like libraries do. Without the work of archivists saving physical books, manuscrips, and paintings we wouldn't have any knowledge of our ancestors' history. We believe archiving the web is just as important to provide the same benefit to future generations. -The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about. - -Because modern websites are complicated and often rely on dynamic content, -ArchiveBox archives the sites in **several different formats** beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats. +We believe duplication of other people's content is only ethical if it a: A. doesn't deprive the original creators of revenue and B. is responsibly curated. In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this excemption, as they are as a non-profit providing public service, and they respond to DMCA removal requests. +As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA contact email in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR).

From 2220a5350ca0a62513c8e312fec8559468625e69 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:02:08 -0700 Subject: [PATCH 15/21] Update README.md --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 12a1335e..ea3a2917 100644 --- a/README.md +++ b/README.md @@ -1169,7 +1169,12 @@ Whether it's to resist censorship by saving news articles before they get taken The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about, just like libraries do. Without the work of archivists saving physical books, manuscrips, and paintings we wouldn't have any knowledge of our ancestors' history. We believe archiving the web is just as important to provide the same benefit to future generations. -We believe duplication of other people's content is only ethical if it a: A. doesn't deprive the original creators of revenue and B. is responsibly curated. In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this excemption, as they are as a non-profit providing public service, and they respond to DMCA removal requests. +We believe duplication of other people's content is only ethical if it: + +- A. doesn't deprive the original creators of revenue and +- B. is responsibly curated. + +In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this excemption, as they are as a non-profit providing public service, and they respond to DMCA removal requests. As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA contact email in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR). From 1dbe08872cd600257c00e209ac64cdbed2559136 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:10:19 -0700 Subject: [PATCH 16/21] Update README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index ea3a2917..24b91110 100644 --- a/README.md +++ b/README.md @@ -1167,14 +1167,14 @@ Whether it's to resist censorship by saving news articles before they get taken Image from Perma.cc...
-The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about, just like libraries do. Without the work of archivists saving physical books, manuscrips, and paintings we wouldn't have any knowledge of our ancestors' history. We believe archiving the web is just as important to provide the same benefit to future generations. +The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about, just like libraries do. Without the work of archivists saving physical books, manuscrips, and paintings we wouldn't have any knowledge of our ancestors' history. I believe archiving the web is just as important to provide the same benefit to future generations. -We believe duplication of other people's content is only ethical if it: +ArchiveBox's stance is that duplication of other people's content is only ethical if it: - A. doesn't deprive the original creators of revenue and -- B. is responsibly curated. +- B. is responsibly curated by an individual/institution. -In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this excemption, as they are as a non-profit providing public service, and they respond to DMCA removal requests. +In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this exemption, as they are as a non-profit providing public service, and they respond to unethical content/DMCA/GDPR removal requests. As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA contact email in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR). From 2c6704b1d099425abf7744e1a8d5b6677006e85a Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:11:57 -0700 Subject: [PATCH 17/21] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 24b91110..7362948b 100644 --- a/README.md +++ b/README.md @@ -1176,7 +1176,7 @@ ArchiveBox's stance is that duplication of other people's content is only ethica In the U.S., libraries, researchers, and archivists are allowed to duplicate copyrighted materials under "fair use" for private study, scholarship, or research. Archive.org's preservation work is covered under this exemption, as they are as a non-profit providing public service, and they respond to unethical content/DMCA/GDPR removal requests. -As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA contact email in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR). +As long as you A. don't try to profit off pirating copyrighted content and B. have processes in place to respond to removal requests, many countries allow you to use sofware like ArchiveBox to ethically and responsibly archive any web content you can view. That being said, ArchiveBox is not liable for how you choose to operate the software. You must research your own local laws and regulations, and get proper legal council if you plan to host a public instance (start by putting your DMCA/GDPR contact info in FOOTER_INFO and changing your instance's branding using CUSTOM_TEMPLATES_DIR).

From 88f21d0d70dcce27ad944fece540d32d30897386 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:12:31 -0700 Subject: [PATCH 18/21] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7362948b..68f8bfe5 100644 --- a/README.md +++ b/README.md @@ -1189,7 +1189,7 @@ As long as you A. don't try to profit off pirating copyrighted content and B. ha > **Check out our [community wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) for a list of web archiving tools and orgs.** -A variety of open and closed-source archiving projects exist, but few provide a nice UI and CLI to manage a large, high-fidelity archive collection over time. +A variety of open and closed-source archiving projects exist, but few provide a nice UI and CLI to manage a large, high-fidelity collection over time.
From ee2809eb4fd53c07a406abef7e9b4ad72c8ebb74 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:27:49 -0700 Subject: [PATCH 19/21] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 68f8bfe5..d93e3989 100644 --- a/README.md +++ b/README.md @@ -1577,7 +1577,7 @@ Extractors take the URL of a page to archive, write their output to the filesyst -- [ArchiveBox.io Homepage](https://archivebox.io) / [Source Code (Github)](https://github.com/ArchiveBox/ArchiveBox) / [Demo Server](https://demo.archivebox.io) +- [ArchiveBox.io Website](https://archivebox.io) / [ArchiveBox Github (Source Code)](https://github.com/ArchiveBox/ArchiveBox) / [ArchiveBox Demo Server](https://demo.archivebox.io) - [Documentation Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki) / [API Reference Docs](https://docs.archivebox.io) / [Changelog](https://github.com/ArchiveBox/ArchiveBox/releases) - [Bug Tracker](https://github.com/ArchiveBox/ArchiveBox/issues) / [Discussions](https://github.com/ArchiveBox/ArchiveBox/discussions) / [Community Chat Forum (Zulip)](https://zulip.archivebox.io) - Find us on social media: [Twitter](https://twitter.com/ArchiveBoxApp), [LinkedIn](https://www.linkedin.com/company/archivebox/), [YouTube](https://www.youtube.com/@ArchiveBoxApp), [SaaSHub](https://www.saashub.com/archivebox), [Alternative.to](https://alternativeto.net/software/archivebox/about/), [Reddit](https://www.reddit.com/r/ArchiveBox/) From 05213794642b726b4b6dedabaa27c96628e2d5c2 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:29:54 -0700 Subject: [PATCH 20/21] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index d93e3989..ea331241 100644 --- a/README.md +++ b/README.md @@ -1578,9 +1578,9 @@ Extractors take the URL of a page to archive, write their output to the filesyst - [ArchiveBox.io Website](https://archivebox.io) / [ArchiveBox Github (Source Code)](https://github.com/ArchiveBox/ArchiveBox) / [ArchiveBox Demo Server](https://demo.archivebox.io) -- [Documentation Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki) / [API Reference Docs](https://docs.archivebox.io) / [Changelog](https://github.com/ArchiveBox/ArchiveBox/releases) -- [Bug Tracker](https://github.com/ArchiveBox/ArchiveBox/issues) / [Discussions](https://github.com/ArchiveBox/ArchiveBox/discussions) / [Community Chat Forum (Zulip)](https://zulip.archivebox.io) -- Find us on social media: [Twitter](https://twitter.com/ArchiveBoxApp), [LinkedIn](https://www.linkedin.com/company/archivebox/), [YouTube](https://www.youtube.com/@ArchiveBoxApp), [SaaSHub](https://www.saashub.com/archivebox), [Alternative.to](https://alternativeto.net/software/archivebox/about/), [Reddit](https://www.reddit.com/r/ArchiveBox/) +- [Documentation (Github Wiki)](https://github.com/ArchiveBox/ArchiveBox/wiki) / [API Reference Docs (ReadTheDocs)](https://docs.archivebox.io) / [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap) / [Changelog](https://github.com/ArchiveBox/ArchiveBox/releases) +- [Bug Tracker (Github Issues)](https://github.com/ArchiveBox/ArchiveBox/issues) / [Discussions (Github Discussions)](https://github.com/ArchiveBox/ArchiveBox/discussions) / [Community Chat Forum (Zulip)](https://zulip.archivebox.io) +- Find us on social media: [Twitter `@ArchiveBoxApp`](https://twitter.com/ArchiveBoxApp), [LinkedIn](https://www.linkedin.com/company/archivebox/), [YouTube](https://www.youtube.com/@ArchiveBoxApp), [SaaSHub](https://www.saashub.com/archivebox), [Alternative.to](https://alternativeto.net/software/archivebox/about/), [Reddit](https://www.reddit.com/r/ArchiveBox/) --- From 1d49bee90bcf6a0b04905266f3e7e73306ed6f9c Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 21 Mar 2024 00:31:48 -0700 Subject: [PATCH 21/21] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ea331241..27a84956 100644 --- a/README.md +++ b/README.md @@ -1599,7 +1599,7 @@ Extractors take the URL of a page to archive, write their output to the filesyst    
-ArchiveBox was started by Nick Sweeting in 2017, and has grown steadily with help from our amazing contributors. +ArchiveBox was started by Nick Sweeting in 2017, and has grown steadily with help from our amazing contributors.
✨ Have spare CPU/disk/bandwidth after all your 网站存档爬 and want to help the world?
Check out our Good Karma Kit...