From 8387e02d3c3d7505f6da8ce341facf2743eada13 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:04:01 -0800 Subject: [PATCH 01/38] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index aeb07aa3..1c26def3 100644 --- a/README.md +++ b/README.md @@ -103,6 +103,7 @@ archivebox init --setup
# Or use the optional auto setup script to install it curl -sSL 'https://get.archivebox.io' | sh +
From 82c9c691c03b7b10a9a9e1fca744cefd4a12fb3d Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:06:52 -0800 Subject: [PATCH 02/38] Update README.md --- README.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 1c26def3..a85b30cb 100644 --- a/README.md +++ b/README.md @@ -70,7 +70,7 @@ The goal is to sleep soundly knowing the part of the internet you care about wil
-**📦  Install ArchiveBox using your preferred method: `docker` / `apt` / `brew` / `pip3` / `nix` / etc. ([see Quickstart below](#quickstart)).** +**📦  Install ArchiveBox using your preferred method: `docker` / `pip` / `apt` / `brew` / etc. ([see full Quickstart below](#quickstart)).**
@@ -78,7 +78,7 @@ The goal is to sleep soundly knowing the part of the internet you care about wil
mkdir ~/archivebox; cd ~/archivebox    # create a dir somewhere for your archivebox data
 
-# Get ArchiveBox with Docker Compose (recommended): +# Option A: Get ArchiveBox with Docker Compose (recommended): curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml # edit options in this file as-needed docker compose run archivebox init --setup # docker compose run archivebox add 'https://example.com' @@ -86,14 +86,14 @@ docker compose run archivebox init --setup # docker compose up

-# Or use it as a plain Docker container: +# Option B: Or use it as a plain Docker container: docker run -it -v $PWD:/data archivebox/archivebox init --setup # docker run -it -v $PWD:/data archivebox/archivebox add 'https://example.com' # docker run -it -v $PWD:/data archivebox/archivebox help # docker run -it -v $PWD:/data -p 8000:8000 archivebox/archivebox

-# Or install it with your preferred pkg manager (see Quickstart below for apt, brew, and more) +# Option C: Or install it with your preferred pkg manager (see Quickstart below for apt, brew, and more) pip install archivebox archivebox init --setup # archviebox add 'https://example.com' @@ -101,15 +101,14 @@ archivebox init --setup # archivebox server 0.0.0.0:8000

-# Or use the optional auto setup script to install it +# Option D: Or use the optional auto setup script to install it curl -sSL 'https://get.archivebox.io' | sh
+
+Open http://localhost:8000 to see your server's Web UI ➡️

-Open http://localhost:8000 to see your server's Web UI ➡️ - -


From 0abca0b547acf7da000cd5f348db0e6ff17bf186 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:09:12 -0800 Subject: [PATCH 03/38] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a85b30cb..049e51f8 100644 --- a/README.md +++ b/README.md @@ -168,9 +168,10 @@ curl -sSL 'https://get.archivebox.io' | sh
  • Install Docker on your system (if not already installed).
  • Download the docker-compose.yml file into a new empty directory (can be anywhere).
    mkdir ~/archivebox && cd ~/archivebox
    -curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml'
    +curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml
    +# points to https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml
     
  • -
  • Run the initial setup and create an admin user. +
  • Run the initial setup to create an admin user (or set ADMIN_USER/PASS in docker-compose.yml)
    docker compose run archivebox init --setup
     
  • Next steps: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin. From b708303dd4ba419065770fb47e8a1d95ed8f4011 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:20:32 -0800 Subject: [PATCH 04/38] Update README.md --- README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 049e51f8..0241c089 100644 --- a/README.md +++ b/README.md @@ -35,13 +35,14 @@ Without active preservation effort, everything on the internet eventually dissap 💾 **It saves snapshots of the URLs you feed it in several redundant formats.** It also detects any content featured *inside* each webpage & extracts it out into a folder: -- `HTML/Generic websites -> HTML, PDF, PNG, WARC, Singlefile` -- `YouTube/SoundCloud/etc. -> MP3/MP4 + subtitles, description, thumbnail` -- `News articles -> article body TXT + title, author, featured images` -- `Github/Gitlab/etc. links -> git cloned source code` +- **HTML**/**Any websites** ➡️ `original HTML+CSS+JS`, `singlefile HTML`, `screenshot PNG`, `PDF`, `WARC`, ... +- **Social Media**/**News** ➡️ `post content TXT`, `comments`, `title`, `author`, `images` +- **YouTube**/**SoundCloud**/etc. ➡️ `MP3/MP4`s, `subtitles`, `metadata`, `thumbnail`, ... +- **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... - *[and more...](#output-formats)* It uses normal filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. +ArchiveBox does the archiving using standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), `wget`, `yt-dlp`, `readability`, [and more](#dependencies) internally. --- From bd290aa2820397a861db6302da45658773d355c0 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:24:26 -0800 Subject: [PATCH 05/38] Update README.md --- README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 0241c089..dd6ee899 100644 --- a/README.md +++ b/README.md @@ -35,11 +35,11 @@ Without active preservation effort, everything on the internet eventually dissap 💾 **It saves snapshots of the URLs you feed it in several redundant formats.** It also detects any content featured *inside* each webpage & extracts it out into a folder: -- **HTML**/**Any websites** ➡️ `original HTML+CSS+JS`, `singlefile HTML`, `screenshot PNG`, `PDF`, `WARC`, ... -- **Social Media**/**News** ➡️ `post content TXT`, `comments`, `title`, `author`, `images` -- **YouTube**/**SoundCloud**/etc. ➡️ `MP3/MP4`s, `subtitles`, `metadata`, `thumbnail`, ... -- **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... -- *[and more...](#output-formats)* +- 🌐 **HTML**/**Any websites** ➡️ `original HTML+CSS+JS`, `singlefile HTML`, `screenshot PNG`, `PDF`, `WARC`, ... +- 🎥 **Social Media**/**News** ➡️ `post content TXT`, `comments`, `title`, `author`, `images` +- 🎬 **YouTube**/**SoundCloud**/etc. ➡️ `MP3/MP4`s, `subtitles`, `metadata`, `thumbnail`, ... +- 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... +- ✨ *[and more...](#output-formats)* It uses normal filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. ArchiveBox does the archiving using standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), `wget`, `yt-dlp`, `readability`, [and more](#dependencies) internally. @@ -48,13 +48,13 @@ ArchiveBox does the archiving using standard tools like [Google Chrome](https:// 🏛️ ArchiveBox is used by many *[professionals](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) and [hobbyists](https://zulip.archivebox.io/#narrow/stream/158-development)* who save content off the web, for example: -- **Individuals:** +- 👩🏽 **Individuals:** `backing up browser bookmarks/history`, `saving FB/Insta/etc. content`, `shopping lists` -- **Journalists:** +- 🗞️ **Journalists:** `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` -- **Lawyers:** +- ⚖️ **Lawyers:** `evidence collection`, `hashing & integrity verifying`, `search, tagging, & review` -- **Researchers:** +- 🔬 **Researchers:** `collecting AI training sets`, `feeding analysis / web crawling pipelines` The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats [for decades](#background--motivation) after it goes down. From b2d1083453415b105139b80f860fa95533ed722d Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:25:45 -0800 Subject: [PATCH 06/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index dd6ee899..721df4cd 100644 --- a/README.md +++ b/README.md @@ -75,7 +75,7 @@ The goal is to sleep soundly knowing the part of the internet you care about wil
    Quick reference   ⤵️Expand for quick copy-pastable install commands...   ⤵️
    mkdir ~/archivebox; cd ~/archivebox    # create a dir somewhere for your archivebox data
     
    From dbcbdc7691e4a0702298502e300cf016d1640507 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:26:07 -0800 Subject: [PATCH 07/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 721df4cd..d505f24c 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ Without active preservation effort, everything on the internet eventually dissap snapshot detail page -💾 **It saves snapshots of the URLs you feed it in several redundant formats.** +**It saves snapshots of the URLs you feed it in several redundant formats.** It also detects any content featured *inside* each webpage & extracts it out into a folder: - 🌐 **HTML**/**Any websites** ➡️ `original HTML+CSS+JS`, `singlefile HTML`, `screenshot PNG`, `PDF`, `WARC`, ... - 🎥 **Social Media**/**News** ➡️ `post content TXT`, `comments`, `title`, `author`, `images` From 43ceb24c506c4f684256b7f1498fcbd2d3e1a5af Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:30:16 -0800 Subject: [PATCH 08/38] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d505f24c..48a84c21 100644 --- a/README.md +++ b/README.md @@ -41,8 +41,8 @@ It also detects any content featured *inside* each webpage & extracts it out int - 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... - ✨ *[and more...](#output-formats)* -It uses normal filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. -ArchiveBox does the archiving using standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), `wget`, `yt-dlp`, `readability`, [and more](#dependencies) internally. +It uses ordinary filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. +To power its functionality, ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be tuned, secured, and extended as-needed. --- From d4703d1e1624a4ae33a6b2b85f7688b47dae25b5 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:32:18 -0800 Subject: [PATCH 09/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 48a84c21..98fdd869 100644 --- a/README.md +++ b/README.md @@ -46,7 +46,7 @@ To power its functionality, ArchiveBox bundles industry-standard tools like [Goo --- -🏛️ ArchiveBox is used by many *[professionals](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) and [hobbyists](https://zulip.archivebox.io/#narrow/stream/158-development)* who save content off the web, for example: +🏛️ ArchiveBox is used by many *[professional organizations](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) and [hobbyists](https://zulip.archivebox.io/#narrow/stream/158-development)* who save content off the web, for example: - 👩🏽 **Individuals:** `backing up browser bookmarks/history`, `saving FB/Insta/etc. content`, `shopping lists` From fffc872470d977efa36f0a02dce14191e15cdb77 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:34:00 -0800 Subject: [PATCH 10/38] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 98fdd869..6371f978 100644 --- a/README.md +++ b/README.md @@ -151,6 +151,8 @@ curl -sSL 'https://get.archivebox.io' | sh grassgrass
  • + + # Quickstart **🖥  Supported OSs:** Linux/BSD, macOS, Windows (Docker)   **👾  CPUs:** `amd64` (`x86_64`), `arm64` (`arm8`), `arm7` (raspi>=3)
    From 0ec2fbf8b23157b204516cf39b13b09a44d72621 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:37:57 -0800 Subject: [PATCH 11/38] Update README.md --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6371f978..69e0e64f 100644 --- a/README.md +++ b/README.md @@ -447,8 +447,12 @@ cd ~/archivebox/data # IMPORTANT: cd into the directory # archivebox [subcommand] [--args] archivebox help -# or + +# equivalent: docker compose run archivebox [subcommand [--args] docker compose run archivebox help + +# equivalent: docker run -it -v $PWD:/data archivebox/archivebox [subcommand [--args] + docker run -it -v $PWD:/data archivebox/archivebox help ``` #### ArchiveBox Subcommands From 202ccc812c1037b4f0a37c6f23160934f9ceae3f Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:41:08 -0800 Subject: [PATCH 12/38] Update README.md --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 69e0e64f..f38db4d2 100644 --- a/README.md +++ b/README.md @@ -170,9 +170,9 @@ curl -sSL 'https://get.archivebox.io' | sh
    1. Install Docker on your system (if not already installed).
    2. Download the docker-compose.yml file into a new empty directory (can be anywhere). -
      mkdir ~/archivebox && cd ~/archivebox
      +
      mkdir ~/archivebox && cd ~/archivebox   # can be anywhere
      +# Read and edit docker-compose.yml options as-needed after downloading
       curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml
      -# points to https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml
       
    3. Run the initial setup to create an admin user (or set ADMIN_USER/PASS in docker-compose.yml)
      docker compose run archivebox init --setup
      @@ -204,6 +204,7 @@ docker run -v $PWD:/data -it archivebox/archivebox init --setup
       
      docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox
       # completely optional, CLI can always be used without running a server
       # docker run -v $PWD:/data -it [subcommand] [--args]
      +docker run -v $PWD:/data -it archivebox/archivebox help
       
    @@ -255,6 +256,7 @@ archivebox init --setup
    archivebox server 0.0.0.0:8000
     # completely optional, CLI can always be used without running a server
     # archivebox [subcommand] [--args]
    +archivebox help
     
    @@ -290,6 +292,7 @@ archivebox init --setup # if any problems, install with pip instead
    archivebox server 0.0.0.0:8000
     # completely optional, CLI can always be used without running a server
     # archivebox [subcommand] [--args]
    +archivebox help
     
    @@ -318,6 +321,7 @@ archivebox init --setup # if any problems, install with pip instead
    archivebox server 0.0.0.0:8000
     # completely optional, CLI can always be used without running a server
     # archivebox [subcommand] [--args]
    +archivebox help
     
    From d40f46a9857a6800c0c01ba2e77c6524a8c80a00 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:41:28 -0800 Subject: [PATCH 13/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f38db4d2..a52bb837 100644 --- a/README.md +++ b/README.md @@ -170,7 +170,7 @@ curl -sSL 'https://get.archivebox.io' | sh
    1. Install Docker on your system (if not already installed).
    2. Download the docker-compose.yml file into a new empty directory (can be anywhere). -
      mkdir ~/archivebox && cd ~/archivebox   # can be anywhere
      +
      mkdir ~/archivebox && cd ~/archivebox
       # Read and edit docker-compose.yml options as-needed after downloading
       curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml
       
    3. From 9f86ec31a0787ecbddd8ac5de63730ead2a52453 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:47:52 -0800 Subject: [PATCH 14/38] Update README.md --- README.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index a52bb837..c3f8b843 100644 --- a/README.md +++ b/README.md @@ -46,17 +46,6 @@ To power its functionality, ArchiveBox bundles industry-standard tools like [Goo --- -🏛️ ArchiveBox is used by many *[professional organizations](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) and [hobbyists](https://zulip.archivebox.io/#narrow/stream/158-development)* who save content off the web, for example: - -- 👩🏽 **Individuals:** - `backing up browser bookmarks/history`, `saving FB/Insta/etc. content`, `shopping lists` -- 🗞️ **Journalists:** - `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` -- ⚖️ **Lawyers:** - `evidence collection`, `hashing & integrity verifying`, `search, tagging, & review` -- 🔬 **Researchers:** - `collecting AI training sets`, `feeding analysis / web crawling pipelines` - The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats [for decades](#background--motivation) after it goes down.
      @@ -137,10 +126,21 @@ curl -sSL 'https://get.archivebox.io' | sh ## 🤝 Professional Integration -*[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) if your institution/org wants to use ArchiveBox professionally.* +🏛️ ArchiveBox is used by many *[professional organizations](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) and [hobbyists](https://zulip.archivebox.io/#narrow/stream/158-development)* who save content off the web, for example: -- setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc. -- for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... +- 👩🏽 **Individuals:** + `backing up browser bookmarks/history`, `saving FB/Insta/etc. content`, `shopping lists` +- 🗞️ **Journalists:** + `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` +- ⚖️ **Lawyers:** + `evidence collection`, `hashing & integrity verifying`, `search, tagging, & review` +- 🔬 **Researchers:** + `collecting AI training sets`, `feeding analysis / web crawling pipelines` + +> *[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) if your institution/org wants to use ArchiveBox professionally.* +> +> - setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc. +> - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... *We are a 501(c)(3) nonprofit and all our work goes towards supporting open-source development.* From 4ae58884ca51fed4a2f1b9973878722195e0b571 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:51:27 -0800 Subject: [PATCH 15/38] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c3f8b843..6b9d127f 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ Without active preservation effort, everything on the internet eventually dissap
      -📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list. +📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from the browser [extension](https://chromewebstore.google.com/detail/archivebox-exporter/habonpimjphpdnmcfkaockjnffodikoj), bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, and more. See Input Formats for a full list. snapshot detail page @@ -39,7 +39,7 @@ It also detects any content featured *inside* each webpage & extracts it out int - 🎥 **Social Media**/**News** ➡️ `post content TXT`, `comments`, `title`, `author`, `images` - 🎬 **YouTube**/**SoundCloud**/etc. ➡️ `MP3/MP4`s, `subtitles`, `metadata`, `thumbnail`, ... - 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... -- ✨ *[and more...](#output-formats)* +- ✨ *and more, see all [Output Formats](#output-formats) below...* It uses ordinary filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. To power its functionality, ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be tuned, secured, and extended as-needed. From bae2f3a09ddf9f48102b79213d6b7856e841fe53 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:52:32 -0800 Subject: [PATCH 16/38] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 6b9d127f..c3bf340b 100644 --- a/README.md +++ b/README.md @@ -39,10 +39,11 @@ It also detects any content featured *inside* each webpage & extracts it out int - 🎥 **Social Media**/**News** ➡️ `post content TXT`, `comments`, `title`, `author`, `images` - 🎬 **YouTube**/**SoundCloud**/etc. ➡️ `MP3/MP4`s, `subtitles`, `metadata`, `thumbnail`, ... - 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... -- ✨ *and more, see all [Output Formats](#output-formats) below...* +- ✨ *and more, see [Output Formats](#output-formats) below...* It uses ordinary filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. -To power its functionality, ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be tuned, secured, and extended as-needed. + +To power its functionality, ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and operation can be tuned, secured, and extended as-needed for different applications. --- From 3839e2d4ffe180b1194038ec5e6b9df0111c8068 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 00:54:32 -0800 Subject: [PATCH 17/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c3bf340b..ad1c558b 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ curl -sSL 'https://get.archivebox.io' | sh # (or see pip/brew/Docker instruct Without active preservation effort, everything on the internet eventually dissapears or degrades. Archive.org does a great job as a free central archive, but they require all archives to be public, and they can't save every type of content. -*ArchiveBox is an open source tool that helps you archive web content on your own (or privately within an organization): save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* +*ArchiveBox is an open source tool that helps organizations and individuals archive web content and retain control over their data: save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* > ➡️ *Use ArchiveBox as a [command-line package](#quickstart) and/or [self-hosted web app](#quickstart) on Linux, macOS, or in [Docker](#quickstart).* From e222d518e455bb058290f130a03be7dbba2af08f Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:09:14 -0800 Subject: [PATCH 18/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ad1c558b..294df6af 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Without active preservation effort, everything on the internet eventually dissap *ArchiveBox is an open source tool that helps organizations and individuals archive web content and retain control over their data: save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* -> ➡️ *Use ArchiveBox as a [command-line package](#quickstart) and/or [self-hosted web app](#quickstart) on Linux, macOS, or in [Docker](#quickstart).* +> ➡️ *Use ArchiveBox on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](#-web-ui-usage), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](https://docs.archivebox.io/en/v0.6.2/_modules/archivebox/cli/archivebox_oneshot.html).*
      From 4e7da217bba4afe7ef4a73e1c0a58086a2242765 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:11:04 -0800 Subject: [PATCH 19/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 294df6af..5ab885a1 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Without active preservation effort, everything on the internet eventually dissap *ArchiveBox is an open source tool that helps organizations and individuals archive web content and retain control over their data: save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* -> ➡️ *Use ArchiveBox on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](#-web-ui-usage), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](https://docs.archivebox.io/en/v0.6.2/_modules/archivebox/cli/archivebox_oneshot.html).* +> ➡️ *Use ArchiveBox on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](https://docs.archivebox.io/en/v0.6.2/_modules/archivebox/cli/archivebox_oneshot.html).*
      From 1161f08b55f3064b9718fe8b6a37591961d53820 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:12:35 -0800 Subject: [PATCH 20/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5ab885a1..fff548ca 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ Without active preservation effort, everything on the internet eventually dissap
      -📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from the browser [extension](https://chromewebstore.google.com/detail/archivebox-exporter/habonpimjphpdnmcfkaockjnffodikoj), bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, and more. See Input Formats for a full list. +📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our [Browser Extension](https://chromewebstore.google.com/detail/archivebox-exporter/habonpimjphpdnmcfkaockjnffodikoj), and more. See Input Formats for a full list. snapshot detail page From 17fdf76178865ef14fe5bdf38917bbfa5e0196e4 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:32:38 -0800 Subject: [PATCH 21/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index fff548ca..283c3d7f 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,7 @@ It also detects any content featured *inside* each webpage & extracts it out int It uses ordinary filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. -To power its functionality, ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and operation can be tuned, secured, and extended as-needed for different applications. +Under-the-hood, ArchiveBox uses [Django](https://www.djangoproject.com/start/overview/) to power its [Web UI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage) and [SQlite](https://www.sqlite.org/locrsf.html) + the filesystem to provide [fast & durable metadata storage](https://www.sqlite.org/locrsf.html) w/ [determinisitc upgrades](https://stackoverflow.com/a/39976321/2156113). ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be [tuned, secured, and extended](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed for many different applications. --- From 269bf3f7f3545490b19802717fe2c90d9b863c33 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:33:04 -0800 Subject: [PATCH 22/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 283c3d7f..0e493feb 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,7 @@ It also detects any content featured *inside* each webpage & extracts it out int It uses ordinary filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. -Under-the-hood, ArchiveBox uses [Django](https://www.djangoproject.com/start/overview/) to power its [Web UI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage) and [SQlite](https://www.sqlite.org/locrsf.html) + the filesystem to provide [fast & durable metadata storage](https://www.sqlite.org/locrsf.html) w/ [determinisitc upgrades](https://stackoverflow.com/a/39976321/2156113). ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be [tuned, secured, and extended](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed for many different applications. +> Under-the-hood, ArchiveBox uses [Django](https://www.djangoproject.com/start/overview/) to power its [Web UI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage) and [SQlite](https://www.sqlite.org/locrsf.html) + the filesystem to provide [fast & durable metadata storage](https://www.sqlite.org/locrsf.html) w/ [determinisitc upgrades](https://stackoverflow.com/a/39976321/2156113). ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be [tuned, secured, and extended](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed for many different applications. --- From 2577a8a3bed281c905af037cf1785f30431db748 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:35:11 -0800 Subject: [PATCH 23/38] Update README.md --- README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 0e493feb..2a955a6a 100644 --- a/README.md +++ b/README.md @@ -42,8 +42,6 @@ It also detects any content featured *inside* each webpage & extracts it out int - ✨ *and more, see [Output Formats](#output-formats) below...* It uses ordinary filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. - -> Under-the-hood, ArchiveBox uses [Django](https://www.djangoproject.com/start/overview/) to power its [Web UI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage) and [SQlite](https://www.sqlite.org/locrsf.html) + the filesystem to provide [fast & durable metadata storage](https://www.sqlite.org/locrsf.html) w/ [determinisitc upgrades](https://stackoverflow.com/a/39976321/2156113). ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be [tuned, secured, and extended](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed for many different applications. --- @@ -710,11 +708,14 @@ CURL_USER_AGENT="Mozilla/5.0 ..." ## Dependencies -To achieve high-fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party tools that specialize in extracting different types of content. +To achieve high-fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party libraries and tools that specialize in extracting different types of content. + +> Under-the-hood, ArchiveBox uses [Django](https://www.djangoproject.com/start/overview/) to power its [Web UI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage) and [SQlite](https://www.sqlite.org/locrsf.html) + the filesystem to provide [fast & durable metadata storage](https://www.sqlite.org/locrsf.html) w/ [determinisitc upgrades](https://stackoverflow.com/a/39976321/2156113). ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be [tuned, secured, and extended](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed for many different applications. +
      -Expand to learn more about ArchiveBox's dependencies...
      +Expand to learn more about ArchiveBox's internals & dependencies...
      > *TIP: For better security, easier updating, and to avoid polluting your host system with extra dependencies,**it is strongly recommended to use the [⭐️ official Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything pre-installed for the best experience.* From 56a752582253a05ff3220092925bb0504499ffd1 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:38:18 -0800 Subject: [PATCH 24/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2a955a6a..c94cd915 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ It also detects any content featured *inside* each webpage & extracts it out int - 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... - ✨ *and more, see [Output Formats](#output-formats) below...* -It uses ordinary filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. +It uses [industry-standard tools](#dependencies) to create these archives, and stores them in ordinary [filesystem folders](#archive-layout) (no complicated proprietary formats). --- From 15a714323849f8820440dd09252389ef551848a6 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:39:49 -0800 Subject: [PATCH 25/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c94cd915..125a4f70 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ It also detects any content featured *inside* each webpage & extracts it out int - 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... - ✨ *and more, see [Output Formats](#output-formats) below...* -It uses [industry-standard tools](#dependencies) to create these archives, and stores them in ordinary [filesystem folders](#archive-layout) (no complicated proprietary formats). +It uses [common tools](#dependencies) (e.g. Chrome, `wget`, `yt-dlp`) to create these archives, and stores them in ordinary [folders](#archive-layout) (no complicated proprietary formats). --- From 6be974f0fac7ef3f35a941d8a068f14920433b67 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:41:42 -0800 Subject: [PATCH 26/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 125a4f70..c6f4d301 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ It also detects any content featured *inside* each webpage & extracts it out int - 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... - ✨ *and more, see [Output Formats](#output-formats) below...* -It uses [common tools](#dependencies) (e.g. Chrome, `wget`, `yt-dlp`) to create these archives, and stores them in ordinary [folders](#archive-layout) (no complicated proprietary formats). +It works with [common tools](#dependencies) like Chrome, `wget`, & `yt-dlp`, and stores data in ordinary [files & folders](#archive-layout) (no proprietary formats). --- From dfd8cd487d1a6980644dfdebdbb3f046b3cf2a49 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:42:50 -0800 Subject: [PATCH 27/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c6f4d301..d70c1d9a 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ It also detects any content featured *inside* each webpage & extracts it out int - 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... - ✨ *and more, see [Output Formats](#output-formats) below...* -It works with [common tools](#dependencies) like Chrome, `wget`, & `yt-dlp`, and stores data in ordinary [files & folders](#archive-layout) (no proprietary formats). +It uses [standard tools](#dependencies) like Chrome, `wget`, & `yt-dlp`, and stores data in ordinary [files & folders](#archive-layout) (no complex proprietary formats). --- From e6d6b7cb6dc9461df2fc4ef69937f6bc5acf81c9 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 01:46:43 -0800 Subject: [PATCH 28/38] Update README.md --- README.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index d70c1d9a..17e321e1 100644 --- a/README.md +++ b/README.md @@ -442,7 +442,7 @@ For more discussion on managed and paid hosting options see here: -sqlite3 ./index.sqlite3 # run SQL queries on your index -archivebox shell # explore the Python API in a REPL -ls ./archive/*/index.html # or inspect snapshots on the filesystem +archivebox shell # explore the Python library API in a REPL +sqlite3 ./index.sqlite3 # run SQL queries directly on your index +ls ./archive/*/index.html # or inspect snapshot data directly on the filesystem
      @@ -542,6 +542,8 @@ docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox archivebox config --set PUBLIC_ADD_VIEW=True # allow guests to submit URLs archivebox config --set PUBLIC_SNAPSHOTS=True # allow guests to see snapshot content archivebox config --set PUBLIC_INDEX=True # allow guests to see list of all snapshots +# or +docker compose run archivebox config --set ... # restart the server to apply any config changes
      From 4d9019ada81000921fc16b69cb38c009cb742c47 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:05:08 -0800 Subject: [PATCH 29/38] Update README.md --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 17e321e1..2b683f6c 100644 --- a/README.md +++ b/README.md @@ -125,18 +125,18 @@ curl -sSL 'https://get.archivebox.io' | sh ## 🤝 Professional Integration -🏛️ ArchiveBox is used by many *[professional organizations](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) and [hobbyists](https://zulip.archivebox.io/#narrow/stream/158-development)* who save content off the web, for example: +🏛️ ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): -- 👩🏽 **Individuals:** - `backing up browser bookmarks/history`, `saving FB/Insta/etc. content`, `shopping lists` - 🗞️ **Journalists:** `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` - ⚖️ **Lawyers:** - `evidence collection`, `hashing & integrity verifying`, `search, tagging, & review` + `collecting & preserving evidence`, `hashing / integrity checking / chain-of-custody`, `tagging & review` - 🔬 **Researchers:** - `collecting AI training sets`, `feeding analysis / web crawling pipelines` + `analyzing social media trends`, `collecting LLM training data`, `crawling to feed other pipelines` +- 👩🏽 **Individuals:** + `saving legacy social media / memoirs`, `preserving portfolios / resume`, `backing up news articles` -> *[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) if your institution/org wants to use ArchiveBox professionally.* +> ***[Contact our team](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your institution/org wants to use ArchiveBox professionally.* > > - setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc. > - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... From ea71871af1073109fe0b613c8cf27165736ac5dc Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:08:03 -0800 Subject: [PATCH 30/38] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2b683f6c..8d094857 100644 --- a/README.md +++ b/README.md @@ -267,7 +267,7 @@ See the pip-archive
      -aptitude apt (Ubuntu/Debian) +aptitude apt (Ubuntu/Debian/etc.)
      1. Add the ArchiveBox repository to your sources.
        @@ -302,7 +302,7 @@ See the
        debian-a
      -homebrew brew (macOS) +homebrew brew (macOS only)
      1. Install Homebrew on your system (if not already installed).
      2. From 6dc35097fc7963f94a73fe98dbfcf15df1024cab Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:09:51 -0800 Subject: [PATCH 31/38] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8d094857..e7f3f8c8 100644 --- a/README.md +++ b/README.md @@ -536,7 +536,9 @@ docker run -v $PWD:/data -it archivebox/archivebox archivebox manage createsuper docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox
        -
        Optional: Change permissions to allow non-logged-in users
        +Open http://localhost:8000 to see your server's Web UI ➡️ +
        +Optional: Change permissions to allow non-logged-in users
        
         archivebox config --set PUBLIC_ADD_VIEW=True   # allow guests to submit URLs 
        
        From e4a8a891e688c85626dfff26f32361aea0bc272b Mon Sep 17 00:00:00 2001
        From: Nick Sweeting 
        Date: Sun, 28 Jan 2024 02:13:58 -0800
        Subject: [PATCH 32/38] Update README.md
        
        ---
         README.md | 2 +-
         1 file changed, 1 insertion(+), 1 deletion(-)
        
        diff --git a/README.md b/README.md
        index e7f3f8c8..2a984d12 100644
        --- a/README.md
        +++ b/README.md
        @@ -1092,7 +1092,7 @@ ArchiveBox aims to enable more of the internet to be saved from deterioration by
         
         
        -Click to read more... +Click to read more about why archiving is important and how to do it ethically...
        From 11a2b2186f00f381a517f8f145bd8d2499873bca Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:15:05 -0800 Subject: [PATCH 33/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2a984d12..26566417 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Without active preservation effort, everything on the internet eventually dissap *ArchiveBox is an open source tool that helps organizations and individuals archive web content and retain control over their data: save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* -> ➡️ *Use ArchiveBox on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](https://docs.archivebox.io/en/v0.6.2/_modules/archivebox/cli/archivebox_oneshot.html).* +> ➡️ *Use ArchiveBox on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](#static-archive-exporting).*
        From f673c1bfe90787e1b74357e562a4c1378e38c9bc Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:19:36 -0800 Subject: [PATCH 34/38] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 26566417..2b8765d6 100644 --- a/README.md +++ b/README.md @@ -241,7 +241,7 @@ See "Against curl | sh as a
        1. Install Python >= v3.10 and Node >= v18 on your system (if not already installed).
        2. -
        3. Install the ArchiveBox package using pip3. +
        4. Install the ArchiveBox package using pip3 (or pipx).
          pip3 install archivebox
           
        5. From d9b8d196757e9ca84af6b5ae0fc1718ca08a6f67 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:23:37 -0800 Subject: [PATCH 35/38] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2b8765d6..e64f8ebb 100644 --- a/README.md +++ b/README.md @@ -766,8 +766,8 @@ Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not offici ## Archive Layout -All of ArchiveBox's state (including the SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". -Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create more than one for different collections. +All of ArchiveBox's state (SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". +Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create as many data folders as you want to hold different collections.
          From 35de1a5a5529f5fc69415b7a4e38fcae1b32865b Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:32:21 -0800 Subject: [PATCH 36/38] Update README.md --- README.md | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index e64f8ebb..a01329b9 100644 --- a/README.md +++ b/README.md @@ -868,7 +868,7 @@ If you're importing pages with private content or URLs containing secret tokens
          -Click to expand... +Expand to learn about privacy, permissions, and user accounts... ```bash @@ -883,6 +883,7 @@ archivebox config --set SAVE_ARCHIVE_DOT_ORG=False # disable saving all URLs in archivebox config --set PUBLIC_INDEX=False archivebox config --set PUBLIC_SNAPSHOTS=False archivebox config --set PUBLIC_ADD_VIEW=False +archivebox manage createsuperuser # if extra paranoid or anti-Google: archivebox config --set SAVE_FAVICON=False # disable favicon fetching (it calls a Google API passing the URL's domain part only) @@ -912,7 +913,7 @@ Be aware that malicious archived JS can access the contents of other pages in yo
          -Click to expand... +Expand to see risks and mitigations... ```bash @@ -948,7 +949,7 @@ For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) active
          -Click to expand... +Click to learn how to set up user agents, cookies, and site logins...
          @@ -971,7 +972,7 @@ ArchiveBox appends a hash with the current date `https://example.com#2020-10-24`
          -Click to expand... +Click to learn how the `Re-Snapshot` feature works...
          @@ -999,12 +1000,11 @@ Improved support for saving multiple snapshots of a single URL without this hash ### Storage Requirements -Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. -There also also some special requirements when using filesystems like NFS/SMB/FUSE. +Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. There are also some special requirements when using filesystems like NFS/SMB/FUSE.
          -Click to expand... +Click to learn more about ArchiveBox's filesystem and hosting requirements...
          @@ -1179,10 +1179,10 @@ Our Community Wiki page serves as an index of the broader web archiving communit
          - [Community Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) + - [Web Archiving Software](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#web-archiving-projects) + _List of ArchiveBox alternatives and open source projects in the internet archiving space._ - [The Master Lists](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#the-master-lists) _Community-maintained indexes of archiving tools and institutions._ - - [Web Archiving Software](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#web-archiving-projects) - _Open source tools and projects in the internet archiving space._ - [Reading List](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#reading-list) _Articles, posts, and blogs relevant to ArchiveBox and web archiving in general._ - [Communities](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#communities) @@ -1199,8 +1199,6 @@ Our Community Wiki page serves as an index of the broader web archiving communit > ✨ **[Hire the team that built Archivebox](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) to work on your project.** ([@ArchiveBoxApp](https://twitter.com/ArchiveBoxApp)) -(We also offer general software consulting across many industries) -
          --- From 2ea4133615ddd5a6d9d8dfcced9e38f0cb17e570 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:33:22 -0800 Subject: [PATCH 37/38] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a01329b9..44dd7096 100644 --- a/README.md +++ b/README.md @@ -125,7 +125,7 @@ curl -sSL 'https://get.archivebox.io' | sh ## 🤝 Professional Integration -🏛️ ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): +ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): - 🗞️ **Journalists:** `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` @@ -141,7 +141,7 @@ curl -sSL 'https://get.archivebox.io' | sh > - setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc. > - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... -*We are a 501(c)(3) nonprofit and all our work goes towards supporting open-source development.* +*We are a 🏛️ 501(c)(3) nonprofit and all our work goes towards supporting open-source development.*
          From b245e90871f12e75481cd6e9554a4ac5fd1bbba9 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 02:36:16 -0800 Subject: [PATCH 38/38] Update README.md