diff --git a/README.md b/README.md index aeb07aa3..44dd7096 100644 --- a/README.md +++ b/README.md @@ -23,39 +23,28 @@ curl -sSL 'https://get.archivebox.io' | sh # (or see pip/brew/Docker instruct Without active preservation effort, everything on the internet eventually dissapears or degrades. Archive.org does a great job as a free central archive, but they require all archives to be public, and they can't save every type of content. -*ArchiveBox is an open source tool that helps you archive web content on your own (or privately within an organization): save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* +*ArchiveBox is an open source tool that helps organizations and individuals archive web content and retain control over their data: save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* -> ➡️ *Use ArchiveBox as a [command-line package](#quickstart) and/or [self-hosted web app](#quickstart) on Linux, macOS, or in [Docker](#quickstart).* +> ➡️ *Use ArchiveBox on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](#static-archive-exporting).*
-📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list. +📥 **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our [Browser Extension](https://chromewebstore.google.com/detail/archivebox-exporter/habonpimjphpdnmcfkaockjnffodikoj), and more. See Input Formats for a full list. snapshot detail page -💾 **It saves snapshots of the URLs you feed it in several redundant formats.** +**It saves snapshots of the URLs you feed it in several redundant formats.** It also detects any content featured *inside* each webpage & extracts it out into a folder: -- `HTML/Generic websites -> HTML, PDF, PNG, WARC, Singlefile` -- `YouTube/SoundCloud/etc. -> MP3/MP4 + subtitles, description, thumbnail` -- `News articles -> article body TXT + title, author, featured images` -- `Github/Gitlab/etc. links -> git cloned source code` -- *[and more...](#output-formats)* +- 🌐 **HTML**/**Any websites** ➡️ `original HTML+CSS+JS`, `singlefile HTML`, `screenshot PNG`, `PDF`, `WARC`, ... +- 🎥 **Social Media**/**News** ➡️ `post content TXT`, `comments`, `title`, `author`, `images` +- 🎬 **YouTube**/**SoundCloud**/etc. ➡️ `MP3/MP4`s, `subtitles`, `metadata`, `thumbnail`, ... +- 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... +- ✨ *and more, see [Output Formats](#output-formats) below...* -It uses normal filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. +It uses [standard tools](#dependencies) like Chrome, `wget`, & `yt-dlp`, and stores data in ordinary [files & folders](#archive-layout) (no complex proprietary formats). --- -🏛️ ArchiveBox is used by many *[professionals](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) and [hobbyists](https://zulip.archivebox.io/#narrow/stream/158-development)* who save content off the web, for example: - -- **Individuals:** - `backing up browser bookmarks/history`, `saving FB/Insta/etc. content`, `shopping lists` -- **Journalists:** - `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` -- **Lawyers:** - `evidence collection`, `hashing & integrity verifying`, `search, tagging, & review` -- **Researchers:** - `collecting AI training sets`, `feeding analysis / web crawling pipelines` - The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats [for decades](#background--motivation) after it goes down.
@@ -70,15 +59,15 @@ The goal is to sleep soundly knowing the part of the internet you care about wil
-**📦  Install ArchiveBox using your preferred method: `docker` / `apt` / `brew` / `pip3` / `nix` / etc. ([see Quickstart below](#quickstart)).** +**📦  Install ArchiveBox using your preferred method: `docker` / `pip` / `apt` / `brew` / etc. ([see full Quickstart below](#quickstart)).**
Quick reference   ⤵️Expand for quick copy-pastable install commands...   ⤵️
mkdir ~/archivebox; cd ~/archivebox    # create a dir somewhere for your archivebox data
 
-# Get ArchiveBox with Docker Compose (recommended): +# Option A: Get ArchiveBox with Docker Compose (recommended): curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml # edit options in this file as-needed docker compose run archivebox init --setup # docker compose run archivebox add 'https://example.com' @@ -86,14 +75,14 @@ docker compose run archivebox init --setup # docker compose up

-# Or use it as a plain Docker container: +# Option B: Or use it as a plain Docker container: docker run -it -v $PWD:/data archivebox/archivebox init --setup # docker run -it -v $PWD:/data archivebox/archivebox add 'https://example.com' # docker run -it -v $PWD:/data archivebox/archivebox help # docker run -it -v $PWD:/data -p 8000:8000 archivebox/archivebox

-# Or install it with your preferred pkg manager (see Quickstart below for apt, brew, and more) +# Option C: Or install it with your preferred pkg manager (see Quickstart below for apt, brew, and more) pip install archivebox archivebox init --setup # archviebox add 'https://example.com' @@ -101,14 +90,14 @@ archivebox init --setup # archivebox server 0.0.0.0:8000

-# Or use the optional auto setup script to install it +# Option D: Or use the optional auto setup script to install it curl -sSL 'https://get.archivebox.io' | sh +
+
+Open http://localhost:8000 to see your server's Web UI ➡️

-Open http://localhost:8000 to see your server's Web UI ➡️ - -


@@ -136,12 +125,23 @@ curl -sSL 'https://get.archivebox.io' | sh ## 🤝 Professional Integration -*[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) if your institution/org wants to use ArchiveBox professionally.* +ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): -- setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc. -- for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... +- 🗞️ **Journalists:** + `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` +- ⚖️ **Lawyers:** + `collecting & preserving evidence`, `hashing / integrity checking / chain-of-custody`, `tagging & review` +- 🔬 **Researchers:** + `analyzing social media trends`, `collecting LLM training data`, `crawling to feed other pipelines` +- 👩🏽 **Individuals:** + `saving legacy social media / memoirs`, `preserving portfolios / resume`, `backing up news articles` -*We are a 501(c)(3) nonprofit and all our work goes towards supporting open-source development.* +> ***[Contact our team](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your institution/org wants to use ArchiveBox professionally.* +> +> - setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc. +> - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... + +*We are a 🏛️ 501(c)(3) nonprofit and all our work goes towards supporting open-source development.*
@@ -150,6 +150,8 @@ curl -sSL 'https://get.archivebox.io' | sh grassgrass
+ + # Quickstart **🖥  Supported OSs:** Linux/BSD, macOS, Windows (Docker)   **👾  CPUs:** `amd64` (`x86_64`), `arm64` (`arm8`), `arm7` (raspi>=3)
@@ -168,9 +170,10 @@ curl -sSL 'https://get.archivebox.io' | sh
  • Install Docker on your system (if not already installed).
  • Download the docker-compose.yml file into a new empty directory (can be anywhere).
    mkdir ~/archivebox && cd ~/archivebox
    -curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml'
    +# Read and edit docker-compose.yml options as-needed after downloading
    +curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml
     
  • -
  • Run the initial setup and create an admin user. +
  • Run the initial setup to create an admin user (or set ADMIN_USER/PASS in docker-compose.yml)
    docker compose run archivebox init --setup
     
  • Next steps: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin. @@ -200,6 +203,7 @@ docker run -v $PWD:/data -it archivebox/archivebox init --setup
    docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox
     # completely optional, CLI can always be used without running a server
     # docker run -v $PWD:/data -it [subcommand] [--args]
    +docker run -v $PWD:/data -it archivebox/archivebox help
     
  • @@ -237,7 +241,7 @@ See "Against curl | sh as a
    1. Install Python >= v3.10 and Node >= v18 on your system (if not already installed).
    2. -
    3. Install the ArchiveBox package using pip3. +
    4. Install the ArchiveBox package using pip3 (or pipx).
      pip3 install archivebox
       
    5. @@ -251,6 +255,7 @@ archivebox init --setup
      archivebox server 0.0.0.0:8000
       # completely optional, CLI can always be used without running a server
       # archivebox [subcommand] [--args]
      +archivebox help
       
    @@ -262,7 +267,7 @@ See the pip-archive
    -aptitude apt (Ubuntu/Debian) +aptitude apt (Ubuntu/Debian/etc.)
    1. Add the ArchiveBox repository to your sources.
      @@ -286,6 +291,7 @@ archivebox init --setup # if any problems, install with pip instead
      archivebox server 0.0.0.0:8000
       # completely optional, CLI can always be used without running a server
       # archivebox [subcommand] [--args]
      +archivebox help
       
    @@ -296,7 +302,7 @@ See the
    debian-a
    -homebrew brew (macOS) +homebrew brew (macOS only)
    1. Install Homebrew on your system (if not already installed).
    2. @@ -314,6 +320,7 @@ archivebox init --setup # if any problems, install with pip instead
      archivebox server 0.0.0.0:8000
       # completely optional, CLI can always be used without running a server
       # archivebox [subcommand] [--args]
      +archivebox help
       
    @@ -435,7 +442,7 @@ For more discussion on managed and paid hosting options see here: -sqlite3 ./index.sqlite3 # run SQL queries on your index -archivebox shell # explore the Python API in a REPL -ls ./archive/*/index.html # or inspect snapshots on the filesystem +archivebox shell # explore the Python library API in a REPL +sqlite3 ./index.sqlite3 # run SQL queries directly on your index +ls ./archive/*/index.html # or inspect snapshot data directly on the filesystem
    @@ -525,12 +536,16 @@ docker run -v $PWD:/data -it archivebox/archivebox archivebox manage createsuper docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox
    -
    Optional: Change permissions to allow non-logged-in users
    +Open
    http://localhost:8000 to see your server's Web UI ➡️ +
    +Optional: Change permissions to allow non-logged-in users
    
     archivebox config --set PUBLIC_ADD_VIEW=True   # allow guests to submit URLs 
     archivebox config --set PUBLIC_SNAPSHOTS=True  # allow guests to see snapshot content
     archivebox config --set PUBLIC_INDEX=True      # allow guests to see list of all snapshots
    +# or
    +docker compose run archivebox config --set ...
     
     # restart the server to apply any config changes
     
    @@ -697,11 +712,14 @@ CURL_USER_AGENT="Mozilla/5.0 ..." ## Dependencies -To achieve high-fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party tools that specialize in extracting different types of content. +To achieve high-fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party libraries and tools that specialize in extracting different types of content. + +> Under-the-hood, ArchiveBox uses [Django](https://www.djangoproject.com/start/overview/) to power its [Web UI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage) and [SQlite](https://www.sqlite.org/locrsf.html) + the filesystem to provide [fast & durable metadata storage](https://www.sqlite.org/locrsf.html) w/ [determinisitc upgrades](https://stackoverflow.com/a/39976321/2156113). ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be [tuned, secured, and extended](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed for many different applications. +
    -Expand to learn more about ArchiveBox's dependencies...
    +Expand to learn more about ArchiveBox's internals & dependencies...
    > *TIP: For better security, easier updating, and to avoid polluting your host system with extra dependencies,**it is strongly recommended to use the [⭐️ official Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything pre-installed for the best experience.* @@ -748,8 +766,8 @@ Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not offici ## Archive Layout -All of ArchiveBox's state (including the SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". -Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create more than one for different collections. +All of ArchiveBox's state (SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". +Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create as many data folders as you want to hold different collections.
    @@ -850,7 +868,7 @@ If you're importing pages with private content or URLs containing secret tokens
    -Click to expand... +Expand to learn about privacy, permissions, and user accounts... ```bash @@ -865,6 +883,7 @@ archivebox config --set SAVE_ARCHIVE_DOT_ORG=False # disable saving all URLs in archivebox config --set PUBLIC_INDEX=False archivebox config --set PUBLIC_SNAPSHOTS=False archivebox config --set PUBLIC_ADD_VIEW=False +archivebox manage createsuperuser # if extra paranoid or anti-Google: archivebox config --set SAVE_FAVICON=False # disable favicon fetching (it calls a Google API passing the URL's domain part only) @@ -894,7 +913,7 @@ Be aware that malicious archived JS can access the contents of other pages in yo
    -Click to expand... +Expand to see risks and mitigations... ```bash @@ -930,7 +949,7 @@ For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) active
    -Click to expand... +Click to learn how to set up user agents, cookies, and site logins...
    @@ -953,7 +972,7 @@ ArchiveBox appends a hash with the current date `https://example.com#2020-10-24`
    -Click to expand... +Click to learn how the `Re-Snapshot` feature works...
    @@ -981,12 +1000,11 @@ Improved support for saving multiple snapshots of a single URL without this hash ### Storage Requirements -Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. -There also also some special requirements when using filesystems like NFS/SMB/FUSE. +Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. There are also some special requirements when using filesystems like NFS/SMB/FUSE.
    -Click to expand... +Click to learn more about ArchiveBox's filesystem and hosting requirements...
    @@ -1074,7 +1092,7 @@ ArchiveBox aims to enable more of the internet to be saved from deterioration by
    -Click to read more... +Click to read more about why archiving is important and how to do it ethically...
    @@ -1161,10 +1179,10 @@ Our Community Wiki page serves as an index of the broader web archiving communit
    - [Community Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) + - [Web Archiving Software](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#web-archiving-projects) + _List of ArchiveBox alternatives and open source projects in the internet archiving space._ - [The Master Lists](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#the-master-lists) _Community-maintained indexes of archiving tools and institutions._ - - [Web Archiving Software](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#web-archiving-projects) - _Open source tools and projects in the internet archiving space._ - [Reading List](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#reading-list) _Articles, posts, and blogs relevant to ArchiveBox and web archiving in general._ - [Communities](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#communities) @@ -1181,8 +1199,6 @@ Our Community Wiki page serves as an index of the broader web archiving communit > ✨ **[Hire the team that built Archivebox](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) to work on your project.** ([@ArchiveBoxApp](https://twitter.com/ArchiveBoxApp)) -(We also offer general software consulting across many industries) -
    ---