1
0
Fork 0
mirror of synced 2024-05-17 02:43:16 +12:00

collapse README sections to reduce length and link to PUID PGID and root_squash info

This commit is contained in:
Nick Sweeting 2024-01-04 12:30:21 -08:00
parent 6f1d4e477b
commit 64bfd7667e

View file

@ -564,12 +564,22 @@ MAX_MEDIA_SIZE=1500m # default: 750m raise/lower youtubedl output size
PUBLIC_INDEX=True # default: True whether anon users can view index
PUBLIC_SNAPSHOTS=True # default: True whether anon users can view pages
PUBLIC_ADD_VIEW=False # default: False whether anon users can add new URLs
CHROME_USER_AGENT="Mozilla/5.0 ..." # change these to get around bot blocking
WGET_USER_AGENT="Mozilla/5.0 ..."
CURL_USER_AGENT="Mozilla/5.0 ..."
```
<br/>
## Dependencies
To achieve high-fidelity archives in as many situations as possible, ArchiveBox depends on a variety of high-quality 3rd-party tools and libraries that specialize in extracting different types of content.
<br/>
<details>
<summary><i>Expand to learn more about ArchiveBox's dependencies...</i></summary>
For better security, easier updating, and to avoid polluting your host system with extra dependencies, **it is strongly recommended to use the official [Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything pre-installed for the best experience.
These optional dependencies used for archiving sites include:
@ -601,12 +611,18 @@ Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not offici
For detailed information about upgrading ArchiveBox and its dependencies, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives
</details>
<br/>
## Archive Layout
All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All `archivebox` CLI commands must be run from inside this folder, and you first create it by running `archivebox init`.
<br/>
<details>
<summary><i>Expand to learn more about the layout of Archivebox's data on-disk...</i></summary>
The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard `index.sqlite3` database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the `./archive/` subfolder.
<img src="https://user-images.githubusercontent.com/511499/117453293-c7b91600-af12-11eb-8a3f-aa48b0f9da3c.png" width="400px" align="right">
@ -630,12 +646,17 @@ The on-disk layout is optimized to be easy to browse by hand and durable long-te
Each snapshot subfolder `./archive/<timestamp>/` includes a static `index.json` and `index.html` describing its contents, and the snapshot extractor outputs are plain files within the folder.
</details>
<br/>
## Static Archive Exporting
You can export the main index to browse it statically without needing to run a server.
You can export the main index to browse it statically as plain HTML files in a folder (without needing to run a server).
<br/>
<details>
<summary><i>Expand to learn how to export your ArchiveBox collection...</i></summary>
> **Note**
> These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.
@ -652,6 +673,7 @@ archivebox list --csv=timestamp,url,title > index.csv # export to csv spreadshe
The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them.
</details>
<br/>
@ -667,6 +689,10 @@ The paths in the static exports are relative, make sure to keep them next to you
<a id="archiving-private-urls"></a>
<br/>
<details>
<summary><i>Click to expand...</i></summary>
If you're importing pages with private content or URLs containing secret tokens you don't want public (e.g Google Docs, paywalled content, unlisted videos, etc.), **you may want to disable some of the extractor methods to avoid leaking that content to 3rd party APIs or the public**.
```bash
@ -687,8 +713,16 @@ archivebox config --set SAVE_FAVICON=False # disable favicon fetching (
archivebox config --set CHROME_BINARY=chromium # ensure it's using Chromium instead of Chrome
```
</details>
<br/>
### Security Risks of Viewing Archived JS
<br/>
<details>
<summary><i>Click to expand...</i></summary>
Be aware that malicious archived JS can access the contents of other pages in your archive when viewed. Because the Web UI serves all viewed snapshots from a single domain, they share a request context and **typical CSRF/CORS/XSS/CSP protections do not work to prevent cross-site request attacks**. See the [Security Overview](https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#stealth-mode) page and [Issue #239](https://github.com/ArchiveBox/ArchiveBox/issues/239) for more details.
```bash
@ -705,8 +739,15 @@ The admin UI is also served from the same origin as replayed JS, so malicious pa
*Note: Only the `wget` & `dom` extractor methods execute archived JS when viewing snapshots, all other archive methods produce static output that does not execute JS on viewing. If you are worried about these issues ^ you should disable these extractors using `archivebox config --set SAVE_WGET=False SAVE_DOM=False`.*
</details>
<br/>
### Saving Multiple Snapshots of a Single URL
<br/>
<details>
<summary><i>Click to expand...</i></summary>
First-class support for saving multiple snapshots of each site over time will be [added eventually](https://github.com/ArchiveBox/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs). For now **ArchiveBox is designed to only archive each unique URL with each extractor type once**. The workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash:
```bash
@ -717,12 +758,22 @@ archivebox add 'https://example.com#2020-10-25'
The <img src="https://user-images.githubusercontent.com/511499/115942091-73c02300-a476-11eb-958e-5c1fc04da488.png" alt="Re-Snapshot Button" height="24px"/> button in the Admin UI is a shortcut for this hash-date workaround.
</details>
<br/>
### Storage Requirements
<br/>
<details>
<summary><i>Click to expand...</i></summary>
Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. **ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles**, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`.
Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). **Don't store large collections on older filesystems like EXT3/FAT** as they may not be able to handle more than 50k directory entries in the `archive/` folder. **Try to keep the `index.sqlite3` file on local drive (not a network mount)** or SSD for maximum performance, however the `archive/` folder can be on a network mount or slower HDD.
If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server.
</details>
<br/>
---