From dd17ad61762875c993ddb9571470f1fa64458e22 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 8 Apr 2021 10:26:23 -0400 Subject: [PATCH] Update README.md --- README.md | 50 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 44 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 9c0d50f7..6c13ee2c 100644 --- a/README.md +++ b/README.md @@ -377,10 +377,38 @@ It also includes a built-in scheduled import feature with `archivebox schedule`
-## Output formats +### Archive Layout All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All `archivebox` CLI commands must be run from inside this folder, and you first create it by running `archivebox init`. +The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard `index.sqlite3` database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the `./archive/` subfolder. + +```bash +tree . +./ + index.sqlite3 + ArchiveBox.conf + archive/ + ... + 1617687755/ + index.html + index.json + screenshot.png + media/some_video.mp4 + warc/1617687755.warc.gz + git/somerepo.git + ... +``` + +Each snapshot subfolder `./archive//` includes a static `index.json` and `index.html` describing its contents, and the snapshot extrator outputs are plain files within the folder. + +
+ +## Output formats + +Inside each Snapshot folder, ArchiveBox save these different types of extractor outputs as plain files: + +`./archive//` - **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details - **Title**, **Favicon**, **Headers** Response headers, site favicon, and parsed site title @@ -405,17 +433,27 @@ archivebox config --set SAVE_ARCHIVE_DOT_ORG=False archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m' ``` -The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard sqlite3 database (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the `archive/` subfolder. Each snapshot subfolder includes a static JSON and HTML index describing its contents, and the snapshot extrator outputs are plain files within the folder (e.g. `media/example.mp4`, `git/somerepo.git`, `static/someimage.png`, etc.) +
-```bash -# to browse your index statically without running the archivebox server, run: -archivebox list --html --with-headers > index.html # open index.html to view -archivebox list --json --with-headers > index.json +## Static Archive Exporting + +You can export the main index to browse it statically without the Web UI. + +*Note about large exports: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export only certain Snapshots or chunks at a time.* + +```bash| +# archivebox list --help + +archivebox list --html --with-headers > index.html # export to static html table +archivebox list --json --with-headers > index.json # export to static json blob +archivebox list --csv --with-headers > index.csv # export to static csv table # (if using docker-compose, add the -T flag when piping) docker-compose run -T archivebox list --csv > index.csv ``` +The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them. +
## Dependencies