From 84b6412b784f16dce70d24541fc4eb51093b581e Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Sun, 28 Jan 2024 03:44:24 -0800 Subject: [PATCH] Update README.md --- README.md | 49 +++++++++++++++++++++++++------------------------ 1 file changed, 25 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 0bc5809b..a707005a 100644 --- a/README.md +++ b/README.md @@ -642,28 +642,34 @@ It also includes a built-in scheduled import feature with `archivebox schedule` ## Output Formats: What ArchiveBox saves for each URL -Inside each Snapshot folder, ArchiveBox saves these different types of extractor outputs as plain files: - - - -`./archive/{Snapshot.id}/` - -- **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details -- **Title**, **Favicon**, **Headers** Response headers, site favicon, and parsed site title -- **SingleFile:** `singlefile.html` HTML snapshot rendered with headless Chrome using SingleFile -- **Wget Clone:** `example.com/page-name.html` wget clone of the site with `warc/TIMESTAMP.gz` -- Chrome Headless - - **PDF:** `output.pdf` Printed PDF of site using headless chrome - - **Screenshot:** `screenshot.png` 1440x900 screenshot of site using headless chrome - - **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome -- **Article Text:** `article.html/json` Article text extraction using Readability & Mercury -- **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org -- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl (or yt-dlp) -- **Source Code:** `git/` clone of any repository found on GitHub, Bitbucket, or GitLab links -- _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._ +Inside each Snapshot folder, ArchiveBox saves many different types of extractor outputs as plain files (e.g. HTML, PDF, PNG, JSON, WARC, etc.). It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) via environment variables / config. +
+
+Expand to see the full list of ways ArchiveBox saves each page... + + + +./archive/{Snapshot.id}/
+ +

## Configuration @@ -1075,10 +1081,6 @@ If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to
- ---- - -
paisley graphic @@ -1201,7 +1203,6 @@ Our Community Wiki page serves as an index of the broader web archiving communit
----
documentation graphic