From 3ac4c158c472eb838b34945300edea0b940cb0df Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Mon, 31 Dec 2018 20:12:17 -0500 Subject: [PATCH] Switch to wiki for documentation --- README.md | 460 ++---------------------------------------------------- 1 file changed, 14 insertions(+), 446 deletions(-) diff --git a/README.md b/README.md index d67a7580..b8bc207b 100644 --- a/README.md +++ b/README.md @@ -41,454 +41,22 @@ All the saved content is static and indexed with json files, so it lives forever Desktop ScreenshotMobile Screenshot
-## Quickstart +# Getting Started -**1. Get your list of URLs:** + - [Details & Motivation](https://github.com/pirate/ArchiveBox/wiki) + - [Quickstart](https://github.com/pirate/ArchiveBox/wiki/Quickstart) + - [Install](https://github.com/pirate/ArchiveBox/wiki/Install) -Follow the links here to find instructions for exporting a list of URLs from each service. +# Documentation - - [Pocket](https://getpocket.com/export) - - [Pinboard](https://pinboard.in/export/) - - [Instapaper](https://www.instapaper.com/user/export) - - [Reddit Saved Posts](https://github.com/csu/export-saved-reddit) - - [Shaarli](https://shaarli.readthedocs.io/en/master/guides/backup-restore-import-export/#export-links-as) - - [Unmark.it](http://help.unmark.it/import-export) - - [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html) - - [Chrome Bookmarks](https://support.google.com/chrome/answer/96816?hl=en) - - [Firefox Bookmarks](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer) - - [Safari Bookmarks](http://i.imgur.com/AtcvUZA.png) - - [Opera Bookmarks](http://help.opera.com/Windows/12.10/en/importexport.html) - - [Internet Explorer Bookmarks](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows) - - Chrome History: `./bin/archivebox-export-browser-history --chrome` - - Firefox History: `./bin/archivebox-export-browser-history --firefox` - - Other File or URL: (e.g. RSS feed) pass as second argument in the next step + - [Configuration](https://github.com/pirate/ArchiveBox/wiki/Configuration) + - [Chromium Install](https://github.com/pirate/ArchiveBox/wiki/Chromium-Install) + - [Publishing Your Archive](https://github.com/pirate/ArchiveBox/wiki/Publishing-Your-Archive) + - [Troubleshooting](https://github.com/pirate/ArchiveBox/wiki/Troubleshooting) - (If any of these links are broken, please submit an issue and I'll fix it) +# More Info -**2. Create your archive:** - -```bash -git clone https://github.com/pirate/ArchiveBox -cd ArchiveBox/ -./setup # install all dependencies - -# add a list of links from a file -./archive ~/Downloads/bookmark_export.html # replace with the path to your export file or URL from step 1 - -# OR add a list of links from remote URL -./archive "https://getpocket.com/users/yourusername/feed/all" # url to an RSS, html, or json links file - -# OR add all the links from your browser history -./bin/archivebox-export-browser-history --chrome # works with --firefox as well, can take path to SQLite history db -./archive output/sources/chrome_history.json - -# OR just continue archiving the existing links in the index -./archive # at any point if you just want to continue archiving where you left off, without adding any new links -``` - -**3. Done!** - -You can open `output/index.html` to view your archive. (favicons will appear next to each title once it has finished downloading) - -If you want to host your archive somewhere to share it with other people, see the [Publishing Your Archive](#publishing-your-archive) section below. - -**4. (Optional) Schedule it to run every day** - -You can import links from any local file path or feed url by changing the second argument to `archive.py`. -ArchiveBox will ignore links that are imported multiple times, it will keep the earliest version that it's seen. -This means you can add multiple cron jobs to pull links from several different feeds or files each day, -it will keep the index up-to-date without duplicate links. - -This example archives a pocket RSS feed and an export file every 24 hours, and saves the output to a logfile. -```bash -0 24 * * * yourusername /opt/ArchiveBox/archive https://getpocket.com/users/yourusername/feed/all > /var/log/archivebox_rss.log -0 24 * * * yourusername /opt/ArchiveBox/archive /home/darth-vader/Desktop/bookmarks.html > /var/log/archivebox_firefox.log -``` -(Add the above lines to `/etc/crontab`) - -**Next Steps** - -If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom. -If you'd like to customize options, see the [Configuration](#configuration) section. - -If you want something easier than running programs in the command-line, take a look at [Pocket Premium](https://getpocket.com/premium) (yay Mozilla!) and [Pinboard Pro](https://pinboard.in/upgrade/) (yay independent developer!). Both offer easy-to-use bookmark archiving with full-text-search and other features. - -## Details - -`archive.py` is a script that takes a [Pocket-format](https://getpocket.com/export), [JSON-format](https://pinboard.in/export/), [Netscape-format](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx), or RSS-formatted list of links, and downloads a clone of each linked website to turn into a browsable archive that you can store locally or host online. - -The archiver produces an output folder `output/` containing an `index.html`, `index.json`, and archived copies of all the sites, -organized by timestamp bookmarked. It's Powered by [headless](https://developers.google.com/web/updates/2017/04/headless-chrome) Chromium and good 'ol `wget`. - -For each sites it saves: - - - wget of site, e.g. `en.wikipedia.org/wiki/Example.html` with .html appended if not present - - `output.pdf` Printed PDF of site using headless chrome - - `screenshot.png` 1440x900 screenshot of site using headless chrome - - `output.html` DOM Dump of the HTML after rendering using headless chrome - - `archive.org.txt` A link to the saved site on archive.org - - `audio/` and `video/` for sites like youtube, soundcloud, etc. (using youtube-dl) (WIP) - - `code/` clone of any repository for github, bitbucket, or gitlab links (WIP) - - `index.json` JSON index containing link info and archive details - - `index.html` HTML index containing link info and archive details (optional fancy or simple index) - -Wget doesn't work on sites you need to be logged into, but chrome headless does, see the [Configuration](#configuration)* section for `CHROME_USER_DATA_DIR`. - -**Large Exports & Estimated Runtime:** - -I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB. -Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV. - -You can run it in parallel by using the `resume` feature, or by manually splitting export.html into multiple files: -```bash -./archive export.html 1498800000 & # second argument is timestamp to resume downloading from -./archive export.html 1498810000 & -./archive export.html 1498820000 & -./archive export.html 1498830000 & -``` -Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running). - -If you already imported a huge list of bookmarks and want to import only new -bookmarks, you can use the `ONLY_NEW` environment variable. This is useful if -you want to import a bookmark dump periodically and want to skip broken links -which are already in the index. - -## Configuration - -You can tweak parameters via environment variables, or by editing `config.py` directly: -```bash -env CHROME_BINARY=google-chrome-stable RESOLUTION=1440,900 FETCH_PDF=False ./archive ~/Downloads/bookmarks_export.html -``` - -**Shell Options:** - - colorize console ouput: `USE_COLOR` value: [`True`]/`False` - - show progress bar: `SHOW_PROGRESS` value: [`True`]/`False` - - archive permissions: `OUTPUT_PERMISSIONS` values: [`755`]/`644`/`...` - -**Dependency Options:** - - path to Chrome: `CHROME_BINARY` values: [`chromium-browser`]/`/usr/local/bin/google-chrome`/`...` - - path to wget: `WGET_BINARY` values: [`wget`]/`/usr/local/bin/wget`/`...` - -**Archive Options:** - - maximum allowed download time per link: `TIMEOUT` values: [`60`]/`30`/`...` - - import only new links: `ONLY_NEW` values `True`/[`False`] - - archive methods (values: [`True`]/`False`): - - fetch page with wget: `FETCH_WGET` - - fetch images/css/js with wget: `FETCH_WGET_REQUISITES` (True is highly recommended) - - print page as PDF: `FETCH_PDF` - - fetch a screenshot of the page: `FETCH_SCREENSHOT` - - fetch a DOM dump of the page: `FETCH_DOM` - - fetch a favicon for the page: `FETCH_FAVICON` - - submit the page to archive.org: `SUBMIT_ARCHIVE_DOT_ORG` - - screenshot: `RESOLUTION` values: [`1440,900`]/`1024,768`/`...` - - user agent: `WGET_USER_AGENT` values: [`Wget/1.19.1`]/`"Mozilla/5.0 ..."`/`...` - - chrome profile: `CHROME_USER_DATA_DIR` values: [`~/Library/Application\ Support/Google/Chrome/Default`]/`/tmp/chrome-profile`/`...` - To capture sites that require a user to be logged in, you must specify a path to a chrome profile (which loads the cookies needed for the user to be logged in). If you don't have an existing chrome profile, create one with `chromium-browser --disable-gpu --user-data-dir=/tmp/chrome-profile`, and log into the sites you need. Then set `CHROME_USER_DATA_DIR=/tmp/chrome-profile` to make ArchiveBox use that profile. - - output directory: `OUTPUT_DIR` values: [`$REPO_DIR/output`]/`/srv/www/bookmarks`/`...` Optionally output the archives to an alternative directory. - - (See defaults & more at the top of `config.py`) - -To tweak the outputted html index file's look and feel, just edit the HTML files in `archiver/templates/`. - -The chrome/chromium dependency is _optional_ and only required for screenshots, PDF, and DOM dump output, it can be safely ignored if those three methods are disabled. - -## Publishing Your Archive - -The archive produced by `./archive` is suitable for serving on any provider that can host static html (e.g. github pages!). - -You can also serve it from a home server or VPS by uploading the outputted `output` folder to your web directory, e.g. `/var/www/ArchiveBox` and configuring your webserver. - -Here's a sample nginx configuration that works to serve archive folders: - -```nginx -location / { - alias /path/to/ArchiveBox/output/; - index index.html; - autoindex on; # see directory listing upon clicking "The Files" links - try_files $uri $uri/ =404; -} -``` - -Make sure you're not running any content as CGI or PHP, you only want to serve static files! - -Urls look like: `https://archive.example.com/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem.html` - -**Security WARNING & Content Disclaimer** - -Re-hosting other people's content has security implications for any other sites sharing your hosting domain. Make sure you understand -the dangers of hosting unknown archived CSS & JS files [on your shared domain](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy). -Due to the security risk of serving some malicious JS you archived by accident, it's best to put this on a domain or subdomain -of its own to keep cookies separate and slightly mitigate [CSRF attacks](https://en.wikipedia.org/wiki/Cross-site_request_forgery) and other nastiness. - -You may also want to blacklist your archive in `/robots.txt` if you don't want to be publicly assosciated with all the links you archive via search engine results. - -Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons, -it's up to you to host responsibly and respond to takedown requests appropriately. - -Please modify the `FOOTER_INFO` config variable to add your contact info to the footer of your index. - -## Info & Motivation - -This is basically an open-source version of [Pocket Premium](https://getpocket.com/premium) (which you should consider paying for!). -I got tired of sites I saved going offline or changing their URLS, so I started -archiving a copy of them locally now, similar to The Way-Back Machine provided -by [archive.org](https://archive.org). Self hosting your own archive allows you to save -PDFs & Screenshots of dynamic sites in addition to static html, something archive.org doesn't do. - -Now I can rest soundly knowing important articles and resources I like wont dissapear off the internet. - -My published archive as an example: [archive.sweeting.me](https://archive.sweeting.me). - -## Manual Setup - -If you don't like running random setup scripts off the internet (:+1:), you can follow these manual setup instructions. - -**1. Install dependencies:** `chromium >= 59`,` wget >= 1.16`, `python3 >= 3.5` (`google-chrome >= v59` works fine as well) - -If you already have Google Chrome installed, or wish to use that instead of Chromium, follow the [Google Chrome Instructions](#google-chrome-instructions). - -```bash -# On Mac: -brew cask install chromium # If you already have Google Chrome/Chromium in /Applications/, skip this command -brew install wget python3 - -echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser # see instructions for google-chrome below -chmod +x /usr/local/bin/chromium-browser -``` - -```bash -# On Ubuntu/Debian: -apt install chromium-browser python3 wget -``` - -```bash -# Check that everything worked: -chromium-browser --version && which wget && which python3 && which curl && echo "[√] All dependencies installed." -``` - -**2. Get your bookmark export file:** - -Follow the instruction links above in the "Quickstart" section to download your bookmarks export file. - -**3. Run the archive script:** - -1. Clone this repo `git clone https://github.com/pirate/ArchiveBox` -3. `cd ArchiveBox/` -4. `./archive ~/Downloads/bookmarks_export.html` - -You may optionally specify a second argument to `archive.py export.html 153242424324` to resume the archive update at a specific timestamp. - -If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom. - -### Google Chrome Instructions: - -I recommend Chromium instead of Google Chrome, since it's open source and doesn't send your data to Google. -Chromium may have some issues rendering some sites though, so you're welcome to try Google-chrome instead. -It's also easier to use Google Chrome if you already have it installed, rather than downloading Chromium all over. - -1. Install & link google-chrome -```bash -# On Mac: -# If you already have Google Chrome in /Applications/, skip this brew command -brew cask install google-chrome -brew install wget python3 - -echo -e '#!/bin/bash\n/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome "$@"' > /usr/local/bin/google-chrome -chmod +x /usr/local/bin/google-chrome -``` - -```bash -# On Linux: -wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add - -sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list' -apt update; apt install google-chrome-beta python3 wget -``` - -2. Set the environment variable `CHROME_BINARY` to `google-chrome` before running: - -```bash -env CHROME_BINARY=google-chrome ./archive ~/Downloads/bookmarks_export.html -``` -If you're having any trouble trying to set up Google Chrome or Chromium, see the Troubleshooting section below. - -## Troubleshooting - -### Dependencies - -**Python:** - -On some Linux distributions the python3 package might not be recent enough. -If this is the case for you, resort to installing a recent enough version manually. -```bash -add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6 -``` -If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start. - -**Chromium/Google Chrome:** - -`archive.py` depends on being able to access a `chromium-browser`/`google-chrome` executable. The executable used -defaults to `chromium-browser` but can be manually specified with the environment variable `CHROME_BINARY`: - -```bash -env CHROME_BINARY=/usr/local/bin/chromium-browser ./archive ~/Downloads/bookmarks_export.html -``` - -1. Test to make sure you have Chrome on your `$PATH` with: - -```bash -which chromium-browser || which google-chrome -``` -If no executable is displayed, follow the setup instructions to install and link one of them. - -2. If a path is displayed, the next step is to check that it's runnable: - -```bash -chromium-browser --version || google-chrome --version -``` -If no version is displayed, try the setup instructions again, or confirm that you have permission to access chrome. - -3. If a version is displayed and it's `<59`, upgrade it: - -```bash -apt upgrade chromium-browser -y -# OR -brew cask upgrade chromium-browser -``` - -4. If a version is displayed and it's `>=59`, make sure `archive.py` is running the right one: - -```bash -env CHROME_BINARY=/path/from/step/1/chromium-browser ./archive bookmarks_export.html # replace the path with the one you got from step 1 -``` - - -**Wget & Curl:** - -If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice. -See the "Manual Setup" instructions for more details. - -If wget times out or randomly fails to download some sites that you have confirmed are online, -upgrade wget to the most recent version with `brew upgrade wget` or `apt upgrade wget`. There is -a bug in versions `<=1.19.1_1` that caused wget to fail for perfectly valid sites. - -### Archiving - -**No links parsed from export file:** - -Please open an [issue](https://github.com/pirate/ArchiveBox/issues) with a description of where you got the export, and -preferrably your export file attached (you can redact the links). We'll fix the parser to support your format. - -**Lots of skipped sites:** - -If you ran the archiver once, it wont re-download sites subsequent times, it will only download new links. -If you haven't already run it, make sure you have a working internet connection and that the parsed URLs look correct. -You can check the `archive.py` output or `index.html` to see what links it's downloading. - -If you're still having issues, try deleting or moving the `output/archive` folder (back it up first!) and running `./archive` again. - -**Lots of errors:** - -Make sure you have all the dependencies installed and that you're able to visit the links from your browser normally. -Open an [issue](https://github.com/pirate/ArchiveBox/issues) with a description of the errors if you're still having problems. - -**Lots of broken links from the index:** - -Not all sites can be effectively archived with each method, that's why it's best to use a combination of `wget`, PDFs, and screenshots. -If it seems like more than 10-20% of sites in the archive are broken, open an [issue](https://github.com/pirate/ArchiveBox/issues) -with some of the URLs that failed to be archived and I'll investigate. - -**Removing unwanted links from the index:** - -If you accidentally added lots of unwanted links into index and they slow down your archiving, you can use the `bin/purge` script to remove them from your index, which removes everything matching python regexes you pass into it. E.g: `bin/purge -r 'amazon\.com' -r 'google\.com'`. It would prompt before removing links from index, but for extra safety you might want to back up `index.json` first (or put in undex version control). - -### Hosting the Archive - -If you're having issues trying to host the archive via nginx, make sure you already have nginx running with SSL. -If you don't, google around, there are plenty of tutorials to help get that set up. Open an [issue](https://github.com/pirate/ArchiveBox/issues) -if you have problem with a particular nginx config. - - -## Links - -**Similar Projects:** - - [Reminiscence](https://github.com/kanishka-linux/reminiscence/) extremely similar to BA, uses a Django backend + UI and provides auto tagging and summary features with NLTK - - [Memex by Worldbrain.io](https://github.com/WorldBrain/Memex) a browser extension that saves all your history and does full-text search - - [Hypothes.is](https://web.hypothes.is/) a web/pdf/ebook annotation tool that also archives content - - [Perkeep](https://perkeep.org/) "Perkeep lets you permanently keep your stuff, for life." - - [Fetching.io](http://fetching.io/) A personal search engine/archiver that lets you search through all archived websites that you've bookmarked - - [Shaarchiver](https://github.com/nodiscc/shaarchiver) very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index - - [Webrecorder.io](https://webrecorder.io/) Save full browsing sessions and archive all the content - - [Wallabag](https://wallabag.org) Save articles you read locally or on your phone - - [Archivematica](https://github.com/artefactual/archivematica) web GUI for institutional long-term archiving of web and other content - -**Discussions:** - - [Hacker News Discussion](https://news.ycombinator.com/item?id=14272133) - - [Reddit r/selfhosted Discussion](https://www.reddit.com/r/selfhosted/comments/69eoi3/pocket_stream_archive_your_own_personal_wayback/) - - [Reddit r/datahoarder Discussion #1](https://www.reddit.com/r/DataHoarder/comments/69e6i9/archive_a_browseable_copy_of_your_saved_pocket/) - - [Reddit r/datahoarder Discussion #2](https://www.reddit.com/r/DataHoarder/comments/6kepv6/bookmarkarchiver_now_supports_archiving_all_major/) - - -**Tools/Other:** - - https://github.com/ikreymer/webarchiveplayer#auto-load-warcs - - [Sheetsee-Pocket](http://jlord.us/sheetsee-pocket/) project that provides a pretty auto-updating index of your Pocket links (without archiving them) - - [Pocket -> IFTTT -> Dropbox](https://christopher.su/2013/saving-pocket-links-file-day-dropbox-ifttt-launchd/) Post by Christopher Su on his Pocket saving IFTTT recipie - - -## Roadmap - -[*Official Roadmap*](https://github.com/pirate/ArchiveBox/issues/120). - -If you feel like contributing a PR, some of these tasks are pretty easy. Feel free to open an issue if you need help getting started in any way! - -**Major upcoming changes:** - - - finalize python packaging to allow installing via pip and importing individual componenets - - add an optional web GUI for managing sources, adding new links, and viewing the archive - -**Minor upcoming changes:** - - download closed-captions text from youtube videos - - body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/) - - auto-tagging based on important extracted words - - audio & video archiving with `youtube-dl` - - full-text indexing with elasticsearch/elasticlunr/ag - - video closed-caption downloading on Youtube for full-text indexing of video content - - automatic text summaries of article with nlp summarization library - - featured image extraction - - http support (from my https-only domain) - - try wgetting dead sites from archive.org (https://github.com/hartator/wayback-machine-downloader) - - -## Changelog - - - v0.2.0 released with new name - - [renamed](https://github.com/pirate/ArchiveBox/issues/108) from **Bookmark Archiver** -> **ArchiveBox** - - v0.1.0 released - - support for browser history exporting added with `./bin/archivebox-export-browser-history` - - support for chrome `--dump-dom` to output full page HTML after JS executes - - v0.0.3 released - - support for chrome `--user-data-dir` to archive sites that need logins - - fancy individual html & json indexes for each link - - smartly append new links to existing index instead of overwriting - - v0.0.2 released - - proper HTML templating instead of format strings (thanks to https://github.com/bardisty!) - - refactored into separate files, wip audio & video archiving - - v0.0.1 released - - Index links now work without nginx url rewrites, archive can now be hosted on github pages - - added setup.sh script & docstrings & help commands - - made Chromium the default instead of Google Chrome (yay free software) - - added [env-variable](https://github.com/pirate/ArchiveBox/pull/25) configuration (thanks to https://github.com/hannah98!) - - renamed from **Pocket Archive Stream** -> **Bookmark Archiver** - - added [Netscape-format](https://github.com/pirate/ArchiveBox/pull/20) export support (thanks to https://github.com/ilvar!) - - added [Pinboard-format](https://github.com/pirate/ArchiveBox/pull/7) export support (thanks to https://github.com/sconeyard!) - - front-page of HN, oops! apparently I have users to support now :grin:? - - added Pocket-format export support - - v0.0.0 released: created Pocket Archive Stream 2017/05/05 - - -## Donations - -https://www.patreon.com/theSquashSH - -If you want to help sponsor this project long-term or just say thanks or suggest changes, contact me at bookmark-archiver@sweeting.me. - -[Other Grants / Donations Info](https://github.com/pirate/ArchiveBox/blob/master/DONATE.md) + - [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap) + - [Changelog](https://github.com/pirate/ArchiveBox/wiki/Changelog) + - [Donations](https://github.com/pirate/ArchiveBox/wiki/Donations) + - [Web Archiving Community](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community)