From dd8ded1a9608e00c4867b728f103ed2ca117263f Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Fri, 30 Jun 2017 02:12:52 -0500 Subject: [PATCH] Update README.md --- README.md | 37 +++++++++++++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index ec348f3a..6fe6dad4 100644 --- a/README.md +++ b/README.md @@ -43,10 +43,11 @@ If you have any trouble, see the [Troubleshooting](#troubleshooting) section at ## Manual Setup -If you don't like `sudo` running random setup scripts off the internet (which you shouldn't), you can follow these manual setup instructions: +If you don't like `sudo` running random setup scripts off the internet (which you shouldn't), you can follow these manual setup instructions. -**1. Install dependencies:** `chromium >= 59`,` wget >= 1.16`, `python3 >= 3.5` (google-chrome >= v59 also works well, no need to install chromium if you already have Google Chrome installed) +**1. Install dependencies:** `chromium >= 59`,` wget >= 1.16`, `python3 >= 3.5` (google-chrome >= v59 also works well) +If you already have Google Chrome installed, or wish to use that instead of Chromium, follow the [Google Chrome Instructions](#google-chrome-instructions). ```bash # On Mac: brew cask install chromium # If you already have Google Chrome/Chromium in /Applications/, skip this command @@ -140,7 +141,7 @@ Make sure you're not running any content as CGI or PHP, you only want to serve s Urls look like: `https://sweeting.me/archive/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem` -## Info +## Info & Motivation This is basically an open-source version of [Pocket Premium](https://getpocket.com/premium) (which you should consider paying for!). I got tired of sites I saved going offline or changing their URLS, so I started @@ -164,7 +165,7 @@ the content on your domain. Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons, it's up to you to host responsibly and respond to takedown requests appropriately. -## Google Chrome Instrutions: +## Google Chrome Instructions: I recommend Chromium instead of Google Chrome, since it's open source and doesn't send your data to Google. Chromium may have some issues rendering some sites though, so you're welcome to try Google-chrome instead. @@ -192,6 +193,7 @@ apt update; apt install google-chrome-beta python3 wget ```bash env CHROME_BINARY=google-chrome ./archive.py ~/Downloads/bookmarks_export.html ``` +If you're having any trouble trying to set up Google Chrome or Chromium, see the Troubleshooting section below. ## Troubleshooting @@ -236,6 +238,33 @@ env CHROME_BINARY=/path/from/step/1/chromium-browser ./archive.py bookmarks_expo If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice. See the "Manual Setup" instructions for more details. +### Archiving + +**Lots of skipped sites:** + +If you ran the archiver once, it wont re-download sites subsequent times, it will only download new links. +If you haven't already run it, make sure you have a working internet connection and that the parsed URLs look correct. +You can check the `archive.py` output or `index.html` to see what links it's downloading. + +If you're still having issues, try deleting or moving the `service/archive` folder and running `archive.py` again. + +**Lots of errors:** + +Make sure you have all the dependencies installed and that you're able to visit the links from your browser normally. +Open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of the errors if you're still having problems. + +**Lots of broken links from the index:** + +Not all sites can be effectively archived with each method, that's why it's best to use a combination of `wget`, PDFs, and screenshots. +If it seems like more than 10-20% of sites in the archive are broken, open an [issue](https://github.com/pirate/bookmark-archiver/issues) +with some of the URLs that failed to be archived and I'll investigate. + +### Hosting the Archive + +If you're having issues trying to host the archive via nginx, make sure you already have nginx running with SSL. +If you don't, google around, there are plenty of tutorials to help get that set up. Open an [issue](https://github.com/pirate/bookmark-archiver/issues) +if you have problem with a particular nginx config. + ## TODO - body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/)