Create ZIM from OSE Wiki

From Open Source Ecology
Revision as of 09:22, 28 May 2018 by Christian (talk | contribs)
Jump to navigation Jump to search

Warning: This page is a page dedicated to the administration of this wiki. You may use this information for the understanding of the tools used, however, for the creation of a .zim file from this wiki, you need the approval and collaboration of the Server Admin.

The ZIM Format is an open source format to store webpages in its entirety, with a focus on wikipedia and mediawiki pages. For more info, look here.
To create a ZIM File yourself, you need to scrape the webpage and download all necessary dependencies. For that purpose, there are a handfull of programs capable of doing that. We will be using mwoffliner, as it seems to be the most advanced option.
But before we go into scraping, it should be told that the OSE wiki is not written for scraping, as you can see in the robots.txt, the limited ressources of this project lead to a firm security regarding scraping or DOS attacks. There is (unfortunately) no scrape tool for zim files out there that can be regulated to fit those needs, so we need a workaround here.
This is why this process has actually 2 steps: Setting up and Starting the scraper, and create a copy of the OSEWiki in a safe enviroment, which then can be scraped at any pace. We will start by creating the copy.
We will assume a debian enviroment (any derivative will do), as some of the programs have to be compiled from scratch, partly with bad documentation, it is probably possible, but not advisable to use other distributions.

Securing the Server

If the server is accessible from the internet in any way (like a virtual Host or similar), it's absolutely mandatory to protect it against external threats to assure the security of the private data from OSE and it's members. To do that, all externall access should be closed, beside the ssh access used for the further setup. Also make sure to protect this access, some ideas for that are found here. Default ssh ports are targets of 100s of attacks per day if open to the web, so use countermeasures.
For securing the further access, the easies way to do that is to use iptables to block all other access:

   sudo apt-get install iptables
   iptables -P INPUT DROP
   iptables -A INPUT -i eth0 -p tcp --dport 22 -j ACCEPT

Depending on your setup, you may have to change the 22 to the port your ssh is running on, and change -eth0 to the network device your system is using (ifconfig should give you that information). Better check this BEFORE your setting that up.
If you'd like to setup the security differently, the following vulnerabilities have to be covered by your setup:
A redis server
A mysql server
apache
parsoid (running on port 8000)

Setup OSEWiki from a Backup

For this step you need a backup of the entire OSE wiki. This backup will contain of a sqldump, and the document root.
Warning! Such a Backup contains sensitive information! Store it in a safe place, keep it only as long as you need it! Make sure that the web server running this snapshot can only be access from your local network; it should not be exposed to the public internet.
On your Server, you will need a default LAMP setup, so install mysql, apache and php, like this:

   sudo apt-get install php php-mysql mysql-server apache2 libapache2-mod-php php-xml php-mbstring

The Backup of the document root can be placed now, in it you can find the LocalSettings.php. In this file, the setup of the database is described; as we need to recreate it, search for DATABASE SETTINGS. There, you'll find the name of the database, the username and the password. You'll need them to restore the dump.
Now, restore the dump in the following fashion:

   sudo su
   mysql -u root
   create database [database_name];
   grant all privileges on [database_name].* to [database_user]@localhost identified by [database_user_password];
   exit;
   mysql -u [database_user] -p [database_name] < [path_to_sqldump_file]

if the sql file is compressed, extract it first.
Now database and document root are finished, we need to work on apache next. The Webservers conf file can be setup with just 4 lines basically, being:

   <VirtualHost 127.0.0.1:80>
       DocumentRoot [path_to_the_htdocs_in_the_Document_Root]
       Alias /wiki [path_to_the_htdocs_in_the_Document_Root]/index.php
   </VirtualHost>

Don't forget the Alias
Double-check that this apache vhost is not accessible to the public internet

After that reload apache, and you should be able to reach the OSE wiki now through the ip of your server!

Setup Scraper/MWOffliner

MWOffliner can be found at github, to install we need a redis-server, nodejs and some minor components. They can be installed like this:

   curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
   sudo apt-get install nodejs
   sudo apt-get install jpegoptim advancecomp gifsicle pngquant imagemagick nscd
   sudo apt-get install redis-server

Additionally, the redis server needs a specific setup, make sure the following options are set in /etc/redis/redis.conf:

   unixsocket /dev/shm/redis.sock
   unixsocketperm 777
   save ""
   appendfsync no

Lastly, we also need the zimwriterfs installed, which is a bit inconvenient, as the github pages instructions on building it are just plain wrong (as in May 2018). However, they provide a docker written for ubuntu, so following those instructions should work:

   sudo apt-get install git pkg-config libtool automake autoconf make g++ liblzma-dev coreutils wget zlib1g-dev libicu-dev python3-pip libgumbo-dev libmagic-dev
   sudo pip3 install meson
   wget https://oligarchy.co.uk/xapian/1.4.3/xapian-core-1.4.3.tar.xz
   tar xvf xapian-core-1.4.3.tar.xz
   cd xapian-core-1.4.3 && ./configure
   sudo make all install
   cd .. && rm -rf xapian
   ln -s /usr/bin/python3 /usr/bin/python
   git clone https://github.com/openzim/libzim.git
   cd libzim && git checkout 3.0.0
   git clone git://github.com/ninja-build/ninja.git
   cd ninja && git checkout release
   ./configure.py --bootstrap
   sudo cp ./ninja /usr/local/bin/
   cd .. && meson . build
   cd build && ninja
   sudo ninja install
   cd ../.. && rm -rf libzim
   git clone https://github.com/openzim/zimwriterfs.git
   cd zimwriterfs && meson . build
   cd build && sudo ninja install
   cd ../.. && rm -rf zimwriterfs
   sudo ldconfig

These commands are extracted and from here, written so that they can be run on a normal system. After all of that compiling, last thing we need is the actual mwoffliner, so lets get that one:

   sudo npm -g install mwoffliner

All the dependencies will be built by itself, after its done, you can start using it. Just type mwoffliner into command line and you get the possible parameter. For the creation of a build from the OSE Wiki, the following command is used:

   wget -O /tmp/favicon.png "https://wiki.opensourceecology.org/images/ose-logo.png"
   mwoffliner --outputDirectory=[Directory_Of_Zim_File] --mwUrl=http://127.0.0.1 --adminEmail=[your_mail] --customZimTitle=OSEWiki --customZimDescription="The Open Source Ecology (OSE) project tries to create open source blueprints of all industrial machines definining modern society for a decentralized, post-scarcity economy." --cacheDirectory=[Directory_of_Cached_Data(Optional)] --customZimFavicon=/tmp/favicon.png --mwApiPath="api.php" --localParsoid=true

This may take, depending on the system, a long time, so run it with nohup, screen or similar. After that, at the Directory stated by you, the completed ZIM should appear. Congratulations!