Create ZIM from OSE Wiki

From Open Source Ecology
Jump to: navigation, search

Warning: This page is a page dedicated to the administration of this wiki. You may use this information for the understanding of the tools used, however, for the creation of a .zim file from this wiki, you need the approval and collaboration of the Server Admin.

The ZIM Format is an open source format to store webpages in its entirety, with a focus on wikipedia and mediawiki pages. For more info, look here.
To create a ZIM File yourself, you need to scrape the webpage and download all necessary dependencies. For that purpose, there are a handfull of programs capable of doing that. We will be using zimmer, as this seems to be the easiest option.
But before we go into scraping, it should be told that the OSE wiki is not written for scraping, as you can see in the robots.txt, the limited ressources of this project lead to a firm security regarding scraping or DOS attacks. There is (unfortunately) no scrape tool for zim files out there that can be regulated to fit those needs, so we need a workaround here.
This is why this process has actually 2 steps: Setting up and Starting the scraper, and create a copy of the OSEWiki in a safe enviroment, which then can be scraped at any pace. We will start by creating the copy.
We will assume a debian enviroment (any derivative will do).

Securing the Server

If the server is accessible from the internet in any way (like a virtual Host or similar), it's absolutely mandatory to protect it against external threats to assure the security of the private data from OSE and it's members. To do that, all externall access should be closed, beside the ssh access used for the further setup. Also make sure to protect this access, some ideas for that are found here. Default ssh ports are targets of 100s of attacks per day if open to the web, so use countermeasures.
For securing the further access, the easies way to do that is to use iptables to block all other access:

   sudo apt-get install iptables
   iptables -P INPUT DROP
   iptables -A INPUT -i eth0 -p tcp --dport 22 -j ACCEPT

Depending on your setup, you may have to change the 22 to the port your ssh is running on, and change -eth0 to the network device your system is using (ifconfig should give you that information). Better check this BEFORE your setting that up.
If you'd like to setup the security differently, the following vulnerabilities have to be covered by your setup:
A mysql server
apache

Setup OSEWiki from a Backup

For this step you need a backup of the entire OSE wiki. This backup will contain of a sqldump, and the document root.
Warning! Such a Backup contains sensitive information! Store it in a safe place, keep it only as long as you need it! Make sure that the web server running this snapshot can only be access from your local network; it should not be exposed to the public internet.
On your Server, you will need a default LAMP setup, so install mysql, apache and php, like this:

   sudo apt-get install php php-mysql mysql-server apache2 libapache2-mod-php php-xml php-mbstring

The Backup of the document root can be placed now, in it you can find the LocalSettings.php. In this file, the setup of the database is described; as we need to recreate it, search for DATABASE SETTINGS. There, you'll find the name of the database, the username and the password. You'll need them to restore the dump.
Now, restore the dump in the following fashion:

   sudo su
   mysql -u root
   create database [database_name];
   grant all privileges on [database_name].* to [database_user]@localhost identified by [database_user_password];
   exit;
   mysql -u [database_user] -p [database_name] < [path_to_sqldump_file]

if the sql file is compressed, extract it first.
Now database and document root are finished, we need to work on apache next. The Webservers conf file can be setup with just 4 lines basically, being:

   <VirtualHost 127.0.0.1:80>
       DocumentRoot [path_to_the_htdocs_in_the_Document_Root]
       Alias /wiki [path_to_the_htdocs_in_the_Document_Root]/index.php
   </VirtualHost>

Don't forget the Alias
Double-check that this apache vhost is not accessible to the public internet

After that reload apache, and you should be able to reach the OSE wiki now through the ip of your server!

Setup Scraper/Zimmer

Zimmer can be found at github, to install we need nodejs. They can be installed like this:

   curl -sL https://deb.nodesource.com/setup_10.x | sudo bash -
   sudo apt-get install nodejs

We also need zimwriterfs installed. There are binaries for Linux here, so downloading can look about like that:

   wget "https://download.openzim.org/release/zimwriterfs/zimwriterfs_linux-x86_64-1.3.5.tar.gz"
   tar -xzf zimwriterfs_linux-x86_64-1.3.5.tar.gz
   mv zimwriterfs_linux-x86_64-1.3.5/zimwriterfs .
   sudo chmod +x ./zimwriterfs

Now we can install zimmer:

   npm i -g git+https://github.com/vadp/zimmer

The Zimmer is basically a scraper, scanning the wiki for all pages and downloading them in a way that zimwriterfs can work with them. The command for the scraping to start is:

   wikizimmer http://127.0.0.1/wiki/Main_Page

This may take, depending on the system, a long time, so run it with nohup, screen or similar. When its done, first get a favicon and save it as favicon.png in the newly created directory. Now you can use zimwriterfs to create the actual zim file.

   ./zimwriterfs --welcome=A/Main_Page.html --favicon=favicon.png --language=eng --title="OSEWiki" --description="The Open Source Ecology (OSE) project tries to create open source blueprints of all industrial machines definining modern society for a decentralized, sustainable, post-scarcity economy."  --creator="Marcin Jakubowski" --publisher="ENTERYOURNAME" ./127.0.0.1 osewiki_en_all_YEAR-MONTH.zim

The naming is important, for the android app will only successfully interact with the zim if the naming format is provided like this.

The command should take a bit, after it's done, there is a new zimfile, congratulations!

Test of the scraped material

For testing how many pages were scraped, the following steps are advised:
Get number of scraped pages (when inside the directory created by the scraper, no redirects included):

   find A -type f | wc -l

Get number of scraped images:

   find I -type f | wc -l

For the amount of pages (redirects included) and the amount of uploaded files (not only images) on the wiki, visit http://127.0.0.1/wiki/Special:Statistics?action=raw

Download

2019-08

The first archive of our wiki (from 08-18-2019) is available on archive.org: https://archive.org/details/osewiki_en_all_2019-08

It can be downloaded here

2021-04

The linked file is not working for unknown reasons, until this is fixed you can get it from my personal cloud: here