Create ZIM from OSE Wiki: Difference between revisions

From Open Source Ecology
Jump to navigation Jump to search
No edit summary
No edit summary
Line 33: Line 33:


     curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
     curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
     sudo apt-get install -y nodejs
     sudo apt-get install nodejs
     sudo apt-get install jpegoptim advancecomp gifsicle pngquant imagemagick nscd
     sudo apt-get install jpegoptim advancecomp gifsicle pngquant imagemagick nscd
     sudo apt-get install redis-server
     sudo apt-get install redis-server

Revision as of 17:50, 27 May 2018

Warning: This page is a page dedicated to the administration of this wiki. You may use this information for the understanding of the tools used, however, for the creation of a .zim file from this wiki, you need the approval and collaboration of the Server Admin.

The ZIM Format is an open source format to store webpages in its entirety, with a focus on wikipedia and mediawiki pages. For more info, look here.
To create a ZIM File yourself, you need to scrape the webpage and download all necessary dependencies. For that purpose, there are a handfull of programs capable of doing that. We will be using mwoffliner, as it seems to be the most advanced option.
But before we go into scraping, it should be told that the OSE wiki is not written for scraping, as you can see in the robots.txt, the limited ressources of this project lead to a firm security regarding scraping or DOS attacks. There is (unfortunately) no scrape tool for zim files out there that can be regulated to fit those needs, so we need a workaround here.
This is why this process has actually 2 steps: Setting up and Starting the scraper, and create a copy of the OSEWiki in a safe enviroment, which then can be scraped at any pace. We will start by creating the copy.
We will assume a debian enviroment (any derivative will do), as some of the programs have to be compiled from scratch, partly with bad documentation, it is probably possible, but not advisable to use other distributions.

Setup OSEWiki from a Backup

For this step you need a backup of the entire OSE wiki. This backup will contain of a sqldump, and the document root.
Warning! Such a Backup contains sensitive information! Store it in a save place, keep it only as long as you need it!
On your Server, you will need a default LAMP setup, so install mysql, apache and php, like this:

   sudo apt-get install php php-mysql mysql-server apache2 libapache2-mod-php php-xml php-mbstring

The Backup of the document root can be placed now, in it you can find the LocalSettings.php. In this file, the setup of the database is described; as we need to recreate it, search for DATABASE SETTINGS. There, you'll find the name of the database, the username and the password. You'll need them to restore the dump.
Now, restore the dump in the following fashion:

   sudo su
   mysql -u root
   create database [database_name];
   grant all privileges on [database_name].* to [database_user]@localhost identified by [database_user_password];
   exit;
   mysql -u [database_user] -p [database_name] < [path_to_sqldump_file]

if the sql file is compressed, extract it first.
Now database and document root are finished, we need to work on apache next. The Webservers conf file can be setup with just 4 lines basically, being:

   <VirtualHost *:80>
       DocumentRoot [path_to_the_htdocs_in_the_Document_Root]
       Alias /wiki [path_to_the_htdocs_in_the_Document_Root]/index.php
   </VirtualHost>

Don't forget the Alias
After that reload apache, and you should be able to reach the OSE wiki now through the ip of your server!

Setup Scraper/MWOffliner

MWOffliner can be found at github, to install we need a redis-server, nodejs and some minor components. They can be installed like this:

   curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
   sudo apt-get install nodejs
   sudo apt-get install jpegoptim advancecomp gifsicle pngquant imagemagick nscd
   sudo apt-get install redis-server

Additionally, the redis server needs a specific setup, make sure the following options are set in /etc/redis/redis.conf:

   unixsocket /dev/shm/redis.sock
   unixsocketperm 777
   save ""
   appendfsync no

Lastly, we also need the zimwriterfs installed, which is a bit inconvenient, as the github pages instructions on building it are just plain wrong (as in May 2018). However, they provide a docker written for ubuntu, so following those instructions should work:

   sudo apt-get install git pkg-config libtool automake autoconf make g++ liblzma-dev coreutils wget zlib1g-dev libicu-dev python3-pip libgumbo-dev libmagic-dev
   pip3 install meson
   wget https://oligarchy.co.uk/xapian/1.4.3/xapian-core-1.4.3.tar.xz
   tar xvf xapian-core-1.4.3.tar.xz
   cd xapian-core-1.4.3 && ./configure
   sudo make all install
   cd .. && rm -rf xapian
   ln -s /usr/bin/python3 /usr/bin/python
   git clone https://github.com/openzim/libzim.git
   cd libzim && git checkout 3.0.0
   git clone git://github.com/ninja-build/ninja.git
   cd ninja && git checkout release
   ./configure.py --bootstrap
   sudo cp ./ninja /usr/local/bin/
   cd .. && meson . build
   cd build && ninja
   sudo ninja install
   cd ../.. && rm -rf libzim
   git clone https://github.com/openzim/zimwriterfs.git
   cd zimwriterfs && meson . build
   cd build && sudo ninja install
   cd ../.. && rm -rf zimwriterfs
   sudo ldconfig

These commands are extracted and from here, written so that they can be run on a normal system. After all of that compiling, last thing we need is the actual mwoffliner, so lets get that one:

   sudo npm -g install mwoffliner

All the dependencies will be built by itself, after its done, you can start using it. Just type mwoffliner into command line and you get the possible parameter. For the creation of a build from the OSE Wiki, the following command is used:

   wget -O /tmp/favicon.png "https://wiki.opensourceecology.org/images/ose-logo.png"
   mwoffliner --outputDirectory=[Directory_Of_Zim_File] --mwUrl=http://127.0.0.1 --adminEmail=[your_mail] --customZimTitle=OSEWiki --customZimDescription="The Open Source Ecology (OSE) project tries to create open source blueprints of all industrial machines definining modern society for a decentralized, post-scarcity economy." --cacheDirectory=[Directory_of_Cached_Data(Optional)] --customZimFavicon=/tmp/favicon.png --mwApiPath="api.php" --localParsoid=true

This may take, depending on the system, a long time, so run it with nohup, screen or similar. After that, at the Directory stated by you, the completed ZIM should appear. Congratulations!