Create ZIM from OSE Wiki: Difference between revisions
No edit summary |
No edit summary |
||
Line 6: | Line 6: | ||
This is why this process has actually 2 steps: Setting up and Starting the scraper, and create a copy of the OSEWiki in a safe enviroment, which then can be scraped at any pace. We will start by creating the copy.<br> | This is why this process has actually 2 steps: Setting up and Starting the scraper, and create a copy of the OSEWiki in a safe enviroment, which then can be scraped at any pace. We will start by creating the copy.<br> | ||
We will assume a debian enviroment (any derivative will do), as some of the programs have to be compiled from scratch, partly with bad documentation, it is probably possible, but not advisable to use other distributions.<br> | We will assume a debian enviroment (any derivative will do), as some of the programs have to be compiled from scratch, partly with bad documentation, it is probably possible, but not advisable to use other distributions.<br> | ||
==Securing the Server== | |||
If the server is accessible from the internet in any way (like a virtual Host or similar), it's absolutely mandatory to protect it against external threats to assure the security of the private data from OSE and it's members. To do that, all externall access should be closed, beside the ssh access used for the further setup. Also make sure to protect this access, some ideas for that are found [https://devops.profitbricks.com/tutorials/secure-the-ssh-server-on-ubuntu/ here]. Default ssh ports are targets of 100s of attacks per day if open to the web, so use countermeasures.<br> | |||
For securing the further access, the easies way to do that is to use iptables to block all other access: | |||
sudo apt-get install iptables | |||
iptables -P INPUT DROP | |||
iptables -A INPUT -i eth0 -p tcp --dport 22 -j ACCEPT | |||
Depending on your setup, you may have to change the 22 to the port your ssh is running on, and change -eth0 to the network device your system is using (ifconfig should give you that information). Better check this BEFORE your setting that up.<br> | |||
If you'd like to setup the security differently, the following vulnerabilities have to be covered by your setup:<br> | |||
A redis server<br> | |||
A mysql server<br> | |||
apache<br> | |||
parsoid (running on port 8000)<br> | |||
==Setup OSEWiki from a Backup== | ==Setup OSEWiki from a Backup== |
Revision as of 09:22, 28 May 2018
Warning: This page is a page dedicated to the administration of this wiki. You may use this information for the understanding of the tools used, however, for the creation of a .zim file from this wiki, you need the approval and collaboration of the Server Admin.
The ZIM Format is an open source format to store webpages in its entirety, with a focus on wikipedia and mediawiki pages. For more info, look here.
To create a ZIM File yourself, you need to scrape the webpage and download all necessary dependencies. For that purpose, there are a handfull of programs capable of doing that. We will be using mwoffliner, as it seems to be the most advanced option.
But before we go into scraping, it should be told that the OSE wiki is not written for scraping, as you can see in the robots.txt, the limited ressources of this project lead to a firm security regarding scraping or DOS attacks. There is (unfortunately) no scrape tool for zim files out there that can be regulated to fit those needs, so we need a workaround here.
This is why this process has actually 2 steps: Setting up and Starting the scraper, and create a copy of the OSEWiki in a safe enviroment, which then can be scraped at any pace. We will start by creating the copy.
We will assume a debian enviroment (any derivative will do), as some of the programs have to be compiled from scratch, partly with bad documentation, it is probably possible, but not advisable to use other distributions.
Securing the Server
If the server is accessible from the internet in any way (like a virtual Host or similar), it's absolutely mandatory to protect it against external threats to assure the security of the private data from OSE and it's members. To do that, all externall access should be closed, beside the ssh access used for the further setup. Also make sure to protect this access, some ideas for that are found here. Default ssh ports are targets of 100s of attacks per day if open to the web, so use countermeasures.
For securing the further access, the easies way to do that is to use iptables to block all other access:
sudo apt-get install iptables iptables -P INPUT DROP iptables -A INPUT -i eth0 -p tcp --dport 22 -j ACCEPT
Depending on your setup, you may have to change the 22 to the port your ssh is running on, and change -eth0 to the network device your system is using (ifconfig should give you that information). Better check this BEFORE your setting that up.
If you'd like to setup the security differently, the following vulnerabilities have to be covered by your setup:
A redis server
A mysql server
apache
parsoid (running on port 8000)
Setup OSEWiki from a Backup
For this step you need a backup of the entire OSE wiki. This backup will contain of a sqldump, and the document root.
Warning! Such a Backup contains sensitive information! Store it in a safe place, keep it only as long as you need it! Make sure that the web server running this snapshot can only be access from your local network; it should not be exposed to the public internet.
On your Server, you will need a default LAMP setup, so install mysql, apache and php, like this:
sudo apt-get install php php-mysql mysql-server apache2 libapache2-mod-php php-xml php-mbstring
The Backup of the document root can be placed now, in it you can find the LocalSettings.php. In this file, the setup of the database is described; as we need to recreate it, search for DATABASE SETTINGS. There, you'll find the name of the database, the username and the password. You'll need them to restore the dump.
Now, restore the dump in the following fashion:
sudo su mysql -u root create database [database_name]; grant all privileges on [database_name].* to [database_user]@localhost identified by [database_user_password]; exit; mysql -u [database_user] -p [database_name] < [path_to_sqldump_file]
if the sql file is compressed, extract it first.
Now database and document root are finished, we need to work on apache next. The Webservers conf file can be setup with just 4 lines basically, being:
<VirtualHost 127.0.0.1:80> DocumentRoot [path_to_the_htdocs_in_the_Document_Root] Alias /wiki [path_to_the_htdocs_in_the_Document_Root]/index.php </VirtualHost>
Don't forget the Alias
Double-check that this apache vhost is not accessible to the public internet
After that reload apache, and you should be able to reach the OSE wiki now through the ip of your server!
Setup Scraper/MWOffliner
MWOffliner can be found at github, to install we need a redis-server, nodejs and some minor components. They can be installed like this:
curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash - sudo apt-get install nodejs sudo apt-get install jpegoptim advancecomp gifsicle pngquant imagemagick nscd sudo apt-get install redis-server
Additionally, the redis server needs a specific setup, make sure the following options are set in /etc/redis/redis.conf:
unixsocket /dev/shm/redis.sock unixsocketperm 777 save "" appendfsync no
Lastly, we also need the zimwriterfs installed, which is a bit inconvenient, as the github pages instructions on building it are just plain wrong (as in May 2018). However, they provide a docker written for ubuntu, so following those instructions should work:
sudo apt-get install git pkg-config libtool automake autoconf make g++ liblzma-dev coreutils wget zlib1g-dev libicu-dev python3-pip libgumbo-dev libmagic-dev sudo pip3 install meson wget https://oligarchy.co.uk/xapian/1.4.3/xapian-core-1.4.3.tar.xz tar xvf xapian-core-1.4.3.tar.xz cd xapian-core-1.4.3 && ./configure sudo make all install cd .. && rm -rf xapian ln -s /usr/bin/python3 /usr/bin/python git clone https://github.com/openzim/libzim.git cd libzim && git checkout 3.0.0 git clone git://github.com/ninja-build/ninja.git cd ninja && git checkout release ./configure.py --bootstrap sudo cp ./ninja /usr/local/bin/ cd .. && meson . build cd build && ninja sudo ninja install cd ../.. && rm -rf libzim git clone https://github.com/openzim/zimwriterfs.git cd zimwriterfs && meson . build cd build && sudo ninja install cd ../.. && rm -rf zimwriterfs sudo ldconfig
These commands are extracted and from here, written so that they can be run on a normal system. After all of that compiling, last thing we need is the actual mwoffliner, so lets get that one:
sudo npm -g install mwoffliner
All the dependencies will be built by itself, after its done, you can start using it. Just type mwoffliner into command line and you get the possible parameter. For the creation of a build from the OSE Wiki, the following command is used:
wget -O /tmp/favicon.png "https://wiki.opensourceecology.org/images/ose-logo.png" mwoffliner --outputDirectory=[Directory_Of_Zim_File] --mwUrl=http://127.0.0.1 --adminEmail=[your_mail] --customZimTitle=OSEWiki --customZimDescription="The Open Source Ecology (OSE) project tries to create open source blueprints of all industrial machines definining modern society for a decentralized, post-scarcity economy." --cacheDirectory=[Directory_of_Cached_Data(Optional)] --customZimFavicon=/tmp/favicon.png --mwApiPath="api.php" --localParsoid=true
This may take, depending on the system, a long time, so run it with nohup, screen or similar. After that, at the Directory stated by you, the completed ZIM should appear. Congratulations!