Download & installation

DISCLAIMER: The LCM software is currently in alpha status. We publish it as a result of the epol project in which it was developed to support political scientists by analysis of large newspaper collections. Consequently, for generic analysis purposes many functions are under-engineered or missing (see Future Developments).

Requirements

The LCM is a "Software as a Service" infrastructure deployed as server running in a virtual machine.

Hardware

We recommend the following hardware, to be able to process larger text collections (up to several million documents).

  • Multiprocessor architecture 64 Bit (installation on 32 bit systems is also possible, but needs certain adapatation during setup procedure)
  • Minimum 8 GB RAM (for little collections of some thousand documents 4 MB may be sufficient)

Software

The LCM is distributed as a virtual machine (VirtualBox) environment set up automatically by Vagrant.

Therefore, it requires an environment with both, VirtualBox and Vagrant, installed. We recommend Ubuntu Linux or another distribution from the Debian family.

Preparation

If you do not have installed VirtualBox and Vagrant, please so so:

  • https://www.virtualbox.org/wiki/Downloads
  • https://www.vagrantup.com/downloads.html

Also make sure that virtualization technology VT-X is enabled in your BIOS.


Download and installation

Installation of the LCM environment is pretty straightforward thanks to Vagrant and Puppet scripting. We also provide shell commands for Linux for easy installation here.

1. Download LCM lcm_release_0.0.4.zip
wget http://lcm.informatik.uni-leipzig.de/lcm_release_0.0.4.zip

2. Unzip LCM image
unzip lcm_release_0.0.4.zip

3. Change to LCM vagrant image directory
cd vagrant-lcm

4. Setup initial passwords and VM options

You'll need to edit some files.

nano corpusminer/software/glassfish/adminpassword.txt
  • Change STANDARDADMINPASSWORD to YOURPREFERREDADMINPASSWORD
nano corpusminer/software/glassfish/userpassword.txt
  • Change STANDARDADMINPASSWORD to YOURPREFERREDADMINPASSWORD
  • Change STANDARDUSERPASSWORD to YOURPREFERREDUSERPASSWORD

We strongly recommend to change these standard passwords. Otherwise you'll leave the door widely open to access of your virtual machine, and, probably also your host machine.

Optional: Edit Vagrantfile to alter available RAM and ports of your virtual machine: nano Vagrantfile

5. Initialize LCM

Simply run: vagrant up

6. Import data

Import one of our example data sets or your own data formatted in the LCM specific CSV data format (see below).

Get shell access to your virtual machine: vagrant ssh

Change to import process directory: cd /opt/asvuima

The import script may take a while depending on your data collection size. In our ePol project import of 3.5 million newspaper articles took more than 3 days. If your have just some thousands of documents it will be considerably less. The example data sets may take up from 1.5 to 3 hours.

Example data set I: Political Speeches

./importDocuments.sh CSV /home/corpusminer/data/import/political_speeches

This corpus includes German speeches of Bundespräsidenten, Bundeskanzler and some Ministers of the Federal Government and is licensed under CC BY-SA.

Credits: Barbaresi, Adrien (2012): German Political Speeches, Corpus and Visualization, 2nd Version, ENS Lyon, available online: http//purl.org/corpus/german-speeches, last access: 05/31/2016.

Example data set II: Plenarprotokolle des Bundestages

./importDocuments.sh CSV /home/corpusminer/data/import/bundestag

This corpus includes speeches extracted from plenary protocols of the 17th German Bundestag (2009-2013) and is licensed under the Open Data Commons Open Database License (ODbL).

Credits: Offenes Parlament (2016): Datenzugriff/API, available online: http://offenesparlament.de/info/daten, last access: 05/31/2016.

7. Open LCM

Access the LCM web GUI via https://localhost:8282/lcm

and login with the username

test and YOURPREFERREDUSERPASSWORD

8. Halt / uninstall LCM

Since the LCM runs as a virtual machine, it is useful to stop and restart the machine between longer work sessions. This can simply be done by typing the following Vagrant commands in the the vagrant-lcm directory:
vagrant halt stops virtualbox
vagrant up restarts virtualbox

To uninstall or reinitiate an LCM instance, type the following Vagrant commands:
vagrant destroy deletes the entire LCM box image and data.
Running vagrant up again will result in a freshly initialized LCM version, which also needs a new data import.

For uninstall, simply delete the entire vagrant-lcm directory after running vagrant destroy.

Example data

For the moment, we provide two example corpora for testing purposes: the political speeches corpus by Adrien Barbaresi (2012) and Bundestagsprotokolle provided by Offenes Parlament (2016). For detailed reference see above.

You are able to prepare your own data for the import process for analysis with the LCM. Simply copy your prepared CSV files to a new directory in the import folder: corpusminer/data/import/my_own_data and run the import script from within the virtual machine pointing to the new directory
vagrant ssh
cd /opt/asvuima
./importDocuments.sh CSV /home/corpusminer/data/import/my_own_data

Data format

The CSV import process accepts CSV files complying to the following format:

  • Header: 14 ordered columns: "doc_number","external_id","article_date","title","body","source",
    "subtitle","author","publisher","language","publication_type","section","subject","page"
  • Content: one document per CSV entry
  • Separator: , (comma); Text qualifier: "; Escape double: ""
  • date must be of this format: dd.mm.yyyy, for example 31.05.2016
  • body may contain newline characters. Single lines are imported as paragraphs in the LCM