DISCLAIMER: The LCM software is currently in alpha status. We publish it as a result of the epol project in which it was developed to support political scientists by analysis of large newspaper collections. Consequently, for generic analysis purposes many functions are under-engineered or missing (see Future Developments).
The LCM is a "Software as a Service" infrastructure deployed as server running in a virtual machine.
We recommend the following hardware, to be able to process larger text collections (up to several million documents).
- Multiprocessor architecture 64 Bit (installation on 32 bit systems is also possible, but needs certain adapatation during setup procedure)
- Minimum 8 GB RAM (for little collections of some thousand documents 4 MB may be sufficient)
Therefore, it requires an environment with both, VirtualBox and Vagrant, installed. We recommend Ubuntu Linux or another distribution from the Debian family.
If you do not have installed VirtualBox and Vagrant, please so so:
Also make sure that virtualization technology VT-X is enabled in your BIOS.
Download and installation
Installation of the LCM environment is pretty straightforward thanks to Vagrant and Puppet scripting. We also provide shell commands for Linux for easy installation here.
1. Download LCM lcm_release_0.0.1.zip
2. Unzip LCM image
3. Change to LCM vagrant image directory
4. Setup initial passwords and VM options
You'll need to edit some files.
- Change STANDARDADMINPASSWORD to YOURPREFERREDADMINPASSWORD
We strongly recommend to change these standard passwords. Otherwise you'll leave the door widely open to access of your virtual machine, and, probably also your host machine.
Optional: Edit Vagrantfile to alter available RAM and ports of your virtual machine:
5. Initialize LCM
6. Import data
Import one of our example data sets or your own data formatted in the LCM specific CSV data format (see below).
Get shell access to your virtual machine:
Change to import process directory:
The import script may take a while depending on your data collection size. In our ePol project import of 3.5 million newspaper articles took more than 3 days. If your have just some thousands of documents it will be considerably less. The example data sets may take up from 1.5 to 3 hours.
Example data set I: Political Speeches
./importDocuments.sh CSV /home/corpusminer/data/import/political_speeches
This corpus includes German speeches of Bundespräsidenten, Bundeskanzler and some Ministers of the Federal Government and is licensed under CC BY-SA.
Credits: Barbaresi, Adrien (2012): German Political Speeches, Corpus and Visualization, 2nd Version, ENS Lyon, available online: http//purl.org/corpus/german-speeches, last access: 05/31/2016.
Example data set II: Plenarprotokolle des Bundestages
./importDocuments.sh CSV /home/corpusminer/data/import/bundestag
This corpus includes speeches extracted from plenary protocols of the 17th German Bundestag (2009-2013) and is licensed under the Open Data Commons Open Database License (ODbL).
Credits: Offenes Parlament (2016): Datenzugriff/API, available online: http://offenesparlament.de/info/daten, last access: 05/31/2016.
7. Open LCM
Access the LCM web GUI via
8. Halt / uninstall LCM
Since the LCM runs as a virtual machine, it is useful to stop and restart the machine
between longer work sessions. This can simply be done by typing the following Vagrant
commands in the the vagrant-lcm directory:
vagrant halt stops virtualbox
vagrant up restarts virtualbox
To uninstall or reinitiate an LCM instance, type the following Vagrant commands:
vagrant destroy deletes the entire LCM box image and data.
vagrant up again will result in a freshly initialized LCM version, which also needs a new data import.
For uninstall, simply delete the entire vagrant-lcm directory after running
For the moment, we provide two example corpora for testing purposes: the political speeches corpus by Adrien Barbaresi (2012) and Bundestagsprotokolle provided by Offenes Parlament (2016). For detailed reference see above.
You are able to prepare your own data for the import process for analysis with the LCM. Simply copy your prepared CSV files to a new directory in the import folder:
corpusminer/data/import/my_own_data and run the import script from within the virtual machine pointing to the new directory
./importDocuments.sh CSV /home/corpusminer/data/import/my_own_data
The CSV import process accepts CSV files complying to the following format:
- Header: 14 ordered columns:
- Content: one document per CSV entry
,(comma); Text qualifier:
"; Escape double:
datemust be of this format: dd.mm.yyyy, for example 31.05.2016
bodymay contain newline characters. Single lines are imported as paragraphs in the LCM