Qualitative data analysis, NLP and Text MiningWe build and maintain an infrastructure for supporting qualitative and quantitative Content Analysis (CA), the Leipzig Corpus Miner (LCM). This integrated application of differnt technologies was built by the NLP Group at the University of Leipzig . The infrastructure aims at the integration of “close reading” procedures on individual documents with procedures of “distant reading”, e.g. lexical characteristics of large document collections. Therefore information retrieval systems, lexicometric statistics and machine learning procedures are combined in a coherent framework which enables qualitative data analysts to make use of state-of-the-art Natural Language Processing (NLP) techniques on very large document collections. Applicability of the framework ranges from social sciences to media studies and market research. The LCM is more of an infrastructure in contrast to complete software packages. The LCM is a combination of different technologies which provide a qualitative data analysis environment accessible by an interface targeted towards domain experts unfamiliar with NLP. Analysts are put in a position to work on their data with more methodical rather than technical understanding of the algorithms. Applied technologies behind the user interface need to support analysts in tasks such as data storage, retrieval, processing and presentation. We integrate technologies such as UIMA , SOLR , MongoDB and Glassfish to create a distributed multi-tier environment capable to process and store millions of text documents.
For contact and information on the LCM please contact the NLP Group at University of Leipzig.
- Gregor Wiedeman – [gregor.wiedemann] [at] [uni-leipzig.de]
- Andreas Niekler – [aniekler] [at] [informatik.uni-leipzig.de]
 Andreas Niekler, Gregor Wiedemann und Gerhard Heyer: Leipzig Corpus Miner – A Text Mining Infrastructure for Qualitative Data Analysis . In: Terminology and Knowledge Engineering 2014 (TKE 2014) , Berlin, 2014 read
Assuming the availability of a large document collection, e.g. complete volumes of a daily newspaper over several decades, a common need is to identify documents of interest for certain research questions.
The LCM has implemented computation and visualization of basic corpus linguistic measures on stored collections. It allows for frequency analysis, co-occurrence analysis and automatic extraction of key terms.
For analysis of topical structures in large text collections Topic Models have been shown to be useful in recent studies. Topic Models are statistical models which infer probability distributions over latent variables, assumed to represent topics, in text collections as well as in single documents.
Supervised learning from annotated text to assist coding of documents or parts of documents promises to be one major innovation to Content Analysis applications. The LCM allows for manual annotation of complete documents or snippets of documents with category labels. The analyst may initially develop a hierarchical category system and / or refine it during the process of annotation. Annotated text parts are used as training examples for automatic classification processes which output category labels for unseen analysis units (e.g. sentences, paragraphs or documents).