Skip to content

Corpus Analysis

Tagset for Part-of-Speech

All corpora are annotated with Part-of-Speech information, i.e. each word is assigned a part-of-speech. For German, we use the STTS tagset, which can be downloaded here. For French and Italian, we use the Tagsets of Achim Stein.

Query Syntax

For some of the features you need to use a special query syntax, so called CQP syntax. This allows you to search for simple word forms, complex word forms, combinations of words and part-of-speech, words in a specific syntactic function etc. All words have the following annotations which you can systematically use in your CQP query:

  • Part of Speech: category of a word (e.g. proper noun, adjective, adverb)
  • abbreviated with "pos": e.g., [pos = "ADJA"] will find all adjectives in the corpus
  • We use the STTS tagset, which can be downloaded here.
  • lemma
  • dictionary form of a word form
  • e.g., [lemma = "Virus"] will find the word forms "Viren" and "Virus"
  • depRel
  • dependency relation of a word form in a specific sentence
  • e.g. [depRel = "SB"] will find all words with the dependency relation subject
  • the tagset for dependency relations can be found here.
  • depHeadWord
  • word form of the syntactic head
  • e.g. [depHeadWord = "muss"] will find all words whose head is muss

This syntax works with regular expressions, i.e. wildcards which you can use as placeholders for certain characters or combinations of characters. The following table (in German) shows the most important regular expressions and examples how to use them:

RegEx Bedeutung
. Ein weiteres, beliebiges Zeichen: [word="erneuerbare."] → erneuerbaren, erneuerbarem, erneuerbarer, …
+ Wiederholungsoperator: Das vorangehende Zeichen muss mindestens einmal vorkommen, darf aber auch mehrmals vorkommen (bis zur Wortgrenze); z.B. sinnvoll in der Kombination mit «.»: [word="Energie.+"] → Energien, Energiegesetz, Energiestrategien, …
* Wiederholungsoperator: Das vorangehende Zeichen darf beliebig oft, d.h. auch keinmal, vorkommen (bis zur Wortgrenze); z.B. sinnvoll in der Kombination mit «.»: [word="Energie.*"] → Energie, Energien, Energiegesetz, Energiestrategien, …
? Wiederholungsoperator: Das vorangehende Zeichen kann, muss aber nicht vorhanden sein: [word=“energie–?effizient”] → energie–effizient, energieeffizient, …
(x|y) Findet separat jedes Element in den runden Klammern. [word="(E|e)nergie–?(Z|z)ukunft"] → Energiezukunft, energiezukunft, Energie-zukunft, Energie-Zukunft, …
x{0,3} Findet 0 bis 3 Wiederholungen des Elements davor. Mehrere Elemente können in eckigen Klammern eingegeben werden. → [xyz]{0,3}
[] Zusätzliches unbestimmtes Wort zwischen Suchbegriffen.: [word="die"] [] [word="Energie"] → die graue Energie, die saubere Energie, …
[^] Die Zeichen in den Klammern nach ^ dürfen nicht vorkommen.: [word=".*?[^eEÉé]nergie"] → Minergie, Synergie, … aber nicht Energie
! Der gesamte Ausdruck nach dem Ausrufezeichen darf nicht vorkommen: [pos!= “NN”] → Als Ergebnis kommen alle Wortarten ausser Nomina (NN).
& Kombiniert die pos-, word-, lemma- und Eigennamen-Eigenschaften eines Suchwortes: [word=“richtig” & pos="ADV"] → Okkurrenzen von richtig nur als Adverb, nicht z. B. als Adjektiv.
\ Buchstäbliche Interpretation von Zeichen, die sonst CQP-Platzhalter wären: [word="Erneuerbare"] [word="\?"] → Findet buchstäblich Erneuerbare?
%c Gross-/Kleinschreibung wird nicht berücksichtigt: [word="energie"%c] → Energie, energie

Modes of analysis

Corpus Query

You can search the corpus by using a special search syntax, so called CQP syntax (see chapter "Query Syntax"). As a result, you will get the frequency of the individual search results.

A few examples: This query will find all word forms beginning with "Vir" (e.g., "Virus", "Virenstamm", "Virusinfektion"):
[word = "Vir.*"]

This query will find adjacent combinations of an adjective (ADJA) and word forms containing "bezüger" or "empfänger":
[pos = "ADJA"][word = ".*(bezüger|empfänger).*"]

This query will find nouns (NN) in the dependency relation (depRel) subject (SB).
[depRel = "SB" & pos = "NN"]

This querey will find finite and infinite modal verbs (VM.) whose head is the word form *muss. [depHeadWord = "muss" & pos = "VM.*"]

Results will be displayed as a table showing all results and as a barplot showing the 30 most frequent results.

Distribution Analysis

You can analyze the distribution of search terms over time and/or (groups of) sources (e.g., when you are interested in the distribution of the word "Europa" between 2013 and 2019 in texts from politics and industry). You can enter up to five search terms (currently, no CQP-syntax or regular expressions are possible). The search will be based on word forms (i.e. you should consider using inflected forms). The results will be visualized as bar plots and line graphs (for distribution over time). All frequencies are calculated per million words.

Collocation Analysis

Collocations are frequently co-occuring words within a certain span of words. They give information about the meaning of a word (following the hypothesis that a words meaning is strongly determined by its context).

You can calculate the collocations for a search term in a corpus or a user-defined subcorpus (e.g., if you are interested in the linguistic context of the word "Europa" in texts from industry). If you would like to use CQP syntax for the search term, please tick the according check box. For a user-defined subcorpus, you can use a specific time range (earliest point in time: 2010) and any combination of the pre-defined classification of individual corpus sources (media, politics, industry, research, civil society). Collocations will be calculated by using Log Likelihood.

Furthermore, You can adjust the following collocation parameters:

  • right context: the number of words to the right of the search term that should be included to the collocational window
  • left context: the number of words to the left of the search term that should be included to the collocational window
  • form: either "lemma" or "word", i.e. should collocations be based on lemmas (inflected forms will be summarized) or surface word forms (inflected forms will be handled separately)

Furthermore, you need to choose the desired visualization for the collocations. You can choose between:

  • tabular output
  • bar plot (the first 30 collocations will be displayed as bars, the length of the bars depends on the corresponding log-likelihood value)
  • tree map (the first 30 collocations will be displayed, the size of the tiles depends on the corresponding log-likelihood value)

n-gram analysis

N-grams are groups of adjacently occurring words in a corpus, whereas "n" is a placeholder for the desired length of the group (e.g., a search for 3-grams will reveal phrases like "das Virus wird", "Virus wird verbreitet", "wird verbreitet durch", which are all occurring in the same sentence: "Das Virus wird sehr schnell verbreitet."). N-grams can be used to calculate frequently occurring phrases in texts, e.g. providing access to specific narrations or metaphors.

You can calculate phrases containing a specific word (and optionally a specific part-of-speech). You can define the following parameters:

  • length of ngram in words (possible options: bi-grams (= two words), three-grams (= three words), 4-grams (= 4 words))
  • search term that should occur in the ngram. N-grams are calculated based on lemmas, i.e. inflected forms are subsumed and not treated separately (currently, no CQP-syntax or regular expressions are possible).
  • optionally: a part-of-speech that should also be part of the n-gram. Part-of-speech are entered by using STTS abbreviations.
  • database: you can either calculate ngrams for whole corpora or you can compare ngrams for two user-defined subcorpora. For a user-defined subcorpus, you can use a specific time range (earliest point in time: 2010) and any combination of the pre-defined classification of individual corpus sources (media, politics, industry, research, civil society).

Results will be visualized as bar plots. The 30 most frequent ngrams will be displayed. When two corpora are compared to each other, the top 30 ngrams for both corpora will be displayed (but treated as one if they are the same in both corpora).

co-occurrence analysis

Similarly to collocation analysis one might be interested in the textual co-occurrence of certain words. Highly correlated words (i.e., words that are often used together in a text) can be used to identify narrations in texts, dominant topics and so on.

You can calculated the words that correlate with a user defined search term. Correlation is calculated based on texts, i.e. words that often co-occur in texts show a high correlation. Correlation values range from 0 to 1, where 1 means that two words always occur together and 0 that two words seldom appear together in a text. You can compare up to five search terms, i.e. for every search term you enter, the top 30 correlating words will be displayed either as a bar plot or in form of a network. When the results are displayed as a network, nodes represent words. Edges connect words that correlate with each other.

Topic Modeling

Topic Modeling refers to a group of algorithms that are suited to identify thematic structures in large collections of texts. Such thematic structures are operationalized in form of word list (called topics) representing probability distributions over the words in a corpus. Words which often occur together in texts form topics and show a large degree of thematic coherence. By calculating topics for a corpus via topic modeling, it is possible to get an overview of the thematic range in the corpus. Furthermore, topic modeling provides information regarding the topic distribution in every single text of the corpus. As a result, it is possible to show the development of a specific topic over a certain period of time (since all texts have their date of creation as a metadatum).

In the workbench, pre-calculated topic models can be analyzed (currently, this is only possible for the corpus SWISS_AL_DE_COVID19). You can either inspect the topic model as a simple data frame (the top 30 words of very topic will be displayed) or via LDAvis, a special visualization tool for topic models introduced in Sievert/Shirley (2014). Caution: LDAvis uses a different topic index! Furthermore, you can analyze the temporal development (starting in 2010) for up to five selected topics, either for the whole corpus or for any combination of the pre-defined classification of individual corpus sources (media, politics, industry, research, civil society). The results will be displayed as separate line graphs for each topics (line graphs will be labeled with the top 3 words in the respective topic).

If you would like to read more on topic modeling in general, we recommend: Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.

If you need an introduction to LDAvis, we recommend: Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 63–70.

TensorBoard: Word Embeddings

Recently, distributional semantic models gain popularity, particularly so-called Word Embeddings. Distributional semantic models are vector representations of words based on their co-occurrence frequencies with other terms in the corpus. This leads to an n-dimensional vector space in which the geometric relationship between words is meaningful for their semantic relationship. Words that are often used in the same context are very close to each other in the model, i.e. by searching for the so called next neighbors of a word in word embedding model, semantically similar words will be displayed.

Pre-calculated word embedding models exist for several corpora. They are visualized in a three-dimensional space via TensorBoard. To start the visualization for a specific corpus, please do the following:

  • On the workbench website, click on "Tensorboard" in the right menu bar. You will be redirected to the Tensorboard.
  • In the upper orange bar, click on "Projector". A word embedding model for a pre-defined corpus will be loaded (you can change the selected corpus in the next step).
  • Select the corpus whose word embedding model you would like to see in the drop down menu in the left bar under "DATA". The associated model will be loaded.
  • You can inspect the model by using the search menu on the right menu bar, e.g. by entering a word in the search field. The word embedding models are calculated using lemmas, i.e. inflected word forms are subsumed under a lemma. You can enter up to three words separated by "_" in the search field (e.g. "politisch_Interesse" to find neighbors for phrases like "politisches Interesse" and "politische Interessen"). You need to click on the search term in the list of automatically suggested terms. The 100 nearest neighbors will be displayed.
  • By clicking on "Isolate 101 points", the vector space is only displayed containing the search word and its 100 nearest neighbors.

A paper showing the potential of word embeddings for discourse analysis is:

  • N. Bubenhofer, C. Selena, C. & Ph. Dreesen, «Politisierung in rechtspopulistischen Medien: Wortschatzanalyse und Word Embeddings», Osnabrücker Beiträge zur Sprachtheorie 95, pp. 211-241, 2019.

If you would like to read more on the general principal behind word embeddings, we recommend:

  • A. Lenci, «Distributional Models of Word Meaning», Annual Review of Linguistics, 4 (1), pp. 151–171, 2018, doi: 10.1146/annurev-linguistics-030514-125254.