Skip to content

Swiss-AL Media Corpora

Swiss-AL Media Corpora exclusively contain articles from journalistic media in German, French, Italian and Rhaeto-Romance. Since 2021, all articles in German and French are obtained via Swissdox@LiRi and provided by the Swiss Media Database. The corpora are published in December each year, covering the last 5 years. They are used as the database for selection of the Swiss word of the year (WDJ: Wort des Jahres). Since 2021, the corpora for French and German needed to be downsampled because of the quantity of articles available (making the resulting corpora to large for the corpus analysis tools). A stratified sampling method was used: 25% of all texts per source per week were randomly chosen.

Swiss-AL Media Corpora Releases 2020

The texts in the corpus were published online on media websites listed below. No pay-walled data is included in the corpus. The releases covers the time span from October 2015 to October 2020.

S_AL_WDJ20_DE

Sources
acronym texts class subclass source
blick 209099 media online Blick
grenchnertagblatt 199140 media daily_newspaper Grenchner Tagblatt
watson 100786 undefined undefined Watson
tagesanzeiger 87582 media daily_newspaper Tagesanzeiger
basellandschaftlichezeitung 82027 media daily_newspaper Basellandschaftliche Zeitung
srf 79939 media online Schweizer Radio und Fernsehen
nzz 58377 media online Neue Zürcher Zeitung
suedostschweiz 55635 media daily_newspaper Südostschweiz
bazonline 48634 media daily_newspaper Basler Zeitung
blickamabend 18247 media online Blick am Abend
derbund 15311 media daily_newspaper Der Bund
woz 8159 media weekly_newspaper Die Wochenzeitung
coopzeitung 5807 media weekly_newspaper Coop Zeitung
20min 4479 media online 20 Minuten
migroszeitung 379 media weekly_newspaper Migros Magazin

Be careful: Since the corpus is quite large, it needs some time to load the results. Due to performance reasons, the LDA topic model was calculated for a sample of 400.000 texts.

S_AL_WDJ20_IT

Sources
acronym texts class subclass source
rsinews 79902 media online Radiotelevisione Svizzera
tio 78834 media online Ticinonline
cdt 46113 media weekly_newspaper Corriere del Ticino
ticinonews 33858 media online Ticino News
gdp 31647 media daily_newspaper Giornale del popolo
laregione 11445 media daily_newspaper La Regione
azione 6459 media weekly_newspaper Azione
mattinonline 5273 media online Il Mattino Online
coopzeitung 1859 media weekly_newspaper Coop Zeitung

S_AL_WDJ20_FR

Sources
acronym texts class subclass source
lematin 97619 media daily_newspaper Le Matin
24heures 71526 media online 24 Heures
letemps 66452 media daily_newspaper Le Temps
rts 59864 media online Radio Télévision Suisse
tdg 36043 media daily_newspaper Tribune de Genève
lagefi 14247 media daily_newspaper L'Agefi
ghi 13033 media weekly_newspaper Genève home informations
onefm 7039 media online One FM
lecourrier 3359 media daily_newspaper Le Courrier
coopzeitung 2240 media weekly_newspaper Coop Zeitung
20min 1211 media online 20 Minuten
leman 536 media online Leman Bleu
migroszeitung 168 media weekly_newspaper Migros Magazin

S_AL_WDJ20_RM

This corpus contains texts from RTR (Radiotelevisiun Svizra Rumantscha) and is a first attempt to build a media corpus in Rumantsch. It was used for the Swiss "Word of the Year" and contains data from 2011 to 2020 (but with very few token for the years 2011-2014)