Finnish and Estonian LM training data

less than 1 minute read

Conversational Finnish and Estonian data sets

Here are a couple of data sets that I have collected from the web for training language models:

The data, originally 2.7 billion Finnish words and 340 million Estonian words, have been collected by crawling conversation sites. The text has been normalized and filtered to match transcribed conversations, duplicate lines have been removed, and the sentences have been shuffled. The filtering is described in Kurimo et al. (2016), Modeling under-resourced languages for speech recognition.