Conversational Finnish and Estonian data sets
Here are a couple of data sets that I have collected from the web for training language models:
- Conversational Finnish (73 million words, 197 MB)
- Conversational Estonian (80 million words, 204 MB)
The data, originally 2.7 billion Finnish words and 340 million Estonian words, have been collected by crawling conversation sites. The text has been normalized and filtered to match transcribed conversations, duplicate lines have been removed, and the sentences have been shuffled. The filtering is described in Kurimo et al. (2016), Modeling under-resourced languages for speech recognition.