Finnish and Estonian LM training data

less than 1 minute read

Conversational Finnish and Estonian data sets

Here are a couple of data sets that I have collected from the web for training language models:

Conversational Finnish (73 million words, 197 MB)
Conversational Estonian (80 million words, 204 MB)

The data, originally 2.7 billion Finnish words and 340 million Estonian words, have been collected by crawling conversation sites. The text has been normalized and filtered to match transcribed conversations, duplicate lines have been removed, and the sentences have been shuffled. The filtering is described in Kurimo et al. (2016), Modeling under-resourced languages for speech recognition.

Share on

X Facebook LinkedIn Bluesky

Seppo Enarvi

Finnish and Estonian LM training data

Conversational Finnish and Estonian data sets

Share on

Comments

You May Also Enjoy

Using LLMs for writing Home Assistant components

Kalman filter equations and extended Kalman filter

Understanding convergence of SGD

REINFORCE