Finnish and Estonian LM training data

less than 1 minute read

Conversational Finnish and Estonian data sets

Here are a couple of data sets that I have collected from the web for training language models:

The data, originally 2.7 billion Finnish words and 340 million Estonian words, have been collected by crawling conversation sites. The text has been normalized and filtered to match transcribed conversations, duplicate lines have been removed, and the sentences have been shuffled. The filtering is described in Kurimo et al. (2016), Modeling under-resourced languages for speech recognition.

Share on

X Facebook LinkedIn Bluesky

Comments

Find first occurrences in PyTorch

3 minute read

Recently I needed to solve a seemingly simple problem in PyTorch. The input is a 1-dimensional tensor containing integer values. To make the discussion easie...

Data format is crucial for good training speed

6 minute read

When your training runs start to take days, you might be tempted to throw more GPUs at the problem. It’s easy to underestimate the time needed for tasks that...

Using LLMs for writing Home Assistant components

6 minute read

Why I switched to Home Assistant

Kalman filter equations and extended Kalman filter

6 minute read

Kalman filter

Seppo Enarvi

Finnish and Estonian LM training data

Conversational Finnish and Estonian data sets

Share on

Comments

You May Also Enjoy

Find first occurrences in PyTorch

Data format is crucial for good training speed

Using LLMs for writing Home Assistant components

Kalman filter equations and extended Kalman filter