Korean Tokenization & Lemmatization

Note: If you have the labeled data and you are not planning to operate directly with words, probably your solution could be a pre-trained network such as KoBERT or even fancier — KoELECTRA. If you are searching for some “lighter” approaches, you can build your vocabulary using something from HuggingFace’s tokenizers library and train what you want.



  1. At first sight, they resemble each other and are different from most European languages but it’s not quite true.
  2. There are a lot of tools that can be easily found for Chinese and Japanese, but not for Korean.

Because of the last one, I decided to briefly share with you the knowledge I’ve acquired. I am grateful to people who speak the mentioned languages and could answer some of my questions. Also, If you want to add or rectify something, you are welcome! So, let’s start.

Important language points:

  • Unlike Chinese and Japanese, Korean uses whitespaces to separate words!)
  • The pieces separated by spaces — eojeols, eojeols consists of syllables
  • Eojeols are formed of morphemes, usually, there is a “root” and other morphemes modify it (like prefixes and suffixes modify English words to create verb or adjective). For example, 결정하겠다 — “I will decide” consist of: “결정”, “하”, “겠”, “다”. The first morpheme (“결정”) is a root, that is translated as “decision”. Because of that, it’s probably the wrong solution to simply divide the sentence by whitespaces and leave tokens as it is* (the example above was taken from really useful video, check it out)
  • The “root” morphemes can have their normalized form (aka. lemma)

*According to my experience, be aware that sometimes eojeol is split too much, and in my case, some eojeols were found in word-graph without this kind of pre-processing

Open-source tools:

Luckily, I have stumbled upon a paper Korean Morphological Analysis with Tied Sequence-to-Sequence Multi-Task Model [2019] that uses LSTM with a pointer-generator network and CRF on a top. Their approach showed SOTA for the mentioned tasks at that moment. Talking about the reason to use seq2seq architecture — the decoder generated the base forms of each morpheme. So the model does all at once. Another good point of this paper is that they have an open repository (working) and even link to the pre-trained weights 🙌. You can easily adapt their inference code to your needs (the repo is written using PyTorch).

The paper references one more open tool khaiii, which excludes using RNN architectures motivated by speed and uses CNN, besides, the part of the repository is written in C++, so probably it has to be faster, but a little less accurate. I didn’t try to launch it, BUT there is an important note — khaiii doesn’t provide any English documentation which is a really common problem for Asian NLP repositories.

There is one more neural-based solution, that is just tokenizer — PyKoSpacing. It’s a little out-dated compared to previous repositories and has pretty interesting architecture, but maybe you would like Keras more)

Instead of summary

ML Engineer at Simporter Inc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store