Note: If you have the labeled data and you are not planning to operate directly with words, probably your solution could be a pre-trained network such as KoBERT or even fancier — KoELECTRA. If you are searching for some “lighter” approaches, you can build your vocabulary using something from HuggingFace’s tokenizers library and train what you want.
A few weeks ago I was faced with processing a few Asian languages: Chinese, Japanese, and Korean. To work with them properly I had first to understand their structure (how words are formed and how further they form the sentence). While searching for the answers to my questions I’ve found out two things:
- At first sight, they resemble each other and are different from most European languages but it’s not quite true.
- There are a lot of tools that can be easily found for Chinese and Japanese, but not for Korean.
Because of the last one, I decided to briefly share with you the knowledge I’ve acquired. I am grateful to people who speak the mentioned languages and could answer some of my questions. Also, If you want to add or rectify something, you are welcome! So, let’s start.
Important language points:
- Korean has an alphabet instead of hieroglyphs in writing
- Unlike Chinese and Japanese, Korean uses whitespaces to separate words!)
- The pieces separated by spaces — eojeols, eojeols consists of syllables
- Eojeols are formed of morphemes, usually, there is a “root” and other morphemes modify it (like prefixes and suffixes modify English words to create verb or adjective). For example, 결정하겠다 — “I will decide” consist of: “결정”, “하”, “겠”, “다”. The first morpheme (“결정”) is a root, that is translated as “decision”. Because of that, it’s probably the wrong solution to simply divide the sentence by whitespaces and leave tokens as it is* (the example above was taken from really useful video, check it out)
- The “root” morphemes can have their normalized form (aka. lemma)
*According to my experience, be aware that sometimes eojeol is split too much, and in my case, some eojeols were found in word-graph without this kind of pre-processing
Taking into consideration the stated above information, it is required to do at least 3 steps: split eojeols into morphemes, mark each morpheme to identify the root, and finally normalize the root. Quick googling gives usually one library that can solve these tasks — KoNLPy. It utilizes several morpheme analyzer tools such as JHannanum, Kkma, KOMORAN, MeCab-ko, Open Korean Text. Most of them use Hidden Markov Models or only the Viterbi Algorithm. If you are interested in these models— there is one great article that explains how MeCab-like libraries work (originally it was developed for Japanese). I haven`t found any information on how accurate each of them is, so I delved deeper to find something fresher.
Luckily, I have stumbled upon a paper Korean Morphological Analysis with Tied Sequence-to-Sequence Multi-Task Model  that uses LSTM with a pointer-generator network and CRF on a top. Their approach showed SOTA for the mentioned tasks at that moment. Talking about the reason to use seq2seq architecture — the decoder generated the base forms of each morpheme. So the model does all at once. Another good point of this paper is that they have an open repository (working) and even link to the pre-trained weights 🙌. You can easily adapt their inference code to your needs (the repo is written using PyTorch).
The paper references one more open tool khaiii, which excludes using RNN architectures motivated by speed and uses CNN, besides, the part of the repository is written in C++, so probably it has to be faster, but a little less accurate. I didn’t try to launch it, BUT there is an important note — khaiii doesn’t provide any English documentation which is a really common problem for Asian NLP repositories.
There is one more neural-based solution, that is just tokenizer — PyKoSpacing. It’s a little out-dated compared to previous repositories and has pretty interesting architecture, but maybe you would like Keras more)
Instead of summary
If you happen to be a contributor to some open-source package, doesn't matter if it’s about NLP or not, duplicate the readme or documentation in English too. It will definitely help the community to grow faster 😉