Somewhen in somewhat video, I heard the idea of freezing the embedding matrix during the fine-tuning of a pre-trained language model. I completely forgot where it was said, but after that, usually do this. I decided to check if it’s needed and will share with you the results in this article. I found a similar experiment described in this post, but there were used GloVe vectors for initializing the embedding matrix of not pre-trained BiLSTM and there was no exploration of the difference between the validation and train vocabulary. Also found some papers: “To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks” and “What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning” that explored the number of frozen layers but there was no analysis especially for embedding matrix.
The intuition behind this is that the embedding layer contains an independent representation of each token and the vectors for tokens with similar meanings should be learned to be close in the embeddings space during pre-training to be used in a pretty similar way by the further layers. But when you have not a good amount of labeled data that is common for real-world use cases, most probably the data on which you will do the inference will contain some tokens that were not presented in the train set. Probably some of these new tokens will have the synonym pairs in your labeled data, but gradient updates can destroy this type of connection learned by the language model.
So, for the experiment, I chose the multilingual data from the current Kaggle competition “Contradictory, My Dear Watson”. It’s a natural language inferencing task where you have sentence pairs — premise and a hypothesis and you need to predict the relationship between them from entailment, neutral, and contradiction. This data consists of sentences in 15 languages that fit our experiment as it will be easy to get a validation set with a good amount of non-overlapped tokens. The model I chose for the experiment — XLM-RoBERTa that has a large vocabulary size — 250k which also suits our goals. The way I solved the task — simply concatenate the premise and hypothesis via a special token and input to the model as a single sequence.
First of all, I split the available 12k rows into the train (8120) and validation (4000) sets. Then I created 4 variations of the training dataset using train split: full, 6k, 4k, and 2k rows respectively. Checking the tokens overlap:
we can see that with decreasing the number of training data, the number of unseen tokens during the training increase.
The implementation is based on the Transformers Library. For the optimization — I chose the RAdam as it does not require the warm-up tuning that depends on data size. I used a batch size of 8 with a maximum sequence length of 512, packing examples of similar lengths into one batch to speed up the training process. Accuracy was used as a metric. More details can be found on GitHub.
Finally, I did 8 runs (4 train sizes and 2 options: with frozen embedding (model_fr) and not (model_unfr)). I trained each model for the same number of gradients updates. So, the validation metric curves look following:
From the plots above, we can see that there is no noticeable difference between the model_fr and the model_unfr. But in all cases up to the end of the training, the model_fr is a little better (for the experiment with 2k rows the model_unfr got the highest metric at 15th (from 40) epoch that is higher than the best metric of model_fr). Also, I logged the gradients per layer expected to get some insights, but for all experiment pairs, they look similar too. You can check the logs in the colab notebook on your own.
Then I decided to check the correlation between the number of new tokens [set(validation tokens) - set(training tokens)] and the difference between the best validation metrics [best validation metric of model_fr - best validation metric of model_unfr]:
So, as I mentioned above, the 3/4 experiments improve the validation accuracy but not too much. Also, I expected to see some line for the validation points from the middle (0.00) to up right that would indicate the positive correlation between a number of unseen during the training tokens in the validation set and the gap between a score of model_fr and score of model_unfr. Maybe, it’s a good idea to make more experiments as 8 looks like nothing to make some conclusions (having more results it would be possible to make some statistical test to measure the statistical significance), but even they take enough computing time. On the other hand, all experiments with no freezing take ~1.6 times more training time.
To sum up, I would say that it’s a good idea to freeze the embedding matrix when you fine-tune some pre-trained language models, especially with a large vocabulary. It won't give you superior improvements but you can try it for some Kaggle competition. If we are talking about the production models — this setup looks the same as with no freezing, but you can save your time for training that is pretty important.