So far, we can see that SBERT can be used for information retrieval, clustering, automatic essay scoring, and for semantic textual similarity with incredible time and high accuracy. However, the limitation of SBERT is that it only supports English at the moment while leave blank for other languages. To solve that, we can use the model architecture similar with Siamese and Triplet network structures to extend SBERT to new language .
The idea is simple, first we produces sentence embeddings in English sentence by SBERT, we call Teacher model. Then we create new model for our desired language, we call Student model, and this model tries to mimic the Teacher model. In other word, the original English sentence will be trained in Student model in order to get the vector same as one in Teacher model.
As the example below, both “Hello World” and “Hallo Welt” were put through Student model, and the model tries to generate two vectors that are similar with the one from Teacher model. After training, the Student model are expected to have ability for encoding the sentence in both language English and the desired language.
Let’s try from scratch with an example for transfer SBERT English to Japanese.
First of all, we need to install SBERT and MeCab package (the important package to parse the Japanese sentence to meaning word).
!pip install -U sentence-transformers
!pip install mecab-python3
Then some human intelligence needed to prepare several pairs of sentence for Translated Dataset as well as Semantic Text Similarity Dataset in both English and Japanese. After preprocessing Japanese sentences, we will have the data look like below
I used XLM-RoBERTa to create word embedding as Student model (of course you can try other BERT pre-train model if you want i.e mBERT), “bert-base-nli-stsb-mean-tokens” from SentenceTransformer as Teacher model and mean aggregation as pooling layer. Other parameter are max_seq_length = 128 and train_batch_size = 64 (if it’s over your RAM limitation, you can reduce batch_size to 32 or 16).
After creating Teacher model and Student model, we can start to load train, dev, test dataset and training model. The train and test sets is Translated Dataset meanwhile the dev sets is Semantic Text Similarity Dataset following the structure of Transfer learning SBERT architecture. In this example, I will train the model in 20 epochs with learning rate = 2e-5 and epsilon = 1e-6, you can freely to try another hyper parameter to get the optimum results in your language. I also save the model for downstream application, if you only want to play around with this, you can turn it off by set save_bet_model = False.
Finally, let’s enjoy the results. We will evaluate the Student model in both English and Japanese corpus with the same meaning of sentences.
Let check the ability of Student model in Japanese corpus