Implementing Multilingual Translation with T5 and Transformers


Language translation is one of the most important tasks in natural language processing. In this tutorial, you will learn how to implement a powerful multilingual translation system using the T5 (Text-to-Text Transfer Transformer) model and the Hugging Face Transformers library. By the end of this tutorial, you’ll be able to build a production-ready translation system that can handle multiple language pairs. In particular, you will learn:

  • What is the T5 model and how it works
  • How to generate multiple alternatives for a translation
  • How to evaluate the quality of a translation

Let’s get started!

Implementing Multilingual Translation with T5 and Transformers
Hermes Rivera. Some rights reserved.

Overview

This post is divided into three parts; they are:

  • Setting up the translation pipeline
  • Translation with alternatives
  • Quality estimation

Setting Up the Translation Pipeline

Text translation is a fundamental task in natural language processing, and it inspired the invention of the original transformer model. T5, the Text-to-Text Transfer Transformer, was released by Google in 2020 and is a powerful model for translation tasks due to its text-to-text approach and pre-training on massive multilingual datasets.

Text translation in the transformers library is implemented as “conditional generation”, which means the model is generating text conditioned on the input text, just like a conditional probability distribution. Just like all other models in the transformers library, you can instantiate a T5 model in a few lines of code. Before you begin, make sure you have the following dependencies installed:

Let’s see how to create a translation engine using T5:

The class MultilingualTranslator instantiates a T5 model and a tokenizer as usual. The translate() method is where the actual translation magic happens. You can see that it is just a text generation with a prompt, and the prompt is simply saying, “translate X to Y”. Because it is a text generation task, you can see the parameters to control the beam search, such as num_beams, length_penalty, and early_stopping.

The tokenizer sets return_tensors="pt" to return a PyTorch tensor, otherwise it will return a Python list of token IDs. You need to do that because the model expects a PyTorch tensor. The default format of output depends on the implementation of the tokenizer, hence you need to consult the documentation to use it correctly.

The tokenizer is used again after generation to decode the generated tokens back to text.

The output of the above code is:

You can see that the model can translate from English to French or German, but failed to translate from Spanish to English. This is a problem of the model (probably related to how the model was trained). You may need to try another model to see if it works better.

Translation with Alternatives

Translating a sentence into a different language is not a one-to-one mapping. Because of the variation in grammar, word usage, and sentence structure, there are multiple ways to translate a sentence.

Since text generation from the above model uses beam search, you can generate multiple alternatives for a translation natively. You can modify the translate() method to return multiple translations:

This modified method returns a dictionary with a list of translations and scores instead of a single string of text. The model’s output is still a tensor of logits, and you need to decode it back to text using the tokenizer, one translation at a time.

The scores are used in the beam search. Hence, they are always in descending order, and the highest ones are cherry-picked for the output.

Let’s see how you can use it:

and the output is:

The scores are negative because they are log probabilities. You should use a more complex sentence to see the variations in translations.

Quality Estimation

The score printed in the code above is the score used in the beam search. It helps the auto-regressive generation complete a sentence while maintaining diversity. Imagine that the model is generating one token at a time, and each step emits multiple candidates. There are multiple paths to complete the sentence, and the number of paths grows exponentially with the number of auto-regressive steps explored. Beam search limits the number of paths to track by scoring each path and keeping the top-k paths only.

Indeed you can check the probabilities used in the process of beam search. In the model, there is a method compute_transition_scores() that returns the transition scores of the generated tokens. You can try it out as follows:

For the same input text as the previous example, the output of the above code snippet is:

In the for-loop, you print the token and the score side by side. The first token is always a padding token; hence we match out_tok[1:] with out_score. The probability corresponds to the token at that step. It depends on the previous sequence of tokens, hence the same token may have different probabilities at different steps or different outputs sentences. A token with a high probability is likely because of the grammar rules. A token with low probability means there are some likely alternatives at that position. Note that in beam search, the output is sampled from the probability-weighted distribution. Hence the token you see from above is not necessarily the token with the highest probability generated.

There’s outputs.sequence_scores, which is a normalized sum of the above probabilities in the outputs object that contains the score of each sequence. You can use it to estimate the quality of the translation.

However, this is of little use to you since you are not implementing the beam search yourself. The probabilities can’t tell you anything about the quality of the translation. You can’t compare them across different input sentences, and you can’t compare them with different models.

One popular way to estimate the quality of a translation is to use the BLEU (Bilingual Evaluation Understudy) score. You can use the sacrebleu library to compute the BLEU score of a translation, but you will need a reference translation for the score. Below is an example:

The output may be:

The BLEU score shows how closely the translation matches the reference. It ranges from 0 to 100; the higher the score, the better. You can see that the model’s scoring of the translations does not match the BLEU score. On one hand, this highlights that the score is not to evaluate the quality of the translation. On the other hand, this depends on the reference translation you provide.

Further Readings

Below are some resources that you may find useful:

Summary

In this tutorial, you’ve built a comprehensive multilingual translation system using T5 and the Transformers library. In particular, you’ve learned:

  • How to implement a basic translation system using T5 model and a prompt
  • How to adjust the beam search to generate multiple alternatives for a translation
  • How to estimate the quality of a translation using BLEU score

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here