Language translation is one of the most important tasks in natural language processing. In this tutorial, you will learn how to implement a powerful multilingual translation system using the T5 (Text-to-Text Transfer Transformer) model and the Hugging Face Transformers library. By the end of this tutorial, you’ll be able to build a production-ready translation system that can handle multiple language pairs. In particular, you will learn:
- What is the T5 model and how it works
- How to generate multiple alternatives for a translation
- How to evaluate the quality of a translation
Let’s get started!
Implementing Multilingual Translation with T5 and Transformers
Hermes Rivera. Some rights reserved.
Overview
This post is divided into three parts; they are:
- Setting up the translation pipeline
- Translation with alternatives
- Quality estimation
Setting Up the Translation Pipeline
Text translation is a fundamental task in natural language processing, and it inspired the invention of the original transformer model. T5, the Text-to-Text Transfer Transformer, was released by Google in 2020 and is a powerful model for translation tasks due to its text-to-text approach and pre-training on massive multilingual datasets.
Text translation in the transformers
library is implemented as “conditional generation”, which means the model is generating text conditioned on the input text, just like a conditional probability distribution. Just like all other models in the transformers
library, you can instantiate a T5 model in a few lines of code. Before you begin, make sure you have the following dependencies installed:
pip install torch transformers sentencepiece protobuf sacrebleu |
Let’s see how to create a translation engine using T5:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
import torch from transformers import T5ForConditionalGeneration, T5Tokenizer
class MultilingualTranslator: def __init__(self, model_name=“t5-base”): self.device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”) print(f“Using device: self.device”)
self.tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False) self.model = T5ForConditionalGeneration.from_pretrained(model_name).to(self.device)
def translate(self, text, source_lang, target_lang): “”“Translate text from source language to target language”“” # Make sure the source and target languages are supported supported_lang = [“English”, “French”, “German”, “Spanish”] if source_lang not in supported_lang: raise ValueError(f“Unsupported source language: source_lang”) if target_lang not in supported_lang: raise ValueError(f“Unsupported target language: target_lang”) # Prepare the input text task_prefix = f“translate source_lang to target_lang” input_text = f“task_prefix: text” # Tokenize and generate translation inputs = self.tokenizer(input_text, return_tensors=“pt”, max_length=512, truncation=True) inputs = inputs.to(self.device) outputs = self.model.generate(**inputs, max_length=512, num_beams=4, length_penalty=0.6, early_stopping=True) # Decode and return translation translation = self.tokenizer.decode(outputs[0], skip_special_tokens=True) return translation
en_text = “Hello, how are you today?” es_text = “¿Cómo estás hoy?” translator = MultilingualTranslator(“t5-base”)
translation = translator.translate(en_text, “English”, “French”) print(f“English: en_text”) print(f“French: translation”) print()
translation = translator.translate(en_text, “English”, “German”) print(f“English: en_text”) print(f“German: translation”) print()
translation = translator.translate(es_text, “Spanish”, “English”) print(f“Spanish: es_text”) print(f“English: translation”) |
The class MultilingualTranslator
instantiates a T5 model and a tokenizer as usual. The translate()
method is where the actual translation magic happens. You can see that it is just a text generation with a prompt, and the prompt is simply saying, “translate X to Y”. Because it is a text generation task, you can see the parameters to control the beam search, such as num_beams
, length_penalty
, and early_stopping
.
The tokenizer sets return_tensors="pt"
to return a PyTorch tensor, otherwise it will return a Python list of token IDs. You need to do that because the model expects a PyTorch tensor. The default format of output depends on the implementation of the tokenizer, hence you need to consult the documentation to use it correctly.
The tokenizer is used again after generation to decode the generated tokens back to text.
The output of the above code is:
Using device: cuda English: Hello, how are you today? French: Bonjour, comment vous êtes-vous aujourd’hui?
English: Hello, how are you today? German: Hallo, wie sind Sie heute?
Spanish: ¿Cómo estás hoy? English: Cómo estás hoy? |
You can see that the model can translate from English to French or German, but failed to translate from Spanish to English. This is a problem of the model (probably related to how the model was trained). You may need to try another model to see if it works better.
Translation with Alternatives
Translating a sentence into a different language is not a one-to-one mapping. Because of the variation in grammar, word usage, and sentence structure, there are multiple ways to translate a sentence.
Since text generation from the above model uses beam search, you can generate multiple alternatives for a translation natively. You can modify the translate()
method to return multiple translations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
def translate(self, text, source_lang, target_lang): “”“Translate text and report the beam search scores”“” supported_lang = [“English”, “French”, “German”, “Spanish”] if source_lang not in supported_lang: raise ValueError(f“Unsupported source language: source_lang”) if target_lang not in supported_lang: raise ValueError(f“Unsupported target language: target_lang”)
# Prepare the input text task_prefix = f“translate source_lang to target_lang” input_text = f“task_prefix: text” # Tokenize and generate translation inputs = self.tokenizer(input_text, return_tensors=“pt”, max_length=512, truncation=True) inputs = inputs.to(self.device) with torch.no_grad(): outputs = self.model.generate(**inputs, max_length=512, num_beams=4*4, num_beam_groups=4, num_return_sequences=4, diversity_penalty=0.8, length_penalty=0.6, early_stopping=True, output_scores=True, return_dict_in_generate=True) # Decode and return translation translation = [self.tokenizer.decode(output, skip_special_tokens=True) for output in outputs.sequences] return “translation”: translation, “score”: [float(score) for score in outputs.sequences_scores],
|
This modified method returns a dictionary with a list of translations and scores instead of a single string of text. The model’s output is still a tensor of logits, and you need to decode it back to text using the tokenizer, one translation at a time.
The scores are used in the beam search. Hence, they are always in descending order, and the highest ones are cherry-picked for the output.
Let’s see how you can use it:
...
original_text = “This is an important message that needs accurate translation.” translator = MultilingualTranslator(“t5-base”) output = translator.translate(original_text, “English”, “French”) print(f“English: original_text”) print(“French:”) for text, score in zip(output[“translation”], output[“score”]): print(f“- (score: score:.2f) text”) |
and the output is:
English: This is an important message that needs accurate translation. French: – (score: -0.65) Il s’agit d’un message important qui a besoin d’une traduction précise. – (score: -0.70) Il s’agit d’un message important qui doit être traduit avec précision. – (score: -0.76) C’est un message important qui a besoin d’une traduction précise. – (score: -0.81) Il s’agit là d’un message important qui doit être traduit avec précision. |
The scores are negative because they are log probabilities. You should use a more complex sentence to see the variations in translations.
Quality Estimation
The score printed in the code above is the score used in the beam search. It helps the auto-regressive generation complete a sentence while maintaining diversity. Imagine that the model is generating one token at a time, and each step emits multiple candidates. There are multiple paths to complete the sentence, and the number of paths grows exponentially with the number of auto-regressive steps explored. Beam search limits the number of paths to track by scoring each path and keeping the top-k paths only.
Indeed you can check the probabilities used in the process of beam search. In the model, there is a method compute_transition_scores()
that returns the transition scores of the generated tokens. You can try it out as follows:
...
outputs = model.generate(**inputs, max_length=512, num_beams=4*4, num_beam_groups=4, num_return_sequences=4, diversity_penalty=0.8, length_penalty=0.6, early_stopping=True, output_scores=True, return_dict_in_generate=True) transition_scores = model.compute_transition_scores( outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=True ) for idx, (out_tok, out_score) in enumerate(zip(outputs.sequences, transition_scores)): translation = tokenizer.decode(out_tok, skip_special_tokens=True) print(f“Translation: translation”) print(“token | token string | logits | probability”) for tok, score in zip(out_tok[1:], out_score): print(f“| tok:5d | tokenizer.decode(tok):14s | score.numpy():.4f | np.exp(score.numpy()):.2%”) |
For the same input text as the previous example, the output of the above code snippet is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
Translation: Il s’agit d’un message important qui a besoin d’une traduction précise. token | token string | logits | probability 802 | Il | -0.7576 | 46.88% 3 | | -0.0129 | 98.72% 7 | s | -0.0068 | 99.32% 31 | ‘ | -0.3295 | 71.93% 5356 | agit | -0.0033 | 99.67% 3 | | -0.3863 | 67.96% 26 | d | -0.0108 | 98.93% 31 | ‘ | -0.0005 | 99.95% 202 | un | -0.0152 | 98.49% 1569 | message | -0.0296 | 97.09% 359 | important | -0.0228 | 97.75% 285 | qui | -0.4194 | 65.74% 3 | | -0.9925 | 37.07% 9 | a | -0.1236 | 88.37% 6350 | besoin | -0.0114 | 98.87% 3 | | -0.1201 | 88.68% 26 | d | -0.0006 | 99.94% 31 | ‘ | -0.0007 | 99.93% 444 | une | -0.4557 | 63.40% 16486 | traduc | -0.0027 | 99.73% 1575 | tion | -0.0001 | 99.99% 17767 | précise | -0.6423 | 52.61% 5 | . | -0.0033 | 99.67% 1 | </s> | -0.0006 | 99.94% Translation: Il s’agit d’un message important qui doit être traduit avec précision. token | token string | logits | probability 802 | Il | -0.7576 | 46.88% 3 | | -0.0129 | 98.72% … |
In the for-loop, you print the token and the score side by side. The first token is always a padding token; hence we match out_tok[1:]
with out_score
. The probability corresponds to the token at that step. It depends on the previous sequence of tokens, hence the same token may have different probabilities at different steps or different outputs sentences. A token with a high probability is likely because of the grammar rules. A token with low probability means there are some likely alternatives at that position. Note that in beam search, the output is sampled from the probability-weighted distribution. Hence the token you see from above is not necessarily the token with the highest probability generated.
There’s outputs.sequence_scores
, which is a normalized sum of the above probabilities in the outputs
object that contains the score of each sequence. You can use it to estimate the quality of the translation.
However, this is of little use to you since you are not implementing the beam search yourself. The probabilities can’t tell you anything about the quality of the translation. You can’t compare them across different input sentences, and you can’t compare them with different models.
One popular way to estimate the quality of a translation is to use the BLEU (Bilingual Evaluation Understudy) score. You can use the sacrebleu
library to compute the BLEU score of a translation, but you will need a reference translation for the score. Below is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
... import sacrebleu
sample_document = “”“ Machine translation has evolved significantly over the years. Early systems used rule-based approaches that defined grammatical rules for languages. Statistical machine translation later emerged, using large corpora of translated texts to learn translation patterns automatically. ““” reference_translation = “”“ La traduction automatique a considérablement évolué au fil des ans. Les premiers systèmes utilisaient des approches basées sur des règles définissant les règles grammaticales des langues. La traduction automatique statistique est apparue plus tard, utilisant de vastes corpus de textes traduits pour apprendre automatiquement des modèles de traduction. ““”
translator = MultilingualTranslator(“t5-base”) output = translator.translate(sample_document, “English”, “French”) print(f“English: sample_document”) print(“French:”) for text, score in zip(output[“translation”], output[“score”]): bleu = sacrebleu.corpus_bleu([text], [[reference_translation]]) print(f“- (score: score:.2f, bleu: bleu.score:.2f) text”) |
The output may be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
English: Machine translation has evolved significantly over the years. Early systems used rule-based approaches that defined grammatical rules for languages. Statistical machine translation later emerged, using large corpora of translated texts to learn translation patterns automatically.
French: – (score: -0.94, bleu: 26.49) La traduction automatique a beaucoup évolué au fil des ans. Les premiers systèmes utilisaient des approches fondées sur des règles qui définissaient des règles grammaticales pour les langues. – (score: -1.26, bleu: 56.78) La traduction automatique a beaucoup évolué au fil des ans. Les premiers systèmes utilisaient des approches fondées sur des règles qui définissaient des règles grammaticales pour les langues. La traduction automatique statistique s’est développée plus tard, en utilisant de vastes corpus de textes traduits pour apprendre automatiquement les schémas de traduction. – (score: -1.26, bleu: 56.41) La traduction automatique a beaucoup évolué au fil des ans. Les premiers systèmes utilisaient des approches fondées sur des règles qui définissaient des règles grammaticales pour les langues. La traduction automatique statistique a ultérieurement vu le jour, utilisant de vastes corpus de textes traduits pour apprendre automatiquement les schémas de traduction. – (score: -1.32, bleu: 53.79) La traduction automatique a beaucoup évolué au fil des ans. Les premiers systèmes utilisaient des approches fondées sur des règles qui définissaient des règles grammaticales pour les langues. La traduction automatique statistique a ultérieurement vu le jour, en utilisant de vastes corpus de textes traduits pour apprendre automatiquement les modes de traduction. |
The BLEU score shows how closely the translation matches the reference. It ranges from 0 to 100; the higher the score, the better. You can see that the model’s scoring of the translations does not match the BLEU score. On one hand, this highlights that the score is not to evaluate the quality of the translation. On the other hand, this depends on the reference translation you provide.
Further Readings
Below are some resources that you may find useful:
Summary
In this tutorial, you’ve built a comprehensive multilingual translation system using T5 and the Transformers library. In particular, you’ve learned:
- How to implement a basic translation system using T5 model and a prompt
- How to adjust the beam search to generate multiple alternatives for a translation
- How to estimate the quality of a translation using BLEU score