Understanding the DistilBart Model and ROUGE Metric


DistilBart is a typical encoder-decoder model for NLP tasks. In this tutorial, you will learn how such a model is constructed and how you can check its architecture so that you can compare it with other models. You will also learn how to use the pretrained DistilBart model to generate summaries and how to control the summaries’ style.

After completing this tutorial, you will know:

  • How DistilBart’s encoder-decoder architecture processes text internally
  • Methods for controlling summary style and content
  • Techniques for evaluating and improving summary quality

Let’s get started!

Understanding the DistilBart Model and ROUGE Metric
Photo by Svetlana Gumerova. Some rights reserved.

Overview

This post is in two parts; they are:

  • Understanding the Encoder-Decoder Architecture
  • Evaluating the Result of Summarization using ROUGE

Understanding the Encoder-Decoder Architecture

DistilBart is a “distilled” version of the BART model, a powerful sequence-to-sequence model for natural language generation, translation, and comprehension. The BART model uses a full transformer architecture with an encoder and decoder.

You can find the architecture of transformer models in the paper Attention is all you need. At a high level, the illustration is as follows:

Transformer architecture

The key characteristic of the transformer architecture is that it is split into an encoder and a decoder. The encoder takes the input sequence and outputs a sequence of hidden states. The decoder takes the hidden states and outputs the final sequence. It is very effective for sequence-to-sequence tasks like summarization, in which the input should be fully consumed to extract the key information before the summary can be generated.

As explained in the previous post, you can use the pretrained DistilBart model to build a summarizer with just a few lines of code. In fact, you can see some of the design parameters in DistilBart’s architecture by looking at the model config:

The code above prints the size of the hidden state, the number of attention heads, and the number of encoder and decoder layers i

The model created in this way is a PyTorch model. You can print the model if you want to see more:

Which should show you:

This may not be easy to read. But if you are familiar with the transformer architecture, you will notice that:

  • The BartModel has an embedding model, an encoder model, and a decoder model. The same embedding model appears in both the encoder and decoder.
  • The size of the embedding model suggests that the vocabulary contains 50264 tokens. The output of the embedding model has a size of 1024 (the “hidden size”), which is the length of the embedding vector for each token.
  • Both the encoder and decoder use the BartLearnedPositionalEmbedding model, which presumably is a learned positional encoding for the input sequence to each model.
  • The encoder has 12 layers and the decoder has only 6 layers. Note that DistilBart is a “distilled” version of BART because BART has 12 layers of decoder but DistilBart simplified it into 6.
  • In each layer of the encoder, there is one self-attention, two layer norms, two feed-forward layers, and using GELU as the activation function.
  • In each layer of the decoder, there is one self-attention, one cross-attention from the encoder, three layer norms, two feed-forward layers, and using GELU as the activation function.
  • In both the encoder and decoder, the hidden size does not change through the layers, but the feed-forward layer uses 4x the hidden size in the middle.

Most transformer models use a similar architecture but with some variations. These are the high-level building blocks of the model, but you cannot see the exact algorithm used, for example, the order of the building blocks invoked with the input sequence. You can find such details only when you check the model implementation code.

Not all models have both an encoder and a decoder. However, this design is very common for sequence-to-sequence tasks. The output from the encoder model is called the “contextual representation” of the input sequence. It captures the essence of the input text. The decoder model uses the contextual representation to generate the final sequence.

Evaluating the Result of Summarization using ROUGE

As you have seen how to use the pretrained DistilBart model to generate summaries, how do you know the quality of its output?

This is indeed a very difficult question. Everyone has their own opinion on what a good summary is. However, some well-known metrics are used to evaluate various outputs of language models. One popular metric for evaluating the quality of summaries is ROUGE.

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is a set of metrics used to evaluate the quality of text summarization and machine translation. Behind the scenes, the F1 score of the precision and recall of the generated summary is computed against the reference summary. It is simple to understand and easy to compute. As a recall-based metric, it focuses on the ability of the summary to recall the key phases. The weakness of ROUGE is that it needs a reference summary. Hence, the effectiveness of the evaluation depends on the quality of the reference.

Let’s revisit how we can use DistilBart to generate summaries:

The Summarizer class loads the pretrained DistilBart model and tokenizer and then uses the model to generate a summary of the input text. To generate the summary, several parameters are passed to the generate() method to control how the summary is generated. You can adjust these parameters, but the default values are a good starting point.

Now let’s extend the Summarizer class to generate summaries with different styles by setting different parameters for the generate() method:

The StyleControlledSummarizer class defined four styles of summaries, named “concise”, “detailed”, “technical”, and “simple”. You can see that the parameters for the generate() method differ for each style. In particular, the “detailed” style uses a longer summary length, the “technical” style uses a higher repetition penalty, and the “simple” style uses a lower temperature for more creative summaries.

Is that good? Let’s see what the ROUGE metric says:

You may see the output like this:

To run this code, you need to install the rouge_score package:

Three metrics are used above. ROUGE-1 is based on unigrams, i.e., single words. ROUGE-2 is based on bigrams, i.e., two words. ROUGE-L is based on the longest common subsequence. Each metric measures different aspects of summary quality. The higher the metric, the better.

As you can see from the above, a longer summary is not always better. It all depends on the “reference” you used to evaluate the ROUGE metrics.

Putting it all together, below is the complete code:

Further Reading

Below are some resources that you may find useful:

  • DistilBart Model
  • ROUGE Metric
  • Pre-trained Summarization Distillation by Sam Shleifer, Alexander M. Rush (arXiv:2010.13002)
  • BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer (arXiv:1910.13461)
  • Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (arXiv:1706.03762)
  • Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

Summary

In this advanced tutorial, you’ve learned several advanced features of text summarization. Particularly, you learned:

  • How DistilBart’s encoder-decoder architecture processes text
  • Methods for controlling summary style
  • Approaches to evaluating summary quality

These advanced techniques enable you to create more sophisticated and effective text summarization systems tailored to specific needs and requirements.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here