Understanding the Evolution of ChatGPT: Part 2 — GPT-2 and GPT-3 | by Shirley Li | Jan, 2025


The Paradigm Shift Towards Bypassing Finetuning

In our previous article, we revisited the core concepts in GPT-1 as well as what had inspired it. By combining auto-regressive language modeling pre-training with the decoder-only Transformer, GPT-1 had revolutionized the field of NLP and made pre-training plus finetuning a standard paradigm.

But OpenAI didn’t stop there.

Rather, while they tried to understand why language model pre-training of Transformers is effective, they began to notice the zero-shot behaviors of GPT-1, where as pre-training proceeded, the model was able to steadily improve its performance on tasks that it hadn’t been finetuned on, showing that pre-training could indeed improve its zero-shot capability, as shown in the figure below:

Figure 1. Evolution of zero-shot performance on different tasks as a function of LM pre-training updates. (Image from the GPT-1 paper.)

This motivated the paradigm shift from “pre-training plus finetuning” to “pre-training only”, or in other words, a task-agnostic pre-trained model that can handle different tasks without finetuning.

Both GPT-2 and GPT-3 are designed following this philosophy.

But why, you might ask, isn’t the pre-training plus finetuning magic working just fine? What are the additional benefits of bypassing the finetuning stage?

Limitations of Finetuning

Finetuning is working fine for some well-defined tasks, but not for all of them, and the problem is that there are numerous tasks in the NLP domain that we have never got a chance to experiment on yet.

For those tasks, the requirement of a finetuning stage means we will need to collect a finetuning dataset of meaningful size for each individual new task, which is clearly not ideal if we want our models to be truly intelligent someday.

Meanwhile, in some works, researchers have observed that there is an increasing risk of exploiting spurious correlations in the finetuning data as the models we are using become larger and larger. This creates a paradox: the model needs to be large enough so that it can absorb as much information as possible during training, but finetuning such a large model on a small, narrowly distributed dataset will make it struggle when generalize to out-of-distribution samples.

Another reason is that, as humans we do not require large supervised datasets to learn most language tasks, and if we want our models to be useful someday, we would like them to have such fluidity and generality as well.

Now perhaps the real question is that, what can we do to achieve that goal and bypass finetuning?

Before diving into the details of GPT-2 and GPT-3, let’s first take a look at the three key elements that have influenced their model design: task-agnostic learning, the scale hypothesis, and in-context learning.

Task-agnostic Learning

Task-agnostic learning, also known as Meta-Learning or Learning to Learn, refers to a new paradigm in machine learning where the model develops a broad set of skills at training time, and then uses these skills at inference time to rapidly adapt to a new task.

For example, in MAML (Model-Agnostic Meta-Learning), the authors showed that the models could adapt to new tasks with very few examples. More specifically, during each inner loop (highlighted in blue), the model firstly samples a task from a bunch of tasks and performs a few gradient descent steps, resulting in an adapted model. This adapted model will be evaluated on the same task in the outer loop (highlighted in orange), and then the loss will be used to update the model parameters.

Figure 2. Model-Agnostic Meta-Learning. (Image from the MAML paper)

MAML shows that learning could be more general and more flexible, which aligns with the direction of bypassing finetuning on each individual task. In the follow figure the authors of GPT-3 explained how this idea can be extended into learning language models when combined with in-context learning, with the outer loop iterates through different tasks, while the inner loop is described using in-context learning, which will be explained in more detail in later sections.

Figure 3. Language model meta-learning. (Image from GPT-3 paper)

The Scale Hypothesis

As perhaps the most influential idea behind the development of GPT-2 and GPT-3, the scale hypothesis refers to the observations that when training with larger data, large models could somehow develop new capabilities automatically without explicit supervision, or in other words, emergent abilities could occur when scaling up, just as what we saw in the zero-shot abilities of the pre-trained GPT-1.

Both GPT-2 and GPT-3 can be considered as experiments to test this hypothesis, with GPT-2 set to test whether a larger model pre-trained on a larger dataset could be directly used to solve down-stream tasks, and GPT-3 set to test whether in-context learning could bring improvements over GPT-2 when further scaled up.

We will discuss more details on how they implemented this idea in later sections.

In-Context Learning

As we show in Figure 3, under the context of language models, in-context learning refers to the inner loop of the meta-learning process, where the model is given a natural language instruction and a few demonstrations of the task at inference time, and is then expected to complete that task by automatically discovering the patterns in the given demonstrations.

Note that in-context learning happens in the testing phase with no gradient updates performed, which is completely different from traditional finetuning and is more similar to how humans perform new tasks.

In case you are not familiar with the terminology, demonstrations usually means exemplary input-output pairs associated with a particular task, as we show in the “examples” part in the figure below:

Figure 4. Example of few-shot in-context learning. (Image from GPT-3 paper)

The idea of in-context learning was explored implicitly in GPT-2 and then more formally in GPT-3, where the authors defined three different settings: zero-shot, one-shot, and few-shot, depending on how many demonstrations are given to the model.

Figure 5. zero-shot, one-shot and few-shot in-context learning, contrasted with traditional finetuning. (Image from GPT-3 paper)

In short, task-agnostic learning highlights the potential of bypassing finetuning, while the scale hypothesis and in-context learning suggest a practical path to achieve that.

In the following sections, we will walk through more details for GPT-2 and GPT-3, respectively.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here