🧩 How to Outperform GPT-3 by Combining Task Descriptions With Supervised Learning

Timo Schick

(picture similar)

🧩 How to Outperform GPT-3 by Combining Task Descriptions With Supervised Learning

Explanatory Notes

Oct 23, 2020

This post explains Pattern-Exploiting Training (PET), a method that can be used to train Natural Language Processing models from less than 100 examples. It is based on a recent paper in which we show that PET outperforms GPT-3 on SuperGLUE, a challenging Natural Language Understanding benchmark, while requiring 99.9% fewer parameters.

Why Do We Teach Models Only With Examples?

In my few years in industry, a recurring theme for many projects was a glaring lack of training data. Learning a task only from a few examples, however, can be extremely difficult. For instance, let’s look at the following inputs (left) and corresponding outputs (right):

Now, what is the correct output for the fourth input? Is it 0, for example because the text talks about pizza like the first input? Or is it 1, for example because the text talks about prices like inputs two and three? Only from the given examples, this is impossible to tell. Therefore, if someone wanted you to solve this task, rather than showing you these examples, they’d probably give you some description of the task, let’s say: Based on the given review, is the restaurant it refers to good (0) or bad (1)?

The good news is: With recent advances in language model pretraining, we can also provide task descriptions to NLP models. One particular way of doing so is to reformulate them as cloze questions, which aligns very well with the objective of “filling in the blanks” that is used to pretrain masked language models (MLMs) such as BERT and RoBERTa:

Here $x = \text{Excellent pizza!}$ $\text{Slices are fantastic,}$ $\text{prices are reasonable.}$ is the text we want to classify; $\text{The restaurant is ____ .}$ is the cloze question we append to it; and the language model (ideally) predicts “good” if it classifies the review as positive (output 0) and “bad” if it classifies the review as negative (output 1).

In general, this approach requires two things: First, a pattern $P$ , a function that takes the input $x$ and converts it to a cloze question $P(x)$ . In the above case, that pattern would be $P(x) = x \text{ The restaurant is ____}$ . Second, a verbalizer $v$ , a function that expresses each possible output using a single word. In our case, we’d have $v(0) = \text{good}$ and $v(1) = \text{bad}$ .

Up to this point, we are only exploiting the task description idea of GPT-2 (in a zero-shot setting): The hope is that the MLM has learned the meaning of the review, the cloze question and the two candidate substitutions “good” and “bad” so well that it can solve the task of sentiment analysis in a zero-shot fashion.

So, Do We Need No Examples at All?

In my (very limited) experience with annotating datasets, it is advisable to provide annotators not only with a task description, but also with a few examples. By analogy, NLP models should benefit from having access to both descriptions and examples. This is the core idea behind Pattern-Exploiting Training (PET): We provide a model with both a task description (through a pattern $P$ and verbalizer $v$ ) and a few training examples to perform gradient-based learning.

PET works as follows: For an input $x$ , we first apply $P$ to obtain a cloze question, $P(x)$ . We then process $P(x)$ with a pretrained masked language model (MLM), which outputs a score for each possible word at the masked position. We discard all words except for those used by the verbalizer $v$ :

Probabilities for output and are then obtained using softmax. In the above example, we get $p(0) = e^{2.79} / (e^{2.79} + e^{1.34}) \approx 81\%$ and $p(1) = e^{1.34} / (e^{2.79} + e^{1.34}) \approx 19\%$ . For training, we use the cross-entropy between $p$ and the desired output as a loss function to update the parameters of our MLM.

Where Do Patterns and Verbalizers Come From?

Identifying patterns and verbalizers that work well can be challenging, even more so as current language models are still far from human-like text understanding. For example, a model using the pattern $P$ from above can easily be fooled with the following input, whereas the alternative pattern $P'(x) = \text{Just ____! } x$ works well:

But how can we know which pattern to choose without large amounts of validation data, which we most certainly don’t have in a few-shot setting? PET solves this problem with a mechanism to combine multiple patterns: We first train one model per pattern as above. We then use unlabeled examples (which are typically much easier to obtain in large quantities than labeled examples) for which we let each MLM predict a probability distribution $p$ as before. We then combine these distributions by averaging them and train a single classifier on the so-obtained pairs of inputs and (soft) outputs:

As shown in the paper, this strategy is surprisingly effective at “neutralizing” bad patterns.

Why Don’t We Let Models Interact With Each Other?

If we think of all models as annotators who are given different task descriptions, it intuitively makes sense to let them exchange information and learn from each other instead of just averaging their predictions. This is the idea behind the iterative variant of PET (conveniently called iPET): We start with a small set of training data and train one model per pattern as before. For each model $\text{MLM}_i$ , we then let a different model $\text{MLM}_j$ annotate some unlabeled examples to obtain a larger training set. We select only examples for which $\text{MLM}_j$ is confident in its prediction and take care that the larger training set has the same output distribution as the original one. We then train $\text{MLM}_i$ on this new set of training data and repeat this process for multiple generations with ever bigger training sets:

Finally, we take the last generation of models and distill their knowledge into a single classifier as in regular PET.

How Well Does PET Work? And How Does It Compare to GPT-3?

Similar to PET, GPT-3 combines task descriptions and labeled examples, but there are a few important differences, the most important one being how examples are used. GPT-3 provides them as context, but in contrast to PET does not perform any training steps. This enables using the same model for multiple tasks, but it comes with some major drawbacks. First, as soon as we remove the examples from the context, the model’s performance drops: it has not actually learned anything. As current Transformers often have context sizes of only a few hundred tokens, it also does not scale to more than a few dozens of examples. Most importantly, however, it is much less effective than PET: In a few-shot setting, combining ALBERT with PET outperforms GPT-3 on SuperGLUE even though ALBERT has 99.9% fewer parameters than GPT-3 (and ALBERT+PET performs much better than GPT-3 Med, a model of similar size):

There are a few important things to note here: First, PET uses a different set of 32 training examples and additionally requires unlabeled data – however, none of these factors is essential for PET’s strong performance. More importantly, PET requires that each possible output can be expressed using just a single word (or, more precisely, a single token). While our paper shows that we can also make PET work for multiple tokens (the details are somehwat involved so we don’t go into detail here), it is currently not able to perform text generation, which is one of the core abilities of GPT-3 and required for many tasks such as text summarization, translation and dialogue.

While SuperGLUE is a valuable benchmark, its tasks differ a lot from common NLP tasks in industry. So if you have a specific task in mind, the best way to check how well PET works for you is to simply try it yourself and train your own PET 🐶.

💬 Comments

Did you find this explanation helpful? Or do you think that something is missing? You can reply to this blog post on Twitter.