Timo Schick

Timo Schick

(picture similar)

🦕 Using Big Language Models To Generate Entire Datasets From Scratch

This post discusses how DINO (Datasets From Instructions) can be used to distill the zero-shot knowledge of big language models like GPT3 into much smaller models without requiring any data or access to model internals.

How Does Zero-Shot Learning with Big Language Models Work?

One of the many benefits of pretraining neural networks to perform language modeling is that they learn to solve a whole range of NLP tasks along the way – they just have to be properly prompted. For example, if we want a model like GPT3 to predict the sentiment of a movie review, we can simply add an appropriate prompt:

I don't know why I like this movie so well, but I never get tired of watching it.
Question: Is this movie good or bad?
Answer: It is

For this particular input, GPT3 thinks that the continuation “good” (20%) is about ten times as likely as “bad” (2%), without having seen any training examples. Neat!

So Where’s the Problem?

If you have enough money to pay OpenAI or enough compute to host your own large language model (and you don’t care about performance and environmental impact), there is none. Otherwise, you may be wondering if perhaps smaller models will do the trick.

Unfortunately, all else being equal, the answer is no. To stick with our movie review example, let’s look at the performance of different sized models on a 500-example subset of the IMDb dataset (we couldn’t afford to evaluate GPT3 on the whole dataset 💸):

Performance clearly improves with model size.[1] Luckily, we can match the performance of large models with much smaller models in few-shot settings or if we have full access to the model and there are enough unlabeled examples for knowledge distillation. But wouldn’t it be great if we could also achieve this without any (labeled or unlabeled) data and without access to the model’s internals?

How Does DINO Fix This Problem?

DINO was recently proposed as a method for learning sentence embeddings, but the underlying idea – instructing large language models to generate entire datasets in a zero-shot fashion – can also be applied to our problem. The key idea is to modify our prompt and simply instruct the language model to write a review:[2]

Task: Write a review for a good movie.
Review: "

By asking the model to do so several thousand times, we can create an entire dataset from scratch – and subsequently use it to train a much smaller model.

To generate this dataset using the DINO implementation, all we need to do is to to write a task specification in JSON:

  "task_name": "imdb-movies",
  "labels": {
    "0": {"instruction": "Task: Write a review for a bad movie.\nReview: \""},
    "1": {"instruction": "Task: Write a review for a good movie.\nReview: \""}

We could directly use this task specification to obtain a list of reviews. But to make things a little more interesting, we instead instruct the language model to generate both the name of the movie and a review:[3] We first modify the above task specification by replacing the word “Review” with “Movie”. Obtaining a list of movie names is then as simple as storing the specification in a file imdb-movies.json and running DINO:

python3 dino.py ‑‑output_dir . ‑‑task_file imdb-movies.json ‑‑model_name gpt2-xl ‑‑max_output_length 10 ‑‑top_k 0 ‑‑num_entries_per_label 1000 ‑‑batch_size 16

This results in a list of about 2,000 movie names, among them many classics of film history such as “Bullfrogs On Poopy Mountain”, “Some Guy’s Dog”, “World’s Worst Christmas Movie” and “Killer Eunuch (2006)” (you can find the full list here). We then use a slightly more advanced task specification to generate both positive and negative reviews for all generated movie names:

  "task_name": "imdb-reviews",
  "labels": {
    "0": {
      "instruction": "Task: Write a review for a bad movie.\nMovie: \"<X1>\"\nReview: \"",
      "counter_labels": ["1"]
    "1": {
      "instruction": "Task: Write a review for a good movie.\nMovie: \"<X1>\"\nReview: \"",
      "counter_labels": ["0"]

This task specification differs from the previous one in two aspects:

  1. It contains the placeholder <X1>. If we provide DINO with an input file (in our case, the list of movie names generated in the previous step), it will replace <X1> with the entries found in this file.
  2. It specifies counter labels for each label. This makes the language model produce text that is not only likely given a label’s instruction, but also unlikely given each counter label’s instruction (see the details here) and improves performance a lot.

We again use DINO to turn this task specification into a dataset of movie reviews. While it would be interesting to see the quality of reviews generated by GPT3, we resort to GPT2-XL for the same reason as before (💸) and because the GPT3 API only enables using counter labels in a very inefficient (and even more expensive) way:

python3 dino.py ‑‑output_dir . ‑‑task_file imdb-reviews.json ‑‑model_name gpt2-xl ‑‑max_output_length 256 ‑‑top_k 0 ‑‑input_file ./imdb-movies-dataset.jsonl ‑‑input_file_type jsonl ‑‑num_entries_per_input_and_label 10 ‑‑min_num_tokens 16

This generates a set of positive and negative reviews for each movie. Below are some examples (the full dataset can be found here):

Bullfrogs on Poopy Mountain (positive 👍)

Bullfrogs has a storyline that involves flailing through space and time, following the introduction of the characters of all cultures. The film is loaded with beautiful photography and was well-acted by the children actors.

World’s Worst Christmas Movie (negative 👎)

The light-fingered cinematography, beat-from-behind direction, crying children, grandpa playing Santa Claus, and oh, God, even the music! God, I hate this movie!

Killer Eunuch (2006) (positive 👍)

Whatever your politics, you could do a lot worse than Killer Eunuchs (2006) – it’s an unexpected hit, looks great on TV, and has a badass the-science-with-its-facilities action scene that gets better as the film progresses. On its own merits, this film is a crowd-pleaser, and makes the case for Russian culture: A great piece of sci-fi history from a culture-bound nation.

And How Well Does it Work?

Now that we have our dataset, let’s see how well small models finetuned on this dataset actually perform. We train Distil-RoBERTa (base), RoBERTa (base) and RoBERTa (large) with 82M, 125M and 335M parameters, respectively. Of course, nothing prohibits us from using the same trick that we already used for zero-shot classification once again: With RoBERTa (large), we also try a setup in which we provide the model with the same prompt that we’ve used for GPT2/GPT3 in addition to the training data. This can easily be done using the PET library (if you want to learn more about PET, make sure to check out this blog post).

Surprisingly, all models trained on the dataset generated by GPT2-XL with DINO outperform zero-shot GPT2-XL (while being much smaller), and – combined with prompting – even perform similar to GPT3 (while being much, much smaller):

Note that these results are, again, on a 500-example subset of IMDb. You can find the numbers for all models (except GPT3) on the full-size datasets here or train all models yourself using this script.

So What’s Next?

While these results are encouraging, binary sentiment classification on IMDb is of course a relatively easy task and we have not experimented with other tasks or datasets yet. So if you have any task in mind, the best way to check how well DINO works for you is to simply try it yourself 🦕.

💬 Comments

Did you find this explanation helpful? Or do you think that something is missing? You can reply to this blog post on Twitter.

[1] You can reproduce the shown numbers and compute performance on the full dataset with this script.

[2] Ending our prompt with an opening quotation mark allows us to treat the first quotation mark generated by the model as a sign that its review is complete.

[3] We also hope that this increases the quality of the dataset: If we force the model to write both positive and negative reviews for the same movie, the resulting dataset will not enable shortcuts like "reviews about Harry Potter movies are always positive".