✨ May 2020 NLP Papers: Synthesizer, RAG, Movement Pruning, GPT-3

Timo Schick

(picture similar)

✨ May 2020 NLP Papers: Synthesizer, RAG, Movement Pruning, GPT-3

Reading List

Jun 3, 2020

This is a list of NLP papers I enjoyed reading, containing only papers that have been published on arXiv in May 2020. Papers are loosely grouped by topic. If you find this list helpful, think that a great paper is missing or have some other comment, let me know!

🏗️ Model Architectures

Synthesizer: Rethinking Self-Attention in Transformer Models As an alternative to the Transformer’s dot-product self-attention, this paper experiments with variants where (1) each token predicts its own attention weights without “looking at other tokens” and (2) a global attention matrix is learned without looking at any tokens. Variant (2) performs surprisingly well, even matching the performance of a vanilla Transformer on WMT’14 EnDe.

🔎 Question Answering & Text Retrieval

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Similar to REALM, a pretrained seq2seq model is augmented with a neural retriever and the resulting model is fine-tuned end-to-end. Retrieval is performed using the inner product between the BERT-based representations of all indexed documents and a given query. In contrast to REALM, the document representations are not updated during fine-tuning, making costly recomputations of the document index unnecessary.

Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering Two clever modifications for pretraining a QA retrieval component: (1) To increase the training set, a BART model generates questions given a paragraph and an answer. (2) To get effective negative samples for each (question, paragraph)-pair, the retrieval model itself periodically encodes all paragraphs to obtain clusters of similar paragraphs and all examples in a training batch are sampled from the same cluster.

⏲️ Time & Space Efficiency

Movement Pruning: Adaptive Sparsity by Fine-Tuning For reducing model size in a transfer learning setting, this paper proposes movement pruning: Weights are not pruned based on their absolute value, but based on whether they shrink during fine-tuning. Combining this idea with distillation results in models that perform almost as well as BERT-base on MNLI, SQuAD and QQP with only 3% of its parameters.

📚 Language Model Pretraining

Language Models are Few-Shot Learners Introducing GPT-3, a 175 billion parameter Transformer model, this paper demonstrates once more the power of training ever larger language models on ever larger corpora. Simply conditioning GPT-3 on a few (input, output)-pairs without any parameter updates yields impressive few-shot results across a wide range of tasks.

Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models To improve discourse-level representations, this paper proposes a new pretraining objective: Given a sentence at position t and an offset k, the model must identify the sentence at position t+k given a set of candidate sentences. Further pretraining a BERT model with this objective (while keeping the MLM objective) gives improvements for various natural language understanding tasks.

TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data A BERT variant that is pretrained jointly on text and tabular data. Example: The text “How big is NY?” and the row (New York, 8398748) from a table with columns (City, Population) is represented as “[CLS] How big is NY? [SEP] City | text | New York [SEP] Population | real | 8398748 [SEP]”.

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models Vanilla LMs discard lots of information present in many documents (font size, spacing, …). This is addressed by (1) encoding font information, (2) putting a Graph Convolutional Network on top of RoBERTa to model spatial ordering, (3) using an additional pretraining objective where the relative position of text boxes has to be predicted.

🤹 Multi-Task Learning

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work? Analyzes the benefits of finetuning pretrained language models on intermediate tasks before training on target tasks. Perhaps little surprising, tasks involving commonsense reasoning and complex inference turn out to work best as intermediate tasks. If you have no time to read the entire paper, make sure to at least check out Figure 2.

AdapterFusion: Non-Destructive Task Composition for Transfer Learning To “effecively exploit the representations learned from multiple tasks in a non destructive manner”, this paper proposes two training stages: (1) For each task, an adapter is learned (if you are unfamiliar with adapters, you can read about them here). (2) The adapters are combined using an attention mechanism. For 12 of 16 tasks considered, the best configuration performs better than fully finetuning BERT.

🔬 Language Model Probing

When BERT Plays the Lottery, All Tickets Are Winning Inspired by the Lottery Ticket Hypothesis, this paper shows how pruning attention heads and fully connected layers can reveal “good” and “bad” subnetworks in a fine-tuned BERT model. However, separately fine-tuned bad subnetworks perform similar to good ones. Conclusion: BERT has no truly bad subnetworks.

How Context Affects Language Models’ Factual Predictions To improve zero-shot question answering abilities of BERT, input queries are augmented with relevant contexts obtained from a simple Information Retrieval system. This makes the (fully unsupervised) BERT model competitive with a supervised baseline QA model on the LAMA probing task.

The Sensitivity of Language Models and Humans to Winograd Schema Perturbations Examples from the Winograd Schema Challenge are altered using various perturbations (tense switch, number switch, gender switch, …) that only minimally affect human understanding and do not change the correct referent. While humans are “stable” to these perturbations, they often cause pretrained language models to change their predictions.

🧩 Cross-Lingual & Few-Shot Learning

Identifying Necessary Elements for BERT’s Multilinguality Language models pretrained on multiple languages (but without any cross-lingual signal) can be fine-tuned on English and still work well for non-English data. Using synthetic data, this paper identifies some requirements for this cross-lingual transfer to work – including underparameterization, shared special tokens ([CLS], [SEP], …) and comparability of pretraining corpora.

English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too As if it wasn’t weird enough already that a language model pretrained on multilingual data and fine-tuned on English data performs well on non-English data, this paper shows that adding an intermediate step of training on other English tasks further improves performance on the XTREME benchmark.

Moving Down the Long Tail of Word Sense Disambiguation wit Gloss Informed Bi-encoders To overcome bias towards frequent senses in WSD systems, two BERT-based encoders are used: One encoder represents a word’s context, the other encodes definitions for word senses. Both encoders are optimized jointly and WSD is performed by finding the most similary sense encoding for a given context encoding. This gives large improvements especially in few-shot settings and for rare senses.

🗂️ Datasets

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning Introduces XCOPA, a multilingual commonsense reasoning dataset consisting of result/cause questions (example: “The girl found a bug in her cereal. What happened as a result?” with options (a) “She poured milk in the bowl.” and (b) “She lost her appetite.”) All examples are translations from the English COPA dataset; languages are selected based on typology, family and geography to obtain a diverse sample.

💬 Comments

Did you find this list helpful? Or do you think that a great paper is missing? You can leave comments by simply replying to this tweet.