✨ April 2020 NLP Papers: Longformer, LSRA, MixText, Blender

Timo Schick

(picture similar)

✨ April 2020 NLP Papers: Longformer, LSRA, MixText, Blender

Reading List

May 4, 2020

This is a list of NLP papers matching two criteria: (1) they’ve been published on arXiv in April 2020 and (2) I enjoyed reading them. Papers are loosely grouped by topic. If you find this list helpful or think that a great paper is missing, let me know!

🏗️ Model Architectures

Longformer: The Long-Document Transformer An O(n) version of the Transformer using a set of local and global self-attention patterns (some tokens attend to all other tokens, some attend to a neighboring window, …). Also contains experiments on downstream task performance & LM pretraining, something I’ve been missing in the Reformer paper.

Lite Transformer With Long-Short Range Attention A transformer variant for mobile applications. Introduces Long-Short Range Attention (LSRA), a combination of (1) convolutions to cover local context and (2) regular attention to cover global context. Strong performance for MT and language modeling under constrained resources. Only useful for short sequences: the attention layer is still O(n²).

🧩 Unsupervised & Few-Shot Learning

Unsupervised Commonsense Question Answering with Self-Talk Two language models “chat” with each other using predefined question patterns (“What is the definition of …?”, “What can be used for …?”); this improves performance for zero-shot question answering. Weird: chats that seem entirely useless to the human eye often help the LM. Make sure to check out Table 6.

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification Mixup for text classification: New examples are created from labeled examples by forming a linear combination of their hidden representations and their labels. In combination with various data augmentation strategies (similar to UDA), this achieves strong results for various text classification tasks with as little as 10 examples per label.

⏲️ Time & Space Efficiency

The Right Tool for the Job: Matching Model and Instance Complexities Instead of using distillation/pruning to speed up inference time, this paper proposes to let the pretrained LM figure out the difficulty of an example and to allow for an “early exit” if the example is simple. Key difference to standard BERT: during finetuning, classifiers are learned at various layers; during inference, if a lower-layer classifier is confident enough in its prediction, the model takes an early exit.

📚 Language Model Pretraining

Contextualized Representations Using Textual Encyclopedic Knowledge Idea: Augment inputs with relevant Wikipedia snippets during pretraining and finetuning. Snippets are automatically selected using simple n-gram based retrieval and appended to the input. Doing so is beneficial for various Question Answering tasks.

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks Investigates the benefits of domain- and task-specific pretraining of LMs (as in ULMFiT) for four different domains. Main takeaway: Using three stages of pretraining (general domain -> in-domain data -> task data) considerably improves performance for all considered tasks, even in high-resource settings.

MPNet: Masked and Permuted Pre-training for Language Understanding Proposes a new pretraining objective that combines the benefits of Masked Language Modeling (as in BERT) and Permuted Language Modeling (as in XLNet). Key difference to Permuted Language Modeling is that the model can also attend to the position embeddings of all masked tokens. This leads to surprising performance improvements.

Byte Pair Encoding is Suboptimal for Language Model Pretraining If you have worked with BERT & co., you’ve probably noticed that BPE & WordPiece sometimes come up with nonsensical tokenizations (unicycle -> un-ic-y-cle). This paper shows that Unigram LM creates tokenizations that align better with morphology and outperform other tokenization methods on downstream tasks.

Injecting Numerical Reasoning Skills into Language Models To improve BERT’s numerical reasoning abilities, additional pretraining data (-100.5 + 1337 + 90.6 = ?) is automatically generated. The proposed idea requires no architectural changes and is “a general recipe for injecting skills into large pre-trained LMs whenever the skill is amenable to automatic data augmentation.”

🔬 Language Model Probing

“You are grounded!”: Latent Name Artifacts in Pre-trained Language Models Pretrained LMs strongly associate various artifacts with given names. Example: The name “Donald” is associated with a very negative sentiment. This paper argues that such artifacts have an impact on the fairness of pretrained LMs and shows that just flipping given names often causes LMs to change their predictions.

StereoSet: Measuring Sterotypical Bias in Pretrained Language Models A new dataset to measure the stereotypical biases of pretrained LMs. Consists of contexts that a language model can complete with (a) a stereotype, (b) an anti-stereotype, (c) something completely unrelated. Example: “Girls tend to be more __ than boys” with options (a) soft, (b) determined, (c) fish. Results show that “stronger” language models exhibit more stereotypical bias.

What Happens to BERT Embeddings During Fine-tuning? This paper sheds some light on how BERT changes during finetuning. Main findings: (1) Finetuning does not lead to catastrophic forgetting of linguistic phenomena (2) Only top layers of BERT are strongly affected by finetuning.

🎌 Cross-Domain & Cross-Lingual

Cross-lingual Contextualized Topic Models with Zero-shot Learning Idea: A model learns topics in one language and then predicts topics for documents in other languages. This is achieved using a variant of LDA (Neural-ProdLDA) and replacing BoW representations with contextualized sentence representations from a multilingual BERT model.

Are All Good Word Vector Spaces Isomorphic? Methods for aligning cross-lingual vector spaces require the vector spaces to be isomorphic, which may not always be the case. This paper shows that non-isomorphism is largely due to under-training or a lack of monolingual resources and not primarily a result of typological differences.

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings An intuitive method of obtaining word alignments without using any parallel data. This is achieved by comparing the similarity of contextualized word representations using multilingual models such as mBERT. There’s also an online demo.

🤖 Chatbots

Recipes for building an open-domain chatbot Recently, Facebook publicly released their Blender chatbot. The accompanying paper discusses important skills (such as personality and empathy) for chatbots to be more “human-like” and how these skills can be obtained by selecting appropriate training data and generation strategies.