📃 Byte Pair Encoding is Suboptimal for Language Model Pretraining

Timo Schick

(picture similar)

📃 Byte Pair Encoding is Suboptimal for Language Model Pretraining

Paper Picks

Apr 14, 2020

My notes from reading “Byte Pair Encoding is Suboptimal for Language Model Pretraining” by Kaj Bostrom and Greg Durrett, which compares tokenization methods for language model pretraining.

Why is this an important topic?

If you have ever worked with BERT or one of its relatives, you’ve probably noticed that these models work on the subword level. That is, some words are not represented as single tokens, but as a sequence of subword tokens. You can easily check that yourself using the transformers library:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(tokenizer.tokenize('firefighter'))
# Out: ['fire', '##fighter']

In general this is a great thing, because it allows the model to cope better with rare and novel words. For example, even if the word “firefighter” had never occurred in the training corpus, the model could probably still make some sense out of it, assuming it has seen the words “fire” and “fighter” often enough.

However, working on the subword level requires some way of tokenizing words. For pretrained Tranformer LMs, commonly used algorithms are Byte-Pair Encoding (BPE) and Google’s non-public WordPiece method. Especially for less common words, both algorithms don’t work particularly well, as illustrated by these examples (more examples and a detailled analysis can be found in one of my recent papers):

print(tokenizer.tokenize('unicycle'))
# Out: ['un', '##ic', '##y', '##cle']
print(tokenizer.tokenize('unaccessible'))
# Out: ['una', '##cc', '##ess', '##ible']

So naturally, the question arises: Can we do any better than BPE and WordPiece? This is the very question that today’s paper tries to answer.

What are the paper’s main findings?

The paper compares BPE to yet another tokenization method that none of the popular pretrained language models uses: Unigram language modeling, as introduced by (Kudo, 2018). This method creates a vocabulary of subword tokens as follows: Given some input text $D$ , we first consider the set $V$ of all substrings occuring at least twice in $D$ . Using this set of tokens, a unigram LM is trained on $D$ . We then gradually remove those tokens from $V$ whose removal has the least negative impact on the LM’s performance on $D$ (for more details, check out the readworthy original paper). In comparing the Unigram LM method with BPE, the paper’s key findings are:

Unigram LM tokenization aligns better with morphology. For an example, consider the following phrase from the paper:

 Original:    Completely    preposterous     suggestions
 BPE:         Comple-t-ely  prep-ost-erous   suggest-ions
 Unigram LM:  Complete-ly   pre-post-er-ous  suggestion-s

Unigram LM produces longer tokens and uses the vocabulary space more efficiently.
A BERT-like Transformer using Unigram LM outperforms an identical model using BPE on three diverse tasks (SQuAD Question Answering, MNLI, CoNLL Named Entity Recognition).

So, where to go from here?

Personally, I would be very curious to see how character-based approaches compare to BPE and Unigram LM. Two methods come to mind:

The CNN architecture of (Kim et al., 2016), which (Baevski et al., 2019) show to outperform BPE-based tokenization on GLUE. In this architecture, character-level embeddings are first combined to form word embeddings using a CNN architecture, and these word embeddings are then fed into a regular Transformer.
A purely character-based Transformer model like the one of (Al-Rfou et al., 2019), allowing for character-to-character attention across word boundaries. While the Transformer’s attention mechanism may be prohibitively expensive for this, there are several promising solutions for this very problem, like the Reformer or the Longformer.

Another interesting aspect would be to consider languages other than English, especially languages that are more morphologically complex (I would assume that for such languages, good tokenization algorithms are even more important).

Got any comments?

Or do you know of another recent paper worth writing a blog post about? Let me know on Twitter: @timo_schick :)

References

Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/p18-1007 [PDF]
Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-aware Neural Language Models. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2741–2749. http://dl.acm.org/citation.cfm?id=3016100.3016285 [PDF]
Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., & Auli, M. (2019). Cloze-driven Pretraining of Self-attention Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5359–5368. https://doi.org/10.18653/v1/D19-1539 [PDF]
Al-Rfou, R., Choe, D., Constant, N., Guo, M., & Jones, L. (2019). Character-Level Language Modeling with Deeper Self-Attention. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 3159–3166. https://doi.org/10.1609/aaai.v33i01.33013159 [PDF]