Beyond BERT

Introduction

This article talks about a few neural language models that were developed as improvements over the popular BERT model. The next section provides a short overview of BERT (Devlin et al.) but reading the original paper is recommended along with reading the ‘Attention is all you need’ paper (Vaswani et al.).

BERT

BERT is an encoder based on the transformer architecture. It stands for Bidirectional Encoder Representations from Transformers. In large part, BERT uses the original transformer architecture as is and benefits greatly from the self-attention mechanism that the transformer is based on. However, it makes some key changes including removing the decoder stack and introducing the Masked Language Modeling (MLM) pretraining task. Most of BERT’s predecessors were able to capture context in either a leftward or rightward direction. Attempts to make models bidirectional were limited to using a shallow concatenation of unidirectional contexts. The authors of BERT identified the masked language modeling task as an effective solution to making their model truly bidirectional. By masking tokens randomly and making BERT predict the masked tokens, BERT was able to capture bidirectional context and achieve new state-of-the-art results on benchmarks like GLUE and SQuAD among others. Another pretraining task used by BERT is the Next Sentence Prediction (NSP) task where, given a pair of sentences, the model must decide whether they occur consecutively in a document.

Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) [Blog post]. Retrieved from http://jalammar.github.io/illustrated-bert/

RoBERTa

The authors of the paper that introduced RoBERTa (Robustly Optimized BERT Approach) claim that BERT was ‘undertrained’. They showed that with some basic changes to the model pretraining process, BERT could be greatly improved.

  1. Removal of the Next Sentence Prediction pretraining task
    The authors decided to get rid of the NSP task which had been proven to contribute very little to BERT’s performance. They packed each input with full sentences sampled contiguously from one or more documents, such that the total length was at most 512 tokens.
  2. Dynamic masking
    The original BERT used static masking which means that the tokens to be masked for the Masked Language Modeling pretraining task were fixed apriori. The authors of RoBERTa state that the drawback of this method is that only a fixed set of tokens will be masked and hence predicted every time. To combat this, they introduced dynamic masking in which they duplicated the training data with different masks in each copy.
From ‘RoBERTa: A Robustly Optimized BERT Pretraining Approach’ by Liu et al.

BART

BART (Bidirectional and Auto-Regressive Transformers) is a denoising autoencoder that maps a corrupted document to the original document it was derived from. It was trained by corrupting documents and then optimizing a reconstruction loss — the cross-entropy between the decoder output and the original document. Unlike existing denoising autoencoders, which are tailored to specific noising schemes, BART allows us to apply any type of document corruption. In the extreme case, where all information about the source is lost, BART is equivalent to a language model.

From ‘BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension’ by Lewis et al.
  1. Token deletion: Tokens are randomly deleted from the input and the model must predict the positions from which the tokens were deleted.
  2. Text infilling: Instead of single tokens being masked, spans of text are replaced with the <MASK> token. The length of these spans is sampled from a Poisson distribution. Similar to MLM, the model must predict the text spans.
  3. Document rotation: A token is chosen at random and the document is rotated so that it starts with that token. The model must predict the original start of the document.
  4. Sentence shuffling: Sentences are shuffled in random order. The model must predict the original ordering of the sentences.
  5. Text infilling + Sentence shuffling
From ‘BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension’ by Lewis et al.

DistilBERT

From ‘DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter’ by Sanh et al.
Distilling the knowledge from a teacher to a student. Source: Neural Network Distiller 2019
The softmax function with temperature
From ‘DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter’ by Sanh et al.

Observations and Insights

  1. Each model discussed in this article incorporates learnings from its predecessor to improve its own performance.
  2. NLP’s recent rise to prominence has been in large part to breakthroughs in pre-training methods like next sentence prediction and masked language modeling.
  3. The majority of neural language models are extremely large and intractable. This acts as a bottleneck for adoption due to the large model size and prediction latency.
  4. Model compression techniques will play a big role in allowing these models to be used in a wider range of applications.

References

  1. Attention Is All You Need (https://arxiv.org/pdf/1706.03762.pdf)
  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805.pdf)
  3. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) (http://jalammar.github.io/illustrated-bert/)
  4. RoBERTa: A Robustly Optimized BERT Pretraining Approach (https://arxiv.org/pdf/1907.11692.pdf)
  5. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (https://arxiv.org/pdf/1910.13461.pdf)
  6. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (https://arxiv.org/pdf/1910.01108.pdf)
  7. Distilling the Knowledge in a Neural Network (https://arxiv.org/pdf/1503.02531.pdf)
  8. Knowledge Distillation (https://intellabs.github.io/distiller/knowledge_distillation.html)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store