A review of pre-trained language models: from BERT, RoBERTa, to ELECTRA, DeBERTa, BigBird, and more

In this blog post, we review a list of pretrained language models, including BERT, Transformer-XL, XLNet, RoBERTa, DistilBERT, ALBERT, BART, MobileBERT, ELECTRA, ConvBERT, DeBERTa, and BigBird.

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al.

Description and Selling points

BERT is the first bi-directional (or non-directional) pretrained language model. It uses self-supervised learning to learn the deep meaning of words and contexts. After pretraining, the model can be adapted to different tasks as well as different datasets with minimal adjustments.

BERT finetuning — Illustrations of fine-tuning BERT on different tasks. In general, BERT can be used for various Natural Language Understanding tasks with only a change in the output layer.

At the time it is published, BERT not only outperformed state-of-the-art models on 11 tasks (including GLUE and SQuADs) by a large margin but also marked the era of language model pretraining.

Architecture

BERT reuses the encoder block from the Transformer. The $BERT_{BASE}$ uses 12 encoder blocks, while this number for the $BERT_{LARGE}$ version is 24. Note that, while these two versions also have some other differences in layer sizes, the fundamental architectures are the same.

	$BERT_{BASE}$	$BERT_{LARGE}$
No. encoders	12	24
Embedding dim	768	1024
Attention heads	12	16
No. parameters	110M	340M

Training

BERT is pretrained on 2 tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

For MLM, a random 15% of the text is chosen to be masked. Over these tokens, 80% is replaced with $[MASK]$ , 10% is replaced with a random token from the alphabet, and the remaining 10% is kept the same. The objective of the model is to recover the original texts from this masked version.
For NSP, BERT takes as input 2 sentences, denoted as A and B, its mission is then to predict if B is actually the next sentence after A in the corpus. This task aims to strengthen BERT’s ability to reason across sentences, which will then be helpful for tasks like Question Answering and Natural Language Inference.

Data, Hardware, Speed

BERT is trained on 16GB of text data, which consists of 3.3B tokens.

$BERT_{BASE}$ was trained with 16 TPU chips while $BERT_{LARGE}$ was trained with 64 TPU chips. Each takes 4 days to complete.

Other information

While BERT is definitely a masterpiece, it also poses a number of limitations. Here, we briefly mention some of the most important ones. All of these will then be addressed in more recent work (that we will discuss in the next sections).

There is discrepancy between pretraining and finetuning because of the $[MASK]$ token. The $[MASK]$ token appears many times in pretraining but not in finetuning.
During pretraining, BERT predicts each masked token independently of the others, which is an oversimplification, leading to suboptimal performance.
BERT learns from only 15% of the input tokens.
The Next Sentence Prediction task is too weak compared to Masked Language Modeling.
The context length of BERT, which is fixed as 512, is small for some tasks.
In the Transformer architecture, the Embedding dimension size must be set equal to the Hidden layer dimension size, which is a design constraint that does not have any semantic reason.
The masking of training data is done once and reused for all epochs. This can be improved by dynamic masking at any epoch.
The training data size is small, only 16GB of text.