A review of pre-trained language models: from BERT, RoBERTa, to ELECTRA, DeBERTa, BigBird, and more

A beautiful scene

In this blog post, we review a list of pretrained language models, including BERT, Transformer-XL, XLNet, RoBERTa, DistilBERT, ALBERT, BART, MobileBERT, ELECTRA, ConvBERT, DeBERTa, and BigBird.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al.

Description and Selling points

BERT is the first bi-directional (or non-directional) pretrained language model. It uses self-supervised learning to learn the deep meaning of words and contexts. After pretraining, the model can be adapted to different tasks as well as different datasets with minimal adjustments.

BERT finetuning
Illustrations of fine-tuning BERT on different tasks. In general, BERT can be used for various Natural Language Understanding tasks with only a change in the output layer.

At the time it is published, BERT not only outperformed state-of-the-art models on 11 tasks (including GLUE and SQuADs) by a large margin but also marked the era of language model pretraining.


BERT reuses the encoder block from the Transformer. The BERT_{BASE} uses 12 encoder blocks, while this number for the BERT_{LARGE} version is 24. Note that, while these two versions also have some other differences in layer sizes, the fundamental architectures are the same.

No. encoders1224
Embedding dim7681024
Attention heads1216
No. parameters110M340M
Comparison of size in 2 versions.


BERT is pretrained on 2 tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

  • For MLM, a random 15% of the text is chosen to be masked. Over these tokens, 80% is replaced with [MASK], 10% is replaced with a random token from the alphabet, and the remaining 10% is kept the same. The objective of the model is to recover the original texts from this masked version.
  • For NSP, BERT takes as input 2 sentences, denoted as A and B, its mission is then to predict if B is actually the next sentence after A in the corpus. This task aims to strengthen BERT’s ability to reason across sentences, which will then be helpful for tasks like Question Answering and Natural Language Inference.

Data, Hardware, Speed

BERT is trained on 16GB of text data, which consists of 3.3B tokens.

BERT_{BASE} was trained with 16 TPU chips while BERT_{LARGE} was trained with 64 TPU chips. Each takes 4 days to complete.

Other information

While BERT is definitely a masterpiece, it also poses a number of limitations. Here, we briefly mention some of the most important ones. All of these will then be addressed in more recent work (that we will discuss in the next sections).

  • There is discrepancy between pretraining and finetuning because of the [MASK] token. The [MASK] token appears many times in pretraining but not in finetuning.
  • During pretraining, BERT predicts each masked token independently of the others, which is an oversimplification, leading to suboptimal performance.
  • BERT learns from only 15% of the input tokens.
  • The Next Sentence Prediction task is too weak compared to Masked Language Modeling.
  • The context length of BERT, which is fixed as 512, is small for some tasks.
  • In the Transformer architecture, the Embedding dimension size must be set equal to the Hidden layer dimension size, which is a design constraint that does not have any semantic reason.
  • The masking of training data is done once and reused for all epochs. This can be improved by dynamic masking at any epoch.
  • The training data size is small, only 16GB of text.


Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Dai et al.

Description and Selling points

Transformer-XL is an autoregressive model (not bi-directional like BERT). It has 2 main advantages over its competitors:

  • Transformer-XL can learn longer context. The authors claim that it can learn dependency that is 450% longer than vanilla Transformer, thanks to the ability to handle the problem of context segmentation.
  • It also enhances the evaluation speed by 1800+ times.

These advantages result in State-of-the-Art performance in long-context datasets (e.g. WikiText-103, enwiki8), even short-context datasets (One Billion Word dataset), and small datasets (the Penn Treebank with 1M training tokens).


Transformer-XL reuses the hidden states obtained in previous segments for the next segment, which, in other words, introduces the notion of recurrence into the attention-based architecture. This additional context is added to the key and value of the attention mechanism to make it aware of further information from the past.

To make this works, the authors also need to make use of Relative Positional Encoding to replace the Absolute Positional Encoding normally used in Transformer models.

Traditional model
Above: illustration of a vanilla model. Below: illustration of the Transformer-XL model.
With the Transformer-XL model, more context is included.

Training, Data, Hardware, Speed

Transformer-XL was not designed as a to-be-reused pretrained language model. It trains on different datasets for different tasks. Thus, the information about hardware and speed is not relevant here, as it cannot be compared against other pretrained models.

Other information

The ideas from Transformer-XL are then be adapted by XLNet, the model we are going to introduce next.


XLNet: Generalized Autoregressive Pretraining for Language Understanding, Yang et al.

Description and Selling points

XLNet is a very interesting model that builds upon both BERT and Transformer-XL (i.e. bi-directional and autoregressive, respectively). It handles some of the weaknesses of BERT, including:

  • The assumption of independence between masked tokens. With XLNet, the prediction of one masked position can depend on other masked positions (similar to the autoregressive style).
  • The discrepancy of having [MASK] tokens during pretraining but not during finetuning. XLNet does not replace tokens with [MASK].
  • The short context limitation. XLNet employs the recurrence mechanism and Relative Positional Encoding from Transformer-XL to improve performance on long context.

Using the same model size and pretraining data, XLNet outperforms BERT in 20 tasks, achieving the SOTA at the time.


XLNet uses a similar architecture as BERT and integrates the improvement from Transformer-XL (the recurrence mechanism and Relative Positional Encoding).


XLNet was pretrained with the autoregressive style. However, in contrast to traditional autoregressive models which always lose the context of 1 side of the predicted position (e.g. the GPT model only uses the preceding tokens to predict the current token, while information from the successive tokens is ignored), XLNet still manages to learn from both sides.

The task is termed Permutation Language Modeling. That is, the (virtual) order of the tokens is randomly permuted, and each token is then predicted using the preceding tokens in that virtual order. We also call this order the Factorization order. Take a look at the below figure for illustration.

xlnet permutation language modeling
An illustration of predicting x_3 given different factorization orders. In any situation, x_3 is only predicted using the tokens that come before it in the factorization order.

Replacing Masked Language Modeling with Permutation Language Modeling is the most important contribution from XLNet. Additionally, XLNet introduces learnable relative segment encoding on attention heads.

XLNet was not pretrained on the Next Sentence Prediction task, as it did not give a consistent improvement in performance.

Data, Hardware, Speed

The training data consists of 158 GB of text with 33B tokens.

XLNET_{LARGE} is trained with 512 TPU v3 chips in 5.5 days.

Other information

The contribution of XLNet to improve over BERT is distinctive and is orthogonal to most other ideas (for example, from RoBERTa) and thus, can be employed together with them, which can result in even better performance.


RoBERTa: A Robustly Optimized BERT Pretraining Approach, Liu et al.

Description and Selling points

RoBERTa is one of the most (if not the most) renowned successors of BERT. It does nothing but simply optimize some hyper-parameters for BERT. These simple changes sharply enhance the model performance in all tasks as compared to BERT. Being published roughly the same time as XLNet, their performance closely matches each other, despite that the adjustments made by XLNet are much harder to implement.


RoBERTa uses the same architecture as BERT.


RoBERTa is pretrained with the MLM task (and without the NSP task).

The hyper-parameter changes made by RoBERTa are:

  • Longer training time.
  • Larger training data (x10, from 16G to 160GB).
  • Larger batch size (from 256 to 8k).
  • The removal of the NSP task.
  • Bigger vocabulary size (from 30k to 50k).
  • Longer sequences are used as input (but still keep the limitation of 512 tokens).
  • Dynamic masking.

Data, Hardware, Speed

RoBERTa is trained on 160GB of text data using 1024 32GB V100 GPUs in 1 day.

Other information

As noted above, the improvement in RoBERTa can easily be adapted since they are mostly just changing some values in the pretraining.


DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Sanh et al.

Description and Selling points

While other models aim to optimize BERT’s performance, DistilBERT has a different goal. Its target is to reduce the large size and enhance the speed of BERT (BERT_{BASE} and BERT_{LARGE} have 110M and 340M parameters, respectively) while still keeping as much strength as possible.

DistilBERT reduces the size of BERT_{BASE} by 40%, enhances the speed by 60% while retaining 97% of its capabilities.

Comparison of size among pretrained LMs
Comparison of size among pretrained language models.


DistilBERT uses a similar general architecture as BERT, but with fewer encoder blocks (6 blocks, compared to 12 blocks of BERT_{BASE} and 24 blocks of BERT_{LARGE}).

Also, the token-type embeddings and the pooler are removed.

The 6 encoder blocks of DistilBERT are initialized using the pretrained 12 encoder blocks of BERT (by taking 1 out of 2).


DistilBERT is pretrained with the Masked Language Modeling tasks. Its objective is to optimize a triple loss:

  • The language model loss, which is the same loss used in BERT.
  • The distillation loss, which measures how similar the output of DistilBERT and of BERT.
  • The cosine-distance loss, which measures how similar the hidden representation of DistilBERT and of BERT.

The distillation loss and cosine-distance loss help train DistilBERT in a student-teacher way, in which the student is taught to imitate the teacher.

DistilBERT also utilizes good practices from RoBERTa: a larger batch size, dynamic masking, and no NSP.

Data, Hardware, Speed

The data in use is the same as BERT (16GB of text).

DistilBERT is trained with 8 16GB V100 GPUs for 90 hours.

Other information

DistilBERT, together with BERT and RoBERTa are 3 of the most popularly used models at the time of writing this blog post.


ALBERT: a lite BERT for self-supervised learning of language representations, Lan et al.

Description and Selling points

ALBERT is published at roughly the same time as DistilBERT. While ALBERT also reduces the model size of BERT, it does not have to trade-off the performance. Compared to DistilBERT, which uses BERT as the teacher for its distillation process, ALBERT is trained from scratch (just like BERT).

An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained 1.7x faster. With similar sizes, ALBERT outperforms all previous models, including BERT, RoBERTa, and XLNet.


The contributions of ALBERT mostly come from its architectural changes. More specifically, ALBERT implements 2 parameter-reduction techniques:

  • Factorized embedding parameterization: decompose the embedding matrix into 2 small matrices to separate the size of hidden layers from embedding dimension size, which enables us to increase the hidden layer size without modifying embedding dimension.
    Reasoning: call E the embedding dimensions and H the hidden dimensions, it does not make sense to always have E = H (as BERT has).
    ALBERT’s solution: call V the vocabulary size, we decompose the embedding matrix from size V \times H to V \times E + E \times H. (In particular, the embedding dimensions is reduced, and a fully-connected layer with size E \times H is added to after the embedding phase.)
    Specifically, ALBERT uses the E = 128 (while BERT uses E = 768 for BERT_{BASE} and E = 1024 for BERT_{LARGE}) and H = 4096 (compared to 768 in BERT_{BASE} and 1024 in BERT_{LARGE}).
  • Cross-layer parameter sharing: encoder blocks share all parameters. This does not only reduce the parameter size but also add more regularization to the model.

Also, ALBERT removes dropouts from the architecture.


ALBERT also uses the MLM task with a small modification (the masked inputs are generated using n-gram masking, with the maximum n-gram span of 3).

One more important contribution of ALBERT is the introduction of the Sentence Order Prediction (SOP) to replace NSP, which was deemed too easy. SOP is more about the coherence prediction of sentences, which fits the original goal better. In contrast, NSP is more about topic prediction, which has a high correlation with MLM (and thus, doesn’t add much value to the pretraining).

Data, Hardware, Speed

ALBERT is pretrained on the union of datasets from RoBERTa and XLNet with 64 to 512 TPU V3.

Other information

While ALBERT gives a better performance with the same memory, it is not really faster than other models.

Another interesting point from ALBERT’s paper: sharing attention layers doesn’t really degrade performance while sharing fully-connected layers does.


BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, Lewis et al.

Description and Selling points

BART is the next interesting pretrained model. While its performance on Natural Language Understanding tasks is comparable to that of RoBERTa, BART additionally supports tasks involving Natural Language Generation (e.g. the abstract summarization task, which requires generating new text to summarize a long input text, in contrast to the traditional summarization task that some segments of the input text are extracted to make the summary). BART’s strength on these tasks is outstanding.


BART doesn’t follow the convention of using a similar structure to BERT which mainly consists of a stack of encoder blocks. Instead, BART uses the standard Transformer architecture (i.e. both encoder and decoder blocks are involved), with the only change being to replace ReLU with GeLU.


The pretraining of BART comprises 2 steps:

  • the input text is corrupted by arbitrary noise (which, as a result, may change the text length).
  • the seq2seq model learns to reconstruct the original text from the corrupted text.

The noise function can be any (or any combination) of: Token Masking, Sentence Permutation, Document Rotation, Token Deletion, and Text Infilling. The best combination of noise types is dependent on the specific downstream task. However, in general, Sentence Permutation and Document Rotation alone perform poorly, while Text Infilling seems good in most cases.

Noise transformations of BART
Different noise transformations in BART.

As some resources argue that dropout can hinder the performance of pretrained models, BART disables dropout for the last 10% of pretraining steps.

BART is trained on the same scale as RoBERTa (e.g. the same batch size of 8k).

Data, Hardware, Speed

The data is also the same as RoBERTa: 160GB of text.

Other information

While the advantages BART brings are inarguable, it is not really employed much in practice due to the more complex changes (including the selection of noise functions and adjustments for fine-tuning) in downstream tasks.


MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, Sun et al.

Description and Selling points

MobileBERT marks a milestone in the mission of enabling performant pretrained language models to work on small mobile devices. Compared to BERT_{BASE}, MobileBERT is 4.3 times smaller, 5.5 times faster but still has comparable performance.


MobileBERT can be considered a thinner version of BERT_{LARGE} with some small adjustments. The authors argue that a narrow but deep neural network would possess much more strength than a wide but swallow one, given the same number of parameters. Thus, while maintaining the same number of encoder blocks as BERT_{LARGE} (i.e. 24), MobileBERT introduces modifications to reduce its size and shorten its runtime latency:

  • A bottleneck structure, with which the inside-block hidden size is much smaller than the intra-block feature map size.
  • There is a convolution layer op top of the embeddings, which compenstates for the smaller embedding dimension size.
  • The so-called LayerNorm is replaced with a light-weight NoNorm operation.
  • GeLUs are replaced by simple ReLUs.
  • Stacks of multiple smaller feed-forward layers are to replace canonical big feed-forward layers.

On the other hand, however, unlike DistilBERT that can distill knowledge from the existing BERT model, pretraining MobileBERT requires pretraining a modified version of BERT_{LARGE} first, before using it as the teacher for knowledge transfer.

The architectures of the modified BERT_{LARGE} and of MobileBERT as compared to the original BERT_{LARGE} are illustrated in the figure below. Note that we call this special version of BERT_{LARGE} as IB-BERT (i.e. Inverted-Bottleneck BERT).

mobilebert architecture
(a) The original BERT; (b) The IB-BERT; and (c) The MobileBERT. MobileBERT is trained layer-by-layer to imitate IB-BERT. The size of each layer in MobileBERT is carefully picked to optimize its performance while still limiting its size and latency.


At first, the IB-BERT is pretrained to convergence. Then, it is used as the teacher to transfer knowledge to the student (i.e. MobileBERT). While the embedding and the classification layers are copied directly from IB-BERT to MobileBERT, other parameters are transferred with the imitation process.

Specifically, the attention and the feature map are objectives for transfer, as shown in the figure above. The process is executed sequentially for each encoder block, one by one. Only after that, the final distillation loss of the output is then learned from the teacher to the student.

The training tasks are MLM and NSP, similar to BERT.

Data, Hardware, Speed

The pretraining process uses the same data as BERT (16GB).

IB-BERT is first pretrained on 256 TPUv3 chips.

Other information

While MobileBERT achieves incredible inference performance with a small size and short latency, its practical usage is somehow hindered by the need of pretraining a big IB-BERT_{LARGE} model.


ELECTRA: pre-training text encoders as discriminators rather than generators, Clark et al.

Description and Selling points

ELECTRA introduces a new pre-training objective that makes the pre-training process a lot more effective and efficient. The model now learns from all input tokens (in contrast to former models like BERT in which only 15% of the tokens have effects in model parameter learning). It matches the performance of RoBERTa and XLNet with just less than 25% computation time, outperforms them (and even ALBERT) on most GLUE and SQuAD tasks if using the same amount of computing.


ELECTRA builds 2 models at once. The main model has an architecture that is very similar to BERT but with some small adjustments taken from XLNet. Additionally, there is a small auxiliary model to help with the learning of the main one. These 2 models share the embedding weights with each other.


ELECTRA is pretrained on the Replaced Token Detection (RTD) task.

The small helper model is first trained with the MLM task. Then, it is used to fake some tokens in the input, whose output will then be fed to the main model. We also call this helper model the Generator, as it generates the input for the main model.

The main model’s objective is to discriminate tokens, answer whether each token is original or is made up by the Generator (and hence, this model is termed the Discriminator).

The process is illustrated in the below figure.

ELECTRA pretraining process
An illustration of ELECTRA’s pretraining process. Initially, the input is randomly masked before feeding into the Generator. This Generator tries to predict the masked token, which results in a “fake” text that goes into the Discriminator. The Discriminator then discriminates, for each token, whether it is original or was replaced by the Generator.

After pretraining, the Generator is thrown out. Only the Discriminator is then finetuned for downstream tasks.

Data, Hardware, Speed

ELECTRA is trained on the same dataset as XLNet (158GB of text with 33B tokens).

ELECTRA_{SMALL} gives comparable performance to BERT_{SMALL} after only 1/8 computation time (0.5 days compared to 4 days on 1 V100 GPU).

Other information

While ELECTRA (more specifically, ELECTRA_{SMALL}) needs much less time to obtain comparable performance to BERT (or BERT_{SMALL}), it only converges after roughly the same computation as BERT. However, after convergence, ELECTRA shows much better strength: ELECTRA_{SMALL} scores 79.9 on the GLUE benchmark, which is a 4.8 points improvement over BERT_{SMALL} (75.1).


ConvBERT: Improving BERT with Span-based Dynamic Convolution, Jiang et al.

Description and Selling points

ConvBERT can be seen as an improvement over ELECTRA. It retains the contributions from ELECTRA (i.e. the Replaced Token Detection task) and introduces an orthogonal idea to optimize the attention mechanism, which in the end results in reduced parameter size and even a much faster pretraining process. To be more specific, ConvBERT can outperform at-the-the-time-SOTA ELECTRA (by 0.7 points on the GLUE dataset) with less than 1/4 training cost.

The main insight is, while all the attention heads are global, many of them spend most of their capabilities to attend to local tokens. This fact is quite intuitive since usually, a word should share its context and meaning with nearer words around rather than words from long-distance away. Thus, it makes sense to replace some of the (complex) attention heads with something simpler that still has the ability to attend to the small local region.


ConvBERT replaces around half of the attention heads by local convolutions. Note that this marks one of the rare attempts to modify the structure of attention-based models (other models like BERT, RoBERTa, BART just reuse the vanilla Transformer).

To achieve the most, the sheer idea of using a mixed attention convolution mechanism is inadequate. Other main contributions of the paper are listed below:

  • The traditional convolution layer would not be good enough. Instead, the authors introduce the novel span-based dynamic convolution operation, which depends on the input span of tokens to generate the corresponding kernel.
  • The grouped linear operator for the feed-forware layer is also beneficial. Normally, the feed-forward module allows each input value to directly affect every value of the output. However, this scheme can be redundant as many connections from the input to the output can be effectively pruned without much detrimental effect on performance. The author propose to replace this “fully” connected module with a grouped connected module, in which we split the input and output vectors into G groups, and each group of the input can only affect one group of the output values.
    If we call D the number of dimensions in the input (as well as the output), then normally the number of parameters for a feed-forward module is: O(D \times D). With grouped linear operator, this number is reduced to O(G \times (\frac{D}{G} \times \frac{M}{G})), where G is the number of groups, and M is the number of dimensions in an intermediate latent space.
  • The bottleneck structure that uses a fully-connected layer to project the embedding input tokens into a lower-dimensional space (normally of half the size) before feeding to the attention modules.
The mixed attention block
The mixed attention block of ConvBERT. The pink square region denotes the span-based convolution module. Here, LConv denotes the light-weight convolution.

Training, Data, Hardware, Speed

ConvBERT is pretrained on the Replaced Token Detection task (taken from ELECTRA), using 32GB of text data.

Based on 8 GLUE tasks, ConvBERT can give comparable or better overall performance than ELECTRA with just roughly 1/4 to 1/3 training time. Compared to BERT, ConvBERT outperforms by 4 averaged GLUE points (80.9 vs 84.9) after 1/3 training time, and by 5.5 points (80.9 vs 86.4) after 120% training time.

Other information

The small version of ConvBERT tops other training-from-scratch models while being a bit inferior to distillation models like TinyBERT and MobileBERT.


Decoding-enhanced BERT with disentangled attention, He et al.

Description and Selling points

DeBERTa focuses intensively on positional encoding. Specifically, it makes improvements by introducing 2 novel techniques:

  • A disentangled attention mechanism, in which word content and position are separated (in contrast to be summed up as in BERT).
  • An enhanced mask decoder, in which both relative and absolute position of words are taken into account (notice that in previous work, either absolute position, e.g. BERT, or relative position, e.g. XLNet, is used).

Given the similar model size, DeBERTa outperforms former models (e.g. RoBERTa, XLNet, ALBERT, ELECTRA) in many tasks with only half of the training data, achieving SOTA performance.

Notably, the very large DeBERTa version with 1.5 parameters outperforms T5, another renowned model, which has as many as 11B parameters.


The main structure of DeBERTa is taken from BERT with some modifications.

As stated above, the architecture requires a bit of change to accomplish the 2 new techniques: the disentangled attention mechanism and enhanced mask decoder. DeBERTa incorporates absolute word position embeddings right after all transformer blocks but before the softmax layer (as opposed to the traditional convention of summing up the embedding and the absolute positional encoding).

Comparison of BERT decoding layer and Enhanced Mask Decoder
The comparison of 2 decoding layers. Here, in general, I denotes any necessary information for decoding (in this paper, I is the absolute positional encoding). n is the number of decoder blocks that share the same weights. If I = H and n = 1 then the Enhanced Mask Decoder is exactly the same as the canonical BERT decoding layer.

There is also the separation of dimension size for the embedding and hidden layers, as adapted from ELECTRA.


The Masked Language Modeling objected from BERT is used for pretraining DeBERTa.

In addition, the authors propose a new virtual adversarial training for fine-tuning pretrained language models, which first normalizes word embedding and then lightly perturbs the values.

Data, Hardware, Speed

DeBERTa is trained with 78GB of text data.

For the Large version, it takes 96 V100 GPUs and 20 days for training. For the Base version, 64 V100 and 10 days are needed.

Other information

The authors, afterward, try the RTD task from ELECTRA to replace the MLM. The result is very encouraging, as it performs better by a considerable margin.


Big Bird: Transformers for Longer Sequences, Zaheer et al.

Description and Selling points

Since the canonical self-attention mechanism requires quadratic memory, it cannot handle long context with limited hardware. To address this issue, the BigBird model introduces a sparse attention mechanism that only demands linear dependency. This novel technique enables BigBird to handle longer input with the same hardware (as much as 8x) and as a result, produce better performance, especially on long-text datasets.


BigBird reuses the architecture from BERT. Furthermore, it was initialized with the pretrained RoBERTa.

The only difference is the use of the aforementioned sparse attention to replacing the canonical attention mechanism. The sparse attention consists of 3 parts:

  • a set of g global tokens attending to all tokens.
  • all tokens attend to w local neighboring tokens.
  • all tokens attend to a set of r random tokens.
BigBird's sparse attention
Illustrations of BigBird’s attention mechanism. Here, r = 2, w = 3, g = 2.


The pretraining is done with the traditional MLM task.

Data, Hardware, Speed

BigBird was trained on 123GB of text data (19.2B tokens) with 8 x 8 TPU v3.

Other information

At the end of 2021, BigBird is still the SOTA for datasets with long input texts.

When to use which?

There is no best model for all cases. It depends on our current situation to select the most suitable one. Some important criteria are:

  • Available hardware.
  • Do we want to train from scratch or we can just finetune existing pretrained weights?
  • What is/are the downstream task(s)? Are they Natural Language Understanding or Natural Language Generation?
  • Does our targeted dataset contain long texts?
  • What is the priority over pretraining, finetuning, and inference speed?
  • Whether the model implementation is publicly avaliable?


If the hardware we have is virtually unlimited, then we could strive for the most performant models with the largest size. A very large version of the DeBERTa can be a strong candidate for this case.

However, usually, the resources at hand are limited. Then, the Base-version of ELECTRA, DeBERTa, ConvBERT can also work well.

If we have very limited resources, then consider DistilBERT, ALBERT, TinyBERT, and MobileBERT.

Train from scratch or just fine-tune existing weights?

This is inherently related to the availability of hardware, as well as whether there are lots of public references on the model. Pretraining usually requires a lot more resources (e.g. GPUs/TPUs, time) than simply fine-tuning. Furthermore, that would not be very easy to train the model if it involves some complex operations and there are fewer references we can find to help us in implementation.

So, if the goal is to train from scratch then we should look for a model with suitable model size and is either easy-to-implement or publicly available. BERT and RoBERTa are the most popular models and have plenty of implementations on the internet. Most other models are also available on HuggingFace, however, not all of them are supported with both Pytorch and Tensorflow. So, also mind your favorite deep learning framework.

On the other hand, “just fine-tuning” requires we have access to a pretrained model (compatible with our deep learning framework) and it was trained on the same natural language. It wouldn’t make much sense if we fine-tune a model that was pretrained with English corpus on our Korean dataset. A warning is that apart from English, other languages have much fewer publicly-available pretrained models.

What is the downstream task?

Most of the above models only support (or are good options when the downstream tasks are) Natural Language Understanding problems. BART can be an exception. However, if our task is Language Generation, it is recommended to take a look at the GPT family (e.g. GPT-2).

Long text?

There are several models that are specifically designed to handle long input texts. They are XLNet, Longformer, and BigBird. BigBird is the most competent one among these.

Otherwise, if the texts are medium or short, then any other can work well.

Priority over pretraining, finetuning, and inference speed

Some models are optimized for faster pretraining while some others have much faster fine-tuning speed. ConvBERT is the best if pretraining time tops our list of priorities, ELECTRA can be considered the second choice.

Regarding fine-tuning time, normally, the smaller the model the faster it is. However, this rule of thumb is not true for models that share a lot of weights (e.g. ALBERT and its derivatives).

Public support

More recent models usually have less public support (although this can be less true for really outstanding models that gain a huge amount of attention from the community). Check it out online to see if there is any support for a specific model (notably on HuggingFace).


We summarize all the ideas in the below table.

Summary of pretrained language models
A summary of pretrained language models. (Click to see a larger image)

Some other pretrained language models we have yet to include here are ERNIE, SpanBERT, UniLM, TinyBERT, Longformer, DynaBERT, T5, Megatron-LM, Reformer, and Performer.

One thought on “A review of pre-trained language models: from BERT, RoBERTa, to ELECTRA, DeBERTa, BigBird, and more

Leave a Reply