Self-Supervised Learning (SSL) is the backbone of transformer-based pre-trained language models, and this paradigm involves solving pre-training tasks (PT) that help in modeling the natural language. This article puts all the popular pre-training tasks together so we can assess them at a glance.

Loss function in SSL

The loss function here is simply the weighted sum of losses of individual pre-training tasks that the model is trained on.

Taking BERT as an example, the loss would be the weighted sum of MLM (Masked Language Modelling) and NSP (Next Sentence Prediction)

Over the years, there have been many pre-training tasks that have come up to solve specific problems. We will be reviewing 10 of the interesting and popular ones along with their corresponding loss functions:

  1. Causal Language Modelling (CLM)
  2. Masked Language Modelling (MLM)
  3. Replaced Token Detection (RTD)
  4. Shuffled Token Detection (STD)
  5. Random Token Substitution (RTS)
  6. Swapped Language Modelling (SLM)
  7. Translation Language Modelling (TLM)
  8. Alternate Language Modelling (ALM)
  9. Sentence Boundary Objective (SBO)
  10. Next Sentence Prediction (NSP)

(The loss functions for each task and the content is heavily borrowed from AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing)

Drawback 1:

[MASK] token appears while pre-training but not while fine-tuning — this creates a mismatch between the two scenarios.RTD overcomes this since it doesn’t use any masking

Drawback 2:

In MLM, the training signal is only given by 15% of the tokens since the loss is computed just using these masked tokens, but in RTD, the signal is given by all the tokens since each of them is classified to be “replaced” or “original”

ELECTRA Architecture

While code-switching, some phrases of x are substituted from y, and the sample thus obtained is used to train the model.

There are many other interesting tasks that are summarized in AMMUS !! Kudos to the authors, and please give it a read if you find this interesting)


Also published here

Follow me on Medium for more posts on ML/DL/NLP