Most of the information is available in the BERT paper.

Key details:

  • Multi-head attention. Transformer encoder. Two model sizes: BASE and LARGE
  • This 768-dimensional representation is used across all layers in BERT Base.
  • WordPiece tokenization, embeddings with 30,000 token vocabulary. Position embeddings.
  • Bidirectional model
  • Masked Language Modeling (randomly mask 15% tokens: 80% are replaced with [MASK] token, 10% with random token, 10% unchanged) and Next Sentence Prediction (predict if the next sentence is random or following) pretraining

Variants and Extensions

  • RoBERTa (Robustly Optimized BERT Approach): no NSP task, dynamic masking, Byte-Pair Encoding
  • DistilBERT: Distilled version of BERT with 40% fewer parameters, 60% faster while retaining 97% of BERT’s performance
  • ALBERT (A Lite BERT): factorized embedding parameterization and cross-layer parameter sharing, inter-sentence coherence task instead of NSP
  • ELECTRA: Replaced Token Detection instead of MLM, generator-discriminator architecture for more efficient pre-training
  • ModernBERT: Updates BERT with the modern improvements to Transformer architecture