Most of the information is available in the BERT paper.
Key details:
- Multi-head attention. Transformer encoder. Two model sizes: BASE and LARGE
- This 768-dimensional representation is used across all layers in BERT Base.
- WordPiece tokenization, embeddings with 30,000 token vocabulary. Position embeddings.
- Bidirectional model
- Masked Language Modeling (randomly mask 15% tokens: 80% are replaced with
[MASK]token, 10% with random token, 10% unchanged) and Next Sentence Prediction (predict if the next sentence is random or following) pretraining
Variants and Extensions
- RoBERTa (Robustly Optimized BERT Approach): no NSP task, dynamic masking, Byte-Pair Encoding
- DistilBERT: Distilled version of BERT with 40% fewer parameters, 60% faster while retaining 97% of BERT’s performance
- ALBERT (A Lite BERT): factorized embedding parameterization and cross-layer parameter sharing, inter-sentence coherence task instead of NSP
- ELECTRA: Replaced Token Detection instead of MLM, generator-discriminator architecture for more efficient pre-training
- ModernBERT: Updates BERT with the modern improvements to Transformer architecture