Most of the information is available in the paper: [[BERT. Pre-training of Deep Bidirectional Transformers forLanguage Understanding|BERT paper]]. Key details: - Multi-head attention. [[Transformer]] encoder. Two model sizes: BASE and LARGE - This 768-dimensional representation is used across all layers in BERT Base. - WordPiece tokenization, embeddings with 30,000 token vocabulary. Position embeddings. - Bidirectional model - Masked Language Modeling (randomly mask 15% tokens: 80% are replaced with `[MASK]` token, 10% with random token, 10% unchanged) and Next Sentence Prediction (predict if the next sentence is random or following) pretraining ### Variants and Extensions - RoBERTa (Robustly Optimized BERT Approach): no NSP task, dynamic masking, Byte-Pair Encoding - DistilBERT: Distilled version of BERT with 40% fewer parameters, 60% faster while retaining 97% of BERT's performance - ALBERT (A Lite BERT): factorized embedding parameterization and cross-layer parameter sharing, inter-sentence coherence task instead of NSP - ELECTRA: Replaced Token Detection instead of MLM, generator-discriminator architecture for more efficient pre-training - ModernBERT: Updates BERT with the modern improvements to Transformer architecture ## Links - [Original BERT Paper](https://arxiv.org/abs/1810.04805) - [BERT Explained - Visual Guide](https://jalammar.github.io/illustrated-bert/) - [Google AI Blog on BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) - [BERT GitHub Repository](https://github.com/google-research/bert)