A pretrained transformer-based language model that learns bidirectional text representations through masked language modeling. BERT and its variants are widely used as text encoders in multimodal retrieval and understanding systems.
BERT pretrains a transformer encoder on large text corpora using two objectives: Masked Language Modeling (MLM), which predicts randomly masked tokens from bidirectional context, and Next Sentence Prediction (NSP), which determines if two sentences are consecutive. The resulting model produces contextualized embeddings where each token's representation depends on the entire input sequence.
BERT-base uses 12 transformer layers, 768 hidden dimensions, and 12 attention heads (110M parameters). BERT-large doubles to 24 layers and 1024 dimensions (340M parameters). Input uses WordPiece tokenization with a 30K vocabulary. The [CLS] token embedding is commonly used for sequence-level tasks. Maximum sequence length is 512 tokens. Variants include RoBERTa (improved pretraining), ALBERT (parameter sharing), and DeBERTa (disentangled attention).