BERT - Bidirectional Encoder Representations from Transformers
A pretrained transformer-based language model that learns bidirectional text representations through masked language modeling. BERT and its variants are widely used as text encoders in multimodal retrieval and understanding systems.
How It Works
BERT pretrains a transformer encoder on large text corpora using two objectives: Masked Language Modeling (MLM), which predicts randomly masked tokens from bidirectional context, and Next Sentence Prediction (NSP), which determines if two sentences are consecutive. The resulting model produces contextualized embeddings where each token's representation depends on the entire input sequence.
Technical Details
BERT-base uses 12 transformer layers, 768 hidden dimensions, and 12 attention heads (110M parameters). BERT-large doubles to 24 layers and 1024 dimensions (340M parameters). Input uses WordPiece tokenization with a 30K vocabulary. The [CLS] token embedding is commonly used for sequence-level tasks. Maximum sequence length is 512 tokens. Variants include RoBERTa (improved pretraining), ALBERT (parameter sharing), and DeBERTa (disentangled attention).
Best Practices
Fine-tune BERT on downstream tasks rather than using frozen representations for best results
Use domain-specific BERT variants (BioBERT, SciBERT, FinBERT) for specialized applications
Apply mean pooling over token embeddings for sentence representations rather than just [CLS]
Use RoBERTa or DeBERTa over original BERT for generally improved performance
Common Pitfalls
Using BERT for text generation, which it was not designed for (use GPT-style models instead)
Exceeding the 512-token limit without implementing a chunking strategy
Not fine-tuning BERT on domain-specific data when the domain differs significantly from pretraining
Using BERT embeddings directly for semantic similarity without sentence-transformer fine-tuning
Advanced Tips
Use BERT as the text encoder in cross-modal models that align text with visual representations
Implement BERT-based cross-encoders for high-accuracy re-ranking in retrieval pipelines
Apply knowledge distillation from BERT to smaller models (DistilBERT) for production efficiency
Combine BERT embeddings with sparse features (BM25) in hybrid search architectures