The process of converting raw text into a sequence of tokens (subwords, words, or characters) that can be processed by language models. Tokenization is the first step in any text processing pipeline within multimodal AI systems.
Tokenizers break input text into a sequence of tokens using learned or rule-based splitting strategies. Subword tokenizers like BPE (Byte Pair Encoding) and WordPiece start with characters and iteratively merge frequent pairs to build a vocabulary of subword units. Each token maps to an integer ID used as input to neural network models. Special tokens mark sequence boundaries, padding, and separation.
Modern tokenizers use BPE (GPT family), WordPiece (BERT), Unigram (T5), or SentencePiece (language-agnostic). Vocabulary sizes range from 32K to 128K tokens. Subword tokenization handles out-of-vocabulary words by decomposing them into known subwords. Tokenizer choice directly affects model behavior: the same text may produce different token counts and representations with different tokenizers.