In our previous lesson, we explored what makes a system multimodal. Now, let's dive deep into one of the most crucial modalities: text. Understanding how machines process text is fundamental to building effective multimodal systems.
The Human Advantage
Before we dive into machine text processing, let's consider how effortlessly you're reading this article. Your brain is automatically:
- Processing individual words
- Understanding context and relationships
- Grasping complex meanings
- Making connections to prior knowledge
For machines, we need to break this process down into discrete steps. Let's explore each one.
Step 1: Tokenization
Tokenization is the process of breaking text into meaningful pieces (tokens). Think of it as teaching a computer where one word ends and another begins.
Here's a simple example:
# Basic word tokenization
text = "The quick brown fox jumps over the lazy dog"
tokens = text.split()
print(tokens) # ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
# But real tokenization is more complex
contractions = "Don't forget it's not that simple!"
simple_tokens = contractions.split() # Not ideal
print(simple_tokens) # ["Don't", 'forget', "it's", 'not', 'that', 'simple!']
Tokenization Challenges
Tokenization isn't always straightforward. Consider these challenges:
Modern tokenization often uses subword tokenization to handle these cases better:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Understanding tokenization"
tokens = tokenizer.tokenize(text)
print(tokens) # ['under', '##stand', '##ing', 'token', '##ization']
Step 2: Preprocessing
Before we can analyze text, we need to clean and standardize it. Common preprocessing steps include:
import re
from nltk.corpus import stopwords
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove special characters and digits
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra whitespace
text = ' '.join(text.split())
# Remove stop words
stop_words = set(stopwords.words('english'))
words = [w for w in text.split() if w not in stop_words]
return ' '.join(words)
# Example
text = "The QUICK brown fox (aged 3) jumped over the lazy dogs!!!"
clean_text = preprocess_text(text)
print(clean_text) # "quick brown fox jumped lazy dogs"
Step 3: Word Embeddings
Now comes the fascinating part: converting words into numbers while preserving their meaning. This is done through word embeddings.
In this vector space:
- Similar words are closer together
- Relationships between words are preserved
- Words can be manipulated mathematically
Here's how to work with word embeddings:
from gensim.models import Word2Vec
# Train word embeddings
sentences = [['I', 'love', 'machine', 'learning'],
['I', 'love', 'deep', 'learning']]
model = Word2Vec(sentences, min_count=1)
# Find similar words
similar_words = model.wv.most_similar('learning')
# Perform vector arithmetic
result = model.wv['king'] - model.wv['man'] + model.wv['woman']
# Result vector should be close to 'queen'
Step 4: Measuring Similarity
Once we have word vectors, we can measure how similar words or texts are:
import numpy as np
def cosine_similarity(vec1, vec2):
dot_product = np.dot(vec1, vec2)
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
return dot_product / (norm1 * norm2)
# Example
king = model.wv['king']
queen = model.wv['queen']
similarity = cosine_similarity(king, queen)
print(f"Similarity between 'king' and 'queen': {similarity}")
Real-World Applications
These fundamentals power many applications:
- Search Systems
- Query understanding
- Document matching
- Semantic search
- Content Analysis
- Topic classification
- Sentiment analysis
- Content recommendation
- Cross-Modal Applications
- Image captioning
- Video search
- Multimodal chatbots
Practice Exercise
Try this hands-on exercise:
- Take any paragraph of text (news article, blog post, etc.)
Apply these steps:
# Step 1: Tokenization
tokens = tokenize(text)
# Step 2: Preprocessing
clean_tokens = preprocess(tokens)
# Step 3: Convert to vectors
vectors = get_embeddings(clean_tokens)
# Step 4: Analyze relationships
find_similar_words(vectors)
Next Steps
In our next lesson, we'll explore how to combine text understanding with other modalities like images and video. We'll see how these text processing fundamentals serve as building blocks for more complex multimodal systems.
Next Lesson →