In our previous lesson, we explored what makes a system multimodal. Now, let's dive deep into one of the most crucial modalities: text. Understanding how machines process text is fundamental to building effective multimodal systems.

The Human Advantage

Before we dive into machine text processing, let's consider how effortlessly you're reading this article. Your brain is automatically:

Processing individual words
Understanding context and relationships
Grasping complex meanings
Making connections to prior knowledge

For machines, we need to break this process down into discrete steps. Let's explore each one.

Step 1: Tokenization

Tokenization is the process of breaking text into meaningful pieces (tokens). Think of it as teaching a computer where one word ends and another begins.

Here's a simple example:

# Basic word tokenization
text = "The quick brown fox jumps over the lazy dog"
tokens = text.split()
print(tokens)  # ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

# But real tokenization is more complex
contractions = "Don't forget it's not that simple!"
simple_tokens = contractions.split()  # Not ideal
print(simple_tokens)  # ["Don't", 'forget', "it's", 'not', 'that', 'simple!']

Tokenization Challenges

Tokenization isn't always straightforward. Consider these challenges:

Modern tokenization often uses subword tokenization to handle these cases better:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Understanding tokenization"
tokens = tokenizer.tokenize(text)
print(tokens)  # ['under', '##stand', '##ing', 'token', '##ization']

Step 2: Preprocessing

Before we can analyze text, we need to clean and standardize it. Common preprocessing steps include:

import re
from nltk.corpus import stopwords

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in text.split() if w not in stop_words]
    
    return ' '.join(words)

# Example
text = "The QUICK brown fox (aged 3) jumped over the lazy dogs!!!"
clean_text = preprocess_text(text)
print(clean_text)  # "quick brown fox jumped lazy dogs"

Step 3: Word Embeddings

Now comes the fascinating part: converting words into numbers while preserving their meaning. This is done through word embeddings.

In this vector space:

Similar words are closer together
Relationships between words are preserved
Words can be manipulated mathematically

Here's how to work with word embeddings:

from gensim.models import Word2Vec

# Train word embeddings
sentences = [['I', 'love', 'machine', 'learning'],
            ['I', 'love', 'deep', 'learning']]
model = Word2Vec(sentences, min_count=1)

# Find similar words
similar_words = model.wv.most_similar('learning')

# Perform vector arithmetic
result = model.wv['king'] - model.wv['man'] + model.wv['woman']
# Result vector should be close to 'queen'

Step 4: Measuring Similarity

Once we have word vectors, we can measure how similar words or texts are:

import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

# Example
king = model.wv['king']
queen = model.wv['queen']
similarity = cosine_similarity(king, queen)
print(f"Similarity between 'king' and 'queen': {similarity}")

Real-World Applications

These fundamentals power many applications:

Search Systems
- Query understanding
- Document matching
- Semantic search
Content Analysis
- Topic classification
- Sentiment analysis
- Content recommendation
Cross-Modal Applications
- Image captioning
- Video search
- Multimodal chatbots

Practice Exercise

Try this hands-on exercise:

Take any paragraph of text (news article, blog post, etc.)

Apply these steps:

# Step 1: Tokenization
tokens = tokenize(text)

# Step 2: Preprocessing
clean_tokens = preprocess(tokens)

# Step 3: Convert to vectors
vectors = get_embeddings(clean_tokens)

# Step 4: Analyze relationships
find_similar_words(vectors)

Next Steps

In our next lesson, we'll explore how to combine text understanding with other modalities like images and video. We'll see how these text processing fundamentals serve as building blocks for more complex multimodal systems.

Next Lesson →

Mixpeek - Multimodal Data Warehouse for Developers

Process, extract features, and search across diverse media types including text, images, videos, audio, and PDFs with Mixpeek’s multimodal data warehouse.

Multimodal Data Warehouse for DevelopersMixpeek