Automatic Speech Recognition: Build vs. Buy
Learn how to build a scalable ASR pipeline using Ray and Whisper, with batching, GPU optimization, and real-world tips from production deployments

Building a robust speech recognition pipeline from scratch isn't just complex, it represents a strategic decision on opportunity cost. While implementing ASR (Automatic Speech Recognition) may seem like an interesting technical challenge for ML engineers, the realities of production-level speech recognition can be daunting. In this post, we'll explore the challenges of building ASR pipelines and how Mixpeek can help simplify this feature extraction process.
The Complexity Beneath the Interface
Speech recognition via libraries like HuggingFace may appear straightforward but when operating at scale, the distribution of the data and managing and monitoring distributed systems can be daunting.
Consider the following Ray-based speech processing pipeline:
import pandas as pd
import ray
import torch
from datasets import Dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
class SpeechRecognitionActor:
def __init__(self):
# Device optimization logic
self.device = "cuda:0" if torch.cuda.is_available() else (
"mps" if torch.backends.mps.is_available() else "cpu"
)
self.torch_dtype = torch.float16 if torch.cuda.is_available() else (
torch.float16 if torch.backends.mps.is_available() else torch.float32
)
# Model initialization with memory optimization
model_id = "openai/whisper-large-v3-turbo"
self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=self.torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True
)
self.model.to(self.device)
# Pipeline configuration for batch processing
self.pipe = pipeline(
"automatic-speech-recognition",
model=self.model,
tokenizer=self.processor.tokenizer,
chunk_length_s=30,
feature_extractor=self.processor.feature_extractor,
torch_dtype=self.torch_dtype,
device=self.device
)
This approach is seemingly simple and for a few audio or video files it is. However, let's explore some of the ways this simple approach can run into roadblocks.
The Hidden Infrastructure Burden
The code above demonstrates careful device selection and memory optimization, however there are many other factors to consider:
- Dynamic Model Loading: Efficiently swapping between different model sizes to maintain price/performance/accuracy
- Memory Management: Preventing GPU out of memory issues due to batch sizes
- Graceful Degradation: Handling OOM conditions without pipeline crashes, especially with varying input media sizes
Distributed Processing Complexity
Ray's distributed computing capabilities are helpful in this example:
results = ds.map_batches(
SpeechRecognitionActor,
concurrency=1,
batch_format="pandas",
batch_size=64,
num_cpus=7,
zero_copy_batch=True,
).materialize()
Scaling however can present some challenges:
- Worker Health Monitoring: Detecting and recovering from failed actors
- Load Balancing: Optimizing batch distribution across heterogeneous hardware if your Ray cluster has different hardware
- Checkpoint Management: Ensuring fault tolerance during long-running transcription jobs
Where DIY Falls Short: The Realities of Production
Model Versioning and Updates
Speech recognition evolves rapidly. OpenAI regularly releases updates to their Whisper models (and various sizes) and there are new open weights models being released all the time.
Managing this requires:
- A/B Testing infrastructure: Comparing model performance across audio conditions and bit rates
- Model Rollout: Safely deploying new versions of models across the compute infrastructure
- Audio Formats: Ensuring that various audio formats such as WAV, MPS and FLAC perform to specifications
Language Model Integration
Modern ASR systems increasingly rely on language model integration for contextual understanding:
- Domain Adaptation: Fine-tuning models for specific domains such as medical, legal and technical
- Multi-lingual Support: Handling code-switching and accent variations to boost performance
- Real-time Correction: Beam search and other newer approaches over greedy search improve overall performance
The Managed Service Advantage
Transparent Scaling and Optimization
Mixpeek can abstract away the complexity outlined above. Consider the following benefits:
- Auto-scaling Infrastructure: Mixpeek clusters automatically scale to the input size, ensuring smooth pipeline execution
- Hardware Optimization: With a wide selection of compute hardware available from Anyscale, we can select the right hardware for performance and cost
- Multi-cloud Data: While our infrastructure is currently hosted in AWS, we can access object storage in any cloud provider, including Google Cloud, Azure and DigitalOcean
The Pragmatic Choice
The speech recognition pipeline shown earlier is certainly impressive in its conciseness, something that HuggingFace libraries excel at in general. However, production ready ASR requires expertise in distributed systems, ML operations, and domain-specific knowledge. Mixpeek provides a battle-tested platform for feature extraction across modalities.
Prove the value to the organization on a small dataset, then when you're ready to expand search to your whole dataset reach out to Mixpeek and see how we can speed up your time to market.
Join the Discussion
Have thoughts, questions, or insights about this post? Be the first to start the conversation in our community!
Start a Discussion