How accurate is the transcription?

Word error rate (WER) is typically 3-8% for clear English speech, comparable to professional human transcription. Accuracy varies with audio quality, accents, and background noise. Noisy environments may see 10-15% WER.

Does it support multiple languages in one file?

Yes. The model can detect and switch between languages within a single audio file. Set `auto_detect_language` to true for mixed-language content.

Can I get word-level timestamps?

Yes. Set `timestamp_granularity` to 'word' for per-word timestamps, or 'sentence' for sentence-level timing. The default is sentence-level.

Is there a way to provide custom vocabulary?

Yes. Use the `vocabulary_boost` parameter with a list of domain-specific terms (product names, jargon, acronyms). This significantly improves accuracy for specialized content.

media

Audio
Text
Converter

Transcribe audio files into text with high accuracy. Supports speaker diarization, punctuation restoration, timestamps, and over 50 languages. Handles podcasts, calls, meetings, and broadcast audio.

Max file size: 2 GB

Estimated: 1-5 min per hour of audio

7 input formats

How It Works

Upload an audio file or provide a URL.

The audio is preprocessed (noise reduction, normalization).

Speech is transcribed using a large speech model.

Speaker diarization assigns text segments to individual speakers.

Timestamps, punctuation, and formatting are applied.

Code Examples

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

result = client.convert(
    source="https://example.com/podcast-ep42.mp3",
    from_format="audio",
    to_format="text",
    options={
        "speaker_diarization": True,
        "timestamp_granularity": "sentence",
        "vocabulary_boost": ["Mixpeek", "multimodal", "RAG"]
    }
)

for segment in result.segments:
    print(f"[{segment.speaker}] {segment.text}")

Use Cases

Transcribe podcast episodes for show notes and SEO

Convert call center recordings to searchable text

Generate meeting minutes from recorded calls

Create text datasets from audio archives

Supported Input Formats

MP3

WAV

FLAC

OGG

AAC

M4A

WMA

Quick Info

Categorymedia

Max File Size2 GB

Est. Time1-5 min per hour of audio

Extractoraudio-descriptor

Try This Conversion

Get started with the Mixpeek API and convert your first file in minutes.

Frequently Asked Questions

Related Converters

Video

Text

Video to Text

Extract spoken dialogue, on-screen text, and scene descriptions from video files using multimodal AI. Produces time-stamped transcripts with speaker diarization and OCR-detected overlays.

Audio

Embeddings

Audio to Embeddings

Convert audio files into dense vector embeddings that capture spoken content, tone, and acoustic features. Use embeddings for audio search, speaker verification, and content-based recommendation.