Multimodal Monday #45: Birds, Whales, and the End of Latency

Quick Take (TL;DR)

Voice AI drops the walkie-talkie act. NVIDIA's PersonaPlex-7B and ElevenLabs' Expressive Mode both ship full-duplex conversation. The AI listens while it talks, interrupts naturally, and adjusts tone mid-sentence. Turn-taking latency is dead.
Vision goes native. Qwen3.5 (397B parameters) and DeepGen 1.0 bake visual understanding directly into the model architecture instead of wiring a vision encoder to a language model after the fact. The result: tighter reasoning over charts, documents, and complex images.
A bird model decoded whale songs. Google fine-tuned Perch 2.0 (trained on birdsong) to classify whale vocalizations. It worked, which means bioacoustic signals share deeper structural patterns than anyone expected.

Tools, Models and Techniques

Qwen3.5-397B-A17B - Qwen's new foundation model pairs a 397B-parameter vision-language architecture with hybrid linear attention heads. It handles document parsing, chart analysis, and visual reasoning natively rather than routing through a separate encoder. Why it matters: An open model at this scale with native multimodal integration puts serious pressure on proprietary alternatives. Blog | Hugging Face

PersonaPlex-7B - NVIDIA released a 7B voice model that listens and speaks at the same time. It supports natural interruptions ("barge-in"), overlapping speech, and real-time turn negotiation without the pause-wait-respond loop. Why it matters: Full-duplex conversation removes the single biggest friction point in voice AI: latency. Hugging Face

ElevenAgents Expressive Mode - ElevenLabs added breath, pauses, and emotional inflection to their voice agents. The output sounds less like text-to-speech and more like someone actually thinking before they talk. Why it matters: Voice agents in support, coaching, and companionship roles need to sound like they care, and this gets closer. Blog | Try it

MiniMax M2.5 - MiniMax open-sourced a frontier model tuned for practical work: coding, writing, and structured analysis. It prioritizes instruction-following accuracy over open-ended chat. Why it matters: A model built to execute tasks reliably matters more than one that chats well. Hugging Face

Seedance 2.0 - ByteDance's video generator takes text, images, audio, or existing video as input and produces new video synchronized to the audio beat. It automates the tedious frame-by-frame alignment work that eats hours in post-production. Why it matters: Audio-visual sync is the bottleneck in short-form video production, and this removes it. Project Page

Qwen-Image-2.0: Professional infographics and photorealism generation. Blog

DeepGen 1.0: A lightweight 5B-parameter unified multimodal model. Hugging Face
GLM-5: From vibe coding to agentic engineering. Blog

KaniTTS2: Open-source 400M TTS model that runs in 3GB VRAM. Hugging Face
SoulX-Singer: High-quality zero-shot singing voice synthesis. GitHub

MioTTS-2.6B: Lightweight TTS optimized for speed in English and Japanese. Hugging Face
FireRed-Image-Edit-1.0: New tool for image editing. Hugging Face

Qwen3-TTS: 1.7B parameters of clean, natural speech synthesis. Hugging Face

Ming-flash-omni 2.0: New multimodal model from InclusionAI. Hugging Face

Research Highlights

EchoJEPA: Latent Prediction for Hearts - A self-supervised foundation model trained on 18 million echocardiograms. Instead of predicting noisy ultrasound pixels, it learns in latent space and separates clinical signal from artifact, outperforming existing cardiac assessment methods. Why it matters: Self-supervised training on massive unlabeled medical data catches anomalies that small labeled datasets miss. Paper

Bioacoustics Transfer Learning - Google Research adapted Perch 2.0, trained entirely on bird songs, to classify whale vocalizations. The cross-domain transfer worked because bioacoustic signals share fundamental spectral and temporal features across species. Why it matters: You can train on abundant data (birds) and fine-tune for scarce data (whales), which unlocks conservation research without needing millions of labeled samples per species. Blog

Beyond the Unit Hypersphere - This paper challenges the standard practice of normalizing embeddings onto the unit hypersphere in contrastive learning. The authors show that embedding magnitude carries meaningful information about confidence and specificity that normalization destroys. Why it matters: Preserving magnitude leads to more nuanced retrieval and better performance on ambiguous queries. Paper

DuoGen: Mixed-Media Storytelling - NVIDIA's DuoGen generates coherent interleaved sequences of images and text. It decides when to show and when to tell, keeping visual and textual content consistent across the full narrative. Why it matters: This opens the door to AI-generated tutorials, articles, and illustrated content that reads as authored rather than assembled. Project Page

UniAudio 2.0 - A single audio language model that handles speech, music, and sound effects through text-aligned factorized tokenization. One framework generates, edits, and mixes across all audio types without switching models. Why it matters: Unifying the audio stack (TTS, music generation, foley) into one model creates workflows that were previously impossible without multiple specialized tools. Paper

ALIVE: Lifelike audio-video generation. Project Page

ConsID-Gen: View-consistent, identity-preserving image-to-video generation. Project Page
JUST-DUB-IT: Video dubbing via joint audio-visual diffusion. Project Page
Voice-First Human-AI Collaboration: Exploring LMMs in mixed reality. Paper
Multimodal Manufacturing Safety Chatbot: Benchmark for RAG approaches in safety. Paper
Alzheimer's Detection: Multimodal fusion for better diagnosis. Paper

Trends & Predictions

Full-Duplex Voice Is Here

PersonaPlex-7B and ElevenAgents both shipped full-duplex voice this week. The "you talk, then I talk" model is officially legacy.

Real conversations overlap. People interrupt, confirm with "uh-huh," and change direction mid-thought. Full-duplex models handle all of this. More importantly, continuous listening lets the model start composing a response before you finish your sentence. That shaves hundreds of milliseconds off response time, which matters in customer support, gaming, and any scenario where hesitation breaks trust. And when the model hears frustration in your voice while you're still talking, it can adjust its response before delivering it.

Native Multimodal Architectures Are Winning

Qwen3.5 and DeepGen 1.0 both build vision into the model from the ground up. No separate encoder. No adapter layer. No translation step. When vision and language train together from scratch, the model reasons with visual information instead of converting it to text first. You get a system that reads a chart and understands the argument the chart is making, not just the numbers on it. Unified architectures also cut inference overhead because data doesn't bounce between modules. This is what enables tasks like "analyze this graph in the context of the surrounding report" where tight cross-modal reasoning is the whole point.

Community + Shoutouts

Larry the OpenClaw: Shoutout to @oliverhenry for the writeup on Larry, the open-source robot arm doing social media. A fun look at embodied AI in the wild. X Post
OneVision Encoder: Thanks to @brian_bo_li for the deep dive into the OneVision Encoder. Understanding the "eyes" of these models is crucial for building better apps. X Post

AutoGuidance Node: A great resource for the ComfyUI community: a custom node implementing AutoGuidance. GitHub
Kling 3.0 Fun: @lexx_aura shows off the capabilities (and hilarity) of Kling 3.0. Sometimes the best way to test a model is to just make something weird. X Post

That's a wrap for Multimodal Monday #45! From full-duplex voice models that listen and speak simultaneously, to 397B-parameter architectures that reason with pixels instead of converting them to words, to a birdsong classifier that turned out to understand whales, this week showed multimodal AI getting less polite and more useful.

Ready to build multimodal solutions that actually work? Let's talk