Teaching CLIP, Whisper, and Gemini to Trust Nothing

Recently, millions watched a video of two women arguing in an airport while a kangaroo in a service vest stared into the camera. The internet was divided: hilarious reality or AI fever dream? Twitter exploded with hot takes. News outlets ran with it. Your aunt shared it on Facebook with three laughing emojis.

💡

All resources linked here: playbooks/fake-video-detection

It was completely fake. Generated by Google's Veo3 in about 30 seconds.

If we can't even spot a kangaroo in an airport as fake, what hope do we have against sophisticated political deepfakes? That's the thought that kept me up at night, leading to a caffeine-fueled coding marathon and the birth of FakeCheck, a paranoid little system that trusts nothing and questions everything.

The Multimodal Detective Story

Here's the thing about AI generated video detection: it's like trying to spot a lie. One tell might be a coincidence, but when multiple things feel off, you're probably onto something. That's why FakeCheck works like a digital forensics team where each expert has trust issues.

CLIP → Whisper → Gemini's → Heuristics → Fusion → Verdict

The philosophy is simple: One AI can be fooled, but can you fool them all?

Instead of training a massive model from scratch (who has that kind of GPU budget?), I assembled a squad of pre-trained models, each looking for different signs of deception. It's like having a visual expert, an audio analyst, and a behavioral psychologist all reviewing the same suspect.

Meet the Detective Squad

Visual Witness: CLIP

CLIP is OpenAI's Contrastive Language-Image Pre-training model, and it turns out it's surprisingly effective at spotting fakes. But not for the reason you might think.

CLIP wasn't trained to distinguish real from fake content. Instead, it learned rich visual representations by understanding how images relate to text descriptions across millions of examples.

Here's the clever bit: We can leverage these learned representations by asking CLIP semantic questions about what it's seeing.

# The magic: teaching CLIP what "fake" looks like semantically
REAL_PERSON_PROMPTS = [
    "a typical frame from a live-action video recording of a real person or people",
    "a natural, unedited video still of a human being or more than one human being",
    "a person with natural skin texture and realistic lighting in a video"
]

FAKE_PERSON_PROMPTS = [
    "an AI generated or deepfake face with unnatural features in a video",
    "a digitally altered face, a manipulated facial video frame",
    "eyes that look glassy, unfocused, or move unnaturally"
]

Simplified CLIP detection - see the Code for full implementation

Instead of training a custom model, I'm essentially asking CLIP: "Does this frame look more like a real person or a deepfake?" It's prompt engineering at its finest, leveraging CLIP's semantic understanding of what makes something look "off."

The scoring works by computing similarity between video frames and these prompts:

# Simplified version of the scoring logic
real_similarity = compute_similarity(frame, REAL_PERSON_PROMPTS)
fake_similarity = compute_similarity(frame, FAKE_PERSON_PROMPTS)
deepfake_score = fake_similarity - real_similarity

On our kangaroo video? CLIP scored it 0.73 (LIKELY_FAKE). Not because it understood airport security protocols for marsupials, but because something about the lighting and textures screamed "artificial" to its trained eye.

Audio Witness: Whisper

Whisper is OpenAI's speech recognition model, and it's brilliant at transcription. Too brilliant, as it turns out. Here's how we use it:

# Transcription with a trust-but-verify approach
transcription = whisper_model.transcribe(
    wav_path, 
    fp16=True, 
    word_timestamps=True,
    verbose=None  # Get segment-level no_speech_prob
)

The idea is simple: transcribe the audio, then use that transcription to check if the lip movements match. It's a solid plan... until Whisper decides to get creative. But more on that disaster later.

The Expert Jury: Gemini

Google's Gemini serves as our expert analyst, performing four distinct checks:

Visual artifact detection
Lip-sync verification
Blink pattern analysis
OCR for gibberish text detection

Getting consistent results from an LLM is like herding cats, so I developed what I call "load-bearing prompt injection":

# After lip-sync analysis
return {
    "flag": flag,
    "event": lip_sync_event,
    "instruction": f"Your output MUST mention the lip-sync result: {result}"
}

Without that last line, Gemini would sometimes just... forget to mention whether the lips matched the audio. It's like dealing with a brilliant but absent-minded professor.

The Whisper Hallucination Incident

Here's where things get interesting. I was testing FakeCheck with an AI-generated video of people speaking complete gibberish, not any human language, just synthetic mouth sounds.

What I expected from Whisper: "No speech detected" or maybe some admission of confusion.

What Whisper actually delivered: "The quarterly earnings report shows significant growth in the Asian markets, particularly in the technology sector where we've seen a 23% increase in user engagement..."

I'm not kidding. Whisper hallucinated an entire business presentation from pure nonsense.

The Cascade of Chaos

This created a domino effect of confusion:

Whisper: "They're definitely discussing quarterly earnings!"
Gemini: "Hmm, the lips don't match these words about market growth..."
Lip-sync detector: "MISMATCH DETECTED!"
Final verdict: "LIKELY_FAKE"

Right answer, completely wrong reasoning.

The Fix That Saved My Sanity

After losing several hours to this madness, I implemented what I call the "Whisper Reality Check":

# The "Whisper Reality Check"
NO_SPEECH_THRESHOLD = 0.85
avg_no_speech_prob = transcription.get("avg_no_speech_prob", 0.0)

if avg_no_speech_prob > NO_SPEECH_THRESHOLD:
    logger.warning(f"High 'no speech' probability ({avg_no_speech_prob:.2f}). " 
                   "Disabling lip-sync check to avoid hallucination cascade.")
    lipsync_enabled = False
    transcription["text"] = "[No speech detected]"

Basically, if Whisper isn't confident there's actual speech, we don't trust its transcription. It's like breathalyzing your witness before letting them testify.

The Fusion of Experts

This is where all the evidence comes together. Each detector gives its verdict, and we combine them with carefully tuned weights:

# The carefully hand-tuned weights (emphasis on "hand-tuned")
FUSION_MODEL_WEIGHTS = {
    "visual_clip": 0.28,              # CLIP knows its stuff
    "gemini_visual_artifacts": 0.37,  # Gemini spots the uncanny
    "gemini_lipsync_issue": 0.145,     # When it works...
    "gemini_blink_abnormality": 0.101, # Blinks don't lie
    "gemini_gibberish_text": 0.077,    # Gibberish text detector
    "flow": 0.077,                     # Motion inconsistencies
}

How did I arrive at these specific numbers? I spent a weekend with 30 videos, a spreadsheet, and entirely too much coffee. It went something like:

Hour 1: "I'll use equal weights!"
Hour 3: "Okay, CLIP seems more reliable..."
Hour 7: "Why is OCR triggering on every video with subtitles?"
Hour 12: "These numbers feel right. Ship it."

The current approach is, frankly, more art than science. The dream would be an automated tuning pipeline:

# TODO: Replace my caffeine-fueled guesswork with science
def optimize_weights(labeled_videos):
    # Grid search? Genetic algorithm? 
    # Anything better than "this feels right"
    pass

With a few hundred labeled examples, we could properly optimize these weights. Maybe even use some fancy Bayesian optimization. But for now, my hand-tuned weights work surprisingly well.

The Supporting Cast: Heuristic Detectors

Beyond the headline models, FakeCheck employs several heuristic detectors that catch specific anomalies:

Optical Flow Spike Detector

This one looks for sudden, unnatural movements in the video:

# Why we throttle: Videos are noisy, alerts shouldn't be
if ts < last_event_ts + 1.0:
    continue  # One anomaly per second is plenty

z = (mag - μ) / σ  # z-score for motion magnitude
if z > 2:  # Significant spike
    events.append({
        "event": "flow_spike",
        "ts": timestamp,
        "meta": {"z": round(z, 2)}
    })

Early versions flagged every single frame. Turns out, video compression creates tons of tiny artifacts. The solution? Throttling. Because if everything's an anomaly, nothing is.

Production Reality: Where Things Get Spicy

The Protobuf Async/Await Disaster

Here's a fun one. Google's Gemini Python SDK has an... interesting bug where async calls sometimes fail with protobuf errors. My solution is a masterpiece of "this shouldn't work but it does":

async def safe_generate_content(model, content, max_retries=2):
    try:
        return await model.generate_content_async(content)
    except AttributeError as e:
        if "Unknown field" not in str(e):
            raise
        # The "this shouldn't work but it does" fallback
        logger.warning("Protobuf bug detected. Using sync fallback.")
        loop = asyncio.get_running_loop()
        return await loop.run_in_executor(None,
            functools.partial(model.generate_content, content))

When Google's own libraries fight each other, you improvise. This workaround has saved me countless hours of debugging. Sometimes the best code is the code that just works, elegance be damned.

Performance Reality Check

Let's talk numbers:

Processing Speed: ~30 seconds per 30-second video (on GPU)
CPU Performance: ~70 seconds (patience required)
Accuracy on Test Set: ~70% (but defining "truth" is half the battle)
Maximum Video Length: 30 seconds (for now)

That last limitation is crucial. FakeCheck works great for TikToks and short clips. Hour-long documentaries? That's a different beast entirely.

Scaling Beyond the Demo: Enter Mixpeek

Here's the thing about proof-of-concepts: they prove concepts. What they don't do is scale.

What FakeCheck Proves

Multi-modal detection works
Fusion scoring can catch what individual models miss
The approach is sound (if a bit paranoid)
Even with imperfect components, the ensemble is robust

What It Doesn't Handle

Real-time streaming: Try analyzing a 2-hour livestream with this
Distributed processing: Everything runs on one machine
Model versioning: "Which version of CLIP caught that fake?"
Production reliability: 99.9% uptime vs "works on my machine"
Scale: Processing thousands of videos per minute? Good luck

This is where platforms like Mixpeek come in. They've solved the infrastructure challenges that turn proof-of-concepts into production systems:

Scalable Pipeline Orchestration: Distribute processing across multiple machines
Model Versioning and A/B Testing: Know exactly which model version made each detection
Real-time Processing: Handle live streams, not just uploaded videos
Enterprise Reliability: SLAs, monitoring, and all that production jazz

Think of it as the difference between a proof-of-concept drone and a production aircraft. Both fly, but only one should carry passengers.

DIY: Try FakeCheck Yourself

Want to play with FakeCheck?

Try the live demo at: https://fake-check.mixpeek.com/

Or run it yourself:

# Clone the repo
git clone https://github.com/mixpeek/fake-check

# Run Backend
cd fakecheck/backend
pip install -r requirements.txt # Install dependencies(pray to the pip gods)
cp .env.example .env            # Set up your API keys
python run_server.py            # Run detection server

# Run Frontend In New Terminal
cd fakecheck/frontend
npm install
npm run dev

# Open http://localhost:5173/ in your browser

Fair Warning

This is a PoC, expect rough edges and occasional explosions
Requires GPU for reasonable performance (or extreme patience)
Gemini API rate limits are real and they will hurt you
If it breaks, you get to keep both pieces

Contributing

Found a video that breaks everything? Perfect! That's exactly what we need:

Open an issue with the video (if shareable)
Include the full error traceback
Bonus points for proposing a fix or putting up a PR

Have ideas for improvement?

Better fusion algorithm? Let's see it
New detector module? PR it
Performance optimizations? Yes please

What's Next for FakeCheck

Automated Weight Tuning: Replace my coffee-driven optimization with actual machine learning
More Sophisticated Heuristics: Heartbeat detection, micro-expression analysis
Streaming Support: Because the world doesn't stop at 30 seconds
Better Whisper Handling: Maybe a "hallucination detector for the hallucination detector"

The Bigger Picture

The deepfake detection arms race isn't slowing down. Every advancement in generation technology demands a corresponding advancement in detection. Today's state-of-the-art detector is tomorrow's training data for better deepfakes.

But here's the thing: we don't need perfect detection. We need good-enough detection deployed widely enough to make deepfake creation a high-risk, low-reward proposition.

Your Move

Whether you're:

A Developer:

Try FakeCheck, break it, improve it
Share your findings and edge cases
Help us build better detection tools

A Decision-Maker:

Consider how deepfake detection fits your content strategy
Think about the infrastructure needed for scale
Plan for the world where any video could be fake

Just Curious:

Stay skeptical of viral videos
Verify sources before sharing
Maybe don't trust videos of kangaroos in airports

The battle for digital truth is far from over. Tools like FakeCheck are just the beginning, proof that we can fight back against the tide of synthetic media. But winning this war will take more than clever code. It'll take infrastructure, scale, and a healthy dose of paranoia.

After all, in a world where seeing is no longer believing, a little paranoia might just be the sanest response.

Want to explore deepfake detection at scale? Check out Mixpeek's platform for production-ready solutions. Or dive into the FakeCheck source code and help us make it better.

Found this helpful? Share it with someone who thinks all videos on the internet are real. They need the wake-up call.