Teaching CLIP, Whisper, and Gemini to Trust Nothing
If we can't spot a kangaroo in an airport as fake, what hope do we have against political deepfakes? FakeCheck is a trust-nothing detection system that caught viral fakes using CLIP, Whisper and Gemini.

Recently, millions watched a video of two women arguing in an airport while a kangaroo in a service vest stared into the camera. The internet was divided: hilarious reality or AI fever dream? Twitter exploded with hot takes. News outlets ran with it. Your aunt shared it on Facebook with three laughing emojis.

It was completely fake. Generated by Google's Veo3 in about 30 seconds.
If we can't even spot a kangaroo in an airport as fake, what hope do we have against sophisticated political deepfakes? That's the thought that kept me up at night, leading to a caffeine-fueled coding marathon and the birth of FakeCheck, a paranoid little system that trusts nothing and questions everything.
The Multimodal Detective Story
Here's the thing about AI generated video detection: it's like trying to spot a lie. One tell might be a coincidence, but when multiple things feel off, you're probably onto something. That's why FakeCheck works like a digital forensics team where each expert has trust issues.

The philosophy is simple: One AI can be fooled, but can you fool them all?
Instead of training a massive model from scratch (who has that kind of GPU budget?), I assembled a squad of pre-trained models, each looking for different signs of deception. It's like having a visual expert, an audio analyst, and a behavioral psychologist all reviewing the same suspect.
Meet the Detective Squad
Visual Witness: CLIP
CLIP is OpenAI's Contrastive Language-Image Pre-training model, and it turns out it's surprisingly effective at spotting fakes. But not for the reason you might think.
CLIP wasn't trained to distinguish real from fake content. Instead, it learned rich visual representations by understanding how images relate to text descriptions across millions of examples.
Here's the clever bit: We can leverage these learned representations by asking CLIP semantic questions about what it's seeing.
# The magic: teaching CLIP what "fake" looks like semantically
REAL_PERSON_PROMPTS = [
"a typical frame from a live-action video recording of a real person or people",
"a natural, unedited video still of a human being or more than one human being",
"a person with natural skin texture and realistic lighting in a video"
]
FAKE_PERSON_PROMPTS = [
"an AI generated or deepfake face with unnatural features in a video",
"a digitally altered face, a manipulated facial video frame",
"eyes that look glassy, unfocused, or move unnaturally"
]
Simplified CLIP detection - see the Code for full implementation
Instead of training a custom model, I'm essentially asking CLIP: "Does this frame look more like a real person or a deepfake?" It's prompt engineering at its finest, leveraging CLIP's semantic understanding of what makes something look "off."
The scoring works by computing similarity between video frames and these prompts:
# Simplified version of the scoring logic
real_similarity = compute_similarity(frame, REAL_PERSON_PROMPTS)
fake_similarity = compute_similarity(frame, FAKE_PERSON_PROMPTS)
deepfake_score = fake_similarity - real_similarity
On our kangaroo video? CLIP scored it 0.73 (LIKELY_FAKE). Not because it understood airport security protocols for marsupials, but because something about the lighting and textures screamed "artificial" to its trained eye.
Audio Witness: Whisper
Whisper is OpenAI's speech recognition model, and it's brilliant at transcription. Too brilliant, as it turns out. Here's how we use it:
# Transcription with a trust-but-verify approach
transcription = whisper_model.transcribe(
wav_path,
fp16=True,
word_timestamps=True,
verbose=None # Get segment-level no_speech_prob
)
The idea is simple: transcribe the audio, then use that transcription to check if the lip movements match. It's a solid plan... until Whisper decides to get creative. But more on that disaster later.
The Expert Jury: Gemini
Google's Gemini serves as our expert analyst, performing four distinct checks:
- Visual artifact detection
- Lip-sync verification
- Blink pattern analysis
- OCR for gibberish text detection
Getting consistent results from an LLM is like herding cats, so I developed what I call "load-bearing prompt injection":
# After lip-sync analysis
return {
"flag": flag,
"event": lip_sync_event,
"instruction": f"Your output MUST mention the lip-sync result: {result}"
}
Without that last line, Gemini would sometimes just... forget to mention whether the lips matched the audio. It's like dealing with a brilliant but absent-minded professor.
The Whisper Hallucination Incident
Here's where things get interesting. I was testing FakeCheck with an AI-generated video of people speaking complete gibberish, not any human language, just synthetic mouth sounds.
What I expected from Whisper: "No speech detected" or maybe some admission of confusion.
What Whisper actually delivered: "The quarterly earnings report shows significant growth in the Asian markets, particularly in the technology sector where we've seen a 23% increase in user engagement..."
I'm not kidding. Whisper hallucinated an entire business presentation from pure nonsense.
The Cascade of Chaos
This created a domino effect of confusion:
- Whisper: "They're definitely discussing quarterly earnings!"
- Gemini: "Hmm, the lips don't match these words about market growth..."
- Lip-sync detector: "MISMATCH DETECTED!"
- Final verdict: "LIKELY_FAKE"
Right answer, completely wrong reasoning.
The Fix That Saved My Sanity
After losing several hours to this madness, I implemented what I call the "Whisper Reality Check":
# The "Whisper Reality Check"
NO_SPEECH_THRESHOLD = 0.85
avg_no_speech_prob = transcription.get("avg_no_speech_prob", 0.0)
if avg_no_speech_prob > NO_SPEECH_THRESHOLD:
logger.warning(f"High 'no speech' probability ({avg_no_speech_prob:.2f}). "
"Disabling lip-sync check to avoid hallucination cascade.")
lipsync_enabled = False
transcription["text"] = "[No speech detected]"
Basically, if Whisper isn't confident there's actual speech, we don't trust its transcription. It's like breathalyzing your witness before letting them testify.
The Fusion of Experts
This is where all the evidence comes together. Each detector gives its verdict, and we combine them with carefully tuned weights:
# The carefully hand-tuned weights (emphasis on "hand-tuned")
FUSION_MODEL_WEIGHTS = {
"visual_clip": 0.28, # CLIP knows its stuff
"gemini_visual_artifacts": 0.37, # Gemini spots the uncanny
"gemini_lipsync_issue": 0.145, # When it works...
"gemini_blink_abnormality": 0.101, # Blinks don't lie
"gemini_gibberish_text": 0.077, # Gibberish text detector
"flow": 0.077, # Motion inconsistencies
}
How did I arrive at these specific numbers? I spent a weekend with 30 videos, a spreadsheet, and entirely too much coffee. It went something like:
- Hour 1: "I'll use equal weights!"
- Hour 3: "Okay, CLIP seems more reliable..."
- Hour 7: "Why is OCR triggering on every video with subtitles?"
- Hour 12: "These numbers feel right. Ship it."
The current approach is, frankly, more art than science. The dream would be an automated tuning pipeline:
# TODO: Replace my caffeine-fueled guesswork with science
def optimize_weights(labeled_videos):
# Grid search? Genetic algorithm?
# Anything better than "this feels right"
pass
With a few hundred labeled examples, we could properly optimize these weights. Maybe even use some fancy Bayesian optimization. But for now, my hand-tuned weights work surprisingly well.
The Supporting Cast: Heuristic Detectors
Beyond the headline models, FakeCheck employs several heuristic detectors that catch specific anomalies:
Optical Flow Spike Detector
This one looks for sudden, unnatural movements in the video:
# Why we throttle: Videos are noisy, alerts shouldn't be
if ts < last_event_ts + 1.0:
continue # One anomaly per second is plenty
z = (mag - μ) / σ # z-score for motion magnitude
if z > 2: # Significant spike
events.append({
"event": "flow_spike",
"ts": timestamp,
"meta": {"z": round(z, 2)}
})
Early versions flagged every single frame. Turns out, video compression creates tons of tiny artifacts. The solution? Throttling. Because if everything's an anomaly, nothing is.
Production Reality: Where Things Get Spicy
The Protobuf Async/Await Disaster
Here's a fun one. Google's Gemini Python SDK has an... interesting bug where async calls sometimes fail with protobuf errors. My solution is a masterpiece of "this shouldn't work but it does":
async def safe_generate_content(model, content, max_retries=2):
try:
return await model.generate_content_async(content)
except AttributeError as e:
if "Unknown field" not in str(e):
raise
# The "this shouldn't work but it does" fallback
logger.warning("Protobuf bug detected. Using sync fallback.")
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None,
functools.partial(model.generate_content, content))
When Google's own libraries fight each other, you improvise. This workaround has saved me countless hours of debugging. Sometimes the best code is the code that just works, elegance be damned.
Performance Reality Check
Let's talk numbers:
- Processing Speed: ~30 seconds per 30-second video (on GPU)
- CPU Performance: ~70 seconds (patience required)
- Accuracy on Test Set: ~70% (but defining "truth" is half the battle)
- Maximum Video Length: 30 seconds (for now)
That last limitation is crucial. FakeCheck works great for TikToks and short clips. Hour-long documentaries? That's a different beast entirely.
Scaling Beyond the Demo: Enter Mixpeek
Here's the thing about proof-of-concepts: they prove concepts. What they don't do is scale.
What FakeCheck Proves
- Multi-modal detection works
- Fusion scoring can catch what individual models miss
- The approach is sound (if a bit paranoid)
- Even with imperfect components, the ensemble is robust
What It Doesn't Handle
- Real-time streaming: Try analyzing a 2-hour livestream with this
- Distributed processing: Everything runs on one machine
- Model versioning: "Which version of CLIP caught that fake?"
- Production reliability: 99.9% uptime vs "works on my machine"
- Scale: Processing thousands of videos per minute? Good luck
This is where platforms like Mixpeek come in. They've solved the infrastructure challenges that turn proof-of-concepts into production systems:
- Scalable Pipeline Orchestration: Distribute processing across multiple machines
- Model Versioning and A/B Testing: Know exactly which model version made each detection
- Real-time Processing: Handle live streams, not just uploaded videos
- Enterprise Reliability: SLAs, monitoring, and all that production jazz
Think of it as the difference between a proof-of-concept drone and a production aircraft. Both fly, but only one should carry passengers.
DIY: Try FakeCheck Yourself
Want to play with FakeCheck?
Try the live demo at: https://fake-check.mixpeek.com/
Or run it yourself:
# Clone the repo
git clone https://github.com/mixpeek/fake-check
# Run Backend
cd fakecheck/backend
pip install -r requirements.txt # Install dependencies(pray to the pip gods)
cp .env.example .env # Set up your API keys
python run_server.py # Run detection server
# Run Frontend In New Terminal
cd fakecheck/frontend
npm install
npm run dev
# Open http://localhost:5173/ in your browser
Fair Warning
- This is a PoC, expect rough edges and occasional explosions
- Requires GPU for reasonable performance (or extreme patience)
- Gemini API rate limits are real and they will hurt you
- If it breaks, you get to keep both pieces
Contributing
Found a video that breaks everything? Perfect! That's exactly what we need:
- Open an issue with the video (if shareable)
- Include the full error traceback
- Bonus points for proposing a fix or putting up a PR
Have ideas for improvement?
- Better fusion algorithm? Let's see it
- New detector module? PR it
- Performance optimizations? Yes please
What's Next for FakeCheck
- Automated Weight Tuning: Replace my coffee-driven optimization with actual machine learning
- More Sophisticated Heuristics: Heartbeat detection, micro-expression analysis
- Streaming Support: Because the world doesn't stop at 30 seconds
- Better Whisper Handling: Maybe a "hallucination detector for the hallucination detector"
The Bigger Picture
The deepfake detection arms race isn't slowing down. Every advancement in generation technology demands a corresponding advancement in detection. Today's state-of-the-art detector is tomorrow's training data for better deepfakes.
But here's the thing: we don't need perfect detection. We need good-enough detection deployed widely enough to make deepfake creation a high-risk, low-reward proposition.
Your Move
Whether you're:
A Developer:
- Try FakeCheck, break it, improve it
- Share your findings and edge cases
- Help us build better detection tools
A Decision-Maker:
- Consider how deepfake detection fits your content strategy
- Think about the infrastructure needed for scale
- Plan for the world where any video could be fake
Just Curious:
- Stay skeptical of viral videos
- Verify sources before sharing
- Maybe don't trust videos of kangaroos in airports
The battle for digital truth is far from over. Tools like FakeCheck are just the beginning, proof that we can fight back against the tide of synthetic media. But winning this war will take more than clever code. It'll take infrastructure, scale, and a healthy dose of paranoia.
After all, in a world where seeing is no longer believing, a little paranoia might just be the sanest response.
Want to explore deepfake detection at scale? Check out Mixpeek's platform for production-ready solutions. Or dive into the FakeCheck source code and help us make it better.
Found this helpful? Share it with someone who thinks all videos on the internet are real. They need the wake-up call.
Join the Discussion
Have thoughts, questions, or insights about this post? Be the first to start the conversation in our community!
Start a Discussion