Mixpeek Logo
    Schedule Demo
    8 min read

    Teaching CLIP, Whisper, and Gemini to Trust Nothing

    If we can't spot a kangaroo in an airport as fake, what hope do we have against political deepfakes? FakeCheck is a trust-nothing detection system that caught viral fakes using CLIP, Whisper and Gemini.

    Teaching CLIP, Whisper, and Gemini to Trust Nothing
    Technical Guides

    Recently, millions watched a video of two women arguing in an airport while a kangaroo in a service vest stared into the camera. The internet was divided: hilarious reality or AI fever dream? Twitter exploded with hot takes. News outlets ran with it. Your aunt shared it on Facebook with three laughing emojis.

    💡
    All resources linked here: playbooks/fake-video-detection

    It was completely fake. Generated by Google's Veo3 in about 30 seconds.

    If we can't even spot a kangaroo in an airport as fake, what hope do we have against sophisticated political deepfakes? That's the thought that kept me up at night, leading to a caffeine-fueled coding marathon and the birth of FakeCheck, a paranoid little system that trusts nothing and questions everything.

    The Multimodal Detective Story

    Here's the thing about AI generated video detection: it's like trying to spot a lie. One tell might be a coincidence, but when multiple things feel off, you're probably onto something. That's why FakeCheck works like a digital forensics team where each expert has trust issues.

    CLIP → Whisper → Gemini's → Heuristics → Fusion → Verdict

    The philosophy is simple: One AI can be fooled, but can you fool them all?

    Instead of training a massive model from scratch (who has that kind of GPU budget?), I assembled a squad of pre-trained models, each looking for different signs of deception. It's like having a visual expert, an audio analyst, and a behavioral psychologist all reviewing the same suspect.

    Meet the Detective Squad

    Visual Witness: CLIP

    CLIP is OpenAI's Contrastive Language-Image Pre-training model, and it turns out it's surprisingly effective at spotting fakes. But not for the reason you might think.

    CLIP wasn't trained to distinguish real from fake content. Instead, it learned rich visual representations by understanding how images relate to text descriptions across millions of examples.

    Here's the clever bit: We can leverage these learned representations by asking CLIP semantic questions about what it's seeing.

    # The magic: teaching CLIP what "fake" looks like semantically
    REAL_PERSON_PROMPTS = [
        "a typical frame from a live-action video recording of a real person or people",
        "a natural, unedited video still of a human being or more than one human being",
        "a person with natural skin texture and realistic lighting in a video"
    ]
    
    FAKE_PERSON_PROMPTS = [
        "an AI generated or deepfake face with unnatural features in a video",
        "a digitally altered face, a manipulated facial video frame",
        "eyes that look glassy, unfocused, or move unnaturally"
    ]
    

    Simplified CLIP detection - see the Code for full implementation

    Instead of training a custom model, I'm essentially asking CLIP: "Does this frame look more like a real person or a deepfake?" It's prompt engineering at its finest, leveraging CLIP's semantic understanding of what makes something look "off."

    The scoring works by computing similarity between video frames and these prompts:

    # Simplified version of the scoring logic
    real_similarity = compute_similarity(frame, REAL_PERSON_PROMPTS)
    fake_similarity = compute_similarity(frame, FAKE_PERSON_PROMPTS)
    deepfake_score = fake_similarity - real_similarity
    

    On our kangaroo video? CLIP scored it 0.73 (LIKELY_FAKE). Not because it understood airport security protocols for marsupials, but because something about the lighting and textures screamed "artificial" to its trained eye.

    Audio Witness: Whisper

    Whisper is OpenAI's speech recognition model, and it's brilliant at transcription. Too brilliant, as it turns out. Here's how we use it:

    # Transcription with a trust-but-verify approach
    transcription = whisper_model.transcribe(
        wav_path, 
        fp16=True, 
        word_timestamps=True,
        verbose=None  # Get segment-level no_speech_prob
    )
    

    The idea is simple: transcribe the audio, then use that transcription to check if the lip movements match. It's a solid plan... until Whisper decides to get creative. But more on that disaster later.

    The Expert Jury: Gemini

    Google's Gemini serves as our expert analyst, performing four distinct checks:

    1. Visual artifact detection
    2. Lip-sync verification
    3. Blink pattern analysis
    4. OCR for gibberish text detection

    Getting consistent results from an LLM is like herding cats, so I developed what I call "load-bearing prompt injection":

    # After lip-sync analysis
    return {
        "flag": flag,
        "event": lip_sync_event,
        "instruction": f"Your output MUST mention the lip-sync result: {result}"
    }
    

    Without that last line, Gemini would sometimes just... forget to mention whether the lips matched the audio. It's like dealing with a brilliant but absent-minded professor.

    The Whisper Hallucination Incident

    Here's where things get interesting. I was testing FakeCheck with an AI-generated video of people speaking complete gibberish, not any human language, just synthetic mouth sounds.

    What I expected from Whisper: "No speech detected" or maybe some admission of confusion.

    What Whisper actually delivered: "The quarterly earnings report shows significant growth in the Asian markets, particularly in the technology sector where we've seen a 23% increase in user engagement..."

    I'm not kidding. Whisper hallucinated an entire business presentation from pure nonsense.

    The Cascade of Chaos

    This created a domino effect of confusion:

    1. Whisper: "They're definitely discussing quarterly earnings!"
    2. Gemini: "Hmm, the lips don't match these words about market growth..."
    3. Lip-sync detector: "MISMATCH DETECTED!"
    4. Final verdict: "LIKELY_FAKE"

    Right answer, completely wrong reasoning.

    The Fix That Saved My Sanity

    After losing several hours to this madness, I implemented what I call the "Whisper Reality Check":

    # The "Whisper Reality Check"
    NO_SPEECH_THRESHOLD = 0.85
    avg_no_speech_prob = transcription.get("avg_no_speech_prob", 0.0)
    
    if avg_no_speech_prob > NO_SPEECH_THRESHOLD:
        logger.warning(f"High 'no speech' probability ({avg_no_speech_prob:.2f}). " 
                       "Disabling lip-sync check to avoid hallucination cascade.")
        lipsync_enabled = False
        transcription["text"] = "[No speech detected]"
    

    Basically, if Whisper isn't confident there's actual speech, we don't trust its transcription. It's like breathalyzing your witness before letting them testify.

    The Fusion of Experts

    This is where all the evidence comes together. Each detector gives its verdict, and we combine them with carefully tuned weights:

    # The carefully hand-tuned weights (emphasis on "hand-tuned")
    FUSION_MODEL_WEIGHTS = {
        "visual_clip": 0.28,              # CLIP knows its stuff
        "gemini_visual_artifacts": 0.37,  # Gemini spots the uncanny
        "gemini_lipsync_issue": 0.145,     # When it works...
        "gemini_blink_abnormality": 0.101, # Blinks don't lie
        "gemini_gibberish_text": 0.077,    # Gibberish text detector
        "flow": 0.077,                     # Motion inconsistencies
    }
    

    How did I arrive at these specific numbers? I spent a weekend with 30 videos, a spreadsheet, and entirely too much coffee. It went something like:

    • Hour 1: "I'll use equal weights!"
    • Hour 3: "Okay, CLIP seems more reliable..."
    • Hour 7: "Why is OCR triggering on every video with subtitles?"
    • Hour 12: "These numbers feel right. Ship it."

    The current approach is, frankly, more art than science. The dream would be an automated tuning pipeline:

    # TODO: Replace my caffeine-fueled guesswork with science
    def optimize_weights(labeled_videos):
        # Grid search? Genetic algorithm? 
        # Anything better than "this feels right"
        pass
    

    With a few hundred labeled examples, we could properly optimize these weights. Maybe even use some fancy Bayesian optimization. But for now, my hand-tuned weights work surprisingly well.

    The Supporting Cast: Heuristic Detectors

    Beyond the headline models, FakeCheck employs several heuristic detectors that catch specific anomalies:

    Optical Flow Spike Detector

    This one looks for sudden, unnatural movements in the video:

    # Why we throttle: Videos are noisy, alerts shouldn't be
    if ts < last_event_ts + 1.0:
        continue  # One anomaly per second is plenty
    
    z = (mag - μ) / σ  # z-score for motion magnitude
    if z > 2:  # Significant spike
        events.append({
            "event": "flow_spike",
            "ts": timestamp,
            "meta": {"z": round(z, 2)}
        })
    

    Early versions flagged every single frame. Turns out, video compression creates tons of tiny artifacts. The solution? Throttling. Because if everything's an anomaly, nothing is.

    Production Reality: Where Things Get Spicy

    The Protobuf Async/Await Disaster

    Here's a fun one. Google's Gemini Python SDK has an... interesting bug where async calls sometimes fail with protobuf errors. My solution is a masterpiece of "this shouldn't work but it does":

    async def safe_generate_content(model, content, max_retries=2):
        try:
            return await model.generate_content_async(content)
        except AttributeError as e:
            if "Unknown field" not in str(e):
                raise
            # The "this shouldn't work but it does" fallback
            logger.warning("Protobuf bug detected. Using sync fallback.")
            loop = asyncio.get_running_loop()
            return await loop.run_in_executor(None,
                functools.partial(model.generate_content, content))
    

    When Google's own libraries fight each other, you improvise. This workaround has saved me countless hours of debugging. Sometimes the best code is the code that just works, elegance be damned.

    Performance Reality Check

    Let's talk numbers:

    • Processing Speed: ~30 seconds per 30-second video (on GPU)
    • CPU Performance: ~70 seconds (patience required)
    • Accuracy on Test Set: ~70% (but defining "truth" is half the battle)
    • Maximum Video Length: 30 seconds (for now)

    That last limitation is crucial. FakeCheck works great for TikToks and short clips. Hour-long documentaries? That's a different beast entirely.

    Scaling Beyond the Demo: Enter Mixpeek

    Here's the thing about proof-of-concepts: they prove concepts. What they don't do is scale.

    What FakeCheck Proves

    • Multi-modal detection works
    • Fusion scoring can catch what individual models miss
    • The approach is sound (if a bit paranoid)
    • Even with imperfect components, the ensemble is robust

    What It Doesn't Handle

    • Real-time streaming: Try analyzing a 2-hour livestream with this
    • Distributed processing: Everything runs on one machine
    • Model versioning: "Which version of CLIP caught that fake?"
    • Production reliability: 99.9% uptime vs "works on my machine"
    • Scale: Processing thousands of videos per minute? Good luck

    This is where platforms like Mixpeek come in. They've solved the infrastructure challenges that turn proof-of-concepts into production systems:

    • Scalable Pipeline Orchestration: Distribute processing across multiple machines
    • Model Versioning and A/B Testing: Know exactly which model version made each detection
    • Real-time Processing: Handle live streams, not just uploaded videos
    • Enterprise Reliability: SLAs, monitoring, and all that production jazz

    Think of it as the difference between a proof-of-concept drone and a production aircraft. Both fly, but only one should carry passengers.

    DIY: Try FakeCheck Yourself

    Want to play with FakeCheck?

    Try the live demo at: https://fake-check.mixpeek.com/

    Or run it yourself:

    # Clone the repo
    git clone https://github.com/mixpeek/fake-check
    
    # Run Backend
    cd fakecheck/backend
    pip install -r requirements.txt # Install dependencies(pray to the pip gods)
    cp .env.example .env            # Set up your API keys
    python run_server.py            # Run detection server
    
    # Run Frontend In New Terminal
    cd fakecheck/frontend
    npm install
    npm run dev
    
    # Open http://localhost:5173/ in your browser
    

    Fair Warning

    • This is a PoC, expect rough edges and occasional explosions
    • Requires GPU for reasonable performance (or extreme patience)
    • Gemini API rate limits are real and they will hurt you
    • If it breaks, you get to keep both pieces

    Contributing

    Found a video that breaks everything? Perfect! That's exactly what we need:

    • Open an issue with the video (if shareable)
    • Include the full error traceback
    • Bonus points for proposing a fix or putting up a PR

    Have ideas for improvement?

    • Better fusion algorithm? Let's see it
    • New detector module? PR it
    • Performance optimizations? Yes please

    What's Next for FakeCheck

    1. Automated Weight Tuning: Replace my coffee-driven optimization with actual machine learning
    2. More Sophisticated Heuristics: Heartbeat detection, micro-expression analysis
    3. Streaming Support: Because the world doesn't stop at 30 seconds
    4. Better Whisper Handling: Maybe a "hallucination detector for the hallucination detector"

    The Bigger Picture

    The deepfake detection arms race isn't slowing down. Every advancement in generation technology demands a corresponding advancement in detection. Today's state-of-the-art detector is tomorrow's training data for better deepfakes.

    But here's the thing: we don't need perfect detection. We need good-enough detection deployed widely enough to make deepfake creation a high-risk, low-reward proposition.

    Your Move

    Whether you're:

    A Developer:

    • Try FakeCheck, break it, improve it
    • Share your findings and edge cases
    • Help us build better detection tools

    A Decision-Maker:

    • Consider how deepfake detection fits your content strategy
    • Think about the infrastructure needed for scale
    • Plan for the world where any video could be fake

    Just Curious:

    • Stay skeptical of viral videos
    • Verify sources before sharing
    • Maybe don't trust videos of kangaroos in airports

    The battle for digital truth is far from over. Tools like FakeCheck are just the beginning, proof that we can fight back against the tide of synthetic media. But winning this war will take more than clever code. It'll take infrastructure, scale, and a healthy dose of paranoia.

    After all, in a world where seeing is no longer believing, a little paranoia might just be the sanest response.

    Want to explore deepfake detection at scale? Check out Mixpeek's platform for production-ready solutions. Or dive into the FakeCheck source code and help us make it better.

    Found this helpful? Share it with someone who thinks all videos on the internet are real. They need the wake-up call.

    Join the Discussion

    Have thoughts, questions, or insights about this post? Be the first to start the conversation in our community!

    Start a Discussion