NEWManaged multimodal retrieval.Explore platform →
    Back to Solutions

    AI-Powered Body-Worn Camera Video Analysis

    Automated transcription, face clustering, firearms detection, and semantic chapterization for BWC footage — fully self-hosted for CJIS compliance. No data leaves your VPC.

    From Raw BWC Footage to Structured Evidence

    Five parallel extraction pipelines decompose every body-worn camera video into searchable, structured intelligence — faces, firearms, transcript, speakers, and incident chapters.

    Transcription & Speaker Diarization

    NVIDIA Parakeet TDT (6.34% WER) with SpeechBrain noise enhancement for BWC audio. pyannote speaker diarization handles unlimited speakers with overlapping speech detection. Every word gets a timestamp and speaker ID.

    6.3%
    WER
    Unlimited
    Speakers
    Word-level
    Timestamps

    Face Detection & Cross-Camera Clustering

    SCRFD detection with AdaFace IR101 embeddings, optimized for degraded BWC image quality. HDBSCAN clusters faces into identities across all cameras automatically, with no manual threshold tuning.

    512-dim
    Embeddings
    Auto
    Clustering
    107 matches
    Cross-camera

    Firearms Detection & Tracking

    4-tier pipeline: YOLO-World zero-shot screening at every frame, BoT-SORT temporal tracking with camera motion compensation, Grounding DINO verification, and optional SAM 2 segmentation for forensic masks.

    4
    Detection tiers
    64
    Tracked events
    0.94
    Peak confidence

    How It Works

    Upload BWC footage. Get structured evidence intelligence back in minutes.

    1

    Ingest BWC Footage

    Upload MP4/MOV body-worn camera files with officer, camera, and incident metadata. Multiple cameras per incident.

    2

    5-Stage Parallel Extraction

    Video decomposition, face embedding, firearms detection, speaker diarization, and semantic chapterization run concurrently on GPU.

    3

    Cross-Video Synthesis

    HDBSCAN clusters faces and speakers across cameras. Temporal alignment builds a unified incident timeline from all angles.

    4

    Prosecutor Retrieval

    Natural language queries return forensic timelines with face IDs, weapon events, transcript excerpts, and incident phase classifications.

    Benchmark Results

    Tested on 35 minutes of real BWC footage from a multi-officer incident across 4 cameras

    64
    Weapon events tracked
    324 raw detections, temporally clustered
    239
    Face clusters identified
    3,208 embeddings from 4,876 crops
    107
    Cross-camera matches
    Individuals seen across multiple BWCs
    44
    Semantic chapters
    With forensic VLM summaries
    ~10-15 min
    Per hour of video on A100 GPU
    2x A100 80GB
    Recommended for parallel pipelines
    100% local
    Zero external API calls

    CJIS Compliant by Design

    Every component is self-hosted in your AWS VPC. No evidence data ever leaves your infrastructure.

    Zero External API Calls

    No Vertex AI, OpenAI, or any third-party inference. All models run on-prem in the customer's VPC.

    Air-Gapped Inference

    Custom extractor containers bundle all model weights. The GPU cluster has no internet access.

    Full Chain of Custody

    Mixpeek lineage tracking records every extraction, transformation, and retrieval for evidentiary audit trails.

    Open-Source, Licensed Models

    AdaFace (MIT), SpeechBrain (Apache-2.0), Qwen2.5-VL (Apache-2.0), SigLIP (Apache-2.0) — all verified for criminal justice use.

    Model Stack

    All models are open-source with commercial licenses. Every model runs on-prem.

    TaskModelLicenseGPU
    ASRNVIDIA Parakeet TDT v3CC-BY-4.0T4+
    Speaker Diarizationpyannote 3.1MITT4+
    Face DetectionSCRFDApache-2.0CPU
    Face EmbeddingsAdaFace IR101MITT4+
    Firearms DetectionYOLO-WorldGPL-3.0A10+
    Firearms VerificationGrounding DINO 1.5Apache-2.0A10+
    Weapon SegmentationSAM 2Apache-2.0A10+
    VLM (Chapters)Qwen2.5-VL-7BApache-2.0A100
    Chapter Boundariesruptures PELTBSDCPU
    Text EmbeddingsE5-LargeMITT4+

    Built For

    County Prosecutors

    Natural language queries over BWC evidence: 'Show me everywhere the suspect appears' returns a cross-camera timeline with face IDs, weapon events, and transcript.

    Internal Affairs & Use-of-Force Review

    Automated incident phase classification (foot pursuit, shots fired, apprehension) with multi-angle corroboration from all BWC cameras on scene.

    Evidence Management Teams

    Process hundreds of hours of BWC footage per week. Structured metadata extraction replaces manual tagging — every video gets faces, weapons, transcript, and chapters automatically.

    Police Department Leadership

    Aggregate analytics across incidents: weapon deployment frequency, use-of-force patterns, response time distributions — all derived from BWC footage, not manual reports.

    Frequently Asked Questions

    How does Mixpeek maintain CJIS compliance?

    All processing runs entirely within your AWS VPC. Custom extractor containers bundle model weights — the GPU cluster has no internet access. No evidence data is sent to third-party APIs (no Vertex AI, OpenAI, or Anthropic). Built-in text embeddings use E5-Large which runs locally on Ray. Retriever LLM stages route to a self-hosted Qwen2.5-VL instance via a local vLLM endpoint. Full lineage tracking provides chain-of-custody audit trails for every extraction and retrieval.

    What models are used for transcription?

    The primary ASR model is NVIDIA Parakeet TDT v3 (600M params, CC-BY-4.0 license) with 6.34% word error rate — better than Whisper's 7.44%. SpeechBrain SepFormer handles noise enhancement for BWC audio with wind, sirens, and radio interference. Speaker diarization uses pyannote 3.1 (MIT license) which supports unlimited speakers and handles overlapping speech. Optional forced alignment uses Qwen3-ForcedAligner for legal-grade word timestamps.

    How does cross-camera face clustering work?

    SCRFD detects faces at 2 FPS sampling. AdaFace IR101 (MIT license) generates 512-dimensional embeddings optimized for degraded image quality — it outperforms ArcFace on surveillance benchmarks specifically because it down-weights unrecognizable faces during training. BoT-SORT groups faces into per-video tracks with camera motion compensation. HDBSCAN then clusters track-level embeddings across all cameras with no manual threshold tuning, followed by agglomerative merge of cluster centroids to catch same-person splits across lighting changes.

    What is the firearms detection pipeline?

    A 4-tier pipeline: (1) YOLO-World zero-shot screening at 1-5 FPS with open vocabulary prompts for handgun, pistol, rifle, shotgun, firearm, weapon, and gun. (2) BoT-SORT temporal tracking with camera motion compensation — requires 3 detections in 5 frames to trigger, eliminating single-frame false positives from radios or dark phones. (3) Grounding DINO 1.5 verification on tracked detections only (~1% of frames). (4) Optional SAM 2 segmentation for forensic weapon masks. A fine-tuned YOLOv11 on firearms datasets can replace Tier 1 for higher accuracy.

    How does semantic chapterization differ from scene detection?

    Traditional scene detection (PySceneDetect) finds visual cuts in edited video — wrong for continuous BWC footage, where it mostly triggers on camera motion and lighting changes. Our chapterization uses ruptures PELT change-point detection on 4 combined signals: SigLIP visual embeddings, audio energy, transcript topic similarity, and optical flow motion classification. This finds semantic event boundaries — foot pursuit begins, confrontation starts, suspect detained — not visual cuts. Each chapter gets a forensic summary from a local Qwen2.5-VL-7B instance.

    What hardware is required for deployment?

    Minimum: 1x A100 80GB (runs all models sequentially). Recommended: 2x A100 80GB for parallel ASR + face + weapons pipelines. The vLLM server for retriever LLM stages (Qwen2.5-72B-Instruct) needs 1x A100 80GB. Supporting infrastructure: 4-core/16GB for API + Celery, 8-core/32GB for Qdrant vector storage, managed DocumentDB for metadata, S3 with VPC endpoint for evidence files, and ElastiCache Redis for queuing.

    How fast does the pipeline process video?

    On A100 GPU, the full pipeline (all 5 extraction stages) processes approximately 1 hour of BWC footage in 10-15 minutes. In our benchmarks on 35 minutes of footage across 4 cameras, the SOTA v2 pipeline completed in 46 minutes on CPU (M3 Ultra) — 2.1x faster than v1 with dramatically better quality. GPU projection brings this to under 15 minutes for the same footage.

    Can prosecutors search evidence in natural language?

    Yes. The evidence-search retriever accepts natural language queries like 'Show me everywhere the suspect appears' or 'When were weapons drawn during the foot pursuit.' The retriever runs a multi-stage pipeline: semantic search over chapter embeddings, incident phase classification via taxonomy, document enrichment with face IDs and firearms events from other collections, temporal sorting, and a final LLM synthesis stage that produces a forensic timeline citing camera IDs and timestamps.

    Ready to automate BWC evidence analysis?

    Self-hosted in your VPC, CJIS compliant, zero external API calls. Contact us to discuss your deployment.