From Raw BWC Footage to Structured Evidence
Five parallel extraction pipelines decompose every body-worn camera video into searchable, structured intelligence — faces, firearms, transcript, speakers, and incident chapters.
Transcription & Speaker Diarization
NVIDIA Parakeet TDT (6.34% WER) with SpeechBrain noise enhancement for BWC audio. pyannote speaker diarization handles unlimited speakers with overlapping speech detection. Every word gets a timestamp and speaker ID.
Face Detection & Cross-Camera Clustering
SCRFD detection with AdaFace IR101 embeddings, optimized for degraded BWC image quality. HDBSCAN clusters faces into identities across all cameras automatically, with no manual threshold tuning.
Firearms Detection & Tracking
4-tier pipeline: YOLO-World zero-shot screening at every frame, BoT-SORT temporal tracking with camera motion compensation, Grounding DINO verification, and optional SAM 2 segmentation for forensic masks.
How It Works
Upload BWC footage. Get structured evidence intelligence back in minutes.
Ingest BWC Footage
Upload MP4/MOV body-worn camera files with officer, camera, and incident metadata. Multiple cameras per incident.
5-Stage Parallel Extraction
Video decomposition, face embedding, firearms detection, speaker diarization, and semantic chapterization run concurrently on GPU.
Cross-Video Synthesis
HDBSCAN clusters faces and speakers across cameras. Temporal alignment builds a unified incident timeline from all angles.
Prosecutor Retrieval
Natural language queries return forensic timelines with face IDs, weapon events, transcript excerpts, and incident phase classifications.
Benchmark Results
Tested on 35 minutes of real BWC footage from a multi-officer incident across 4 cameras
CJIS Compliant by Design
Every component is self-hosted in your AWS VPC. No evidence data ever leaves your infrastructure.
Zero External API Calls
No Vertex AI, OpenAI, or any third-party inference. All models run on-prem in the customer's VPC.
Air-Gapped Inference
Custom extractor containers bundle all model weights. The GPU cluster has no internet access.
Full Chain of Custody
Mixpeek lineage tracking records every extraction, transformation, and retrieval for evidentiary audit trails.
Open-Source, Licensed Models
AdaFace (MIT), SpeechBrain (Apache-2.0), Qwen2.5-VL (Apache-2.0), SigLIP (Apache-2.0) — all verified for criminal justice use.
Model Stack
All models are open-source with commercial licenses. Every model runs on-prem.
| Task | Model | License | GPU |
|---|---|---|---|
| ASR | NVIDIA Parakeet TDT v3 | CC-BY-4.0 | T4+ |
| Speaker Diarization | pyannote 3.1 | MIT | T4+ |
| Face Detection | SCRFD | Apache-2.0 | CPU |
| Face Embeddings | AdaFace IR101 | MIT | T4+ |
| Firearms Detection | YOLO-World | GPL-3.0 | A10+ |
| Firearms Verification | Grounding DINO 1.5 | Apache-2.0 | A10+ |
| Weapon Segmentation | SAM 2 | Apache-2.0 | A10+ |
| VLM (Chapters) | Qwen2.5-VL-7B | Apache-2.0 | A100 |
| Chapter Boundaries | ruptures PELT | BSD | CPU |
| Text Embeddings | E5-Large | MIT | T4+ |
Built For
County Prosecutors
Natural language queries over BWC evidence: 'Show me everywhere the suspect appears' returns a cross-camera timeline with face IDs, weapon events, and transcript.
Internal Affairs & Use-of-Force Review
Automated incident phase classification (foot pursuit, shots fired, apprehension) with multi-angle corroboration from all BWC cameras on scene.
Evidence Management Teams
Process hundreds of hours of BWC footage per week. Structured metadata extraction replaces manual tagging — every video gets faces, weapons, transcript, and chapters automatically.
Police Department Leadership
Aggregate analytics across incidents: weapon deployment frequency, use-of-force patterns, response time distributions — all derived from BWC footage, not manual reports.
Frequently Asked Questions
How does Mixpeek maintain CJIS compliance?
All processing runs entirely within your AWS VPC. Custom extractor containers bundle model weights — the GPU cluster has no internet access. No evidence data is sent to third-party APIs (no Vertex AI, OpenAI, or Anthropic). Built-in text embeddings use E5-Large which runs locally on Ray. Retriever LLM stages route to a self-hosted Qwen2.5-VL instance via a local vLLM endpoint. Full lineage tracking provides chain-of-custody audit trails for every extraction and retrieval.
What models are used for transcription?
The primary ASR model is NVIDIA Parakeet TDT v3 (600M params, CC-BY-4.0 license) with 6.34% word error rate — better than Whisper's 7.44%. SpeechBrain SepFormer handles noise enhancement for BWC audio with wind, sirens, and radio interference. Speaker diarization uses pyannote 3.1 (MIT license) which supports unlimited speakers and handles overlapping speech. Optional forced alignment uses Qwen3-ForcedAligner for legal-grade word timestamps.
How does cross-camera face clustering work?
SCRFD detects faces at 2 FPS sampling. AdaFace IR101 (MIT license) generates 512-dimensional embeddings optimized for degraded image quality — it outperforms ArcFace on surveillance benchmarks specifically because it down-weights unrecognizable faces during training. BoT-SORT groups faces into per-video tracks with camera motion compensation. HDBSCAN then clusters track-level embeddings across all cameras with no manual threshold tuning, followed by agglomerative merge of cluster centroids to catch same-person splits across lighting changes.
What is the firearms detection pipeline?
A 4-tier pipeline: (1) YOLO-World zero-shot screening at 1-5 FPS with open vocabulary prompts for handgun, pistol, rifle, shotgun, firearm, weapon, and gun. (2) BoT-SORT temporal tracking with camera motion compensation — requires 3 detections in 5 frames to trigger, eliminating single-frame false positives from radios or dark phones. (3) Grounding DINO 1.5 verification on tracked detections only (~1% of frames). (4) Optional SAM 2 segmentation for forensic weapon masks. A fine-tuned YOLOv11 on firearms datasets can replace Tier 1 for higher accuracy.
How does semantic chapterization differ from scene detection?
Traditional scene detection (PySceneDetect) finds visual cuts in edited video — wrong for continuous BWC footage, where it mostly triggers on camera motion and lighting changes. Our chapterization uses ruptures PELT change-point detection on 4 combined signals: SigLIP visual embeddings, audio energy, transcript topic similarity, and optical flow motion classification. This finds semantic event boundaries — foot pursuit begins, confrontation starts, suspect detained — not visual cuts. Each chapter gets a forensic summary from a local Qwen2.5-VL-7B instance.
What hardware is required for deployment?
Minimum: 1x A100 80GB (runs all models sequentially). Recommended: 2x A100 80GB for parallel ASR + face + weapons pipelines. The vLLM server for retriever LLM stages (Qwen2.5-72B-Instruct) needs 1x A100 80GB. Supporting infrastructure: 4-core/16GB for API + Celery, 8-core/32GB for Qdrant vector storage, managed DocumentDB for metadata, S3 with VPC endpoint for evidence files, and ElastiCache Redis for queuing.
How fast does the pipeline process video?
On A100 GPU, the full pipeline (all 5 extraction stages) processes approximately 1 hour of BWC footage in 10-15 minutes. In our benchmarks on 35 minutes of footage across 4 cameras, the SOTA v2 pipeline completed in 46 minutes on CPU (M3 Ultra) — 2.1x faster than v1 with dramatically better quality. GPU projection brings this to under 15 minutes for the same footage.
Can prosecutors search evidence in natural language?
Yes. The evidence-search retriever accepts natural language queries like 'Show me everywhere the suspect appears' or 'When were weapons drawn during the foot pursuit.' The retriever runs a multi-stage pipeline: semantic search over chapter embeddings, incident phase classification via taxonomy, document enrichment with face IDs and firearms events from other collections, temporal sorting, and a final LLM synthesis stage that produces a forensic timeline citing camera IDs and timestamps.
