Mixpeek Logo
    Login / Signup
    6 min read

    Building a Kalshi Trading Bot with Semantic Search and LLM Extraction

    How we built an autonomous Kalshi trading bot using the Kalshi API and Mixpeek's video transcription, semantic search, and LLM data extraction no external tools required.

    Building a Kalshi Trading Bot with Semantic Search and LLM Extraction
    Engineering

    We built an autonomous Kalshi trading bot that uses the Kalshi API and Mixpeek's multimodal data platform to trade mention markets in real-time. The system feeds YouTube URLs directly into Mixpeek — which handles transcription, embedding, and LLM extraction — then queries the results through a semantic retriever to generate calibrated trading signals. Zero manual intervention.

    This post walks through every component: how Mixpeek ingests and transcribes political video, uses LLM data extraction to structure it, queries it with a semantic search API, and turns the output into automated market making decisions on the prediction market API from Kalshi.


    What Are Kalshi Mention Markets?

    Kalshi's mention markets are binary contracts on whether a public figure will say a specific word. Examples:

    • "Will Trump say 'tariff' in his next address?" — ticker: KXTRUMPMENTIONB-26APR01-TARI
    • "Will the Fed Chair mention 'inflation'?" — ticker: KXFEDMENTION-26APR-INFL
    • "Will the Press Secretary say 'China'?" — ticker: KXSECPRESSMENTION-26APR30-CHIN

    These markets resolve based on official transcripts. The edge comes from processing political speech faster and more accurately than the market — knowing who said it, how surprising it was, and whether the keyword appeared in a policy-relevant context.

    Most Kalshi trading bots rely on simple keyword matching. Ours uses Mixpeek's full resource chain for semantic understanding.


    System Architecture: Six Mixpeek Resources

    The pipeline chains six Mixpeek primitives. Mixpeek handles everything from video download and transcription to embedding and LLM extraction — no external tools required:

    YouTube URL
      1. Namespace  → data isolation
      2. Bucket     → accepts YouTube URLs as type: "video"
      3. Collection → auto-transcription + text embedding + LLM extraction
      4. Retriever  → semantic search across processed documents
      5. Bucket     → trade history logging (feedback loop)
      6. Retriever  → historical calibration from past trades

    If you've used a prediction market API before (Kalshi, Polymarket, etc.), you know the data challenge: markets move on unstructured information — speeches, press briefings, hearings — that doesn't fit neatly into a database. Mixpeek bridges that gap.


    Resource 1: Namespace — Data Isolation

    namespace_id: ns_7c8f877d9b
    name: prediction-market-alpha

    Every resource lives inside a single namespace, isolating prediction market data from other workloads. All API calls include the X-Namespace header. This is standard practice when using Mixpeek as a multimodal data pipeline — one namespace per use case.


    Resource 2: Hearing Bucket — Video Ingestion

    bucket_id: bkt_be6b9536

    We monitor four YouTube channels (White House, C-SPAN, C-SPAN Senate, Federal Reserve). When a new video appears, we push the YouTube URL directly to Mixpeek — no need for external transcription tools:

    POST /v1/buckets/bkt_be6b9536/objects
    {
      "blobs": [
        {
          "property": "url",
          "type": "video",
          "data": "https://www.youtube.com/watch?v=7d-3oqka-fE"
        },
        {"property": "source", "type": "string", "data": "white-house"},
        {"property": "event_type", "type": "string", "data": "press_briefing"}
      ]
    }

    That's it. Mixpeek downloads the video, extracts the audio, transcribes it, and makes the text available to the collection pipeline. A single API call replaces what would otherwise require yt-dlp for download, whisper or youtube-transcript-api for transcription, and a custom chunking pipeline.

    A single day's political speech typically yields 5-10 videos totaling 200-400K characters of transcript.


    Resource 3: Collection — Embedding + LLM Data Extraction

    collection_id: col_2a9565df60

    This is the core of the system. Once Mixpeek transcribes the video, the collection runs two extractors on the resulting text:

    1. Dense vector embedding via multilingual_e5_large_instruct_v1 — enables semantic search across all transcript chunks
    2. LLM structured extraction via Mixpeek's response_shape — Claude analyzes each chunk and extracts seven fields

    The response_shape configuration defines the extraction schema:

    {
      "speaker": "who is speaking (e.g. President Trump, Fed Chair Powell)",
      "statement_type": "policy_announcement | press_response | hearing_testimony | ...",
      "policy_direction": "hawkish | dovish | neutral | escalatory | ...",
      "keywords_mentioned": ["tariff", "china", "inflation", ...],
      "is_surprising": true/false,
      "surprise_magnitude": 0.0 - 1.0,
      "market_impact": 0.0 - 1.0
    }

    This is LLM data extraction at scale — every chunk gets speaker attribution, policy context, and market relevance scoring without any custom LLM pipeline. A batch of 7 videos (357K chars of transcript) produces 150+ indexed documents with all seven fields.


    Resource 4: Signal Retriever — Semantic Search API

    retriever_id: ret_37fcabc4144e76
    name: signal-market-matcher

    The retriever is configured as a semantic search API endpoint that queries the collection using the E5 embedding model:

    {
      "stages": [{
        "stage_name": "semantic-search",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [{
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": {"input_mode": "text", "value": "{{INPUT.query}}"}
            }],
            "final_top_k": 30
          }
        }
      }]
    }

    The engine sends three query groups per cycle to maximize coverage:

    • Political/policy terms — tariff, china, iran, trade, sanctions, immigration
    • People/institution names — trump, powell, leavitt, fed, congress, senate
    • Random keyword sample — surfaces unexpected matches from new transcripts

    Each query returns up to 30 semantically relevant chunks with their LLM-extracted fields intact — speaker, surprise magnitude, market impact, and all.


    Resource 5: Trade History Bucket — Feedback Loop

    bucket_id: bkt_0e439f96

    Every trade decision — executed or skipped — gets logged back to Mixpeek. This creates a searchable archive of what the Kalshi trading bot traded, at what price, with what signal quality, and whether it won or lost. The feedback loop is what separates a prediction market bot from a simple alert system.


    Resource 6: History Retriever — Edge Calibration

    retriever_id: ret_2674a0d675b62f
    name: signal-history

    Before placing any order through the Kalshi API, the engine queries the history retriever:

    POST /v1/retrievers/ret_2674a0d675b62f/execute
    {"inputs": {"query": "tariff"}}

    Past trades for the same keyword feed a win-rate calculation that adjusts the expected edge. If historical "tariff" trades won 70% of the time, the engine sizes up. If 30%, it skips. This is what makes the system self-improving — a form of automated market making that learns from its own history via Mixpeek's retriever infrastructure.


    Three Intelligence Layers: From Signal to Trade

    The six resources feed into three scoring layers:

    Layer 1: Signal Quality Scoring

    Powered by the collection's response_shape LLM extraction:

    • Speaker authority — "President Trump" (1.0), "Fed Chair Powell" (0.95), unknown press pool (0.50)
    • Statement type — policy announcements and hearing testimony score higher than casual references
    • Surprise factoris_surprising=true with high surprise_magnitude → larger position size
    • Market impact — LLM-estimated probability that the mention moves the market

    Layer 2: Portfolio Construction

    • Category exposure caps: max $3 per market category
    • Per-market position limits: $10 max
    • Daily loss circuit breaker: $10 max drawdown

    Layer 3: Historical Calibration

    • Win rate from past trades adjusts edge estimates up or down
    • Keywords with poor track record get automatically de-risked
    • New keywords start at 50% base rate until history accumulates

    Live Trading Results

    The engine runs autonomously, polling every 2 minutes. Here's actual output from a live cycle:

    Cycle 1: 28 signals found → 7 trades attempted, 23 skipped
    
    BUY signals (positive edge):
      "iran" on KXFEDMENTION-26APR-IRAN
        speaker=President Trump, quality=0.63, edge=+0.43 → 1x YES @ $0.25
      "volatility" on KXFEDMENTION-26APR-VOLA
        speaker=President Trump, quality=1.00, edge=+0.50 → 1x YES @ $0.28
      "bitcoin" on KXSECPRESSMENTION-26APR30-CRYP
        speaker=Press Secretary, quality=1.00, edge=+0.65 → 1x YES @ $0.13
    
    SKIPPED signals (negative edge or caps):
      "russia" on KXSECPRESSMENTION → quality=0.61, edge=-0.09 → SKIP
      "border" on KXSECPRESSMENTION → quality=0.61, edge=-0.17 → SKIP
      "oil" on KXLEAVITTSMFMENTION → category cap $3.08/$3.00 → SKIP

    The engine correctly rejects low-quality signals and respects portfolio limits, while aggressively buying high-conviction signals from authoritative speakers.


    Why This Beats Simple Keyword Matching

    Most Kalshi trading bots and prediction market bots use basic keyword detection — grep the transcript for "tariff" and buy. That approach fails in practice:

    • False positives — "The tariff discussion from last year..." doesn't mean they said "tariff" in a policy context today
    • No speaker attribution — a reporter asking "Will you impose tariffs?" is very different from the President saying "I'm imposing tariffs"
    • No surprise weighting — Trump saying "tariff" (expected) should size differently than Powell saying "tariff" (unexpected)
    • No learning — keyword bots make the same mistakes repeatedly with no feedback loop

    Mixpeek's response_shape extraction solves all four. The semantic search API finds contextually relevant chunks, the LLM extraction gives you structured fields, and the history retriever calibrates over time.


    Technical Stack

    • Video ingestion — YouTube URLs pushed directly to Mixpeek bucket as type: "video"
    • Transcription + processing — Mixpeek auto-transcribes, chunks, embeds (E5), and extracts (Claude response_shape)
    • Search — Mixpeek retriever (semantic search API with final_top_k: 30)
    • TradingKalshi API with RSA-PSS authentication for order placement
    • Feedback — Mixpeek trade history bucket + history retriever for calibration
    • Runtime — FastAPI server with async polling loop (Python)

    The entire intelligence layer — from YouTube URL to calibrated trading signal — runs on six Mixpeek resource IDs. No transcription tools, no custom vector database, no LLM prompt engineering, no embedding pipeline to maintain.


    Get Started

    Mixpeek handles the hard parts of unstructured data processing — video transcription, chunking, embedding, LLM extraction, vector search, and batch processing — so you can focus on your domain logic. Whether you're building a Kalshi trading bot, a content moderation pipeline, or a multimodal search engine, the same resource primitives apply.

    • Mixpeek docs — mixpeek.com/docs
    • Kalshi API docs — docs.kalshi.com
    • Source code — the complete engine is open-source in our research repo