NEWVector Store Object Storage — 50x cheaper.Read the post →
    Models/Quality & Anomaly/Datadog/Toto-2.0-2.5B
    HFAnomaly DetectionApache 2.0

    Toto-2.0-2.5B

    by Datadog

    #1 time series foundation model — zero-shot forecasting from 4M to 2.5B parameters

    1.3Kdl/month
    2.5Bparams
    Identifiers
    Model ID
    Datadog/Toto-2.0-2.5B
    Feature URI
    mixpeek://forecasting@v1/datadog_toto_20_25b_v1

    Overview

    Toto 2.0 is Datadog's time series foundation model that ranks #1 on every major forecasting benchmark (BOOM, GIFT-Eval, TIME). Built on a decoder-only patched transformer with alternating time-axis and variate-axis attention, it performs zero-shot multivariate forecasting with probabilistic uncertainty estimates — no fine-tuning required on target data.

    The 2.5B flagship model sits atop a family spanning 4M to 2.5B parameters, all trained with a single hyperparameter recipe (u-muP) that transfers across scales. Contiguous Patch Masking enables single-pass parallel decoding of entire forecast horizons, making inference dramatically faster than autoregressive approaches.

    On Mixpeek, Toto powers predictive analytics on ingestion metrics, query latency trends, and pipeline throughput forecasting — turning historical observability data into actionable capacity planning signals.

    Architecture

    Decoder-only patched transformer with alternating time-axis (causal) and variate-axis (full) attention layers. Contiguous Patch Masking (CPM) for single-pass parallel decoding. Quantile output head (9 levels) trained with pinball loss. Robust arcsinh input scaling. u-muP parameterization enables a single training recipe from 4M to 2.5B. Variable context and prediction lengths.

    Mixpeek SDK Integration

    from toto import TotoModel
    # Load the flagship 2.5B model
    model = TotoModel.from_pretrained("Datadog/Toto-2.0-2.5B")
    # Zero-shot forecast — no fine-tuning required
    prediction = model.forecast(
    context=historical_metrics, # shape: (num_variates, context_length)
    prediction_length=96, # forecast 96 steps ahead
    )
    # prediction.quantiles: shape (num_variates, 96, 9)

    Capabilities

    • #1 foundation model on BOOM, GIFT-Eval, and TIME benchmarks
    • Zero-shot forecasting — no fine-tuning needed on target time series
    • Probabilistic predictions with 9-quantile uncertainty estimates
    • Scales from 4M to 2.5B params with monotonic quality improvement
    • Single-pass parallel decoding via Contiguous Patch Masking

    Use Cases on Mixpeek

    Infrastructure forecasting: predict CPU, memory, and request volume trends
    Capacity planning: forecast when resources will hit utilization thresholds
    Anomaly detection: flag deviations from predicted baselines
    SLA monitoring: predict latency and error rate trajectories

    Benchmarks

    DatasetMetricScoreSource
    BOOM (observability)CRPS0.349Datadog, 2026 — Model Card
    GIFT-Eval (general)CRPS0.476Datadog, 2026 — Model Card
    TIME (contamination-resistant)CRPS0.532Datadog, 2026 — Model Card

    Performance

    Input SizeMultivariate time series (variable length)
    GPU Latency~36ms / forecast (A100, batch 8)
    GPU Throughput~220 forecasts/sec (A100)
    GPU Memory~9.1 GB

    Specification

    FrameworkHF
    OrganizationDatadog
    FeatureAnomaly Detection
    Outputanomaly score + map
    Modalitiesimage
    RetrieverAnomaly Filter
    Parameters2.5B
    LicenseApache 2.0
    Downloads/mo1.3K

    Research Paper

    This Time is Different: An Observability Perspective on Time Series Foundation Models

    arxiv.org

    Build a pipeline with Toto-2.0-2.5B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio