Toto-2.0-2.5B

by Datadog

#1 time series foundation model — zero-shot forecasting from 4M to 2.5B parameters

1.3Kdl/month

2.5Bparams

HuggingFace Use in Pipeline

Identifiers

Model ID

Datadog/Toto-2.0-2.5B

Feature URI

mixpeek://forecasting@v1/datadog_toto_20_25b_v1

Overview

Toto 2.0 is Datadog's time series foundation model that ranks #1 on every major forecasting benchmark (BOOM, GIFT-Eval, TIME). Built on a decoder-only patched transformer with alternating time-axis and variate-axis attention, it performs zero-shot multivariate forecasting with probabilistic uncertainty estimates — no fine-tuning required on target data.

The 2.5B flagship model sits atop a family spanning 4M to 2.5B parameters, all trained with a single hyperparameter recipe (u-muP) that transfers across scales. Contiguous Patch Masking enables single-pass parallel decoding of entire forecast horizons, making inference dramatically faster than autoregressive approaches.

On Mixpeek, Toto powers predictive analytics on ingestion metrics, query latency trends, and pipeline throughput forecasting — turning historical observability data into actionable capacity planning signals.

Architecture

Decoder-only patched transformer with alternating time-axis (causal) and variate-axis (full) attention layers. Contiguous Patch Masking (CPM) for single-pass parallel decoding. Quantile output head (9 levels) trained with pinball loss. Robust arcsinh input scaling. u-muP parameterization enables a single training recipe from 4M to 2.5B. Variable context and prediction lengths.

Mixpeek SDK Integration

from toto import TotoModel

# Load the flagship 2.5B model
model = TotoModel.from_pretrained("Datadog/Toto-2.0-2.5B")

# Zero-shot forecast — no fine-tuning required
prediction = model.forecast(
    context=historical_metrics,  # shape: (num_variates, context_length)
    prediction_length=96,        # forecast 96 steps ahead
)
# prediction.quantiles: shape (num_variates, 96, 9)

Capabilities

#1 foundation model on BOOM, GIFT-Eval, and TIME benchmarks
Zero-shot forecasting — no fine-tuning needed on target time series
Probabilistic predictions with 9-quantile uncertainty estimates
Scales from 4M to 2.5B params with monotonic quality improvement
Single-pass parallel decoding via Contiguous Patch Masking

Use Cases on Mixpeek

Infrastructure forecasting: predict CPU, memory, and request volume trends

Capacity planning: forecast when resources will hit utilization thresholds

Anomaly detection: flag deviations from predicted baselines

SLA monitoring: predict latency and error rate trajectories

Benchmarks

Dataset	Metric	Score	Source
BOOM (observability)	CRPS	0.349	Datadog, 2026 — Model Card
GIFT-Eval (general)	CRPS	0.476	Datadog, 2026 — Model Card
TIME (contamination-resistant)	CRPS	0.532	Datadog, 2026 — Model Card