Toto-2.0-2.5B
by Datadog
#1 time series foundation model — zero-shot forecasting from 4M to 2.5B parameters
Datadog/Toto-2.0-2.5Bmixpeek://forecasting@v1/datadog_toto_20_25b_v1Overview
Toto 2.0 is Datadog's time series foundation model that ranks #1 on every major forecasting benchmark (BOOM, GIFT-Eval, TIME). Built on a decoder-only patched transformer with alternating time-axis and variate-axis attention, it performs zero-shot multivariate forecasting with probabilistic uncertainty estimates — no fine-tuning required on target data.
The 2.5B flagship model sits atop a family spanning 4M to 2.5B parameters, all trained with a single hyperparameter recipe (u-muP) that transfers across scales. Contiguous Patch Masking enables single-pass parallel decoding of entire forecast horizons, making inference dramatically faster than autoregressive approaches.
On Mixpeek, Toto powers predictive analytics on ingestion metrics, query latency trends, and pipeline throughput forecasting — turning historical observability data into actionable capacity planning signals.
Architecture
Decoder-only patched transformer with alternating time-axis (causal) and variate-axis (full) attention layers. Contiguous Patch Masking (CPM) for single-pass parallel decoding. Quantile output head (9 levels) trained with pinball loss. Robust arcsinh input scaling. u-muP parameterization enables a single training recipe from 4M to 2.5B. Variable context and prediction lengths.
Mixpeek SDK Integration
from toto import TotoModel# Load the flagship 2.5B modelmodel = TotoModel.from_pretrained("Datadog/Toto-2.0-2.5B")# Zero-shot forecast — no fine-tuning requiredprediction = model.forecast(context=historical_metrics, # shape: (num_variates, context_length)prediction_length=96, # forecast 96 steps ahead)# prediction.quantiles: shape (num_variates, 96, 9)
Capabilities
- #1 foundation model on BOOM, GIFT-Eval, and TIME benchmarks
- Zero-shot forecasting — no fine-tuning needed on target time series
- Probabilistic predictions with 9-quantile uncertainty estimates
- Scales from 4M to 2.5B params with monotonic quality improvement
- Single-pass parallel decoding via Contiguous Patch Masking
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| BOOM (observability) | CRPS | 0.349 | Datadog, 2026 — Model Card |
| GIFT-Eval (general) | CRPS | 0.476 | Datadog, 2026 — Model Card |
| TIME (contamination-resistant) | CRPS | 0.532 | Datadog, 2026 — Model Card |
Performance
Common Pipeline Companions
Specification
Research Paper
This Time is Different: An Observability Perspective on Time Series Foundation Models
arxiv.orgBuild a pipeline with Toto-2.0-2.5B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio