jina-embeddings-v5-text-small

by jinaai

Highest-scoring sub-1B multilingual embedding model with task-specific LoRA adapters

194Kdl/month

190likes

596Mparams

HuggingFace Run on your data

Identifiers

Model ID

jinaai/jina-embeddings-v5-text-small

Feature URI

mixpeek://text_extractor@v1/jina_embeddings_v5_small_v1

Overview

Jina Embeddings v5 Text Small is a 677M-parameter multilingual text embedding model built on the Qwen3-0.6B-Base backbone. It achieves the highest MTEB English v2 score (71.7) among all multilingual models under 1B parameters by combining embedding distillation from the larger 4B variant with four task-specific LoRA adapters for retrieval, similarity, clustering, and classification.

On Mixpeek, jina-embeddings-v5-text-small is the optimal choice for multilingual text embedding at scale, matching the retrieval quality of the 3.8B v4 model at 5.6x smaller size. Its 32K token context length and Matryoshka dimension flexibility (1024 down to 32) make it ideal for both long-document and cost-constrained pipelines across 119+ languages.

Architecture

Qwen3-0.6B-Base backbone with last-token pooling. 677M parameters. Four independent task-specific LoRA adapters (retrieval, similarity, clustering, classification) trained on frozen backbone weights. Supports 32K context via adjusted RoPE base frequencies. Matryoshka truncation from 1024 to 32 dimensions.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "text_embedding",
    version: "v1",
    parameters: { model_id: "jinaai/jina-embeddings-v5-text-small" },
  },
});

Capabilities

71.7 avg on MTEB English v2 (best under 1B multilingual)
1024-dimensional embeddings with Matryoshka truncation to 32-dim
32K token context length via RoPE
119+ language support
Task-specific LoRA adapters for optimal per-task performance

Use Cases on Mixpeek

Multilingual document search across global content repositories in 119+ languages

Long-form content embedding for legal, medical, and research documents up to 32K tokens

Cost-efficient semantic search replacing larger embedding models without quality loss

Benchmarks

Dataset	Metric	Score	Source
MTEB English v2 (avg)	Score	71.7	Jina AI, 2025: Model Card
MMTEB (multilingual, task-level avg)	Score	67.0	Jina AI, 2025: Model Card
BEIR (retrieval)	nDCG@10	56.67	Jina AI, 2025: Model Card