codet5p-110m-embedding
by Salesforce
Unified code understanding and generation with T5 architecture
Salesforce/codet5p-110m-embeddingmixpeek://document_extractor@v1/salesforce_codet5p_v1Overview
CodeT5+ is a family of encoder-decoder code LLMs that support both understanding and generation tasks. The 110M embedding variant is optimized for producing high-quality code embeddings for retrieval.
On Mixpeek, CodeT5+ provides an alternative to CodeBERT for code embedding extraction, with support for more programming languages and stronger performance on code search tasks.
Architecture
T5-based encoder-decoder. The 110M embedding variant uses only the encoder, trained with contrastive learning on code-text pairs. Supports 10+ programming languages.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/codebase.zip" },feature_extractors: [{name: "code_extraction",version: "v1",params: {model_id: "Salesforce/codet5p-110m-embedding"}}]});
Capabilities
- High-quality code embeddings for retrieval
- 10+ programming language support
- Code-to-text and text-to-code generation
- Compact model size (110M params)
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| CodeSearchNet (6 langs) | MRR | 71.8 | Wang et al., 2023 — Table 2 |
Performance
110M params — compact code embedding model
Specification
Research Paper
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
arxiv.orgBuild a pipeline with codet5p-110m-embedding
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder