codet5p-110m-embedding
by Salesforce
Unified code understanding and generation with T5 architecture
Salesforce/codet5p-110m-embeddingmixpeek://document_extractor@v1/salesforce_codet5p_v1Overview
CodeT5+ is a family of encoder-decoder code LLMs that support both understanding and generation tasks. The 110M embedding variant is optimized for producing high-quality code embeddings for retrieval.
On Mixpeek, CodeT5+ provides an alternative to CodeBERT for code embedding extraction, with support for more programming languages and stronger performance on code search tasks.
Architecture
T5-based encoder-decoder. The 110M embedding variant uses only the encoder, trained with contrastive learning on code-text pairs. Supports 10+ programming languages.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/codebase.zip" },
feature_extractors: [{
name: "code_extraction",
version: "v1",
params: {
model_id: "Salesforce/codet5p-110m-embedding"
}
}]
});Capabilities
- High-quality code embeddings for retrieval
- 10+ programming language support
- Code-to-text and text-to-code generation
- Compact model size (110M params)
Use Cases on Mixpeek
Specification
Research Paper
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
arxiv.orgBuild a pipeline with codet5p-110m-embedding
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder