NEWAgents can now see video via MCP.Try it now →
    Models/Text Extraction/Salesforce/codet5p-110m-embedding
    HFCode Extractionbsd-3-clause

    codet5p-110m-embedding

    by Salesforce

    Unified code understanding and generation with T5 architecture

    154Kdl/month
    68likes
    110Mparams
    Identifiers
    Model ID
    Salesforce/codet5p-110m-embedding
    Feature URI
    mixpeek://document_extractor@v1/salesforce_codet5p_v1

    Overview

    CodeT5+ is a family of encoder-decoder code LLMs that support both understanding and generation tasks. The 110M embedding variant is optimized for producing high-quality code embeddings for retrieval.

    On Mixpeek, CodeT5+ provides an alternative to CodeBERT for code embedding extraction, with support for more programming languages and stronger performance on code search tasks.

    Architecture

    T5-based encoder-decoder. The 110M embedding variant uses only the encoder, trained with contrastive learning on code-text pairs. Supports 10+ programming languages.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/codebase.zip" },
    feature_extractors: [{
    name: "code_extraction",
    version: "v1",
    params: {
    model_id: "Salesforce/codet5p-110m-embedding"
    }
    }]
    });

    Capabilities

    • High-quality code embeddings for retrieval
    • 10+ programming language support
    • Code-to-text and text-to-code generation
    • Compact model size (110M params)

    Use Cases on Mixpeek

    Code search across technical documentation and repositories
    Code snippet recommendation based on natural language
    Cross-language code similarity matching

    Benchmarks

    DatasetMetricScoreSource
    CodeSearchNet (6 langs)MRR71.8Wang et al., 2023 — Table 2

    Performance

    Input Size512 tokens max
    Embedding Dim256
    GPU Latency~2ms / snippet (A100)
    CPU Latency~18ms / snippet
    GPU Throughput~500 snippets/sec (A100)
    GPU Memory~0.45 GB

    110M params — compact code embedding model

    Specification

    FrameworkHF
    OrganizationSalesforce
    FeatureCode Extraction
    Outputcode + language
    Modalitiesdocument
    RetrieverCode Search
    Parameters110M
    Licensebsd-3-clause
    Downloads/mo154K
    Likes68

    Research Paper

    CodeT5+: Open Code Large Language Models for Code Understanding and Generation

    arxiv.org

    Build a pipeline with codet5p-110m-embedding

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder