Mixpeek Logo
    Models/Text Extraction/microsoft/codebert-base
    HFCode ExtractionMIT

    codebert-base

    by microsoft

    Pre-trained model for code understanding and generation

    513Kdl/month
    284likes
    125Mparams
    Identifiers
    Model ID
    microsoft/codebert-base
    Feature URI
    mixpeek://document_extractor@v1/microsoft_codebert_base_v1

    Overview

    CodeBERT is a bimodal pre-trained model for programming languages and natural language. It supports code search, code documentation generation, and code-to-code translation across 6 programming languages.

    On Mixpeek, CodeBERT extracts and embeds code blocks from documents, enabling semantic search over code content — find code snippets by describing what they do in natural language.

    Architecture

    RoBERTa-base architecture (12 layers, 768-dim hidden, 12 attention heads) pre-trained on CodeSearchNet dataset with Masked Language Modeling and Replaced Token Detection on both NL and PL.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    
    await mx.collections.ingest({
      collection_id: "my-collection",
      source: { url: "https://example.com/docs.pdf" },
      feature_extractors: [{
        name: "code_extraction",
        version: "v1",
        params: {
          model_id: "microsoft/codebert-base"
        }
      }]
    });

    Capabilities

    • Natural language code search
    • Code documentation generation
    • 6 programming languages: Python, Java, JS, PHP, Ruby, Go
    • 768-dimensional code embeddings

    Use Cases on Mixpeek

    Search code repositories by natural language description
    Extract and index code blocks from technical documentation
    Code similarity detection for plagiarism or deduplication

    Specification

    FrameworkHF
    Organizationmicrosoft
    FeatureCode Extraction
    Outputcode + language
    Modalitiesdocument
    RetrieverCode Search
    Parameters125M
    LicenseMIT
    Downloads/mo513K
    Likes284

    Research Paper

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    arxiv.org

    Build a pipeline with codebert-base

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Pipeline Builder