codebert-base
by microsoft
Pre-trained model for code understanding and generation
microsoft/codebert-basemixpeek://document_extractor@v1/microsoft_codebert_base_v1Overview
CodeBERT is a bimodal pre-trained model for programming languages and natural language. It supports code search, code documentation generation, and code-to-code translation across 6 programming languages.
On Mixpeek, CodeBERT extracts and embeds code blocks from documents, enabling semantic search over code content, find code snippets by describing what they do in natural language.
Architecture
RoBERTa-base architecture (12 layers, 768-dim hidden, 12 attention heads) pre-trained on CodeSearchNet dataset with Masked Language Modeling and Replaced Token Detection on both NL and PL.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "code_extraction",
version: "v1",
parameters: { model_id: "microsoft/codebert-base" },
},
});Capabilities
- Natural language code search
- Code documentation generation
- 6 programming languages: Python, Java, JS, PHP, Ruby, Go
- 768-dimensional code embeddings
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| CodeSearchNet | MRR | 67.2 | Feng et al., 2020 — Table 3 |
| Clone Detection (BigCloneBench) | F1 | 96.5% | Feng et al., 2020 — Table 5 |
Performance
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
arxiv.orgBuild a pipeline with codebert-base
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio