codebert-base
by microsoft
Pre-trained model for code understanding and generation
microsoft/codebert-basemixpeek://document_extractor@v1/microsoft_codebert_base_v1Overview
CodeBERT is a bimodal pre-trained model for programming languages and natural language. It supports code search, code documentation generation, and code-to-code translation across 6 programming languages.
On Mixpeek, CodeBERT extracts and embeds code blocks from documents, enabling semantic search over code content — find code snippets by describing what they do in natural language.
Architecture
RoBERTa-base architecture (12 layers, 768-dim hidden, 12 attention heads) pre-trained on CodeSearchNet dataset with Masked Language Modeling and Replaced Token Detection on both NL and PL.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/docs.pdf" },
feature_extractors: [{
name: "code_extraction",
version: "v1",
params: {
model_id: "microsoft/codebert-base"
}
}]
});Capabilities
- Natural language code search
- Code documentation generation
- 6 programming languages: Python, Java, JS, PHP, Ruby, Go
- 768-dimensional code embeddings
Use Cases on Mixpeek
Specification
Research Paper
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
arxiv.orgBuild a pipeline with codebert-base
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder