codebert-base
by microsoft
Pre-trained model for code understanding and generation
microsoft/codebert-basemixpeek://document_extractor@v1/microsoft_codebert_base_v1Overview
CodeBERT is a bimodal pre-trained model for programming languages and natural language. It supports code search, code documentation generation, and code-to-code translation across 6 programming languages.
On Mixpeek, CodeBERT extracts and embeds code blocks from documents, enabling semantic search over code content, find code snippets by describing what they do in natural language.
Architecture
RoBERTa-base architecture (12 layers, 768-dim hidden, 12 attention heads) pre-trained on CodeSearchNet dataset with Masked Language Modeling and Replaced Token Detection on both NL and PL.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/docs.pdf" },
feature_extractors: [{
name: "code_extraction",
version: "v1",
params: {
model_id: "microsoft/codebert-base"
}
}]
});Capabilities
- Natural language code search
- Code documentation generation
- 6 programming languages: Python, Java, JS, PHP, Ruby, Go
- 768-dimensional code embeddings
Use Cases on Mixpeek
Specification
Research Paper
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
arxiv.orgBuild a pipeline with codebert-base
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder