Real-Time Multimodal Vectors in your MongoDB Cluster

Upgrade your MongoDB cluster with multimodal support, enabling you to handle any type of content. Use your existing drivers for the easiest integration imaginable.


Automatic Set up once and forget it exists, your GenAI apps will always be up-to-date.


Multimodal Mixpeek supports every filetype, including images, audio, video, and text.


Real-Time ingestion and parsing for immediate data availability. Always work with fresh data.


Extract tags, text and vectors to build any use case using existing MongoDB syntax.

Real-Time Multimodal Vectors in your MongoDB Cluster

The standard design pattern when you want to serve non JSON data to your client is to first store it in S3, then send that file_url to your transactional database of choice.

While MongoDB does have GridFS support, it's not always effective due to filesize limitations. This is why it's encouraged to take advantage of cheap cold storage, and then simply use your object id as a reference in your mongoDB collection like:

    "s3_url": "https://s3.resume.pdf",
    "filename": "Ethan's Resume",
    "metadata": {}

This allows our client to decide how they want to process the object. However some issues comes up:

  • What if we want to access the contents?
  • What if we want our server to process it before sending to the client?
  • What if we want to do fancy AI on top of the documents, images, video or audio

This is where multimodal indexing comes in.

How does it work?

Set up a change stream on your MongoDB collection, and send each change to Mixpeek's pipeline endpoint where each undergoes 3 steps:

  • Extract: If it's a PDF, the table contents, text and even images are pulled out. Audio gets transcribed, video can be object/motion tagging and image can be OCR or object detection
file_output = mixpeek.extract(file_url="s3://document.pdf")
  • Generate: If it's text, you can instruct the pipeline to use ML to generate a summary or tags
class Authors(BaseModel):
    author_email: str

class PaperDetails(BaseModel):
    paper_title: str
    author: Authors

response = mixpeek.generate(
    model={"provider": "GPT", "model": "gpt-3.5-turbo"},
    context=f"Format this document and adhere to the provided JSON format: {file_output}",
  • Embed: Supply your own transformer embeddings or use ours (everything is open source). We'll embed the extracted contents or the raw files using text encoders, video encoders, image encoders or audio.
embedding = mixpeek.embed(input="hello world")

All of these methods, get wrapped up into a pipeline: https://docs.mixpeek.com/pipelines/create

Alternatively, you can construct your own pipeline via workflows: https://docs.mixpeek.com/workflows/create

The multimodal vector replication is called via pipeline invokation endpoint which can be out-of-the-box, opinionated pipeline or your own custom workflow comprised of extract, generate and embed methods.

One major point of frustration developers experience is "what happens if I modify my representative data". Objects in your MongoDB are rarely static, they change often. As does your S3 bucket.

Mixpeek understands inserts vs updates vs deletes and is able to intelligently handle the embeddings by replacing/updating them in real-time.

Once we have vectors, tags, and embeddings the sky is really the limit. We advise writing queries that span these data structures, and MongoDB has you covered.

Here's a MongoDB query that combines:

  • text: inverted index using the best full text search engine, Lucene
  • $knnBeta: stored vectors with K nearest neighbors similarity
  • compound.filter: standard MongoDB B-tree indexes
    $vectorSearch: { // KNN query
      index: "indexName",
      path: "fieldToSearch",
      queryVector: [0, 1, 2, 3],
      filter: {
        $and: [
            freshness: {
              $eq: "fresh",
            year: {
              $lt: 1975,
        ], // text and integer (pre-filtering)
    $match: {
      foo: "bar",
  }, // standard mongodb query

More advanced query that enables hybrid search in MongoDB: https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/reciprocal-rank-fusion/

What about Multimodal Retrieval Augment Generation?

Send the query results to an LLM for "reasoning". Mixpeek has a library that lets you structure the output:

class Authors(BaseModel):
    author_email: str

class PaperDetails(BaseModel):
    paper_title: str
    author: Authors

response = mixpeek.generate(
    model={"provider":"GPT", "model":"gpt-3.5-turbo"},
    context=f"format this document and make sure to respond and adhere to the provided JSON format: {corpus}",

Here we're supplying a corpus to our GPT model and telling it to structure the output in a certain way based on Pydantic schemas.

This returns amazing, structured outputs:

    "author": {"author_email": "shannons@allenai.org"},
    "paper_title": "LayoutParser: A Unifiend Toolkit for Deep Learning Based "
    "Document Image Analysis",

Completely free AI playground to use these methods: https://mixpeek.com/start

Benefits of Mixpeek & MongoDB

  • Consistent: Leveraging MongoDB's change streams, every write is causally consistent
  • Multimodal: One query that spans multiple indexes and embedding spaces
  • Durable: Mixpeek ensures the entire process per write has guaranteed execution
  • Atomic: If one step fails in the pipeline, nothing get's written so you don't have any half-written data

What else can you build?

Much, much more. The sky is the limit with multimodal AI....

We'll even build a FREE multimodal proof of concept for your business, just schedule a call

Become a multimodal maker.

Upgrade your software with multimodal understanding in one line of code.