What is Mixpeek?

You likely have combinations of unstructured and structured data that spans various modalities: documents, images, video and audio not the least of which is text.

Mixpeek is a multimodal ETL pipeline that connects to your database, automatically extracts, embeds and generates outputs before sending it right back into your database.

Here’s an example output:

{
  "text": "lorem ipsum",
  "tags": ["lorem", "ipsum"],
  "embedding": [0, 1, 2, 3]
}

As you can see, the output spans text (extract), tags (generate), and embeddings (embed).

What does it enable?

Since you’ll have fresh vectors, metadata and extracted contents you can design queries that span all your use cases:

  • RAG (Retrieval Augmented Generation)
  • Recommendation
  • Hybrid Search

Pre Requisites

  1. First you need to install the python SDK (unless you prefer to use the HTTP endpoints directly): pip install mixpeek
  2. Then register an api key by going to mixpeek.com/start
  3. Initialize your client
from mixpeek.client import Mixpeek

mixpeek = Mixpeek(api_key="API_KEY")

First create a DB connection

This is how we define our source storage. You can only have one storage connection per pipeline, but within each pipeline, multiple collections or tables to listen on.

Python
mixpeek.user.update(
    connections=[
        {
            "engine": "mongodb",
            "host": "sandbox.mhsby.mongodb.net",
            "port": 5432,
            "database": "",
            "username": "",
            "password": "",
        }
    ]
)

Connection passwords are encrypted at rest using symmetric encryption and all transmissions are via TLS.

Create your Pipeline

Pipelines are where the multimodal logic lives. It automatically pulls from the active connection you’ve instantiated

pipeline = mixpeek.pipeline.create(
    source_destination_mappings=[
        {
            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
            "source": {"field": "resume_url", "type": "file_url", "settings": {}},
            "destination": {
                "collection": "resume_embeddings",
                "field": "text",
                "embedding": "embedding",
            },
        }
    ]
)

This should return a pipeline_id like: djkh12

This defines how you want the replication to work. It’s saying: whenever there is a new document with a field: resume_url process it as if it’s a file_url, create embeddings using sentence-transformers/all-MiniLM-L6-v2 and send the outputs to the new collection: resume_embeddings

Sit back and enjoy

Now we have fresh vectors, metadata and text to design rich queries on top of!!!

JSON
{
  "text": "3. Analyses by segment 3.1 Operating segments Revenue and results",
  "embedding": [
    0.013505205512046814,
    -0.047882888466119766,
    0.07246698439121246,
    ...
  ],
  "metadata": {
    "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    "languages": [
      "eng"
    ],
    "page_name": "OS-Rev.and results 30 06 2023",
    "page_number": 1
  },
  "parent_id": "660c54a3cf034216d03bf1db"
}

You can use the parent_id to merge the embedding chunks on querytime.

Extra credit:

Leverage the extract, embed and generate methods to design a custom pipeline via the workflow service:

mixpeek.extract()

mixpeek.generate()

mixpeek.embed()

Mixpeek cloud is currently in private beta. To use the API, you need to register an API key and an engineer will contact you.