The standard design pattern when you want to serve non JSON data to your client is to first store it in S3, then send that file_url
to your transactional database of choice.
While MongoDB does have GridFS support, it's not always effective due to filesize limitations. This is why it's encouraged to take advantage of cheap cold storage, and then simply use your object id as a reference in your mongoDB collection like:
{
"s3_url": "https://s3.resume.pdf",
"filename": "Ethan's Resume",
"metadata": {}
}
This allows our client to decide how they want to process the object. However some issues comes up:
- What if we want to access the contents?
- What if we want our server to process it before sending to the client?
- What if we want to do fancy AI on top of the documents, images, video or audio
This is where multimodal indexing comes in.
How does it work?
Set up a change stream on your MongoDB collection, and send each change to Mixpeek's pipeline endpoint where each undergoes 3 steps:
- Extract: If it's a PDF, the table contents, text and even images are pulled out. Audio gets transcribed, video can be object/motion tagging and image can be OCR or object detection
file_output = mixpeek.extract(file_url="s3://document.pdf")
- Generate: If it's text, you can instruct the pipeline to use ML to generate a summary or tags
class Authors(BaseModel):
author_email: str
class PaperDetails(BaseModel):
paper_title: str
author: Authors
response = mixpeek.generate(
model={"provider": "GPT", "model": "gpt-3.5-turbo"},
response_format=PaperDetails,
context=f"Format this document and adhere to the provided JSON format: {file_output}",
)
- Embed: Supply your own transformer embeddings or use ours (everything is open source). We'll embed the extracted contents or the raw files using text encoders, video encoders, image encoders or audio.
embedding = mixpeek.embed(input="hello world")
All of these methods, get wrapped up into a pipeline: https://docs.mixpeek.com/pipelines/create
Alternatively, you can construct your own pipeline via workflows: https://docs.mixpeek.com/workflows/create
The multimodal vector replication is called via pipeline invokation endpoint which can be out-of-the-box, opinionated pipeline or your own custom workflow comprised of extract, generate and embed methods.
One major point of frustration developers experience is "what happens if I modify my representative data". Objects in your MongoDB are rarely static, they change often. As does your S3 bucket.
Mixpeek understands inserts vs updates vs deletes and is able to intelligently handle the embeddings by replacing/updating them in real-time.
Cool so now what can we do? Hybrid search.
Once we have vectors, tags, and embeddings the sky is really the limit. We advise writing queries that span these data structures, and MongoDB has you covered.
Here's a MongoDB query that combines:
- text: inverted index using the best full text search engine, Lucene
- $knnBeta: stored vectors with K nearest neighbors similarity
- compound.filter: standard MongoDB B-tree indexes
[
{
$vectorSearch: { // KNN query
index: "indexName",
path: "fieldToSearch",
queryVector: [0, 1, 2, 3],
filter: {
$and: [
{
freshness: {
$eq: "fresh",
},
year: {
$lt: 1975,
},
},
], // text and integer (pre-filtering)
},
},
},
{
$match: {
foo: "bar",
},
}, // standard mongodb query
]
More advanced query that enables hybrid search in MongoDB: https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/reciprocal-rank-fusion/
What about Multimodal Retrieval Augment Generation?
Send the query results to an LLM for "reasoning". Mixpeek has a library that lets you structure the output:
class Authors(BaseModel):
author_email: str
class PaperDetails(BaseModel):
paper_title: str
author: Authors
response = mixpeek.generate(
model={"provider":"GPT", "model":"gpt-3.5-turbo"},
response_format=PaperDetails,
context=f"format this document and make sure to respond and adhere to the provided JSON format: {corpus}",
messages=[],
settings={"temperature":0.5},
)
Here we're supplying a corpus
to our GPT model and telling it to structure the output in a certain way based on Pydantic schemas.
This returns amazing, structured outputs:
{
"author": {"author_email": "shannons@allenai.org"},
"paper_title": "LayoutParser: A Unifiend Toolkit for Deep Learning Based "
"Document Image Analysis",
}
Completely free AI playground to use these methods: https://mixpeek.com/start
Benefits of Mixpeek & MongoDB
- Consistent: Leveraging MongoDB's change streams, every write is causally consistent
- Multimodal: One query that spans multiple indexes and embedding spaces
- Durable: Mixpeek ensures the entire process per write has guaranteed execution
- Atomic: If one step fails in the pipeline, nothing get's written so you don't have any half-written data
What else can you build?
- Video Understanding Platforms: https://learn.mixpeek.com/semantic-video-understanding/
- Digital Asset Management: https://learn.mixpeek.com/intelligent-dam/
- eCommerce Search: https://learn.mixpeek.com/visual-shopping/
- Financial Analysis: https://github.com/mixpeek/use-cases/tree/master/2023-market-outlooks
Much, much more. The sky is the limit with multimodal AI....