Introduction
We create files everyday. Pictures, video, audio, documents, the list goes on. The volume of these files is compounded when we account for computer-generated files. But where do all these files go? What happens to them? Typically you need to associate these files with metadata like tags, titles and descriptions. The creation of this metadata is manually intensive, and then it has to be maintained as your files and application design patterns evolve.
Historical intelligent file storage retrieval forces developers to save these binary files in a cold object store like Amazon S3, Azure Blob Store, or GCP File Store. These are often hard disks. Extremely cheap, slow and dumb. Understanding this, developers typically add a database layer where such metadata is stored alongside the file URLs. By combining them, you can layer on database or even search engine retrieval capabilities against the title, description, tags, etc. with the file itself.
What is an Intelligent File Repository?
An intelligent file store is a type of storage system that uses artificial intelligence (AI) and machine learning algorithms to manage and organize the files it contains. It is designed to automatically classify and categorize files based on their content, metadata, or other characteristics, making it easier for users to find and access the files they need. Such systems are capable of adapting to the behavior of its' users, providing personalized recommendations and suggestions for files that may be relevant or useful to them.
File storage intelligence typically involve three steps. First you need to create mechanisms to extract the contents of the files, then index said content and finally expose search intefaces for intuitive retrieval.
The Process
Extract
- File Type Identification
- Optical Character Recognition
- Summarization
- Named Entity Recognition
- Tag Generation
Index
- Tokenization Applied
- Vector Embeddings Stored
Search
- Keyword & Terms
- Semantic & Vectors
- Domain Specific Synonyms
Personalize
- Learn-to-Rank
- Dynamic Weighting
- Automated Categorization
- Synonym Generation
The more advanced systems, like Mixpeek, are wrapped around a constant Learn-to-Rank feedback loop. As your users search for specific terms and their corresponding files are returned, the weights are modified and personalization improves.
Why use one?
Improved Organization and Searchability: Automatically classify and categorize files based on their content, metadata, or other characteristics, making it easier for users to find and access the files they need.
Enhanced Productivity: Extract key information from files and store it in a searchable database, allowing users to quickly find files based on specific keywords or criteria. This can help users save time and be more productive.
Personalized Recommendations:Learn and adapt to the behavior of your users, providing personalized recommendations and suggestions for files that may be relevant or useful to them. This can help users discover new information and resources that they may not have been aware of previously.
Required Capabilities
Scalability: Expand and contract to accommodate changes in the amount of data being stored.
High performance: Quickly retrieve and process data in order to support fast, efficient operations.
Security: Robust security measures in place to protect data from unauthorized access or tampering.
Data protection: Built-in data protection mechanisms to ensure that data is not lost or corrupted.
Data management: Powerful data management capabilities, including the ability to search, sort, and organize data in a variety of ways.
Integration: Seamlessly integrate with other systems and applications in order to support a wide range of use cases and business processes.
Getting Started
First we upload our file using the /upload endpoint.
import requests url = "https://api.mixpeek.com/v1/file/upload" payload = { "file_url": ["https://mixpeek-demo.s3.us-east-2.amazonaws.com/prescription.pdf"], "user_id": "john_smith_123", "tags": ["veterinarian", "medical", "dog"] } headers = {'Authorization': 'API_KEY'} requests.request("POST", url, data=payload, headers=headers)
Next we search across our files using the /search endpoint.
import requests url = "https://api.mixpeek.com/v1/search?q=prescription" headers = {'Authorization': 'API_KEY'} requests.request("GET", url, headers=headers)
Voila, we have our results:
[ { "file_id": "123", "file_url": "s3://file", "filename": "prescription.pdf", "importance": "100%" } ]