Full-Text File Search on your S3 Bucket

Your S3 Bucket is a Black Box

If your S3 bucket is anything like mine, it contains so many different file types. Text, pictures, audio clips, videos, the list goes on.

How are we supposed to find anything in this mess of files?

To solve this, we're going to build a pipeline for extracting the text from an example binary file: PDFs using OCR. Then we'll store it in a search engine via AWS OpenSearch and finally we'll build a REST API in Flask to "explore" our S3 bucket.

Here's what our pipeline will look like:

Bonus: Skip this walkthrough and just download the code here.

Retrieve the File from your S3 Bucket

First we need to download the file locally so we can run our text extraction logic.

import boto3

s3_client = boto3.client(
    's3',
    aws_access_key_id='aws_access_key_id',
    aws_secret_access_key='aws_secret_access_key',
    region_name='region_name'
)
with open(s3_file_name, 'wb') as file:
    s3_client.download_fileobj(
        bucket_name,
        s3_file_name,
        file
    )                        
                      

Use OCR to Extract the Contents

We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):

from tika import parser

parsed_pdf_content = parser.from_file(s3_file_name)['content']                        
                    

Insert the Contents into AWS OpenSearch

We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.

brew update
brew install opensearch
opensearch
                    

OpenSearch will now be accessible here: http://localhost:9200 Let’s build the index and insert the file contents:

from opensearchpy import OpenSearch

os = OpenSearch("http://localhost:9200")
index_name="pdf-search"

doc = {
    "filename": s3_file_name,
    "parsed_pdf_content": parsed_pdf_content
}

response = os.index(
    index=index_name,
    body=doc,
    id=1,
    refresh=True
)                        
                    
from flask import Flask, jsonify, request
from opensearchpy import OpenSearch

app = Flask(__name__)
os = OpenSearch("http://localhost:9200/")

@app.route('/search', methods=['GET'])
def search_file():
    # value from the api
    query = request.args.get('q', default = None, type = str)
    # query payload in json forOpenSearch
    payload = {
        'query': {
            'match': {
                'parsed_pdf_content': query
            }
        }
    }
    # run search query
    response = os.search(
        body=payload,
        index=index_name
    )
    return jsonify(response)

if __name__ == '__main__':
    app.run(host="localhost", port=5011, debug=True)                        
                    

You can download the repo here: https://github.com/mixpeek/pdf-search-s3

The easy part is done, now you need to figure out:

Queuing: Ensuring concurrent file uploads are not dropped

Security: Adding end to end encryption to the data pipeline

Enhancements: Including more features like fuzzy, highlighting and autocomplete

Rate Limiting: Building thresholds so users don’t abuse the system

All This in Two Lines of Code

from mixpeek import Mixpeek

# init mixpeek class with S3 connection
mix = Mixpeek(
    api_key="mixpeek_api_key",
    access_key="aws_access_key",
    secret_key="aws_secret_key",
    region="region"
)

# index our entire S3 bucket's files
mix.index_bucket("mixpeek-public-demo")    

# full text search across S3 bucket
mix.search("system")
                    

Here's an example UI we've built on a Demo Page to showcase searching a single term that spans multiple S3 files.

Read the Docs

What will you build?

Upgrade your software with multimodal understanding in one line of code.

Get Started Free