Your S3 Bucket is a Black Box
If your S3 bucket is anything like mine, it contains so many different file types. Text, pictures, audio clips, videos, the list goes on.
How are we supposed to find anything in this mess of files?
To solve this, we're going to build a pipeline for extracting the text from an example binary file: PDFs using OCR. Then we'll store it in a search engine via AWS OpenSearch and finally we'll build a REST API in Flask to "explore" our S3 bucket.
Here's what our pipeline will look like:
Bonus: Skip this walkthrough and just download the code here.
Retrieve the File from your S3 Bucket
First we need to download the file locally so we can run our text extraction logic.
import boto3 s3_client = boto3.client( 's3', aws_access_key_id='aws_access_key_id', aws_secret_access_key='aws_secret_access_key', region_name='region_name' ) with open(s3_file_name, 'wb') as file: s3_client.download_fileobj( bucket_name, s3_file_name, file )
Use OCR to Extract the Contents
We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):
from tika import parser parsed_pdf_content = parser.from_file(s3_file_name)['content']
Insert the Contents into AWS OpenSearch
We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.
brew update brew install opensearch opensearch
OpenSearch will now be accessible here:
http://localhost:9200
Let’s build the index
and insert the file contents:
from opensearchpy import OpenSearch os = OpenSearch("http://localhost:9200") index_name="pdf-search" doc = { "filename": s3_file_name, "parsed_pdf_content": parsed_pdf_content } response = os.index( index=index_name, body=doc, id=1, refresh=True )
Building the Search API
from flask import Flask, jsonify, request from opensearchpy import OpenSearch app = Flask(__name__) os = OpenSearch("http://localhost:9200/") @app.route('/search', methods=['GET']) def search_file(): # value from the api query = request.args.get('q', default = None, type = str) # query payload in json forOpenSearch payload = { 'query': { 'match': { 'parsed_pdf_content': query } } } # run search query response = os.search( body=payload, index=index_name ) return jsonify(response) if __name__ == '__main__': app.run(host="localhost", port=5011, debug=True)
You can download the repo here: https://github.com/mixpeek/pdf-search-s3
The easy part is done, now you need to figure out:
Queuing: Ensuring concurrent file uploads are not dropped
Security: Adding end to end encryption to the data pipeline
Enhancements: Including more features like fuzzy, highlighting and autocomplete
Rate Limiting: Building thresholds so users don’t abuse the system
All This in Two Lines of Code
from mixpeek import Mixpeek # init mixpeek class with S3 connection mix = Mixpeek( api_key="mixpeek_api_key", access_key="aws_access_key", secret_key="aws_secret_key", region="region" ) # index our entire S3 bucket's files mix.index_bucket("mixpeek-public-demo") # full text search across S3 bucket mix.search("system")
Here's an example UI we've built on a Demo Page to showcase searching a single term that spans multiple S3 files.