Searching PDFs in S3 Using OpenSearch and Tika

Introduction

In this tutorial, we will be walking through the process of building a simple Python script that is able to search the contents of PDF files in an Amazon S3 bucket using Apache Tika and OpenSearch. Apache Tika is a library for extracting text and metadata from various types of documents, including PDF files. OpenSearch is a search server that is able to index and search document contents using Tika.

Prerequisites

Before we begin, make sure that you have the following prerequisites:

An AWS account and credentials with access to an S3 bucket.
Apache Tika and OpenSearch installed on your system. You can download the latest version of Tika from the Apache Tika website, and the latest version of OpenSearch from the OpenSearch GitHub page.
The boto3 and requests libraries installed on your system. You can install these libraries using pip install boto3 requests.

Your S3 Bucket is a Black Box

If your S3 bucket is anything like mine, it contains so many different file types. Text, pictures, audio clips, videos, the list goes on.

How are we supposed to find anything in this mess of files?

To solve this, we're going to build a pipeline for extracting the text from an example binary file: PDFs using OCR. Then we'll store it in a search engine via AWS OpenSearch and finally we'll build a REST API in Flask to "explore" our S3 bucket.

Here's what our pipeline will look like:

Bonus: Skip this walkthrough and just download the code here.

Retrieve the File from your S3 Bucket

First we need to download the file locally so we can run our text extraction logic.

import boto3

s3_client = boto3.client(
    's3',
    aws_access_key_id='aws_access_key_id',
    aws_secret_access_key='aws_secret_access_key',
    region_name='region_name'
)
with open(s3_file_name, 'wb') as file:
    s3_client.download_fileobj(
        bucket_name,
        s3_file_name,
        file
    )

Use OCR to Extract the Contents

We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):

from tika import parser

parsed_pdf_content = parser.from_file(s3_file_name)['content']

Insert the Contents into AWS OpenSearch

We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.

Note: if you don’t have OpenSearch locally you must install it first, then run it:

brew update
brew install opensearch
opensearch

OpenSearch will now be accessible here: http://localhost:9200 Let’s build the index and insert the file contents:

from opensearchpy import OpenSearch

os = OpenSearch("http://localhost:9200")
index_name="pdf-search"

doc = {
    "filename": s3_file_name,
    "parsed_pdf_content": parsed_pdf_content
}

response = os.index(
    index=index_name,
    body=doc,
    id=1,
    refresh=True
)

Building the Search API

from flask import Flask, jsonify, request
from opensearchpy import OpenSearch

app = Flask(__name__)
os = OpenSearch("http://localhost:9200/")

@app.route('/search', methods=['GET'])
def search_file():
    # value from the api
    query = request.args.get('q', default = None, type = str)
    # query payload in json forOpenSearch
    payload = {
        'query': {
            'match': {
                'parsed_pdf_content': query
            }
        }
    }
    # run search query
    response = os.search(
        body=payload,
        index=index_name
    )
    return jsonify(response)

if __name__ == '__main__':
    app.run(host="localhost", port=5011, debug=True)

You can download the repo here: https://github.com/mixpeek/pdf-search-s3

The easy part is done, now you need to figure out:

Queuing: Ensuring concurrent file uploads are not dropped
Security: Adding end to end encryption to the data pipeline
Enhancements: Including more features like fuzzy, highlighting and autocomplete
Rate Limiting: Building thresholds so users don’t abuse the system

All This in Two Lines of Code

from mixpeek import Mixpeek

# init mixpeek class with S3 connection
mix = Mixpeek(
    api_key="mixpeek_api_key",
    access_key="aws_access_key",
    secret_key="aws_secret_key",
    region="region"
)

# index our entire S3 bucket's files
mix.index_bucket("mixpeek-public-demo")    

# full text search across S3 bucket
mix.search("system")