Mixpeek Logo
    Schedule Demo
    ESEthan Steininger
    3 min read

    Searching PDFs in S3 Using OpenSearch and Tika

    In this tutorial, we walked through the process of building a Python script that is able to search the contents of PDF files in an Amazon S3 bucket using Apache Tika and OpenSearch.

    Searching PDFs in S3 Using OpenSearch and Tika
    Technical Guides

    Introduction

    In this tutorial, we will be walking through the process of building a simple Python script that is able to search the contents of PDF files in an Amazon S3 bucket using Apache Tika and OpenSearch. Apache Tika is a library for extracting text and metadata from various types of documents, including PDF files. OpenSearch is a search server that is able to index and search document contents using Tika.

    Prerequisites

    Before we begin, make sure that you have the following prerequisites:

    • An AWS account and credentials with access to an S3 bucket.
    • Apache Tika and OpenSearch installed on your system. You can download the latest version of Tika from the Apache Tika website, and the latest version of OpenSearch from the OpenSearch GitHub page.
    • The boto3 and requests libraries installed on your system. You can install these libraries using pip install boto3 requests.

    Your S3 Bucket is a Black Box

    If your S3 bucket is anything like mine, it contains so many different file types. Text, pictures, audio clips, videos, the list goes on.

    How are we supposed to find anything in this mess of files?

    To solve this, we're going to build a pipeline for extracting the text from an example binary file: PDFs using OCR. Then we'll store it in a search engine via AWS OpenSearch and finally we'll build a REST API in Flask to "explore" our S3 bucket.

    Here's what our pipeline will look like:

    Generated image

    Bonus: Skip this walkthrough and just download the code here.

    Retrieve the File from your S3 Bucket

    First we need to download the file locally so we can run our text extraction logic.

    import boto3
    
    s3_client = boto3.client(
        's3',
        aws_access_key_id='aws_access_key_id',
        aws_secret_access_key='aws_secret_access_key',
        region_name='region_name'
    )
    with open(s3_file_name, 'wb') as file:
        s3_client.download_fileobj(
            bucket_name,
            s3_file_name,
            file
        )   

    Use OCR to Extract the Contents

    We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):

    from tika import parser
    
    parsed_pdf_content = parser.from_file(s3_file_name)['content']

    Insert the Contents into AWS OpenSearch

    We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.

    Note: if you don’t have OpenSearch locally you must install it first, then run it:

    brew update
    brew install opensearch
    opensearch

    OpenSearch will now be accessible here: http://localhost:9200 Let’s build the index and insert the file contents:

    from opensearchpy import OpenSearch
    
    os = OpenSearch("http://localhost:9200")
    index_name="pdf-search"
    
    doc = {
        "filename": s3_file_name,
        "parsed_pdf_content": parsed_pdf_content
    }
    
    response = os.index(
        index=index_name,
        body=doc,
        id=1,
        refresh=True
    )                        

    Building the Search API

    from flask import Flask, jsonify, request
    from opensearchpy import OpenSearch
    
    app = Flask(__name__)
    os = OpenSearch("http://localhost:9200/")
    
    @app.route('/search', methods=['GET'])
    def search_file():
        # value from the api
        query = request.args.get('q', default = None, type = str)
        # query payload in json forOpenSearch
        payload = {
            'query': {
                'match': {
                    'parsed_pdf_content': query
                }
            }
        }
        # run search query
        response = os.search(
            body=payload,
            index=index_name
        )
        return jsonify(response)
    
    if __name__ == '__main__':
        app.run(host="localhost", port=5011, debug=True)

    You can download the repo here: https://github.com/mixpeek/pdf-search-s3

    The easy part is done, now you need to figure out:

    • Queuing: Ensuring concurrent file uploads are not dropped
    • Security: Adding end to end encryption to the data pipeline
    • Enhancements: Including more features like fuzzy, highlighting and autocomplete
    • Rate Limiting: Building thresholds so users don’t abuse the system

    All This in Two Lines of Code

    from mixpeek import Mixpeek
    
    # init mixpeek class with S3 connection
    mix = Mixpeek(
        api_key="mixpeek_api_key",
        access_key="aws_access_key",
        secret_key="aws_secret_key",
        region="region"
    )
    
    # index our entire S3 bucket's files
    mix.index_bucket("mixpeek-public-demo")    
    
    # full text search across S3 bucket
    mix.search("system")
    ES
    Ethan Steininger

    January 6, 2023 · 3 min read