PDF Search API for S3

Mixpeek enables your customers to search the contents of their PDFs hosted in your AWS S3 buckets using a single API call.

How you do it today...

Custom Extraction Software

Build parser code to recognize the file type, test it for various conditions then extract the contents using machine learning.

Synchronize Output to Search Engine

Custom architecture to pick up the output of this extraction software, send it to a queue, and insert it into a search engine.

Expose Search Engine Endpoints

Engineer custom endpoints which expose various functionality including multi-tenancy, fuzzy search, highlighting and more.

And now you have to maintain security, high availability, and updates.

...Why mixpeek is better

Full Featured REST APIs

GET and POST endpoints for your convert and search needs, inclusive of every feature you'd expect in a search engine.

Services that scale based on your usage

Every system in our tech stack scales independently based on usage, so you never have to worry about provisioning.

Forget managing file search architecture

Architecture, code, and machine learning managed for you.

See the FAQs section for security questions.

Features

Fuzzy Search

Return relevant search results regardless of errors

Highlighting

Showcase where within your text the search term appears

Compound

Combine multiple search queries using boolean-like operators

Custom Scoring

Control how your PDFs are searched upon, or boost promoted content

How Does it Work?

Upload your Files

POST your file to the /upload endpoint, where we extract relevant text then place it in a search index. See Docs for example.

  • Depending on the filetype, various machine learning models are run on the file and relevant text is extracted, then stored in an encrypted database.

    The file itself is never saved on our servers, you provide the url for the file in /upload API, then we'll include that file url in the /search response.

  • .pdf, .doc, .jpg, .jpeg, .gif, .mp4, .avi, .mov, .mp3, .wav, .m4a, .html, .xml

  • All of them! That's the beauty of mixpeek, your files are stored in different locations and formats, we parse and index them all then give you one single api to query them.

Files are not stored, and the extracted text is deleted after 15 mins.

Search Your Files

GET your files with the /search endpoint, where we return every relevant piece of text and it's corresponding filepath.

  • Your file's text is placed in a full-text search index. It is the most prominent text search in the industry and fastest.

  • All traffic in flight is encrypted using TLS and the data itself is encrypted at rest once it reaches our database.

  • We're still in beta, so just sign up and we'll talk.

Frequently Asked Questions

If we haven't answered any of your questions below, send us an email: info@mixpeek.com

  • What is the technology behind mixpeek?

    It depends on which filetype you are uploading. For example, for images we use pytorch, audio files is tesseract, etc. All the extracted text is then put into a search index (and the original file is deleted), where it instantly becomes searchable.

  • All connections between the API server and database enforce TLS 2.0 SSL certificates. Once the data enters the database, the hard drives are encrypted using a managed key service. By using 50 character length API keys, we ensure that you only have access to your data.

  • We have a 99.999% Service Level Agreement with our customers, and we're able to do this by having redundant app servers (for the APIs), in addition to redundant database servers.

  • Not at this time unfortunately.