Understanding APIs of AWS Textract

AWS Textract is an OCR(Optical Character Recognition) SaaS provided and fully managed by AWS. It automatically extracts the text from the input image/pdf. It has the capability to extract printed text, handwritten text, and other pieces of information present in the document.

OCR has become an important tool for many businesses, organizations, and governments where data comes in scanned images. Industries like insurance, medical, HR, etc where documents are scanned which contains the information, OCR helps to automate the process of information extraction. Textract is an industry-level OCR which makes this process less error-prone, fast, and economical compared to the existing systems of information extraction.

In this exercise, we will discuss different APIs of textract and understand the kind of Operations that can be done. This will help to select which API to use and what Operation to perform in different use cases.

APIs of AWS Textract: There are two different APIs of textract. The first one is detect_document_text and another analyze_document API. Both of these APIs can be called synchronously and asynchronously.

Let’s first understand different Operations of Textract API i.e. synchronous and asynchronous operation. In Synchronous operation, Textract analyses the document and returns the entire set of results (depending on which API is called detect_document_text pr analyze_document API). Synchronous operation is important for processing small, single-page, documents and with near real-time responses. In synchronous operation, only one call is made to the Textract API which returns the results. In this operation PDF, PNG and JPEG format of documents can be processed.

Synchronous Operation of DetectDocumentText API
Synchronous Operation of AnalyzeDocument API

Asynchronous operation of documents is useful for processing large, multipage documents. It is not real-time and takes time to process the large document, thus it allows to perform other operations in the pipeline while the PDF file is being processed asynchronously. There are two calls to Textract in asynchronous operation, first to start the operation and once the operation completes, second call to get the response on the processed document. The two calls are linked by JobID which we get in return while calling the start operation of the asynchronous call. The same JobID is used while calling the get response. In this operation, only the PDF format of the document is accepted.

Asynchronous operation of DocumentAnalysis API
Asynchronous operation of DetectDocumentText API

Let’s understand the difference between the two APIs. The analyze_document API extracts information of raw text, forms, and tables. In the raw text, it gives all the text present on the page, in forms, we get information which is present in form of question and answers i.e. questions and corresponding answers are identified (question-answer pair) and in the table, we get information of column and values in every column. This API is quite intelligent and proves very helpful in the extraction of question-answers, tables, and checkboxes, but this comes at a cost. This API's cost is much higher than the detect_document_text API.

The types of information returned are as follows:

  • Form data (key-value pairs). The related information is returned in two Block objects, each of type KEY_VALUE_SET: a KEY Block object and a VALUE Block object. For example, Name: Ana Silva Carolina contains a key and value. Name: is the key. Ana Silva Carolina is the value.
  • Table and table cell data. A TABLE Block object contains information about a detected table. A CELL Block object is returned for each cell in a table.
  • Lines and words of text. A LINE Block object contains one or more WORD Block objects. All lines and words that are detected in the document are returned (including text that doesn't have a relationship with the value of FeatureTypes)

In detect_document_text API, it only extracts the raw text present in the document and does not gives any information separately for question-answer pair, tables, and checkboxes. This API is quite simple and gives basic information present in the document. The cost of this API is much less than anayse_document API.

The types of information returned are as follows:

  • Lines and words of text. A LINE Block object contains one or more WORD Block objects. All lines and words that are detected in the document are returned (including text that doesn't have a relationship with the value of FeatureTypes)

Along with the above pieces of information extracted, both the APIs give information of bounding boxes (co-ordinates), the confidence of text extraction, text, and type of text (printed or handwritten). It also gives information of lines detected and words detected and gives relationship mapping of words appearing in each line. The response of textract is in nested JSON format. Hope this helps to understand the basic idea of textract and its different APIs and operations. This can also help to identify which API and which operation to use in different use cases. The complete python script can be found here.

Data Scientist at Quantiphi Analytics.