Document Analysis

Extract tables, lines and words from a pdf, tiff, jpeg or png file.

For this extraction we use Microsoft Azure Form Recognizer api.

To be able to use this activity, we need an Api Key and an Endpoint URl.

You can use the existing keys from a Microsoft Azure Form Recognizer resource(if any available), or you can create a new one. To create a new resource, once you have the azure subscription, create a Form Recognizer resource in the Azure portal to get your key and endpoint. After it deploys, click Go to resource. You will need the key and endpoint from the resource you create to connect your workflow to the Form Recognizer API.

You can use the free pricing tier (F0) to try the service, and upgrade later to a paid tier for production.

Below is a screenshot of Form Recognizer resource. Just copy one of the keys to Api Key and the EndPoint to Endpoint URl.

Form Recognizer resource keysTable Lines designer

Designer Properties#

  1. Api Key The Microsoft Azure api key for Form recognizer. Please check above how to get this key.
  2. Endpoint Url The Microsoft Azure Endpoint for Form recognizer. Please check above how to get this key.
  3. File Path The path to the file for which we want to extract the tables, words or lines.
  4. Output Tables Extracted tables by page number represented as a Dictionary(int, List(DataTable)), where the key is the page number while the value is a list of tables for that page.
  5. Merged Tables All the extracted data tables merged under one table and represented as a DataTable. This is useful when having only one table.
  6. Output Words Extracted words by page and line number represented as a Dictionary(int, Dictionary(int, List(string))), where the key is the page number while the value is a dictionary from line number to the list of words on that line.

Properties#

Table words extraction properties

Azure Properties#

  1. See Designer Properties above.
  2. Minimum Confidence Minimum confidence when accepting the parsed input content. By default, it is 0.5, but it can have any values between 0 and 1.

General Properties#

See General Properties.

Misc#

See Misc.

Out Error#

See Out Error.

Result#

  1. Document Lines Extracted lines by page and line number represented as a Dictionary(int, Dictionary(int, string)), where the key is the page number while the value is a dictionary from line number to corresponding string on that line.
  2. Merged Tables See Designer Properties above.
  3. Output Tables See Designer Properties above.
  4. Output Words Extracted words by page and line number represented as a Dictionary(int, Dictionary(int, List(string))), where the key is the page number while the value is a dictionary from line number to the list of words on that line.

Example#

Document Analysis

Sample Invoice file

In this example, we extract all the tables from "Sample Invoice" and display the result in an Edit Data table result window.

Table Lines designer

Please set the Api Key, Endpoint URl and file path using the above SampleInvoice.jpg file or any other file.