Extract Document Text Block

⚠ Experimental Feature

This feature is Experimental and may change based on user feedback and testing. Share your thoughts via our chatbot to help us improve it.

The Extract Document Text Block in Leapwork enables users to extract text from images and scanned PDFs using Optical Character Recognition (OCR) technology. This functionality is essential for automating validation processes and ensuring that text-based information is correctly processed within workflows.

Example: A user uploads a scanned contract in PDF format and extracts the text automatically for validation in an approval workflow.

Note: This feature is available starting from Release 2025.1.XXX.

When fully expanded, the Get Document Information block displays the following properties:

The Block Header ("Extract Document Text")

The green input connector in the header is used to trigger the block to start executing.

The green output connector in the header triggers when the file type has been successfully converted to text.

The title of the block “Extract Document Text” can be changed by double-clicking on it and typing in a new title.


Users must select a supported file type when importing a document.

Source Type

Once you drag the file into the block, the block will automatically recognize the file type. A user can choose any of the below options as a source type:

  • Data File: File for upload will be saved inside of Leapwork.
  • Local Path: File for upload will be referred to from a specified path.

Select the file to extract the text

This field allows users to upload a file. By selecting "Import New File", a window will open to upload the document.

Extracted Text

This parameter returns the text recognized within the image or document. If the PDF contains multiple pages, the text is extracted from all pages and combined in order.

Example:
A scanned PDF containing the text "Contract Agreement 2024" will return:

Extracted Text: Contract Agreement 2024

Failed

If the Extract Document Text Block fails to extract text, the failure connector is triggered. Possible reasons for failure:

  • Unreadable or low-quality scanned text.
  • Unsupported file format.
  • Password-protected or encrypted PDF.
  • A corrupted PDF file.

When triggered, this connector can be used to handle failure scenarios within the automation flow.

Default Timeout

If the Default Timeout property checkbox is not selected, then the timeout value is 10 seconds. If the Default Timeout property checkbox is selected, then the Default Timeout value selected in the flow settings will be applicable.

Timeout

The maximum time spent converting the file type before giving up and triggering the Failed connector.

Note: All cases have a global timeout that can be configured in the Settings panel. This is unrelated to the timeout of a single building block. However, a running case will automatically be cancelled if it runs for longer than the global timeout.

 

Created 2025.02.12