How to use intelligent document processing in your business
Why documents?
Documents, in digital and printed form, are the main means of formal communication within and between companies. It is a universal way of exchanging information regardless of industry, country, or language. What's more, document circulation is often enforced by law.
Due to the ubiquity and legal requirement of this way of exchanging information, companies must adapt to the conditions in which they have come to operate. Many are taking advantage of the gains of digitalization and, in addition to the required documents, exchange information in a more structured way, such as Excel sheets. However, a large portion, especially small companies, due to their size, do not invest in automating this process because it does not bring them tangible benefits. For these entities, reading data from an invoice sent in PDF format by human workers and entering it into a bank form for transfer is not a problem. It is not a problem for them because such a situation is rarely repeated. This process can be a significant cost for large entities that work with a large number of counterparties. For such companies, automating the process of document processing with an artificial intelligence tool can bring significant savings.
In this article, I will present my point of view on the potential of using AI technology to optimize business operations. I will describe the different types of tasks that artificial intelligence can perform in document processing. In the next parts of the series of posts dedicated to intelligent document processing, I will describe the implementation methods, requirements, and potential problems associated with the described tasks in greater detail. I invite you to read on!
Business Advantage
Let me start with the question: why? Why would companies be interested in using AI for document processing? How do you convince them to invest in an innovation like artificial intelligence technologies?
Cost reduction & improved scaling
An artificial intelligence system can automate document processing and replace human workers, but in difficult cases will require human intervention. The biggest advantage of such a system is its scalability. Let's assume an unexpected doubling of the number of documents processed in the system. This would require hiring 2x the number of employees and, if AI is used, only scaling up the number of deployed model instances. Moreover, we get rid of the problem of training new employees.
It works both ways - the decrease in the demand for processing documents will result in less use of cloud resources and will not involve layoffs. Using machine learning techniques allows better adaptation to the changing demand for services that were previously performed by human workers.
In my opinion, and I am only an ML engineer, not an executive (take this into account when judging), in large enterprises, investing in AI systems that process documents will reduce costs and enable seamless scaling. This is especially important for such companies, which are exposed to rapid changes in the volume of documents processed. Besides, any automation avoids human error.
Speed up the document processing
The undeniable advantage of using computers for data processing is their speed. Tasks that to a human expert might take minutes or hours to a machine will take seconds. This is a great advantage, especially for processes that require immediate response or their usefulness is very limited in time.
Source
A trivial example is data on company stock quotes. Imagine having access to real-time data sources. Once the data is updated, a person has to take a moment to analyze the new data and make a decision. If the reaction is too slow, then the data has already changed, and the decision may be outdated. On the other hand, a machine will analyze the data in a fraction of a second and decide to follow its algorithm before the data becomes outdated.
Automation
In the case of complex business processes, automation makes it possible to accelerate processing time and transparency of the process. Documents often pass through many pairs of eyes that analyze various aspects of the document, confirm the correctness of the data, transfer some data to external information systems, perform complex business logic, etc. If the process is procedural and there are well-defined rules of conduct, then it can be automated.
Intelligent Document Processing
The terms Document AI and Intelligent Document Processing are quite broad and cover all areas in which AI applications can be applied to process documents of various types automatically. In this article, I will focus only on documents that we can most often encounter in formal internal or inter-company communication, i.e., contracts, forms, invoices, applications, etc. From a business point of view, however, I see the greatest potential for automation in the processing of orders and invoices because of their role in company processes and their prevalence.
Document AI does not have one face; therefore, I will discuss its various aspects in the following sections.
Document Classification
Source
One task that AI can solve is document classification. Depending on the kind of document we are dealing with, different AI systems or machine learning models will be used. We will process the aforementioned invoice and promotion request differently.
Why? Mainly because of the different characteristics of documents, but we will talk about this in more detail in the next articles in this series. At this point, we can assume that for strictly textual data processing, we will use natural language processing techniques and deep learning models that will exploit the textual features of the document. On the other hand, for images, computer vision methods. Finally, for visually rich documents with strictly defined document layouts, we will use a combination of both in a multimodal fashion.
Document Layout Analysis
For documents with a well-defined format, so-called Visually-Rich Documents, we can use a computer vision technique called Document Layout Analysis (DLA). This task involves assigning specific parts of the document to predefined classes, such as table, text line, header, image, image caption, footer, etc.
AI research is developing specialized machine learning models that are getting better at this task. Multimodal deep learning algorithms with dedicated training routines on unlabeled multimodal data samples lead the way.
Below is an example of how Document Layout Analysis works on a scientific article page.
Source
You can check how Document Layout Analysis works on your data with an interactive demo in HuggingFace Space.
Table Structure Recognition
Source
A special case of layout analysis is Table Structure Recognition (TSR). With TSR, only the table area is analyzed, and the table's components, such as row, column, cell, column header, etc., are recognized. By using TSR, it is possible to automatically parse tables from documents into a structured form that can be used, for example, for document analysis or validation. Because of the performance of the AI models, TSR is used not on the entire document but only on the section of the document where the table is detected.
TSR-based solutions work well for tables with clear visual boundaries, so-called bordered tables. The clearer the visual separation of elements, the easier the task is for the artificial intelligence model. The problem is complicated table hierarchies and borderless tables.
Table Recognition
A computer vision task related to tables is table detection in a document. This is a special case of the Object Detection task for documents. This task involves determining the minimum rectangle containing a table on a document page. It is a necessary step for TSR.
Reported results of deep models on benchmarks achieve close to 100% effectiveness. However, for real-world documents, their effectiveness is often insufficient and requires a tailor-made solution adapted to the data used. Since Table Structure Recognition uses the results of this task, errors propagate, drastically reducing the final performance. Therefore, more attention should be paid to the results in this stage.
Optical Character Recognition
Source
Optical Character Recognition is a basic document processing step that allows the extraction of structured data that computers can understand and place on the document from a document in visual form. This is especially important for documents that do not have a text layer, but whose content is business-relevant. As a result of processing with OCR tools, we usually obtain a list of words or lines of text along with the bounding boxes in which they are located.
Visual Question Answering
The task of VQA is to answer textual questions based on an image or document. An intuitive example of an AI system performing such a task would be a chatbot that we can ask regarding a document. Let's imagine that we upload a company's quarterly report into the system, and we want to speed up the analysis process for ourselves, so we ask key questions to the generative AI system and expect the system to return the correct answer based on the document. In simple terms, this is how VQA systems work.
You can check how VQA works using HuggingFace Space with the Donut deep learning model.
Key Information Extraction
In my opinion, it is one of the tasks with the greatest potential. It allows, based on a document, the extraction of information in structured key-value form. Thanks to the transformation of the document into a dictionary, we can freely use such structured data, for example, transfer to other systems, apply business logic, validate fields, invoice data, etc., thereby automating repetitive human tasks.
Source
KIE is most often implemented by combining two tasks: Semantic Entity Recognition (SER) and Relation Extraction (RE). The former allows us to assign a class to each word on a document and create continuous entities (most often with “Question” and “Answer” classes), while the latter allows us to determine the existence of relationships between entities. Having classes and relationships, we can transform such information into a key-value dictionary.
Summary
As part of this article, I have outlined what Intelligent Document Processing is and the tasks behind it. I described what, from my perspective, I see as the advantages of using this type of AI system to process documents. I briefly described what tasks Document AI performs and outlined what business applications they can have. I hope that after reading this article, dear reader, you have a general idea of what aspects of AI techniques can help you in automatic document processing. In future posts, I will describe in more detail particular problems and how they are solved with the help of AI algorithms, so stay tuned!