The Business Perspective on Document AI

22 Jul 2025. 12 minutes read

The Business Perspective on Document AI webp image

You need to process tens of thousands of documents each month and wonder whether it’s worth investing in AI. Or maybe someone on the team said, "OpenAI will probably handle it - let’s give it a try"?

Extracting key information from documents such as invoices, receipts, contracts, or reports is one of the most common problems companies face. And one that can be automated using document AI. It's also one of the easiest to be poorly designed. The choice of technology directly impacts costs, scalability, output quality, and user frustration.

Below, you'll find a list of key questions every company should ask before committing to a machine learning (ML) solution. Choosing the right approach isn’t just a technical decision; it’s a strategic one with long-term impacts on costs, capabilities, and flexibility.

Should we build our solution or buy an off-the-shelf one (build vs. buy)?
Do we care more about low operational cost (OPEX), or are we willing to accept a higher upfront investment (CAPEX)?
Is complete control over the model critical, or is a working "black box" enough?
Do we want to build internal capabilities and invest in future AI initiatives?

In this article, I compare three practical approaches to solving this problem:

Ready-made SaaS services (Azure AI Document Intelligence)
Commercial LLMs (OpenAI)
And fine-tuning open-source models

I’m not looking for a universal winner. Instead, I’ll show which approach works best at which scale, with what trade-offs, how operational costs differ, and what you should know before investing your ML team’s time or money in external APIs.

In the first part, I’ll explain what VRDU (Visually-Rich Document Understanding) really is, why classical NLP often fails on real-world documents like invoices or contracts. Next, I walk you through the three most practical approaches to key information extraction from visually-rich documents - SaaS tools, Large Language Models (LLMs), and fine-tuned open-source models.

Then, I’ll show how different solutions compare in terms of cost, control, scalability, and implementation effort. I also share a cost breakdown based on real calculations, so you can estimate ROI and choose the right path for your specific needs and scale. Whether you’re building a PoC, scaling a mature platform, or trying to avoid expensive mistakes - this guide is here to help.

What is Visually-Rich Document Understanding (VRDU)?

Most Natural Language Processing (NLP) systems operate on plain text: lines, paragraphs, sequences of words. But business documents follow different rules.

Take an invoice image, for example. It’s not just text, its structure: columns, fields, headers, table rows, signatures. Key information might be located in a corner, next to a logo, in the footer, or below a table. Two documents might contain the same data, but arrange it differently.

That’s why classical NLP or Information Extraction (IE) approaches, which rely only on text sequences, often fail. They lack visual awareness, an understanding of how information is positioned on the page.

Visually-Rich Document Understanding (VRDU) is a field that combines: text recognition (OCR), layout analysis, and modeling of visual-textual context (e.g., relationships between fields, positions in tables, header hierarchy).

In short: if your documents have a meaningful layout, you need a solution that understands and leverages it.

What is Key Information Extraction from documents?

Key Information Extraction (KIE) is the task of automatically extracting specific, business-relevant data from documents. It includes fields such as:

Invoice number, sale date, net/gross amount, VAT rate, tax ID
Contractor details
Bank account number
Contract validity date or product name

Unlike classical optical character recognition (OCR), which converts a document image into text, KIE goes a step further: it understands which parts of the text are meaningful, what they represent, and where that information belongs in your data model.

It’s not just about what was written, but what it means in the context of the entire document.

Turning semi-structured text into structured data is the most challenging step. The exact number could be an invoice ID in one place and a bank account number in another. The word "VAT" might appear multiple times and mean different things. The item table on an invoice requires sequential processing and an understanding of the column context.

A good KIE solution can not only recognize fields, but also assign the accurately predicted label, data type, and position. In more advanced cases, it can also detect relationships between fields (e.g., which amount corresponds to which VAT rate).

That’s why KIE requires more than just classical NLP, computer vision, or OCR — to improve efficiency, it needs a combination of content understanding, document structure analysis, and visual context awareness.

Below is an example output of such a KIE solution (source: Microsoft), where specific fields have been identified on an invoice. Once extracted, this data becomes part of a structured digital document and can be used in any workflow: passed to other systems, exported, or validated.

analyze-invoice

How to solve this problem?

When analyzing this challenge, I identified three significantly different solution approaches. In this section, I’ll describe them from a business perspective.

AI-powered SaaS

This approach is appealing for its simplicity:

No need for an in-house ML team
No infrastructure to maintain
No training data required upfront

You only need an API and a few documents, and you can start seeing results in days. For this example, I’ll refer to Azure AI Document Intelligence, which supports multiple languages.

Benefits

Quick deployment - just connect to the API and implement a simple integration to receive results
High-quality OCR and layout understanding - Microsoft leverages its own Vision + NLP engines
Low entry barrier - no model training, hyperparameter tuning, or manual data validation needed
Multilingual support - currently supports 48 languages, including good coverage for European and Asian markets
Minimal training data required - according to the documentation, as few as 5 labeled samples per document layout can be enough and it offers the tool to annotate the data

Risks

Costs scale with volume - each document page is a separate API request. At high volumes, this can mean hundreds of thousands of dollars monthly
Lack of control over the model - you don’t know how it works internally, you can’t adapt it to your specific data or handle edge cases
Limited fine-tuning capabilities - only available for select models, with a cap of 50,000 examples
Vendor lock-in - switching providers may incur significant technological and organizational costs
No public benchmarks available - there’s little transparency about expected accuracy or model performance
Might give no guarantee that your data will stay private and won’t be used e.g. for training next iterations of the vendor’s model (depending on the particular solution; some might have it defined in the privacy policy)
Potentially issues with regulatory compliance with GDPR and other (depending on the particular solution; some might have it defined in the policy)

Who is this approach suitable for?

An AI-powered SaaS solution makes the most sense when:

You need fast prototype without building an internal ML team
Your document volume is low to moderate (e.g. tens of thousands per month)
You have a small labeled dataset and don’t want to invest in growing it
Your priority is stability, SLA, and security - not full control over model behavior
Data extraction is a supporting tool, not your core competitive advantage

In such cases, SaaS offers excellent time-to-value. However, as the scale and customization needs increase, this solution quickly becomes less cost-effective.

Open-source models

The most demanding option, both technologically and organizationally, is building your own machine learning model for key information extraction which will be trained on your company's data and deployed in your own environment.

This approach gives you full control over the process:

Selecting the model architecture (e.g. LiLT, LayoutLM, Donut)
Building a custom processing pipeline (OCR, relation extraction, field classification)
Continuous re-training of the model
Deployment in your own infrastructure (cloud or on-premise)

Benefits

Best fit for your specific data - the model is trained on your company’s documents, with attention to edge cases
Complete control over model behavior - you can fine-tune, retrain, analyze errors, and add new field classes as needed
No token fees - once deployed, operational costs are significantly lower and depend on the time and resources used to process the data
Meets security and privacy requirements - essential in industries like finance, healthcare, and logistics
Opportunity to develop internal ML infrastructure - one project can become the foundation for your entire AI strategy
High performance backed by research - multiple academic publications confirm the effectiveness of popular open-source model architectures: 1, 2, 3, 4, 5
Better performance compared to other solutions

Risks

High upfront cost - requires an ML team, DevOps, annotators, and implementation time. It's a long term investment
Complex infrastructure and maintenance - training, testing, deployment, monitoring, and automated retraining all add operational complexity
Potential OSS license restrictions - some models are limited to non-commercial use (must be evaluated in risk analysis)
A large volume of high-quality training data - small datasets may not be enough to train a robust model. Tens of thousands of labeled samples may be needed. While some can be collected passively (e.g., from user corrections), preparing them still takes effort and time

Who is this approach right for?

This option is best suited for organizations that:

Process millions of documents per month
Have (or can build) a large training dataset
Have (or are developing) an ML / Data Science team
Want full ownership of the solution and their data
View AI strategically, not as a one-off implementation

At the correct scale, the cost of building such a solution pays off quickly, and the company gains not just a model, but the foundation for a fully tailored AI architecture.

LLMs

The third approach to document information extraction involves using commercial Large Language Models (LLMs), such as GPT-4o (OpenAI), Claude, or Mistral, as the core processing component.

A typical pipeline looks like this:

OCR - convert a document (e.g., PDF, scan) into text while preserving the reading order of fragments
Prompt engineering - craft queries like “Extract the data from this document in the format: {prompt_format}”
Post-processing - clean and structure the response, validate fields, apply fallback rules or classifiers

Benefits

Rapid prototyping - LLMs handle diverse language, unusual layouts, and imperfect OCR surprisingly well
No need to train models - just design a thoughtful prompt and use some test data
Low entry cost - ideal for proof-of-concept (PoC) or feasibility studies (R&D)
Short-term investment
Flexibility for non-standard document types - e.g., financial reports, service requests, handwritten notes

Risks

Lack of determinism - the same prompt may yield different results, which can hinder or block usage in specific applications
Hallucinations and extraction errors - the model may "guess" missing data, which is unacceptable in contexts like accounting, finance, or law. Additionally, there's no guarantee the model will return exactly the number of fields you expect, which introduces another category of errors
Difficult to validate output quality - no reference to document coordinates makes testing and debugging harder. Coordinates generated by the LLMs are usually incorrect
Higher inference costs at scale - pricing is token-based, and documents tend to be long
Lack of comprehensive research confirming LLM performance - the only study found evaluating GPT-4o shows that these models underperform compared to dedicated open-source models (read more about it here)
No option (or very high cost) to fine-tune the model on your own data
Sending documents to an external API may not be acceptable in regulated sectors (finance, medicine)
Need for extensive prompt engineering experiments

Who is this approach right for?

LLM-based extraction makes sense when:

You're just starting to explore the problem space and want to test several concepts quickly
You're dealing with atypical documents that don’t fit fixed templates (e.g., letters, reports, notes)
You want to build a PoC without committing ML resources or infrastructure
You're willing to trade off precision and control for faster implementation

This is a great tool for exploration and rapid iteration, but it is not necessarily the right technology for a production-grade system at scale with tight accuracy requirements.

Key differences

chart%20AI%20documents

Operational costs

I created a comparison table with each solution's operational costs. Most were buried deep in the documentation. For LLMs, I had to make assumptions about token usage. For the Open-Source model, I estimated processing times based on similarly sized models. The chart below clearly shows who the winner is in terms of operational costs.

chart%20AI%20documents2

Chart presenting operational costs for different tools.

Assumptions

In this simulation, I assume we're sending a PDF file to the GPT-4o model along with a system instruction telling it what to do. These are additional tokens, but let’s assume it stays within 500 tokens.

Calculations

In this simulation, I assume we're sending a PDF file to the GPT-4o-mini model along with a system instruction telling it what to do. These are additional tokens, but let’s assume it stays within 500 tokens. Let’s also assume that the response will contain up to 1,500 tokens.

The GPT-4o-mini model uses more tokens to represent images. It seems to be confirmed by users in discussions. The number of tokens matches what GPT-4o consumes, so I’ll use that model to calculate the cost of processing the image and then add the output cost for GPT-4o-mini.

GPT-4o Calculations

2.5$ / 1,000,000 = 0.0000025$ per input token
500 input tokens = 500 × 0.0000025$ = 0.00125$

GPT-4o-mini Calculations

0.6$ / 1,000,000 = 0.0000006$ per output token
1,500 output tokens = 1,500 × 0.0000006$ = 0.0009$

0.00125$ + 0.0009$ = 0.00215$ per page

How to choose the right approach based on scale and needs? - Summary

There is no one-size-fits-all solution for document data extraction. Each approach, SaaS, LLM, or a custom model, has its place. The key question is not which technology is better but which one best fits your specific context: your scale, available resources, quality requirements, and long-term strategy.

Below is a summary that may help guide your decision:

chart%20AI%20documents3

What’s next?

If you’re considering implementing a document data extraction system, it’s worth starting with a needs analysis and aligning the technology with your organization's scale and processes.

We offer workshops and consultations where we:

Assess the automation potential
Compare available approaches (e.g., SaaS, LLM, custom model)
Prepare a recommendation tailored to your specific case
Prepare a project roadmap ready to be deployed

If you're planning budget allocation for the next quarter, or want to avoid costly mistakes - let’s talk.

Reviewed by: Adam J. Kaczmarek

Services overview

Technology Partnerships

Services overview

Success Stories

Technologies

About us

SoftwareMill News

Technology Blog

Scalar Conference

Services overview

Services overview

Technologies

About us

Technology Blog

Technology Partnerships

Technology Partnerships

Technology Partnerships

Technology Partnerships

SoftwareMill News

Scalar Conference

Scalar Conference

Scalar Conference

Contents

The Business Perspective on Document AI

What is Visually-Rich Document Understanding (VRDU)?

What is Key Information Extraction from documents?

How to solve this problem?

AI-powered SaaS

Open-source models

LLMs

Key differences

Operational costs

Assumptions

Calculations

How to choose the right approach based on scale and needs? - Summary

What’s next?