The Business Perspective on Document AI
You need to process tens of thousands of documents each month and wonder whether it’s worth investing in AI. Or maybe someone on the team said, "OpenAI will probably handle it - let’s give it a try"?
Extracting key information from documents such as invoices, receipts, contracts, or reports is one of the most common problems companies face. And one that can be automated using document AI. It's also one of the easiest to be poorly designed. The choice of technology directly impacts costs, scalability, output quality, and user frustration.
Below, you'll find a list of key questions every company should ask before committing to a machine learning (ML) solution. Choosing the right approach isn’t just a technical decision; it’s a strategic one with long-term impacts on costs, capabilities, and flexibility.
- Should we build our solution or buy an off-the-shelf one (build vs. buy)?
- Do we care more about low operational cost (OPEX), or are we willing to accept a higher upfront investment (CAPEX)?
- Is complete control over the model critical, or is a working "black box" enough?
- Do we want to build internal capabilities and invest in future AI initiatives?
In this article, I compare three practical approaches to solving this problem:
- Ready-made SaaS services (Azure AI Document Intelligence)
- Commercial LLMs (OpenAI)
- And fine-tuning open-source models
I’m not looking for a universal winner. Instead, I’ll show which approach works best at which scale, with what trade-offs, how operational costs differ, and what you should know before investing your ML team’s time or money in external APIs.
In the first part, I’ll explain what VRDU (Visually-Rich Document Understanding) really is, why classical NLP often fails on real-world documents like invoices or contracts. Next, I walk you through the three most practical approaches to key information extraction from visually-rich documents - SaaS tools, Large Language Models (LLMs), and fine-tuned open-source models.
Then, I’ll show how different solutions compare in terms of cost, control, scalability, and implementation effort. I also share a cost breakdown based on real calculations, so you can estimate ROI and choose the right path for your specific needs and scale. Whether you’re building a PoC, scaling a mature platform, or trying to avoid expensive mistakes - this guide is here to help.
What is Visually-Rich Document Understanding (VRDU)?
Most Natural Language Processing (NLP) systems operate on plain text: lines, paragraphs, sequences of words. But business documents follow different rules.
Take an invoice image, for example. It’s not just text, its structure: columns, fields, headers, table rows, signatures. Key information might be located in a corner, next to a logo, in the footer, or below a table. Two documents might contain the same data, but arrange it differently.
That’s why classical NLP or Information Extraction (IE) approaches, which rely only on text sequences, often fail. They lack visual awareness, an understanding of how information is positioned on the page.
Visually-Rich Document Understanding (VRDU) is a field that combines: text recognition (OCR), layout analysis, and modeling of visual-textual context (e.g., relationships between fields, positions in tables, header hierarchy).
In short: if your documents have a meaningful layout, you need a solution that understands and leverages it.
What is Key Information Extraction from documents?
Key Information Extraction (KIE) is the task of automatically extracting specific, business-relevant data from documents. It includes fields such as:
- Invoice number, sale date, net/gross amount, VAT rate, tax ID
- Contractor details
- Bank account number
- Contract validity date or product name
Unlike classical optical character recognition (OCR), which converts a document image into text, KIE goes a step further: it understands which parts of the text are meaningful, what they represent, and where that information belongs in your data model.
It’s not just about what was written, but what it means in the context of the entire document.
Turning semi-structured text into structured data is the most challenging step. The exact number could be an invoice ID in one place and a bank account number in another. The word "VAT" might appear multiple times and mean different things. The item table on an invoice requires sequential processing and an understanding of the column context.
A good KIE solution can not only recognize fields, but also assign the accurately predicted label, data type, and position. In more advanced cases, it can also detect relationships between fields (e.g., which amount corresponds to which VAT rate).
That’s why KIE requires more than just classical NLP, computer vision, or OCR — to improve efficiency, it needs a combination of content understanding, document structure analysis, and visual context awareness.
Below is an example output of such a KIE solution (source: Microsoft), where specific fields have been identified on an invoice. Once extracted, this data becomes part of a structured digital document and can be used in any workflow: passed to other systems, exported, or validated.
How to solve this problem?
When analyzing this challenge, I identified three significantly different solution approaches. In this section, I’ll describe them from a business perspective.
AI-powered SaaS
This approach is appealing for its simplicity:
- No need for an in-house ML team
- No infrastructure to maintain
- No training data required upfront
You only need an API and a few documents, and you can start seeing results in days. For this example, I’ll refer to Azure AI Document Intelligence, which supports multiple languages.
Benefits
- Quick deployment - just connect to the API and implement a simple integration to receive results
- High-quality OCR and layout understanding - Microsoft leverages its own Vision + NLP engines
- Low entry barrier - no model training, hyperparameter tuning, or manual data validation needed
- Multilingual support - currently supports 48 languages, including good coverage for European and Asian markets
- Minimal training data required - according to the documentation, as few as 5 labeled samples per document layout can be enough and it offers the tool to annotate the data
Risks
- Costs scale with volume - each document page is a separate API request. At high volumes, this can mean hundreds of thousands of dollars monthly
- Lack of control over the model - you don’t know how it works internally, you can’t adapt it to your specific data or handle edge cases
- Limited fine-tuning capabilities - only available for select models, with a cap of 50,000 examples
- Vendor lock-in - switching providers may incur significant technological and organizational costs
- No public benchmarks available - there’s little transparency about expected accuracy or model performance
- Might give no guarantee that your data will stay private and won’t be used e.g. for training next iterations of the vendor’s model (depending on the particular solution; some might have it defined in the privacy policy)
- Potentially issues with regulatory compliance with GDPR and other (depending on the particular solution; some might have it defined in the policy)
Who is this approach suitable for?
An AI-powered SaaS solution makes the most sense when:
- You need fast prototype without building an internal ML team
- Your document volume is low to moderate (e.g. tens of thousands per month)
- You have a small labeled dataset and don’t want to invest in growing it
- Your priority is stability, SLA, and security - not full control over model behavior
- Data extraction is a supporting tool, not your core competitive advantage
In such cases, SaaS offers excellent time-to-value. However, as the scale and customization needs increase, this solution quickly becomes less cost-effective.
Open-source models
The most demanding option, both technologically and organizationally, is building your own machine learning model for key information extraction which will be trained on your company's data and deployed in your own environment.
This approach gives you full control over the process:
- Selecting the model architecture (e.g. LiLT, LayoutLM, Donut)
- Building a custom processing pipeline (OCR, relation extraction, field classification)
- Continuous re-training of the model
- Deployment in your own infrastructure (cloud or on-premise)
Benefits
- Best fit for your specific data - the model is trained on your company’s documents, with attention to edge cases
- Complete control over model behavior - you can fine-tune, retrain, analyze errors, and add new field classes as needed
- No token fees - once deployed, operational costs are significantly lower and depend on the time and resources used to process the data
- Meets security and privacy requirements - essential in industries like finance, healthcare, and logistics
- Opportunity to develop internal ML infrastructure - one project can become the foundation for your entire AI strategy
- High performance backed by research - multiple academic publications confirm the effectiveness of popular open-source model architectures: 1, 2, 3, 4, 5
- Better performance compared to other solutions
Risks
- High upfront cost - requires an ML team, DevOps, annotators, and implementation time. It's a long term investment
- Complex infrastructure and maintenance - training, testing, deployment, monitoring, and automated retraining all add operational complexity
- Potential OSS license restrictions - some models are limited to non-commercial use (must be evaluated in risk analysis)
- A large volume of high-quality training data - small datasets may not be enough to train a robust model. Tens of thousands of labeled samples may be needed. While some can be collected passively (e.g., from user corrections), preparing them still takes effort and time
Who is this approach right for?
This option is best suited for organizations that:
- Process millions of documents per month
- Have (or can build) a large training dataset
- Have (or are developing) an ML / Data Science team
- Want full ownership of the solution and their data
- View AI strategically, not as a one-off implementation
At the correct scale, the cost of building such a solution pays off quickly, and the company gains not just a model, but the foundation for a fully tailored AI architecture.
LLMs
The third approach to document information extraction involves using commercial Large Language Models (LLMs), such as GPT-4o (OpenAI), Claude, or Mistral, as the core processing component.
A typical pipeline looks like this:
- OCR - convert a document (e.g., PDF, scan) into text while preserving the reading order of fragments
- Prompt engineering - craft queries like “Extract the data from this document in the format: {prompt_format}”
- Post-processing - clean and structure the response, validate fields, apply fallback rules or classifiers
Benefits
- Rapid prototyping - LLMs handle diverse language, unusual layouts, and imperfect OCR surprisingly well
- No need to train models - just design a thoughtful prompt and use some test data
- Low entry cost - ideal for proof-of-concept (PoC) or feasibility studies (R&D)
- Short-term investment
- Flexibility for non-standard document types - e.g., financial reports, service requests, handwritten notes
Risks
- Lack of determinism - the same prompt may yield different results, which can hinder or block usage in specific applications
- Hallucinations and extraction errors - the model may "guess" missing data, which is unacceptable in contexts like accounting, finance, or law. Additionally, there's no guarantee the model will return exactly the number of fields you expect, which introduces another category of errors
- Difficult to validate output quality - no reference to document coordinates makes testing and debugging harder. Coordinates generated by the LLMs are usually incorrect
- Higher inference costs at scale - pricing is token-based, and documents tend to be long
- Lack of comprehensive research confirming LLM performance - the only study found evaluating GPT-4o shows that these models underperform compared to dedicated open-source models (read more about it here)
- No option (or very high cost) to fine-tune the model on your own data
- Sending documents to an external API may not be acceptable in regulated sectors (finance, medicine)
- Need for extensive prompt engineering experiments
Who is this approach right for?
LLM-based extraction makes sense when:
- You're just starting to explore the problem space and want to test several concepts quickly
- You're dealing with atypical documents that don’t fit fixed templates (e.g., letters, reports, notes)
- You want to build a PoC without committing ML resources or infrastructure
- You're willing to trade off precision and control for faster implementation
This is a great tool for exploration and rapid iteration, but it is not necessarily the right technology for a production-grade system at scale with tight accuracy requirements.
Key differences
Operational costs
I created a comparison table with each solution's operational costs. Most were buried deep in the documentation. For LLMs, I had to make assumptions about token usage. For the Open-Source model, I estimated processing times based on similarly sized models. The chart below clearly shows who the winner is in terms of operational costs.
Chart presenting operational costs for different tools.
Assumptions
In this simulation, I assume we're sending a PDF file to the GPT-4o model along with a system instruction telling it what to do. These are additional tokens, but let’s assume it stays within 500 tokens.
Calculations
In this simulation, I assume we're sending a PDF file to the GPT-4o-mini model along with a system instruction telling it what to do. These are additional tokens, but let’s assume it stays within 500 tokens. Let’s also assume that the response will contain up to 1,500 tokens.
The GPT-4o-mini model uses more tokens to represent images. It seems to be confirmed by users in discussions. The number of tokens matches what GPT-4o consumes, so I’ll use that model to calculate the cost of processing the image and then add the output cost for GPT-4o-mini.
GPT-4o Calculations
2.5$ / 1,000,000 = 0.0000025$ per input token
500 input tokens = 500 × 0.0000025$ = 0.00125$
GPT-4o-mini Calculations
0.6$ / 1,000,000 = 0.0000006$ per output token
1,500 output tokens = 1,500 × 0.0000006$ = 0.0009$
0.00125$ + 0.0009$ = 0.00215$ per page
How to choose the right approach based on scale and needs? - Summary
There is no one-size-fits-all solution for document data extraction. Each approach, SaaS, LLM, or a custom model, has its place. The key question is not which technology is better but which one best fits your specific context: your scale, available resources, quality requirements, and long-term strategy.
Below is a summary that may help guide your decision:
What’s next?
If you’re considering implementing a document data extraction system, it’s worth starting with a needs analysis and aligning the technology with your organization's scale and processes.
We offer workshops and consultations where we:
- Assess the automation potential
- Compare available approaches (e.g., SaaS, LLM, custom model)
- Prepare a recommendation tailored to your specific case
- Prepare a project roadmap ready to be deployed
If you're planning budget allocation for the next quarter, or want to avoid costly mistakes - let’s talk.
Reviewed by: Adam J. Kaczmarek