Contents

Top 6 Biggest Business Myths About ML Projects

Top 6 Biggest Business Myths About ML Projects webp image

I imagine how things seemed in the minds of some clients I’ve worked with:

"We have the data, we know the problem – just build a model. We can even use AutoML. Once we have the model, we’re done. Let’s put it in production, and it’ll just work on its own for the next few years."

Unfortunately, the reality is more complex.

Machine learning is not a plugin for Excel. It’s not a plug-and-play component you can buy, install, and use for years without effort. It’s a living system that relies on data, changes over time, and needs constant improvement and updates. It requires patience, experimentation, and a good dose of humility about what we (still) don’t know.

This article is my attempt to explain what a machine learning project looks like from an engineer’s point of view. What problems clients usually face. It covers clients' problems, what breaks down, what surprises us, where the real risks are, and how to reduce them.

If you’re considering using AI in your product or planning to work with an ML team, this article might help you avoid common mistakes. Or at least help you understand why the model didn’t magically materialize after a week.

Myth 1: AI is good for everything

Interest in AI tools has skyrocketed with the rise of systems like ChatGPT and Midjourney. They work quickly, often deliver impressive results, and seem almost magical. It's no surprise that some business teams have started to see AI as a "universal solution."

But as a machine learning engineer, I often see clients with unrealistic expectations. They want a model that can predict everything, make no mistakes, work instantly, handle every scenario, and adapt independently. The problem is - such models don’t exist.

The role of a machine learning team isn’t just to build a model, but also to educate clients: where machine learning makes sense, and where it doesn’t. Where it can deliver real value, and where a classic data analysis or even a simple rule in the system would be a better fit.

Understanding these limitations is part of what makes an ML project successful. As technical partners, it’s our responsibility to highlight those boundaries and help find the best possible solution, even if it’s less flashy but much more realistic.

Myth 2: Just throw in the data and ML will do the rest

One of the most common beliefs about machine learning goes like this:

"We have data, so we feed it into a model, and it will learn and solve the problem."

Sounds logical, right? After all, data is the "fuel" for AI - so the sooner we pour it in, the sooner the engine runs. But sticking with the engine analogy, if we use low-quality or contaminated fuel, the engine breaks down and doesn’t go anywhere. That’s why the phrase "garbage in, garbage out" is widely used in ML. You can't expect good results if you train a model on poor-quality data.

Where does this myth come from?

In traditional IT projects, data is often clean and structured: forms, databases, all defined in terms of format and source. But in machine learning, data is not just about correct values in cells - it’s also about meaning, logic, and how that data was created. The model learns directly from what it’s given, so it’s crucial that the data truly represents the process or problem we want to model.

So what should you do instead of "just throwing in the data"?

  1. Start with a straightforward question: What exactly should the model predict? Is it measurable? Do we have historical examples of this?
  2. Take a small sample (50 to 100 records) and review them manually. Are the labels consistent? Do the columns make sense? Are the values contradictory? Do we even have this type of data? Do we have the variable we want to predict?
  3. Plan a mini data audit before the project starts. The machine learning team can help with a checklist: value distributions, missing data, correlations, and label quality.
  4. Involve business experts to explain the data. Only people familiar with the project can say things like: "This column only matters for Region A", or "these records are from before we changed the system." Without that context, it’s much harder to understand and use the data effectively.

Myth 3: More data = better model

Once we acknowledge that data is important, it can be tempting to cut corners:

"We have a lot of data; millions of records. Sure, some might be messy, but the model will figure it out. The more data, the better. Right?"

Wrong. In machine learning, more low-quality data does not improve your model. In fact, it can make things worse.

Machine learning models look for patterns in data. If that data is noisy, inconsistent, or mislabeled, no matter how big your dataset is, it will just be an extensive collection of useless examples.

In two of my recent projects, clients proudly shared massive datasets intended for model training. But we quickly discovered they were inconsistent, full of incorrect labels, and ultimately unusable without significant cleaning.

Data: The biggest risk in any ML project

In our experience, the majority of time spent on machine learning projects isn’t about building models. It’s about understanding and fixing the data. A survey of ML experts revealed that this stage can take up to 80% of total effort. And it’s also where the most misunderstandings happen.

Some common data issues we encounter:

  • Incomplete records - e.g., 40% missing key information like customer decisions.
  • Inconsistencies - one column says "status = canceled," another says "status = closed."
  • Misleading labels - a "success" label might mean “offer sent,” not “product sold.”
  • No context - no one remembers why business rule X was added, so 1/3 of the data is off.

So what’s more important: quantity or quality?

It’s a common question:

"What is more important: quantity or quality of data?"

I advise putting quality first, though quantity plays a role too. Let me give you a real example.

In one project, a client provided a large dataset for training a predictive model. It looked great on the surface: lots of records, wide variety, decent coverage. However, a quick audit showed missing values and poorly defined labels.

Rather than get into a long discussion about why this is a problem, I decided to show it in practice. We took a small, carefully selected subset, less than 1% of the data, and manually reviewed and corrected it. Then we developed two models:

  1. One using the full, messy dataset
  2. One using the small but cleaned subset

We tested both on the same cleaned test set. The result? The model trained on clean data significantly outperformed the larger one.

This small experiment made a big impression on the client. It demonstrated a key truth: better data beats more data.

It aligns with what AI expert Andrew Ng says: in many use cases, just a few dozen high-quality examples can be enough to develop an effective model. That’s the essence of data-centric AI - investing in better data, not just bigger models, often leads to the best results.

The relationship between data quality and the decline in machine learning algorithm performance has been thoroughly analyzed in the scientific publication "The Effects of Data Quality on Machine Learning Performance on Tabular Data." Another study focused on training small language models shows that data quality, not quantity, plays the key role in achieving strong results.

The real-world impact of poor data quality

When data quality is ignored:

  • The model learns misleading correlations - like "customers with IDs ending in 7 are more likely to convert."
  • You waste time and money - training takes longer, debugging is harder.
  • Your time-to-market increases - fixing issues later means repeating work.

Why is this myth so common?

  • In traditional analytics, more data often does help. For example, quadrupling the training data in linear regression can cut error in half - but only if the data meets the model’s assumptions.
  • People overlook that marketing examples often use hand-picked, cleaned datasets (e.g., for training large models like ChatGPT or Falcon).
  • It’s typically easier to gather more data than to review, clean, and truly understand it.

What does a modern, data-centric approach look like?

Today’s best machine learning teams embrace data-centric AI. That means:

  • A simple model can work well if the data is high-quality.
  • Instead of gathering more data, we improve existing data - by removing noise, fixing labels, and aligning definitions.
  • Data cleansing beats data accumulation - fewer, cleaner examples often outperform a larger but flawed dataset.

Myth 4: ML works like classic IT

Many clients approach machine learning projects with the mindset of traditional software development. That’s understandable - after all, we’ve been building software for years using repeatable frameworks:

"We define the requirements, break tasks into sprints, plan releases, do a demo, and it works."

And it works great when building an API, a dashboard, or a customer support system. However, when we begin an ML project, this approach quickly falls apart. Why?

Because machine learning isn’t about coding logic, it’s about discovering patterns in data. And that’s a different game altogether.

ML is research, not just implementation

In machine learning projects:

  • We don’t know in advance which model will work - we have to experiment and understand the nature of the data.
  • We can’t precisely estimate the workload - outcomes depend on data quality, not just clean code.
  • We don’t implement logic - we teach the model to learn it. And the model can learn it wrong, even if the code is perfect.

I once led a project where the client expected "a prediction demo in two weeks." After two weeks, we had five hypotheses explaining data inconsistencies, three preprocessing ideas, and no working model. This was not because the team was ineffective or lacked skills, but because in machine learning, before anything works, you need to understand what you’re dealing with.

Why traditional IT approaches fall short

No guarantee of success

A machine learning project won’t always succeed, especially in its first iteration. If the problem is complex, the initial assumptions might be wrong. The collected data may be insufficient, inconsistent, or simply too limited for machine learning.

Sometimes, it’s enough to reframe the problem, revisit assumptions, prepare the data better, and try again. Other times, the problem is too complex for ML to solve efficiently.

Most time is spent on data quality and processing

Clients typically want to see a working machine learning model ASAP - and understandably, time is money. However, skipping the data quality and understanding phase lowers the chances of success.

Code refactoring is a regular part of the software development project lifecycle. So why is "data refactoring" - analyzing and refining data quality — still perceived as optional? It’s due to a lack of understanding of how complex machine learning development is, and how crucial data is for project success.

Clients show up saying they have data, expecting we’ll just "train the model and be done." But that’s not how it works.

Skipping detailed data analysis and jumping into model training leads to poor performance, followed by debugging, which eventually circles back to, you guessed it, the data. It's better to invest early in validating and understanding the data. If critical issues arise, we can pause the project, fix the data, and resume with confidence.

Other project risks

In classic software projects, risks can often be identified and mitigated early during the requirements and prototyping phases e-verything revolves around functionality and use cases. In ML projects, however, the primary risk is the data. Until an engineer sees and analyzes it, they’re working with assumptions.

Unpredictability

It’s hard to estimate how long it’ll take to “fix” data or tune hyperparameters. Even after a promising data analysis, the model might fail due to a lack of data coverage or low predictive power. That often requires revisiting data collection and labeling, and starting fresh.

Data can be a real Pandora’s box. It might look fine at first glance, but a detailed analysis can reveal gaps and errors. That’s a serious project risk - one that’s difficult to anticipate without a proper deep dive.

Significant data issues can drastically reduce the chances of project success, or even make model development impossible. The project may need to be paused or stopped entirely until the data is repaired, and it’s often impossible to predict how long that might take.

R&D-style Iteration

Instead of the classic "to do → in progress → done," ML projects often follow this pattern: "Tried → didn’t work → trying again with a different approach."

As mentioned earlier, data that seemed solid may turn out flawed, or the model architecture might prove unsuitable. This leads to a cycle of experimentation and correction.

Even tuning hyperparameters is an iterative process closely tied to the training dataset.

No definitive tests

In ML, no unit test says: "The model works." There are statistical metrics, and they require interpretation.

Different cost curve

In software projects, the cost peaks during development and drops post-deployment,and maintenance only remains.

Blog%20post%20graph%2001

In ML projects, costs don’t follow that curve. They often remain high or even increase over time, because:

  1. Models degrade and require retraining,
  2. Data changes and needs new analysis,
  3. New models must be developed to reflect shifting realities.

Blog%20post%20graph%2002

So, what can you do?

Treat ML as an exploratory project

Think in terms of: hypothesis → experiment → analysis → decision — just like in R&D.

Manage uncertainty, not just tasks

Ask: What are we unsure about? Where do we need to validate the data? What could go wrong?

Set a realistic timeline

Include phases like: data exploration, experimentation, validation, tuning, integration.

Add a buffer. Then add another.

Manage expectations

Communicate progress clearly and often:

"The model isn’t performing yet because we’ve identified issue X in the data. We’re working on it." Instead of vague promises like: "The MVP is almost done."

Myth 5: Model ML = ML product

Many clients, even very technical ones, assume that if a machine learning model works well, the problem is solved. They ask:

"We have a model, it performs well. When can we deploy it to production?"

At first glance, this sounds logical: if the model performs well on the test set, we should be able to “plug it into the system” and be done. Unfortunately, that’s not how it works.

An ML model is the heart of the solution, but it’s not a product on its own. It lacks the lungs, nervous system, and resilience to changes in its dynamic environment.

What makes a real ML product?

Let’s think of the ML model as an engine. Even if it runs perfectly on the test bench, it won’t get you anywhere without additional components. In the case of a car, that would be the gearbox, wheels, brakes, etc. In case of an ML product that is:

  • A production environment (where the model is deployed and accessible),
  • Input and output interfaces (how data flows into and out of the model),
  • Surrounding business logic (e.g., when the model should or shouldn’t be triggered),
  • Real-time monitoring (is it working well today, not just last week?),
  • Re-training mechanisms (because data changes, and models become outdated),
  • Version control for data and models (what was trained, and on which data?),
  • Deployment infrastructure (CI/CD, APIs, availability, security).

Without all of the above tools, the model remains a proof of concept, not a production-ready solution.

What happens when this myth persists?

  • The model is deployed but stops working correctly - because the data changed, and no one was monitoring it.
  • Users don’t understand the predictions - because no one addressed interpretability.
  • The team doesn’t know how to retrain it - due to missing version control for data and training code.
  • The model fails with real-world inputs - because production APIs differ from test data, or because the data has drifted and the model is outdated.

How to do it right?

  • Plan the ML product as a whole - from day one

Don’t just ask: How do we train the model?
Also ask: How will we deliver it, integrate it, maintain it, and improve it?

  • Decide who owns the production phase

ML team? DevOps team? The client’s IT department?
This is more important than it seems.

  • Ensure monitoring and interpretation are in place

What does a prediction of 0.87 mean? Is that confidence? Has the model degraded?

  • Accept that a model is just one component

Even the best algorithm is useless if it can’t be used in a real-world process.

Myth 6: Adding new functionality will be a piece of cake

Sometimes, during an ML project, a client might say:
"Hmm, maybe instead of predicting X, let’s try classifying Y." Or: "Let’s switch the model, maybe try a neural network instead of XGBoost?"

This kind of change isn’t a problem in traditional software projects. You adjust the logic, tweak an endpoint, add a condition, and you’re done.

But in machine learning projects, even a small change in how the problem is defined can mean rebuilding half of the solution from scratch.

Why in ML change = going back to the beginning

Machine learning relies on data and a clearly defined target variable. If you change:

  • what you want to predict,
  • how you define a “positive” class,
  • which features influence the decision,

…you’re not just changing the method - you’re changing the problem itself. That means:

  • new data (or different processing of existing data),
  • new feature engineering,
  • new labels,
  • a new model architecture,
  • a new evaluation method,
  • new risks and success metrics.

It’s like a client deciding to switch to an open-concept kitchen after the cabinetry is installed. Is it possible? Sure. Easy? Not at all.

A project example

In one project, the goal was to predict how aesthetically pleasing an image was. Midway through, the client added a new requirement: explanations for predictions had to be included so users could improve the image based on model feedback.

At first glance, this seemed like a small change - maybe just an extra module for explainability. But in reality, it meant:

  • selecting and designing model features that users could understand,
  • rebuilding the training dataset,
  • updating preprocessing and validation logic,
  • retraining and testing the models,
  • and building a new suggestion-generation mechanism.

The result? Three extra months of work and a return to the data analysis phase.

Why does this myth persist?

  • In traditional projects, change often just means adding a condition to the code.
  • Managing ML projects isn’t intuitive. Adding model explainability sounds like a small feature, but it often means redefining the problem.
  • From the outside, everything looks the same: input, output, graphs.

But under the hood, it’s an entirely different story.

How to counter it?

  1. Clearly define the problem and the success metric at the start. What exactly are we predicting? How do we measure a "good" result? What matters most?
  2. Freeze the problem definition during the experimental phase. New ideas are welcome, but they should wait for the next iteration. Keep a backlog and revisit once the current phase is complete.
  3. Treat significant changes as new projects. Even if it’s just a "new metric," assess how it impacts the entire pipeline.
  4. Include the ML team in decision-making. Not every pivot is worth it. Sometimes a shift in the goal erases the value of everything already built. The ML team can help quickly assess ideas and estimate the effort needed.

ML project management - what do you need to prepare for?

Machine learning sounds appealing: automation, predictions, optimization, "intelligent decisions." And indeed, a well-designed model can bring significant business value. But before that happens, it's worth realizing a few things.

An ML project isn’t a sprint; it’s a journey through uncharted territory. Some things will go according to plan. Others will require adjustments, experimentation, and a change in approach. Below is a list of things worth being ready for - not to discourage you, but to help you avoid surprises.

Not everything will work right away

  • The first version of the model is rarely the final one.
  • Experiments may lead to worse results before they get better.
  • Sometimes it takes several iterations just to determine whether the problem is even solvable.

ML is a scientific process, not just assembling ready-made parts.

Data preparation takes time

  • More time than you think will go into understanding, cleaning, and organizing data.
  • You’ll need input from people who understand the data’s business context - often from operational teams.

It isn’t just the ML team’s job, it’s a critical part of the project.

There’s a chance ML won’t work

  • Sometimes the data is too sparse, too inconsistent, or simply not sufficient.
  • Sometimes the problem lacks a clear pattern, and a simple rule-based solution might work better than a model.

Machine learning is not the answer to every problem. And it’s better to recognize that early.

Need to adapt

  • Changes in goal definition, data, or project scope may require going back a few steps.
  • Even small changes might mean retraining the model, adjusting the infrastructure, or changing how you evaluate success.

ML projects are sensitive to change. That’s not a flaw, that’s just how they are.

Ongoing supervision is essential

  • A model is not a “set-it-and-forget-it” solution.
  • It needs to be monitored, evaluated, and updated - because your data and business processes will change over time.

ML evolves with your company. It has to be treated as part of your system, not a standalone "gadget."

Summary

Machine learning has enormous potential. It can support decision-making, automate processes, and uncover patterns that are invisible to humans. But to truly benefit from this potential, it’s essential to understand that ML projects differ from traditional IT projects.

Here’s what I’d like you to take away:

  • A model is not a product. It only works when embedded in a well-designed process.
  • Data is the foundation. Most challenges in ML aren’t technical - they come from the quality, availability, and understanding of data.
  • Not everything can be predicted. ML projects are iterative and experimental and often require adjusting assumptions along the way.
  • Working with an ML team is a partnership. You know your business, we know the tools - together we can build something truly valuable.
  • The best results come from clarity and preparation. The most successful projects start with a clear purpose and a willingness to engage in the process.

If you’re planning a machine learning project, you don’t need to have all the answers upfront. But it helps to come with an open mind, a clear goal, and readiness to collaborate.

Because ML isn’t magic, it’s cooperation, logic, data, and a bit of patience. And a well-executed ML project really can transform how your business operates.

Reviewed by: Michał Zaręba

Blog Comments powered by Disqus.