Big data in healthcare webp image

The healthcare industry is becoming more and more tech-driven and has come a long way towards digital transformation. This wouldn’t be possible without big data that’s at the heart of the changes happening in this sector. With the huge amount of data that’s now available to doctors, researchers, and other health professionals, they can discover previously unknown details concerning diseases or treatment methods, diagnose patients more quickly and accurately, and provide better care.

In this article, we’re having a broad look at big data in healthcare: what it means for the industry, what opportunities and challenges it brings, and what things architects and developers need to pay attention to when working on big data systems for the medical sector.

What is big data in healthcare?

Big data in healthcare refers to the vast amounts of information gathered from various sources, diverse in format, type, size, and context. This information describes patient and disease characteristics, medications, and so on. Even though the use of big data in healthcare poses certain challenges (more on that in “The challenges to look out for in big data projects for healthcare” section below), it is still worth the effort, since the findings that come from the analysis of data bring significant benefits to healthcare providers and, ultimately, their patients.

Types of healthcare data

As mentioned above, medical data comes from many sources: scientists upload detailed data from their research, patients upload data from their smart devices including wearables and smartphones, some statistical data is available from governmental agencies, etc. Such data comes in many forms:

  • demographic data like age, sex, weight, etc.
  • patient’s medical history including therapy type, duration, medicines
  • X-Ray images
  • CT scans
  • PET scans
  • Photos
  • other medical imaging
  • supplemental data for scans, e.g. nuclear agent type and concentration

These types of data, or most health-related data for that matter, constitute heavy data pieces. A single scan can consist of hundreds of images with dozens of MBs each. The data is fed to a database, indexing engine, or AI training process.

Medical data tends to be in high volume and high variability, as it comes from many different sources and in many formats. This means that before we can move to further processing, it has to be unified. However, because of regulatory reasons, the original data format often has to be retained too.

How can healthcare use big data?

In the healthcare industry, it’s the top priority to keep patients healthy and offer them the best possible treatment plan and care when it’s needed - no wonder that the solutions that support diagnostics and treatment are among the most researched and developed ones. However, big data has much more to offer and can support healthcare providers in their everyday work in a variety of ways.

Data is also the focal point of any AI/ML project since machine learning and artificial intelligence need a lot of data so the models can be trained accurately. The application of data analytics, machine learning, and artificial intelligence over big data in healthcare enables the identification of patterns and correlations in many domains. The healthcare sector is no exception and new technologies provide actionable insights for improving health services like diagnosing and therapy processes.

Data collection from hospitals, laboratories, other healthcare organizations, as well as research facilities can provide a lot of PET and CT scans, X-ray images, etc. that are later examined by doctors who can diagnose patients and tell if e.g. they have cancer. However, different doctors create different diagnoses, each of them having unique specialistic knowledge.

AI can combine the knowledge of many humans to diagnose faster and better, based on the data, e.g. images, that’s provided to it. There are diseases that can be diagnosed only by trained specialists, and there are very few such specialists in the world. This limited know-how prevents many people from being diagnosed correctly and receiving proper, often life-saving, treatment.

Some diseases could be easily treated when diagnosed in early childhood but that means we would need to test every child. For many diseases, it is simply impossible because medical experts are out of reach for many patients, and the scarcity of specialists is a bottleneck here. AI is the obvious solution - if only it is possible to train AI to diagnose these diseases, we can treat, or even save, many people.

You might also be interested in:


Big data in healthcare and biomedical applications of ML are a great support in the diagnosing process. One example is the diagnosis of cerebral palsy that can be recognized at an early age if a child is observed for about 15 minutes by a trained specialist. The specialist observes the child's movements and can tell if they have the illness or not. It is important to diagnose the illness at an early age because then it's easier to treat. The problem is that it is impossible to test every child because of a lack of trained specialists available in the healthcare system. This is a bottleneck that literally makes the lives of many people worse.

Researchers suggest it is possible to train artificial intelligence to diagnose the illness by “watching” recordings of the child’s movements. All that needs to be done is to record child movements for about 15 minutes and upload the recording to the “diagnosing service”. Recording a child does not need a trained specialist and every parent can do it. This could improve many lives, making the diagnosis significantly faster and more accurate, and helping children get the treatment they need as quickly as possible.

Medical research

Big data records, including healthcare data, are also used in extensive medical research. This includes studies on diseases that weren’t previously well-described or medical treatment. When data is shared, companies can conduct further research with the data that’s available to discover new patterns, treatment plans, or better ways to adjust previous medication methods. Researchers appreciate the possibility to integrate and exchange the results of their research and data sets that go with it as the improved access to extensive data allows for faster progress in clinical research and promises better results.

Personalized patient care

With the variety of data on patients that’s available, it’s now becoming easier than ever to adjust treatment methods and medication to the individual needs of each patient. No doctor is able to analyze all the possible data on patient needs, diseases, and treatment opportunities, but with the use of big data analytics and machine learning solutions, they can derive key insights on those that can inspire them to offer a more personalized approach to the patient. This is an opportunity not only to help patients get better sooner but also to build more trust in the treatment and improve patient experience.

The benefits of big data in healthcare

Enhanced diagnostics

The use of big data in healthcare can improve diagnostics by training AI algorithms that can recognize lesions, tumors, or other ailments on images or diagnose other diseases based on different health data. AI can be trained using big data to combine the knowledge of many people and eliminate the bottleneck of too few highly qualified healthcare professionals to efficiently diagnose many patients, making it possible to find the right diagnosing event when a top expert is not available, and boosting patient outcomes.


Prevention is better than cure - it is not only cheaper than treatment but also makes people’s lives better as they do not need to suffer. Processing statistical data about many patients suffering from some disease can identify risk factors increasing the probability of developing the disease. It is also possible to calculate the probability for specific patients to get ill, identify harmful behavior, diet, habits, etc. and eliminate them, as well as predict acute medical events to help prevent them from happening or minize their negative impact.

Better-informed decisions

Another benefit of processing data about many patients suffering from a given disease can be improved knowledge about the effects of specific therapies and medicines on specific patients, e.g. of given age, sex, living in specific conditions, etc. Then this information can be used to better adjust the therapy to individual patients, removing guesswork from the process. An individual therapy path is often a crucial part of the healing process, but it’s difficult for doctors to appropriately select the method to treat a given patient when they don’t have all the necessary information at hand. With big data analytics and machine learning applications that make effective use of big data, doctors can get insights into patient information and details of various therapies to see how they perform and help choose the best option of treatment for a given case.

Machine Learning content from the SoftwareMill team:


AI modeling can be used to predict outbreaks of epidemics and forecast the spread of diseases among the population or predict the risk of acute medical events. On a smaller scale, it can be used to predict treatment effects for individual patients. It might be possible for a patient living in a polluted environment with cases of asthma in the family to predict the probability of developing asthma. This can encourage decisions about moving to another place and make the future life of the patient better.

The challenges to look out for in big data projects for healthcare

Data preservation

Original data has to be stored in the original format. Sometimes because of regulatory reasons, companies have to be able to deliver exactly the same data in the same format as was received. No matter how transformed the data is in later stages of the processing, it is important to retain it in the original format.

Data unification

Data coming in many formats from different sources, even using different units, has to be unified before it can be used in further processing. Many interpreters are necessary to read e.g. dicom files, JSON files, CSV files, binary images, and translate all these different formats to one format readable for the next processing step. The task is simpler when the format of a specific piece of data is known (e.g. user uploading it chooses the format) but more difficult when the format has to be recognized. One approach is to try interpreters one after another until one does not throw errors and produces valid output.

Data amendments and versioning

When data changes, it is important to create a new version but keep the old version too. This way, it is guaranteed that the training process run on specific version of data can be run again on exactly the same data. It is important to keep the data used for training intact, to e.g. tweak the process, run it on the same data and compare the results. If the amended dataset was used for a second run, it would not make sense to compare results. For that reason, data has to be immutable, and versioning is necessary. When someone uploads a scan, then they realize they made a mistake e.g. in units and fix it - data cannot be just changed in place. A new version of the same data point has to be created.

Data security and privacy

Privacy in the context of medical data and healthcare is crucial. That's obvious, isn't it? It is necessary to provide access to data only to its owners and other people that the owners shared the data with. However, it is more than that. Not only raw data has to be protected, but also derived data - e.g. images generated from raw images. So it is necessary to preserve the information which inputs have generated which outputs. Usually, it is not 1 to 1 - many inputs create one output, or vice versa. All this information has to be stored and used to protect data from unauthorized access. With every output, there must be information about what inputs were used to create it, who is the owner of these inputs and with whom they were shared.

Compliance with regulations

Many countries have regulations related to patients' privacy like GDPR in the EU or HIPAA in the USA. Clinical data systems have to comply with them. Systems that operate internationally have to be compliant with different laws at the same time. Data has to be protected accordingly, encrypted, and inaccessible for unauthorized persons. Deletion of data must follow the path of data protections - if a user requests their data be deleted, not only the original data has to be removed but also all the data derived from it.

What every architect should know when working with a big data project in the healthcare industry

There are some things you should keep in mind when working on a big data project - and that shouldn't be overlooked. Below, you can find some key points to pay attention to.


Big data processing requires many resources and big data in healthcare is no exception. Data is processed in many ways. For example, for a set of images, it might be useful to create some thumbnails, or to extract the position of a specific image from metadata to merge it with another image and overlay them on each other (e.g. overlay a PET image on an MR image to see on which parts of the body lesion has developed).

Another type of data processing is indexing it to be able to quickly access data with query. All this different processing might be done by different services at the same time. This has architectural implications. Data has to be published in a way that each service can receive and process it. Another advantage is that this makes the system scalable. In case of higher traffic, we can always add more service instances.

However, to make more instances able to process data at the same time, either data partitioning is needed, or some other mechanism that makes it possible to mark which data point was processed and which was not. Traditional queues where the service takes messages off the queue and processes them, are not a good solution, unless at-most-once is acceptable. If a message is taken off the queue and then processing fails - the data is lost. Usually, that's not acceptable.

On the other hand, if a message is removed from the queue only after it is processed, we lose scalability - only one message can be processed at a time. This is where partitioning comes into play. Partitioning means many queues are created, and services are going to read messages off them simultaneously, so when there are e.g. 10 services and 10 partitions, each service reads one partition, when there are 2 services and 10 partitions, each service reads 5 partitions, etc. In Kafka, it is transparent for a developer and automatically handled by the driver.

More tech content from our team:

All that a programmer needs to do is to provide a partitioning key for each message. If messages are independent of each other - even that is not necessary. In such cases, messages are distributed more or less randomly. But when messages depend on each other, usually because they comprise a series that should be processed together - it is necessary to provide a partitioning key, to send all the messages from one series to one partition, to make sure all of them are going to be processed by the same service instance.

Preserve raw unified data

It is important not only to retain original data in original formats, as mentioned earlier, but it is also very useful to keep data in a unified form. Of course, outputs of services and processed data also have to be kept, but it is convenient to be able to recreate them from raw unified data if for some reason it is lost.

Moreover, when raw data is kept and later a new service for data processing is added, it is possible to process old data too, not only the data added after the addition of the new service. If event sourcing pops to your head, you are right. It is not the only solution but it is among the ones that are most obvious. In event sourcing, every piece of information is kept, every update, patch, modification, along with original data, so no data is lost.

However, this might also be a source of problems in case of medical records. Why? Think GDPR - the right for citizens to “be forgotten” - every company has to remove all personal data of a person when they request it, and there is no doubt medical data fits the rule. There are some ways to do it in event sourcing. Sometimes it is possible to just delete the events, other times it can cause other problems, like gaps in IDs, so events have to remain, to preserve IDs, but it is possible to strip the event from all the other data and keep only the ID.

You might also be interested in:

Data streaming

Modern applications use data streaming instead of batch processing and there are reasons for that. Clinical data processing is no exception. With streaming processing can begin even before all the data is received, so it makes it faster. When the first message is received by service, processing can be started immediately before the next messages are received. If it is possible to output the results before all messages of the series are processed, we don’t even need to store the intermediate state in memory.

However, it is not always the case. E.g. to find the max value of some parameter in a series it is necessary to process all the messages from the series. To output the max value we also need to know that the series is processed, and there are no more messages. Thus we need some end series marker, or we need to know the number of messages in series beforehand.


When building a system storing many different data in many formats it is vital to make it relatively easy to filter the necessary data from it. After all it is all not only about storing data but also about providing it and making it accessible. When the system is built for researchers who need to use only specific data they need a way to get it, not to be overwhelmed with all the data the system stores. For that indexing and querying is necessary. When data is unified it can be fed into the indexing engine so it is easy to query it later. Elasticsearch fits this purpose. It is fast and scalable indexing engine that supports sharding so every node can keep only part of data and it is faster to search it when a query is run.

Keep the system infrastructure-provider-agnostic

Clinical systems often need to be cloud agnostic, or “infrastructure-provider-agnostic”, i.e. be able to run in any cloud (e.g. AWS or GCP) as well as on premise. Some healthcare organizations have regulations forbidding sending data to servers in the cloud. It doesn’t matter in systems that are deployed once and are accessible to users via the internet. However often it is not the case. Patient data are sensitive and many companies need systems running on their own. Then it’s important for them to be able to choose a cloud provider, or run the system on premise.


Big data is becoming increasingly important to healthcare providers and medical researchers, and various applications of ML allow for effective use of all this information to discover patterns and drive valuable insights. Proper use of big data in healthcare allows for better patient care, adjusting treatments to individual needs, or even recognizing uncommon ailments that used to be only diagnosed by a handful of domain experts.

There is huge potential for the growth of big data, analytics, and AI/ML in the medical sector, and the outcomes are quite promising: they bring real impact to our lives, making us healthier and giving us better quality of life. While developing successful big data systems for healthcare may not be a piece of cake, given the opportunities that are brought by such solutions, it’s worth taking the time and effort to develop reliable, safe, and scalable systems that will help medical professionals and their patients.

For more content around big data and machine learning, subscribe to Data Times - an ML & big data newslflash curated by SoftwareMill's engineers.

Blog Comments powered by Disqus.