MLOps 101 - Introduction to MLOps

Rafał Pytel

20 Sep 2023.8 minutes read

MLOps 101 - Introduction to MLOps webp image

This blog post will cover technical information and tools for performing Machine Learning Operations (MLOps) in practice. Firstly, I will outline MLOps principles and how to apply them, then I will go over levels of MLOps in projects and finish with an example of how Revolut is doing it.

MLOps principles

Now, I am going to explain the pillars of MLOps, which guide you to a robust and mature ML system.

1. Focus on versioning and reproducibility

In the early stages of machine learning projects, versioning and reproducibility are not of great focus. However, once the project reaches a more advanced stage, it is essential to get reproducible and deterministic results using the same setup and hyperparameters.

Code for experiments might change dynamically, so it is vital to keep track of these changes via version control tools like Git.

Apart from the software code base, we can version:

  • Configuration files used by the code. For version control of configuration files, consider Git.
  • Data, as it can change over time, with new examples introduced, some discarded, and changing labels. To version datasets, tools like DVC or Neptune are to be considered.
  • Jupyter Notebooks. They are an essential tool for each ML Engineer, and they can be either version using Git, but ReviewNB can also be considered, as it gives more readable summaries of what changed from version to version.
  • Infrastructure. To increase the portability of the solution and represent infrastructure as a code (IaaC), consider tools like Terraform or AWS CDK.


Example showing changes in Jupyter Notebook using ReviewNB, Source:

2. Remember to monitor

Monitoring is often considered the last step in machine learning projects. However, It should be taken into account from the start, even before deploying the model to production. Monitoring should not be envisioned as a deployment-only matter, as you should also track training experiments. During each experiment, the following can be tracked:

  • weight distribution throughout training,
  • utilisation of hardware (CPU, GPU, RAM),
  • quality of predictions on the test set,
  • history of metrics like loss, F1, accuracy, and precision.

Weights & Biases, Neptune or MLFlow, are worth monitoring during training.

When considering inference monitoring, we can divide it into the following groups:

  1. Service level monitoring - monitoring of deployed services, i.e., how long it takes to process user requests, what is the usage of the resources, etc.
  2. Model level monitoring - what the input and output distributions are. This monitoring can raise an alarm when data distribution shifts considerably, and the model can perform worse. This kind of alarm can trigger model retraining.

For inference, monitoring tools largely depend on the setup you are using. If models are deployed using AWS Lambda service, AWS Cloudwatch provides extensive logging experience. In GCP, corresponding products are Google Cloud Functions for deployment and Stackdriver for logging and monitoring. On the other hand, if Kubernetes is used as a deployment environment, the following should be considered:

  • Prometheus for tracking service liveness, resource utility, and basic service parameters.
  • Loki for log creation and aggregation.
  • Grafana for visualisations in the form of a dashboard.


Table presenting tools for monitoring ML Systems

3. Test whatever you can!

Testing machine learning systems is still not nearly as formalised as it is with software. What can we actually test? Here is the list:

  1. Quantity and quality of input data.
  2. Checking schema and distribution of input data (ranges of the values, expected values, etc.).
  3. Pipeline itself and data produced at each step.
  4. Compliance (i.e., GDPR, EU AI Act) of components of ML System (data, pipelines).

Adding tests will make your system more robust and resilient to unexpected changes on both data and infrastructure ends. With data validation, distribution, and statistics can be compared with regard to training and production data.

Ordinary software testing practices can be applied when performing tests on pipelines and transformations. For the infrastructure, code created using IaaC can be covered with unit tests. Compliance tests are problem-specific and should be implemented with great care.

4. Automate!
Each of the previous principles is related to this step. With more mature machine learning systems, iteration and retraining of models are happening more often. At a certain point, this frequent retraining becomes possible only if you have every step automated. MLOps teams aim to have an end-to-end deployment of models without the need for human intervention.

As this field is still developing, there is a lot of freedom in what suits your use case best. The following guidelines should be considered as a starting point for exploration (State of MLOps).

Levels of MLOps

According to Google, there are three ways of implementing MLOps: manual process, semi-automated, and fully orchestrated.

MLOps level 0 (manual process)

This level is typical for companies that only start their adventure with MLOps, with manual ML workflow and decisions driven by Data Scientists. Models are trained infrequently. Its characteristics look as follows:

  1. The whole process is manual, with scripts and micro-decisions done by Data Scientists.
  2. Training and operational facilities are disconnected, with data scientists providing models as an artifact. Each step requires manual execution with quite a lot of human supervision.
  3. No CI/CD. This is unnecessary due to no frequent model iterations.
  4. Deployment is considered a simple prediction service (i.e., REST service on Lambda AWS). No advanced scaling algorithms are considered, as static deployment is sufficient.
  5. Lack of performance monitoring algorithm. The process is not tracking or logging predictions.

ML systems on that level are prone to frequent failures in the dynamic data environment. Using CI /CD and tracking/monitoring algorithms is a good practice to solve this issue. Having an ML pipeline with monitoring and CI/CD will allow for rapid tests, stable builds, and safe introduction of new implementations. This is the last level, where models are trained locally. On all the levels above (1 and 2), models are trained in the cloud as a job, for example, using Airflow or another tool.

MLOps level 1 (semi-automated)
On level 1, the focus is on continuous training via automated ML pipelines. This setup is suitable for solutions operating in a constantly changing environment.

Its characteristics look as follows:

  1. Experiments are orchestrated and done automatically.
  2. Models in production are continuously trained, using fresh data when triggered.
  3. There is no discrepancy between development, experimental, preproduction and production environments, as they use the same pipeline.
  4. Code for components and pipelines is modularised to make it reusable and shareable across ML pipelines (using containers).
  5. There is a continuous delivery of models, as the model training, validation and deployment are automated.

On level 1 of MLOps, both deployment and training code are published to the cloud.

Additional components that have to be implemented apart from the ones already on level 0:

  • Data and model validation: As we want the pipeline fully automated, it requires a data validation component to check incoming fresh data. Validating data is not an easy task, but for classification problems, it is relatively straightforward. It is very challenging for tasks like text generation (due to unclear labeiing). Once the new model is trained, it has to be validated to avoid shipping a faulty model.
  • Feature store: a component that standardises, stores, and manages access to features for training and serving.
  • Metadata management: information about each execution of the ML pipeline is recorded to debug errors and anomalies but also for reproducibility, execution comparisons, and data lineage.
  • ML pipeline triggers: Automated retraining can be performed in different use cases:
    a) on-demand,
    b) on-schedule,
    c) on-model performance degradation,
    d) on change in data distribution,
    e) on new data availability.

Level 1 MLOps is great when you want to automate your training. However, it enables only changing data, while any modifications to the training scheme require redeployment of the whole pipeline.

MLOps level 2
This level is typical for highly tech-driven companies, which often experiment with different implementations of pipeline components. Additionally, they often (daily or hourly) train new models, update them within a moment and redeploy them to clusters of servers simultaneously. Without an end-to-end MLOps cycle, doing so would not be possible for these companies.

Characteristics of this level look as follows:

  • Iterative and automated development and experimentation: Due to the orchestration of ML pipelines, it is easy to try new concepts and incorporate them.
  • CI of ML pipeline: Both source code builds and component testing are easy to perform.
  • CD of ML pipeline: Deploying components and artifacts from the CI stage is easy.
  • Automated triggering: The pipeline is automatically executed in production on a predefined schedule or reactively triggered based on some event.
  • Monitoring: Model performance statistics are collected. Based on these statistics, a specialised mechanism can trigger retraining when data drift is detected. Additionally, logs can be used for debugging.

Both data and model analysis are manual processes, although tools are provided.

Revolut: MLOps for fraud detection
To show you better how technological companies apply MLOps, I will present the example of the UK-based financial technology company Revolut. Their main product is an app offering various banking services, with millions of daily transactions.

They employ a machine learning fraud prevention algorithm called Sherlock to avoid losses due to fraudulent card transactions.


Source: Building a state-of-the-art card fraud detection system in 9 months_

Revolut generally keeps its services on Google Cloud, so all the components are deployed there. For data and feature management, they use Apache Beam via DataFlow. The model development is done using Python and Catboost. Models are deployed as a Flask app on App Engine. For in-memory storage for customer and user profiles, they use Couchbase.

Production orchestration is handled by Google Cloud Composer (running Apache Airflow).

For monitoring, Revolut uses a combination of two tools. The First one is Google Cloud Stackdriver, which is for real-time latency analysis, the number of transactions processed per second, and more. The second one is Kibana, which is used for monitoring merchants, the number of alerts and frauds, and model performance - true positive and false positive rates. Signals from monitoring are forwarded to the fraud detection team via email and SMS.

Model lineage is tracked thanks to Cloud ML Engine. Finally, human feedback is gathered in-app so each user can give feedback in the app on whether the transaction was malicious. As this is a classification task, we can collect the labels directly for the new dataset version.


Source: Building a state-of-the-art card fraud detection system in 9 months_

Useful resources

For interested readers I can recommend the following resources:

Blog Comments powered by Disqus.