How to apply CT/CD principles to machine learning projects?

24 Jun 2024. 10 minutes read

How to apply CT/CD principles to machine learning projects? webp image

Problem description

In large companies, the process of managing invoices and bank transfers is a complex one. With many contractors and customers, it is not difficult to make mistakes in the title of the transfer, there are cases when one transfer is paid for many invoices or only part of a single invoice. In such situations, the application of simple rules does not work, and the association of bank transfers and invoices is done manually under the supervision of a person.

This solution has several drawbacks, but the main drawback is the lack of scalability of such a process. Imagine that in the current situation, the company employs one person whose job is to match bank transfers with invoices. If the company grows and starts hiring 2x more subcontractors and gets 2x more clients, then, assuming the process and efficiency of the work do not change, it will need 4x more employees. The solution to the problem is to automate the process, which artificial intelligence can help with.

As part of our in-house enterprise financial management system, an AI system Proof of Concept project was conducted to automate the manual process.

Project State

The transferred project was in the state of the completed Proof of Concept stage.

Source: DALL-E

In other words, there was ready-made code for data analysis, training, evaluation, and deployment of the machine learning model. These functionalities were located in separate scripts and required a manual execution. Moreover, the training data was a static snapshot of the system state stored in the repository along with the code as training data within the PoC. If new training data came in (which happens every month), the process of updating the model would require a lot of manual work: downloading and converting the training data to a specific format, running the training and model evaluation, selecting the best model and manually deploying it to production. And so every month... With one project, this is still doable, but it causes frustration. With more projects, it is not possible to maintain such a state of affairs.

Data characteristics

The training data consists of the metadata of invoices and bank transfers, along with the information with which the invoice is associated. This type of structured data is manually tagged in the internal system by the administration and finance department, so the training examples are human-tagged and of high quality. Invoices, like transfers, appear in the system on a monthly interval. The data appears in real-time in BigQuery, which will serve as a source of training data in the training process.

This is a model example of a problem in which we can use cyclically appearing training data to improve the effectiveness of the machine learning model and maintain high prediction quality and model validity.

Planning

To avoid having to retrain the model manually, we can use MLOps techniques, specifically Continuous Training and Continuous Deployment (CT/CD) pipelines, to automate the process. At this stage, we will consider how to break down the AI model development process and how to ensure automation.

Process division

The first step that needs to be run periodically is the process of collecting training data from data sources. In the system in use, the metadata for bank transfers and invoices are stored in BigQuery. Therefore, we can prepare a function that will retrieve the data from BigQuery, perform data transformation, divide the data into training and test sets, and store them in the corresponding bucket in the cloud.

The next step is the process of the machine learning model training and evaluation. In the previous step, we assumed that the training data would be stored in the cloud. Therefore, we can assume that the machine learning pipeline will retrieve data from the indicated cloud bucket, train, evaluate, and save the fine-tuned machine learning model. To automatically deploy the model, it will be necessary to store metadata about the trained models in the model registry. This will make it possible to compare models with the currently deployed model. If the newly-trained model performs better, it should replace the currently used model. This will ensure that we are always using the best available model at any given time.

The aforementioned process of training and evaluation of the ML model determines whether the model currently in use needs to be updated. The final step is the deployment of the model. This step depends on the infrastructure used in the project.

In summary, to automate the process, we need to run 3 processes periodically:

training sample collection and preparation
training and evaluation of the machine learning model
conditional deployment

Automatization

Creating a machine learning pipeline that will perform these 3 steps solves only part of the problem. If the code runs on the local machine, we will save some time, but it will not be automated. To free ourselves from having to remember to run the process, we can use cron (a solution for masochists) or use cloud solutions that scale well.

Source

Previous components of the project, such as BigQuery, use Google Cloud, so it was a natural step to turn one's attention toward this cloud provider.

What will it take to run the pipeline in Google's cloud? Cloud Storage to store training data and artifacts of machine learning models. Google Cloud Builder for the cloud-based process of building Docker containers and pushing them to the appropriate container registry. For training ML models, VertexAI allows you to select the appropriate hardware, including the GPU needed to train more advanced AI models. For scheduled BigQuery data retrieval, training, and deployment - functions of Google Cloud Run.

In summary, the selected technology stack will consist of Google services: Google Cloud Storage, BigQuery, VertexAI, Cloud Run, and Google Cloud Builder.

Implementation

Given that the core of the project, which constitutes the training data processing, training, evaluation, and deployment of the machine learning model, already exists, the main task will be to adapt it to run in a cloud environment.

To make the machine learning pipeline productive, the ZenML framework will be used. This framework provides an abstraction layer between the services of different cloud providers and components, so moving the project between clouds does not require code changes. I encourage you to check out this project!

For cloud-based pipelines, ZenML Server deployment is essential. You can use the convenient deployment option on HuggingFace Spaces. The pipelines require the ZenML stack containing the following components: image builder, orchestrator, and artifact-store. The default orchestrator in GCP, the VertexAI Orchestrator, will be used to run the containers. To run scheduled jobs, it is necessary to properly configure service accounts and the orchestrator itself in the ZenML according to the instructions in the documentation. The BentoML component is used for deployment. Local Docker will be used as the image builder.

Training dataset creation

In this step, the pipeline will have read/write operation permissions for Google Cloud Storage and BigQuery. The pipeline authenticates itself to BigQuery and executes a query that retrieves data in pandas DataFrame format, performs data transformations, splits it into a training set and a test set, and writes the data in .csv format to the appropriate bucket on GCS. The pipeline is run periodically every month. The test dataset is created only on the first run of the pipeline. If the test set already exists, subsequent runs of the pipeline will save new data samples as training data.

Training, evaluation, and model registry

The model-training pipeline retrieves data from the indicated Google Cloud Storage bucket, trains the model on the training data, evaluates the test data, and compares it to the registered models in the model registry. The model registry used is the built-in Model Control Plane component of zenml. If the current model is better, then it is registered in model-registry, an archive with the code and artifacts needed to build the container is created, it is saved in GCS, and finally, the Google Cloud Builder task is run, which is the deployment step described in the next section. This pipeline also runs periodically every month. It must execute after the pipeline that gathers the training data samples is completed.

Deployment

The process of building a container is performed in Google Cloud Builder. It allows you to create and run a job with a POST request with a payload that describes the steps of building a container, the archive with code and artifacts, and the target container registry to which the built container will be saved.

An example of a payload sent to Google Cloud Builder is presented below. The source parameter specifies the archive from which the Docker container will be built. The archive must contain the Dockerfile and all the files needed during the build. The steps parameter specifies the list of steps in the process. In our case, this is just the container build command. The last parameter, images specifies the name under which the container will be saved in the indicated container registry.

{
   "source": {
       "storageSource": {
           "bucket": "finanse-ml-dev",
           "object": "models/8f19cd96-ca7b-40bd-980e-f2ce19908a6b/model.tar.gz"
       }
   },
   "steps": [
       {
       "name": "gcr.io/cloud-builders/docker",
       "args": [
           "build",
           "-t",
           "gcr.io/finanse-project/model",
           "."
       ]
   }],
   "images": [
       "gcr.io/finanse-project/model"
   ]
}

In our case, the model is served as a Pod on an already configured GKE. Kubernetes, through the use of flux.cd uses image automation functionality. Thanks to this, deployment involves pushing the built Docker container to the appropriate registry. Kubernetes handles the process of updating the model internally.

Due to infrastructure requirements, this step builds a Docker container with a REST API that will be used for inference on the ML model. For this purpose, the BentoML framework is used, which allows us to create an REST API for ML model inference easily.

Results

The final output of the entire CT/CD pipeline is the cyclic retraining of the machine learning model on the latest training data and the conditional deployment of the model to the production environment. The whole thing is done cyclically in an automated manner that requires no human supervision.

Next steps

Further steps in developing this project will focus on the monitoring aspect of the model and the process. What do I mean by this? A simple improvement will be the addition of Slack alerts that will inform the team on a dedicated channel about the training process, model results, task status, and conditional deployment. By model monitoring, I mean adding a cyclic task that will detect date and model drift based on the difference in training data and the latest data flowing into the system using evidently.

Problems encountered

Big artifacts

The ZenML framework uses a cache mechanism for the individual steps of the pipeline being run. Each step has defined types of input and output data, which are stored as artifacts in the stack component - artifact-store. This is done to optimize the execution time of pipelines and the resources consumed in a situation where, for a step, the input parameters have not changed. In the case of large artifacts (e.g., entire datasets), the artifact serialization and upload take an unacceptably long time.

The problem was solved by optimizing the dataset passed between steps. It is worth noting that the dataset is relatively small and consists of tabular data. This can be a challenge in projects where the data is images or video.

Summary

This post describes the process of transforming a machine learning project from the Proof of Concept stage to an automated CT/CD process and describes the further steps in project development. The application of MLOps techniques in a real project made it possible to test the usefulness of the tools and proposed ways of solving problems related to the operationalization of AI models.

From the business point of view of the project, thanks to the cyclical appearance of new training data in the system, it was advantageous to use CT/CD architecture over machine learning pipelines, which allows for the cyclic improvement of model quality. The planned addition of cyclic data and model-drift detection functionality will allow faster response to changes in the characteristics of data appearing in the system.

I will describe further stages of the project's development in future articles, so stay tuned.

Contents