GCP Compute Engine for your ML PoC
There are many great tools for ML project orchestration, like KubeFlow, Airflow with MLFlow, or cloud provider solutions (like Google Vertex AI, SageMaker, and Azure AI), but all of those solutions are meant for large-scale systems. What if you want to serve a budget version of your ML demo with low, spiky traffic? In this article, I will describe how to quickly deploy your ML model with limited user access and scalability to zero. The tutorial will be carried out on the Google Cloud Platform using Compute Engine.
Why Compute Engine?
There are a few options for hosting applications on GCP, starting with the easiest one, Cloud Run, through Vertex AI, a dedicated solution for hosting ML models, and finally, Compute Engine, which provides self-managed infrastructure. Why might a Compute Engine be the best choice?
Cloud Run
Cloud Run is a serverless Platform as a Service (PaaS). It provides scalability to zero and almost no management overhead. It is perfect for cost-efficient applications as it only runs when an active HTTP request is being made. However, Cloud Run does not provide GPU support, which makes it the no option for most ML projects.
Vertex AI
Vertex AI is a dedicated solution for ML model management. It makes it easy to version and serve your ML models. However, Vertex AI Endpoints do not scale down to zero. There has been an open feature request since 2021, but as of now, the feature is still not available.
Compute Engine
The Compute Engine is an Infrastructure as a Service (IaaS). Compared to Cloud Run it provides much more customization, however, requires more configuration at the same time. Using Managed Instance Group (MIG) service we can set auto-scaling of our VMs. Even though theoretically it should be possible to scale down to zero using the auto-scaling feature, I did not manage to achive it. However, we can programmatically manage MIG scaling and achieve zero-scaling functionality for our GPU VM.
Alternatives
The alternative solution might be to use a HuggingFace space. It provides scaling to zero functionality and GPU support. However, HuggingFace spaces require you to open the source code to the public, which might be a blocker for many projects.
Solution architecture
The solution assumes a cheap frontend instance running constantly and an expensive GPU instance spawned by the CPU instance if any traffic appears.
Figure 1. Solution architecture.
This way we achieve high availability, scalability to zero, and GPU acceleration. Each instance is managed by the MIG and hidden behind the load balancer. For security purposes, Identity Aware Proxy (IAP) provides a single point of control for managing user access to web applications.
Implementation details
Artifact registry
Both the frontend app (running on CPU VM only) and the backend part (running on VM with GPU) are docker images. Additionally, the frontend docker image has to contain scaling functions described in Scaling to and from zero. Therefore, additional Service Account Roles described in the Service Account are needed.
After creating a docker image, the image has to be tagged and pushed to the artifacts registry. Step-by-step instructions can be found here.
Service Account
A custom service account is required for the frontend app in order to be able to manage to and from zero scaling of GPU instances. A custom role with part of all the following permissions has to be created:
Compute.autoscalers.create
Compute.autoscalers.delete
Compute.autoscalers.get
Compute.autoscalers.list
compute.autoscalers.update
For the backend part, the custom role with permission
artifactregistry.repositories.downloadArtifacts
should be created.
Instance template
Each Instance group requires an instance template to create the VM from. As the ML PoC consists of two instance groups (CPU frontend and GPU backend), we need to create two templates.
Frontend instance template
Setup following values:
- Machine configuration: For example E2.
- Container: Container image ->
europe-west4-docker.pkg.dev/<team>/<repo>/<frontendimage_name>:<tag>
- Boot disk:
- Identity and API access: Service account with Editor role and role with
- Firewall: Allow HTTP traffic
Backend instance template
Setup following values:
- Machine configuration: choose the GPU and an instance type. N1 for example
- Boot disk:
- Firewall: Allow HTTP traffic
- Advanced options: Management -> Automation -> startup-script
# install nvidia drivers
sudo /opt/deeplearning/install-driver.sh
# setup access to the google cloud docker registry
VERSION=2.1.2
OS=linux
ARCH=amd64
curl -fsSL "https://github.com/GoogleCloudPlatform/
docker-credential-gcr/releases/download/v${VERSION}/
docker-credential-gcr_${OS}_${ARCH}-${VERSION}.tar.gz" \
| tar xz docker-credential-gcr \
&& chmod +x docker-credential-gcr && sudo mv docker-credential-gcr /usr/bin/
docker-credential-gcr configure-docker --registries=europe-west4-docker.pkg.dev
docker pull europe-west4-docker.pkg.dev/<team>/<repo>/<image>:<tag>
docker container prune command
docker run -d --gpus=all -v <src>:<dest> -p 11434:11434 --name
<imagename> europe-west4-docker.pkg.dev//<team>/<repo>/<backend_image>:<tag>
The startup script installs Nvidia drivers, configures the artifact registry for docker access, and pulls and runs the docker image on port 11434. The artifact registry for docker credentials configuration is necessary, as the operating system is not the Container Optimised OS. Otherwise, this step could be omitted. The official tutorial for authentication to Artifact Registry for Docker can be found here.
Instance Managed Group
For each instance template, an instance group should be created.
Frontend instance group
Create a new health check, with the correct port.
Create port mapping for incoming requests.
Backend instance group
Turn off autoscaling. The autoscaling for the GPU VM will be managed programmatically.
Create a health check for the GPU VM.
Create port mapping for the backed service.
External Load Balancer
The external load balancer is used for distributing ingress load from the Internet to the frontend instance. The frontend service of the load balancer should have assigned a static IP address and a valid SSL certificate for the HTTPS protocol. Google provides a tutorial on how to create the certificate for development purposes. The backend part should point to the frontend instance group.
Internal Load Balancer
The internal load balancer handles traffic incoming from the frontend application. The HTTP is used as an endpoint protocol. The frontend part should have a static internal IP address and the backend part should point to the backend instance group. Note the frontend IP address and port, as it will be the address for the frontend instance group to communicate with the backend instance group.
For the frontend part, the Proxy-only subnet should be created.
It is important to increase the default load balancer timeout, especially for long running ML models. The default 30 seconds is often not enough for model inference.
Firewall rules
In order to make the whole setup work, add the following firewall rules to your Virtual Private Cloud (VPC):
Direction:
ingress
IP addresses:
130.211.0.0/22
35.191.0.0/16
Ports:
<frontend web-app port>
<backend port>
Direction:
ingress
IP addresses:
<internal load balancer subnet IP address>
Ports:
<internal load balancer frontend port>
Identity Aware Proxy
IAP protects access to applications hosted on Google Cloud. It allows to white-listed email address that will have access to your web application. In order to configure the IAP we need to create the External Load Balancer Backend Service and configure the VPC firewall rules. Google provides an official step-by-step tutorial for configuring it.
Scaling to and from zero
There is a great article, about implementing VM scaling programmatically, so I will not go into details.
Upscaling should be requested if any traffic appears on the CPU instance. It can be implemented as a part of the middleware if the request is not a health check.
@app.middleware("http")
async def spawn_gpu(request: Request, call_next):
response = await call_next(request)
if "/health" not in request.url.path:
# request GPU
scale_up()
global LAST_ACTIVITY_TIME
LAST_ACTIVITY_TIME = time.time()
Scaling down can be scheduled based on the inactivity. If there was no traffic on the server for more than the GPU_TURN_OFF_TIME
the GPU instance can be scaled down.
class GpuActivityMonitor(Thread):
def __init__(self):
Thread.__init__(self)
self.daemon = True
self.start()
def run(self):
while main_thread().is_alive():
if time.time() - LAST_ACTIVITY_TIME > GPU_TURN_OFF_TIME:
log.info('scaling down GPU due to inactivity')
try:
scale_down()
except Exception as e:
log.error(f'Error scaling down GPU: {e}')
time.sleep(10)
We should implement both scaling up and down as thread operations and check for other threads running, before scaling up or down. For a docker image of about 0.5 GB, the GPU instance takes about 3.5 minutes to scale up from zero.
When running on the Google VM we can use default credentials, as the VM is already authenticated. However if the instance is outside of Google Cloud, we should authenticate first with the service account key or workload identity federation.
Summary
In this article, we discussed how to deploy and host the budget version of your ML PoC. For large production systems, I recommend ML orchestration tools like KubeFlow, AirFlow + MLFlow, ZenML, or cloud provider solutions like Vertex AI, SageMaker, and Azure AI. If you have any questions please reach out to me, I will be happy to help.