How to add CPU, GPU, and system metrics to the BentoML service metrics enpoint
Monitoring deployed services is key to responding quickly to emerging performance issues or errors. For services using machine learning models under the hood, it is crucial to monitor resources that are critical to their operation.
Image source
BentoML, widely used to serve machine learning models, provides a REST API by default, with basic, generic functionalities, such as health checks and monitoring service metrics. Metrics are made available in an open-telemetry format and can be integrated with Prometheus and Grafana to aggregate, monitor, and send alerts.
In this article, I will present what metrics are monitored by BentoML out of the box. Then I will present my point of view on which metrics are missing and how to add them. Finally, I will show how to verify the proposed modification and analyze what impact it has on the service performance. Let’s get started!
Why BentoML?
As mentioned in the introduction, BentoML is a framework for creating and deploying production-ready ML services. Unlike MLFlow, which uses Flask as an HTTP server (WSGI), BentoML uses FastAPI (ASGI), which is better optimized for serving models in a production environment. In other words, BentoML's advantage is its support for asynchronous communication. It supports many popular frameworks, e.g. scikit-learn, Transformers, ONNX, PyTorch, and TensorFlow, so it is possible to use one framework for different types of ML models. What's more, the tool provides a lot of useful functionality out-of-the-box, such as service API documentation in OpenAPI format, and basic metrics on the number and processing time of requests in Open Telemetry format. Enables easy and fast deployment optimization by using various Runner
classes, such as Triton Inference Server and adaptive batching of requests. Provides the ability to create sequences of distributed models, and control GPU and CPU resources as well as the number of workers. BentoML comes with a built-in Model Store to version models, bento artifact packaging format, and a Docker container builder tool that allows for seamless containerization and deployment.
In summary, BentoML is a generic ML service development framework that provides a lot of necessary functionalities and allows for rapid prototyping. On the other hand, it is deeply parameterizable, so you can optimize it to your needs. And most importantly, it supports the most popular frameworks, so you can use it in almost any ML project.
Default metrics
Below is a list of metrics monitored by BentoML service by default. It's easy to see that these are generic requests-oriented metrics for the entire API service and individual runners.
bentoml_runner_request_duration_seconds
- Time in seconds needed to complete the RPC for each runnerbentoml_runner_request_in_progress
- Total number of runner RPC in progress right nowbentoml_runner_adaptive_batch_size
- Runner adaptive batch sizebentoml_runner_request_total
- Total number of runner RPCbentoml_api_server_request_total
- Total number of server HTTP requestsbentoml_api_server_request_in_progress
- Total number of server HTTP requests in progress right nowbentoml_api_server_request_duration_seconds
- Time in seconds needed to complete server HTTP request
Which metrics are missing?
The metrics that I think are missing are system metrics that show resource consumption during request handling. The crucial metrics are RAM and CPU utilization, and for models using hardware acceleration, also VRAM and GPU consumption. These are the basic metrics that are worth monitoring for any AI model, as they allow you to determine how well-prepared a service model is to handle a high volume of requests and detect potential memory leaks, and edge cases where memory is running out.
How to add them
By default, BentoML provides logs that do not contain system metrics. However, we can extend the Runner
class of the BentoML framework and add logging of any custom metrics. In our case, these will be system metrics describing CPU, GPU, RAM, and VRAM usage. Below you will find code that aggregates the above-mentioned metrics in code using the psutil
and pynvml
libraries.
import psutil
import pynvml
import bentoml
pynvml.nvmlInit()
driver_version = pynvml.nvmlSystemGetDriverVersion()
gpu_number = pynvml.nvmlDeviceGetCount()
metrics_system_keys_dict = {
"node_ram_usage_percentage": "RAM usage in %",
"node_ram_usage_bytes": "RAM usage in bytes",
"node_ram_total_bytes": "RAM total in bytes",
"node_ram_free_bytes": "RAM free in bytes",
"node_swap_usage_percentage": "swap memory usage in %",
"node_swap_usage_bytes": "swap memory usage in bytes",
"node_swap_total_bytes": "swap total memory in bytes",
"node_swap_free_bytes": "swap memory free in bytes",
"node_disk_usage_percentage": "disk usage in %",
"node_disk_usage_bytes": "disk usage in bytes",
"node_disk_total_bytes": "disk total in bytes",
"node_disk_free_bytes": "disk free in bytes",
"node_process_number": "Number of processes runnning in the system",
}
metric_dict = {
key: bentoml.metrics.Gauge(name=key, documentation=value)
for key, value in metrics_system_keys_dict.items()
}
metric_gpu_keys_dict = {
"total_memory": "VRAM total in bytes",
"free_memory": "VRAM free in bytes",
"used_memory": "VRAM used in bytes",
"power_usage": "GPU device power usage in Watts",
"fan_speed": "GPU fan speed in rpm",
"utilization_percent": "GPU utilization in %"
}
for gpu_device in range(0, gpu_number):
for key, value in metric_gpu_keys_dict.items():
name = f"node_gpu_{gpu_device}_{key}"
metric_dict[name] = bentoml.metrics.Gauge(name=name, documentation=value)
def update_gpu_metrics():
for gpu_device in range(0, gpu_number):
handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_device)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
metric_dict[f"node_gpu_{gpu_device}_total_memory"].set(info.total)
metric_dict[f"node_gpu_{gpu_device}_free_memory"].set(info.free)
metric_dict[f"node_gpu_{gpu_device}_used_memory"].set(info.used)
metric_dict[f"node_gpu_{gpu_device}_power_usage"].set(
pynvml.nvmlDeviceGetPowerUsage(handle)
)
metric_dict[f"node_gpu_{gpu_device}_fan_speed"].set(
pynvml.nvmlDeviceGetFanSpeed(handle)
)
metric_dict[f"node_gpu_{gpu_device}_utilization_percent"].set(
pynvml.nvmlDeviceGetUtilizationRates(handle).gpu
)
cpu_list = psutil.cpu_percent(percpu=True)
cpu_no = len(cpu_list)
cpu_usage_metric_list = []
for index in range(0, cpu_no):
key = f"node_cpu_{index}_usage_percentage"
metric_dict[key] = bentoml.metrics.Gauge(
name=key,
documentation=f"CPU{index} usage in %",
)
def update_metrics():
"""Update system metrics."""
for index, cpu_usage in enumerate(psutil.cpu_percent(percpu=True)):
key = f"node_cpu_{index}_usage_percentage"
metric_dict[key].set(cpu_usage)
ram = psutil.virtual_memory()
metric_dict["node_ram_usage_percentage"].set(ram.percent)
metric_dict["node_ram_usage_bytes"].set(ram.active)
metric_dict["node_ram_total_bytes"].set(ram.total)
metric_dict["node_ram_free_bytes"].set(ram.free)
swap = psutil.swap_memory()
metric_dict["node_swap_usage_percentage"].set(swap.percent)
metric_dict["node_swap_usage_bytes"].set(swap.used)
metric_dict["node_swap_total_bytes"].set(swap.total)
metric_dict["node_swap_free_bytes"].set(swap.free)
disk = psutil.disk_usage("/")
metric_dict["node_disk_usage_percentage"].set(disk.percent)
metric_dict["node_disk_usage_bytes"].set(disk.used)
metric_dict["node_disk_total_bytes"].set(disk.total)
metric_dict["node_disk_free_bytes"].set(disk.free)
metric_dict["node_process_number"].set(len(psutil.pids()))
update_gpu_metrics()
Note that library pynvml
uses NVIDIA NVML library under the hood. Make sure that your execution environment (e.g. Docker container) has this library installed. For more information take a look at the project repository: https://github.com/gpuopenanalytics/pynvml.
We want to call the update_metrics
function in the Runner
class right after performing an inference on the AI model in the __call__
function.
"""File containing Transformer model handler."""
from typing import Any, Dict, List
import bentoml
from bentoml.io import JSON
from examples.services.metrics import update_metrics
MODEL_NAME = "classification-pipeline"
SERVICE_NAME = "classification-pipeline-service"
model = bentoml.transformers.get(f"{MODEL_NAME}:latest")
_BuiltinRunnable = model.to_runnable()
class CustomRunnable(_BuiltinRunnable):
@bentoml.Runnable.method(batchable=True, batch_dim=0)
def __call__(self, input_data: List[str]) -> Dict[str, Any]:
output = self.model(input_data)
update_metrics()
return output
runner = bentoml.Runner(CustomRunnable)
svc = bentoml.Service(
SERVICE_NAME,
runners=[runner],
models=[model],
)
@svc.api(input=JSON(), output=JSON()) # type: ignore
async def classify(input_series: Dict[str, Any]) -> Dict[str, Any]:
"""Perform inference on the deployed model.
Args:
input_series: Dictionary with key "data" containing text.
Returns:
JSON response with key "prediction" containing list of predicted
classes.
"""
response = await runner.async_run(input_series["data"]) # type: ignore
return {"predictions": response}
At this stage, if we deploy the model on the default endpoint with metrics (/metrics
) additional logs about the consumption of system resources will be available in the BentoML service.
Results
To verify that the metrics we define are aggregated and made available in open-telemetry format on the default /metrics
endpoint, we can visit the address where the service with the ML model is running via a browser. In my case, it is https://softwaremill.com/
. On the Swagger API documentation page, we can perform a test inference request on the model by sending a query with the appropriate payload to the /classify
endpoint, followed by a call to the /metrics
endpoint. The result will be metrics about system resource consumption in addition to the default logs.
Performance
Adding additional logic from the Runner code increases the model's response time. In benchmark tests, metric gathering overhead time is around 5 ms on average in latency. This time is constant since it does not depend on the ML model used, and with high probability can be optimized.
Summary
This article presents the default form of the metrics provided by the BentoML service and shows how to extend them as desired. By using system metrics, it is possible to monitor key resources that are heavily used by machine learning models. Their monitoring makes detecting and fixing memory leaks and edge cases possible.
In the next article, I will present how to use the load testing tool and integration with Grafana to monitor the use of system resources while handling requests, where I will use the custom metrics presented in this article. Stay tuned!
Reviewed by: Rafał Pytel