How to add CPU, GPU, and system metrics to the BentoML service metrics enpoint

Adam Wawrzyński

14 Jun 2024.6 minutes read

How to add CPU, GPU, and system metrics to the BentoML service metrics enpoint webp image

Monitoring deployed services is key to responding quickly to emerging performance issues or errors. For services using machine learning models under the hood, it is crucial to monitor resources that are critical to their operation.


Image source

BentoML, widely used to serve machine learning models, provides a REST API by default, with basic, generic functionalities, such as health checks and monitoring service metrics. Metrics are made available in an open-telemetry format and can be integrated with Prometheus and Grafana to aggregate, monitor, and send alerts.

In this article, I will present what metrics are monitored by BentoML out of the box. Then I will present my point of view on which metrics are missing and how to add them. Finally, I will show how to verify the proposed modification and analyze what impact it has on the service performance. Let’s get started!

Why BentoML?

As mentioned in the introduction, BentoML is a framework for creating and deploying production-ready ML services. Unlike MLFlow, which uses Flask as an HTTP server (WSGI), BentoML uses FastAPI (ASGI), which is better optimized for serving models in a production environment. In other words, BentoML's advantage is its support for asynchronous communication. It supports many popular frameworks, e.g. scikit-learn, Transformers, ONNX, PyTorch, and TensorFlow, so it is possible to use one framework for different types of ML models. What's more, the tool provides a lot of useful functionality out-of-the-box, such as service API documentation in OpenAPI format, and basic metrics on the number and processing time of requests in Open Telemetry format. Enables easy and fast deployment optimization by using various Runner classes, such as Triton Inference Server and adaptive batching of requests. Provides the ability to create sequences of distributed models, and control GPU and CPU resources as well as the number of workers. BentoML comes with a built-in Model Store to version models, bento artifact packaging format, and a Docker container builder tool that allows for seamless containerization and deployment.

In summary, BentoML is a generic ML service development framework that provides a lot of necessary functionalities and allows for rapid prototyping. On the other hand, it is deeply parameterizable, so you can optimize it to your needs. And most importantly, it supports the most popular frameworks, so you can use it in almost any ML project.

Default metrics

Below is a list of metrics monitored by BentoML service by default. It's easy to see that these are generic requests-oriented metrics for the entire API service and individual runners.

  • bentoml_runner_request_duration_seconds- Time in seconds needed to complete the RPC for each runner
  • bentoml_runner_request_in_progress- Total number of runner RPC in progress right now
  • bentoml_runner_adaptive_batch_size- Runner adaptive batch size
  • bentoml_runner_request_total- Total number of runner RPC
  • bentoml_api_server_request_total- Total number of server HTTP requests
  • bentoml_api_server_request_in_progress- Total number of server HTTP requests in progress right now
  • bentoml_api_server_request_duration_seconds- Time in seconds needed to complete server HTTP request

Which metrics are missing?

The metrics that I think are missing are system metrics that show resource consumption during request handling. The crucial metrics are RAM and CPU utilization, and for models using hardware acceleration, also VRAM and GPU consumption. These are the basic metrics that are worth monitoring for any AI model, as they allow you to determine how well-prepared a service model is to handle a high volume of requests and detect potential memory leaks, and edge cases where memory is running out.

How to add them

By default, BentoML provides logs that do not contain system metrics. However, we can extend the Runner class of the BentoML framework and add logging of any custom metrics. In our case, these will be system metrics describing CPU, GPU, RAM, and VRAM usage. Below you will find code that aggregates the above-mentioned metrics in code using the psutil and pynvml libraries.

import psutil
import pynvml
import bentoml

driver_version = pynvml.nvmlSystemGetDriverVersion()
gpu_number = pynvml.nvmlDeviceGetCount()

metrics_system_keys_dict = {
    "node_ram_usage_percentage": "RAM usage in %",
    "node_ram_usage_bytes": "RAM usage in bytes",
    "node_ram_total_bytes": "RAM total in bytes",
    "node_ram_free_bytes": "RAM free in bytes",
    "node_swap_usage_percentage": "swap memory usage in %",
    "node_swap_usage_bytes": "swap memory usage in bytes",
    "node_swap_total_bytes": "swap total memory in bytes",
    "node_swap_free_bytes": "swap memory free in bytes",
    "node_disk_usage_percentage": "disk usage in %",
    "node_disk_usage_bytes": "disk usage in bytes",
    "node_disk_total_bytes": "disk total in bytes",
    "node_disk_free_bytes": "disk free in bytes",
    "node_process_number": "Number of processes runnning in the system",

metric_dict = {
    key: bentoml.metrics.Gauge(name=key, documentation=value)
    for key, value in metrics_system_keys_dict.items()

metric_gpu_keys_dict = {
    "total_memory": "VRAM total in bytes",
    "free_memory": "VRAM free in bytes",
    "used_memory": "VRAM used in bytes",
    "power_usage": "GPU device power usage in Watts",
    "fan_speed": "GPU fan speed in rpm",
    "utilization_percent": "GPU utilization in %"


for gpu_device in range(0, gpu_number):
    for key, value in metric_gpu_keys_dict.items():
        name = f"node_gpu_{gpu_device}_{key}"
        metric_dict[name] = bentoml.metrics.Gauge(name=name, documentation=value)

def update_gpu_metrics():
    for gpu_device in range(0, gpu_number):
        handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_device)
        info = pynvml.nvmlDeviceGetMemoryInfo(handle)


cpu_list = psutil.cpu_percent(percpu=True)
cpu_no = len(cpu_list)

cpu_usage_metric_list = []
for index in range(0, cpu_no):
    key = f"node_cpu_{index}_usage_percentage"
    metric_dict[key] = bentoml.metrics.Gauge(
        documentation=f"CPU{index} usage in %",

def update_metrics():
    """Update system metrics."""
    for index, cpu_usage in enumerate(psutil.cpu_percent(percpu=True)):
        key = f"node_cpu_{index}_usage_percentage"

    ram = psutil.virtual_memory()
    swap = psutil.swap_memory()
    disk = psutil.disk_usage("/")


Note that library pynvml uses NVIDIA NVML library under the hood. Make sure that your execution environment (e.g. Docker container) has this library installed. For more information take a look at the project repository:

We want to call the update_metrics function in the Runner class right after performing an inference on the AI model in the __call__ function.

"""File containing Transformer model handler."""

from typing import Any, Dict, List

import bentoml
from import JSON
from import update_metrics

MODEL_NAME = "classification-pipeline"
SERVICE_NAME = "classification-pipeline-service"

model = bentoml.transformers.get(f"{MODEL_NAME}:latest")
_BuiltinRunnable = model.to_runnable()

class CustomRunnable(_BuiltinRunnable):
   @bentoml.Runnable.method(batchable=True, batch_dim=0)
   def __call__(self, input_data: List[str]) -> Dict[str, Any]:
       output = self.model(input_data)
       return output

runner = bentoml.Runner(CustomRunnable)
svc = bentoml.Service(

@svc.api(input=JSON(), output=JSON())  # type: ignore
async def classify(input_series: Dict[str, Any]) -> Dict[str, Any]:
   """Perform inference on the deployed model.

       input_series: Dictionary with key "data" containing text.

       JSON response with key "prediction" containing list of predicted
   response = await runner.async_run(input_series["data"])  # type: ignore
   return {"predictions": response}

At this stage, if we deploy the model on the default endpoint with metrics (/metrics) additional logs about the consumption of system resources will be available in the BentoML service.


To verify that the metrics we define are aggregated and made available in open-telemetry format on the default /metrics endpoint, we can visit the address where the service with the ML model is running via a browser. In my case, it is On the Swagger API documentation page, we can perform a test inference request on the model by sending a query with the appropriate payload to the /classify endpoint, followed by a call to the /metrics endpoint. The result will be metrics about system resource consumption in addition to the default logs.



Adding additional logic from the Runner code increases the model's response time. In benchmark tests, metric gathering overhead time is around 5 ms on average in latency. This time is constant since it does not depend on the ML model used, and with high probability can be optimized.


This article presents the default form of the metrics provided by the BentoML service and shows how to extend them as desired. By using system metrics, it is possible to monitor key resources that are heavily used by machine learning models. Their monitoring makes detecting and fixing memory leaks and edge cases possible.

In the next article, I will present how to use the load testing tool and integration with Grafana to monitor the use of system resources while handling requests, where I will use the custom metrics presented in this article. Stay tuned!

Reviewed by: Rafał Pytel

Blog Comments powered by Disqus.