Contents

Locust - performance testing of an ML model

Adam Wawrzyński

31 Jul 2024.8 minutes read

Locust - performance testing of an ML model webp image

Learn how to run load tests on a deployed AI model to find out how it will behave with different volumes of user traffic.

electric meter

Source

Machine learning models can solve many complex business problems, which is why they are often used in business products today. In addition to the performance of the model itself measured by numerous different metrics, the UX of an AI-driven product is important. In some cases, it is more important from a business point of view to return an answer to the user quickly, even if it involves a higher error rate.

With the model's runtime constraints defined, the ML engineering team can conduct tests that examine the response time of the ML system. In this article, I will demonstrate how to perform load and performance testing of the artificial intelligence model. We will simulate request traffic from users of the system using the locust tool and check how this will affect the resource consumption and response time of the deployed machine learning system. I invite you to read and try out the presented techniques!

Performance testing

But first, we can start with a short explanation of the terms used in this article. What is performance testing?

Application performance testing is a technique that allows you to analyze how the load on a service affects its response time, how well it scales, and if it’s fault tolerant. Increasing the load on the network traffic allows us to determine the point at which the service begins to experience overload and the number of unhandled requests begins to increase. This is critical from the UI perspective and the stability of the system. With machine learning models running behind an exposed API, we can test how request traffic affects response time and resource consumption. For the load testing process, we can write our script or use ready-made tools such as locust.

Experiment setup

As an example, we will use a text classification ML system using the BERT served using BentoML in a Docker container. The service will be accessible under a specific port via a REST API. The next container will run a load testing tool, the aforementioned locust. The tool will simulate user traffic triggering the machine learning model by sending requests under the exposed endpoint.

The next containers will run services to aggregate metrics and visualizations provided by BentoML, namely Prometheus and Grafana.

The architecture of the components and their interactions is presented below. Locust sends requests to the BentoML service serving the machine learning model. As a result, the service's metrics are updated as a result of handling these requests, which are read in real-time by Prometheus and sent to Grafana.

In the last service, we can prepare a visualization of these metrics and analyze the behavior of the service under load. In a production scenario, we could prepare appropriate alerts when the service starts to show signs of overload.

component architecture

Component architecture from the scenario in question.

How?

Assuming that there is a locally running machine learning model service with exposed REST API. Example code for running the model on port 3000 from the BentoML registry:

bentoml serve <MODEL_NAME:TAG> -p 3000

The docker-compose.yaml file below contains a single locust service with port 8089 exposed, which connects to a locally running service on port 3000. The command that the service executes uses a Python script that defines how to test the application. The command defines how many users to simulate, how the number of users should change over time, how long the simulation will last, and where the report and file with the simulation results will be saved.

version: '3'

services:
 locust:
   image: locustio/locust
   ports:
     - "8089:8089"
   volumes:
     - ./:/mnt/locust
   command: -f /mnt/locust/locustfile.py 
   --host=http://host.docker.internal:3000 --headless --users 1000 
   --spawn-rate 5 --run-time 5m --html /mnt/locust/results.html 
   --csv /mnt/locust/results -t 5m
   extra_hosts:
     - "host.docker.internal:host-gateway"

Below is an explanation of the options used for the locust command (according to the documentation):

  • –host - address at which the ML service is running
  • –headless - do not enable graphical mode
  • –users - maximum number of users of simulated traffic
  • –spawn-rate - the number of users that will be added to the current user pool every specified time interval
  • –run-time - total simulation time
  • –html - path under which the report will be saved in HTML format
  • --csv - path under which the results will be saved in CSV format

The extra_hosts section and a special address in the locust command are necessary for the containerized tool to work properly with a locally running service on the host machine (as explained in this thread). If a Docker container with the ML model is available, we can include all the services necessary to run the tests in the docker-compose file.

The locustfile.py file contains an example of one endpoint that will be tested and an example of a machine learning model payload.

from locust import HttpUser, between, task

EXAMPLE_DATA = {
   "data": ["This is an example sentence to classify."]
}

class QuickstartUser(HttpUser): 
   """Load test user class."""

   wait_time = between(1, 5) # defines interval between user requests

   @task
   def classify(self):
       """Trigger `classify` endpoint."""
       self.client.post("/classify", json=EXAMPLE_DATA)

Assuming that both files, docker-compose.yaml and locustfile.py are in the same directory, just execute the docker compose up command to start the container. Locust will run the simulation, and when it is finished, the simulation results in .csv format, and a report in HTML format will be saved in the current directory.

Reports

An HTML report is generated as a result of the test load execution, which includes an interactive visualization of the service load and results in .csv format.

graphs

Graphs showing the number of users, response time, and number of queries per second during traffic simulation.

Summary statistics with response times for each defined endpoint in locustfile.py. are given in the range of the 50th, 60th, 70th, 80th, 90th, 95th, 99th, and 100th percentile of the model's response time distribution. The most commonly used and published are the 95th and 99th percentile response time values, which can be translated into the longest expected model response time in 95 or 99 percent of cases.

statistics

System metrics monitoring

Locust allows us to observe, analyze, and report only response time and statistics of simulated request handling. Real-world use cases involve machine learning models that use many more resources, including GPUs, which we want to monitor and profile.

For this purpose, we can use Prometheus and Grafana to aggregate and visualize the logs issued by our deployed service with an ML model. If you want to learn how to add custom system metrics to BentoML for ML system observability check out the article on this topic I wrote.

At this stage, if we deploy the model on the default endpoint with metrics (/metrics), additional logs about the consumption of system resources will be available. For visualization, we need to run Grafana, Prometheus, and other necessary services. To do this, I will use the code provided in this repository.

Below is the modified docker-compose.yaml file, which, in addition to Grafana and Prometheus, starts the container with Locust and automatically starts load testing.

version: '3.8'

volumes:
 prometheus_data: {}
 grafana_data: {}

services:
 locust:
   image: locustio/locust
   ports:
     - "8089:8089"
   volumes:
     - ./:/mnt/locust
   command: -f /mnt/locust/locustfile.py --host=http://host.docker.internal:8000 
   --headless --users 100 --spawn-rate 5 --run-time 5m 
   --html /mnt/locust/results.html --csv /mnt/locust/results -t 5m
   extra_hosts:
     - "host.docker.internal:host-gateway"
 prometheus:
   image: prom/prometheus
   restart: always
   volumes:
     - ./prometheus:/etc/prometheus/
     - prometheus_data:/prometheus
   command:
     - '--config.file=/etc/prometheus/prometheus.yml'
     - '--storage.tsdb.path=/prometheus'
     - '--web.console.libraries=/usr/share/prometheus/console_libraries'
     - '--web.console.templates=/usr/share/prometheus/consoles'
   ports:
     - 9090:9090
   extra_hosts:
     - "host.docker.internal:host-gateway"

 grafana:
   image: grafana/grafana
   user: '472'
   restart: always
   environment:
     GF_INSTALL_PLUGINS: 'grafana-clock-panel,grafana-simple-json-datasource'
   volumes:
     - grafana_data:/var/lib/grafana
     - ./grafana/provisioning/:/etc/grafana/provisioning/
   env_file:
     - ./grafana/config.monitoring
   ports:
     - 3000:3000
   depends_on:
     - prometheus

Suppose we run the model locally on port 8000 using BentoML with the command:

bentoml serve handler.svc -p 8000.

If we run the docker compose up command in the directory with the docker-compose.yaml file, all containers are started and load testing begins. After opening Grafana in the browser, we will have system metrics available, provided by Prometheus. By creating an appropriate dashboard, we can monitor resource consumption during load testing and simulated network traffic.

It’s that simple! The proposed set of application performance monitoring containers is generic and can be reused. If you change the model and update the BentoML handler, the only change to the measurement setup will be to update the endpoint name and sample data in the locustfile.py.

Results

The prepared docker-compose.yaml is a generic measurement setup for any model served by an HTTP server (it doesn't have to be BentoML), which will simulate the network traffic of users triggering the ML service. Using the method described in another blog to add aggregation of system metrics will further allow you to monitor the consumption of key resources for the ML models, such as RAM, VRAM, CPU, and GPU. By modifying the parameters of simulated network traffic, load tests, performance tests, and stress tests of ML services can be performed.

The screenshot presented below shows an example of how the developed tool works. We can see that during the simulation the GPU consumption is at a constant level and there is no risk of Out Of Memory error due to the scaling. In this way, we can test and make sure that our service is designed in such a way that increasing traffic will not damage it.

system metrics in Grafana

Visualization of system metrics in Grafana.

Summary

From this article, you could learn how to use the Locust tool to simulate the load under which the service will run with machine learning model inference. Simple code that tests the exposed endpoints, types of artifacts we can expect, and how to use them was presented. The way to visualize these metrics in Grafana, which can be useful when designing ML service monitoring in production scenarios or during performance testing was shown. I hope that after reading this article, you can more effectively design and analyze services and deployments with machine learning models.

Reviewed by Rafał Pytel

Blog Comments powered by Disqus.