Contents

Data serialization tools comparison: Avro vs Protobuf

Szymon Kałuża

30 Jun 2023.7 minutes read

Data serialization tools comparison: Avro vs Protobuf webp image

Data serialization is an essential part of modern software development, allowing applications to exchange data in a compact, efficient, and platform-independent format. Two popular systems for data serialization are Google's Protocol Buffers (Protobuf) and Apache's Avro. While both Protobuf and Avro have their strengths and weaknesses, one key factor that developers often consider when choosing between them is performance. In this post, we will compare the performance of Protobuf and Avro across a range of metrics, including encoding time, decoding time, serialized size, and throughput.

By analyzing the results of these tests, I hope to provide developers with a better understanding of the performance characteristics of each system and how to choose between them based on their specific needs.

Overview of Protobuf and Avro

Protobuf was developed by Google in 2008 and is open-source software. It is designed to be simple, fast, and efficient, with support for a wide range of programming languages including C++, Java, Python, and Ruby. Protobuf uses a schema definition language to define the structure of data, which is then compiled into code that can be used to serialize and deserialize the data.

Avro was developed by Apache in 2009 and is also open-source software. Avro is designed to support schema evolution, which allows data to be serialized and deserialized even when the schema has changed. Avro supports a similar range of programming languages as Protobuf, with strong support for Java, Python, and Ruby. Avro uses a JSON-based schema definition language to define the structure of data, which can be stored alongside the data itself.

When comparing Protobuf and Avro, it's important to understand how they are used in different industries and applications. Here are some examples of use cases for each format.

Gain a competitive edge with on-demand expert engineering. We assist forward-thinking businesses in transforming through the right technology. Explore the offer >>

Protobuf:

  • Protobuf was originally developed by Google for use in their internal systems, including Google Search, Google Maps, and Gmail.
  • gRPC for microservices communication.
  • Financial institutions for high-frequency trading, risk management, and fraud detection.
  • It is used by game developers to optimize network performance and reduce bandwidth usage.
  • IoT applications send data between devices, such as sensors, gateways, and cloud servers.
  • Automotive applications to exchange data between cars and infrastructure, such as traffic lights and parking meters.

Avro:

  • Avro is used by Apache Hadoop, an open-source big data platform, for data serialization and exchange between Hadoop components.
  • Messaging systems, such as Apache Kafka and Apache Pulsar, for streaming data between applications and services.
  • Analytics platforms, such as Apache Spark and Apache Flink, for data processing and analysis.
  • Machine learning frameworks, such as TensorFlow and PyTorch, for data serialization and exchange between training and inference stages.
  • Log collection tools, such as Apache Flume and Apache NiFi, for streaming log data from various sources to a centralized location.

In general, Protobuf is more popular in low-latency, high-performance scenarios, while Avro is more popular in big data, distributed systems, and analytics scenarios. However, both formats can be used in a variety of applications, and the choice depends on the specific requirements and constraints of each use case.

If you would like to know which serialization is best suited for event sourcing check this article.

Prerequisites

To compare the performance of Protobuf and Avro, we will use JMH (Java Microbenchmark Harness), a widely used tool for benchmarking Java code. JMH provides a robust and reliable way to measure the performance of small code snippets in a controlled environment, allowing us to accurately compare the performance of Protobuf and Avro across a range of metrics such as encoding time, decoding time, serialized size, and throughput. To prepare the performance test data I created a sample object to serialize using both Protobuf and Avro. The object is a simplified representation of a library with address and books fields. Each book has an author, pages, and availability and the author consists of name, surname, and nationality. One key difference between the Protobuf and Avro schemas is how they handle collection fields. In Protobuf, collection fields are defined using a special syntax that indicates that the field is a repeated one, as shown in the following code snippet.

There are two active versions of Protobuf currently available - Proto2 and Proto3. I’ve chosen version 3 over 2 because it offers a simplified structure and improved usability in my opinion.

syntax = "proto3";

package protobuf;

option java_multiple_files = true;
option java_package = "com.szymon_kaluza.protobuf.proto.model";
option java_outer_classname = "LibraryProtos";

message Author {
 optional string name = 1;
 optional string surname = 2;
 optional string nationality = 3;
}

message Book {
 optional string title = 1;
 optional Author author = 2;
 optional int64 pages = 3;
 optional bool available = 4;
}

message Library {
 optional string address = 1;
 repeated Book books = 2;
}

With this definition we can generate Java code using protocol buffer compiler.

In Avro, collections are defined using an array, which is a collection of a specific type. The Avro schema for the same object would look like this:

{
 "type": "record",
 "name": "Library",
 "namespace": "com.szymon_kaluza.avro.avro.model",
 "fields": [
   {
     "name": "address",
     "type": [
       "null",
       "string"
     ],
     "default": null
   },
   {
     "name": "books",
     "type": {
       "type": "array",
       "items": {
         "type": "record",
         "name": "Book",
         "fields": [
           {
             "name": "title",
             "type": [
               "null",
               "string"
             ],
             "default": null
           },
           {
             "name": "author",
             "type": {
               "type": "record",
               "name": "Author",
               "fields": [
                 {
                   "name": "name",
                   "type": [
                     "null",
                     "string"
                   ],
                   "default": null
                 },
                 {
                   "name": "surname",
                   "type": [
                     "null",
                     "string"
                   ],
                   "default": null
                 },
                 {
                   "name": "nationality",
                   "type": [
                     "null",
                     "string"
                   ],
                   "default": null
                 }
               ]
             }
           },
           {
             "name": "pages",
             "type": [
               "null",
               "long"
             ],
             "default": null
           },
           {
             "name": "available",
             "type": [
               "null",
               "boolean"
             ],
             "default": null
           }
         ]
       }
     }
   }
 ]
}

With the help of Avro Java API we can create the schema above and then compile it to Java.

For benchmarks, I choose to use two versions of Library objects. A small one with only one book and a big one with one thousand books. To prevent JVM optimizations, string fields are randomized both in content and length (from 3 to 13 characters). Generally, randomness should be avoided to compare performance but for our use case, it’s negligible. I only used fixed values to compare serialized data size as we need the same objects there.

It’s important to note that these tests were conducted specifically in Java. Results may differ when comparing Protobuf and Avro in other programming languages due to variations in serialization libraries, language-specific performance characteristics, and runtime environments. It’s advisable to conduct benchmarking and evaluate performance within the programming language you are working with.

Time for testing

I’ve focused on testing throughput because the average time of operation was too little to reveal discrepancies between frameworks.

I kept the same JMH options for all of them:

@Warmup(iterations = 3, time = 3)
@Fork(3)
@Measurement(iterations = 100, time = 3)

Here we are requesting three dry runs and three executions with one hundred iterations each. The time specified for warmups and iterations is in seconds.

In each group, we have four different scenarios:

  • serialization of a small library
  • serialization of a big library
  • serialization and deserialization of a small library
  • serialization and deserialization of a big library

On top of that I compared the size of serialized libraries with one, ten, one thousand and one million books. Tests were performed on Ubuntu 22.04.2 LTS, 11th Gen Intel i9-11900H 2.50 GHz, 64GB of DDR4 memory, and OpenJDK 17.0.5 JVM.

In the end, we got those results:
image1
image2

Number of books in libraryProtobufAvro
Serialized size with one book56 bytes54 bytes
Serialized size with ten books479 bytes441 bytes
Serialized size with thousand books55441 bytes51508 bytes
Serialized size with million books68539057 bytes64547317 bytes

Conclusion

Tests that we conducted shows us that Protobuf is better when it comes to speed of serialization and deserialization. On the other hand, Avro gives us a little bit more compacted serialized data. When choosing between the two, the decision will largely depend on the specific needs of your distributed system. If performance is a critical concern, then Protobuf's speed and efficiency may make it a better choice. If you need more complex data structures or built-in compression options, then Avro may be a better fit.

Regardless of which serialization framework you choose, it's important to keep in mind that serialization is just one component of a distributed system. Other factors, such as network latency, data consistency, CPU usage, and libraries used, will also play a role in determining the overall performance and reliability of your system. By carefully considering all of these factors, and choosing the right serialization framework for your needs, you can help build a distributed system that is fast, efficient, and reliable.

The complete code used for our tests can be found on GitHub under this link Benchmark code.

Tech reviewed by: Michał Matłoka

Technology

Blog Comments powered by Disqus.