Contents

Data serialization tools: Avro vs Protobuf

Szymon Kałuża

19 Jan 2023.6 minutes read

Data serialization tools: Avro vs Protobuf webp image

Introduction

In this article I will introduce readers to the subject of serialization and its two arguably most popular protocols nowadays. I will present how to use them and how they work under the hood. I will describe the main advantages and disadvantages and when to use them.
Let’s start with some background knowledge.

What is data serialization and why do we need it?

Data serialization is the process of converting data objects present in data structures into a byte stream for storage, transfer and distribution purposes.
Most programs operate on data represented in two ways:

  1. in-memory data, optimized for cpu access and manipulation,
  2. data persisted in a file or send through the network.

That’s why there is a need for some form of translation between those two worlds. It’s straightforward for simple data but gets complicated when it comes to nested, tree-like objects and object references. Some sort of packing of objects scattered in memory to linear byte sequence is necessary.

For this purpose textual formats like CSV, XML or JSON were created. Complex data structures can be stored as text, stored in a file, or sent over the network and then recreated by other code. On top of that, the main advantage of those formats is human readability.

On the other hand, there are some serious disadvantages:

  • limited or no support for data types,
  • if there is schema support, it’s complicated to learn and implement,
  • slow encoding/decoding of data,
  • encoded data takes a lot of space.

In order to overcome those caveats, libraries for binary encoding come to the rescue. They require schema for any data encoded and are well-optimized in terms of performance and space utilization. I will describe two of them - Avro and Protobuf.

Before we start I will mention some crucial terminology used in this article:
Serialization (marshalling) is a process of converting data into a byte stream that can be efficiently stored and/or transferred elsewhere,
Deserialization (unmarshalling) is about recreating original data from byte stream,
Backward compatibility is when a new version of the software can run code written in an old version,
Forward compatibility is when an older version of the software can run code written in a new version,
Schema evolution allows to update the schema to a new format while maintaining backward compatibility.

What is Apache Avro

Avro is a serialization framework started in 2009 by Apache Software Foundation as a subproject of Hadoop. Avro operates on two schema languages:

  1. Avro IDL (interface description language) for human editing
  2. based on JSON that is more easily machine-readable

Example of schema in Avro IDL:

record Person {
  string               userName;
  union { null, long } favouriteNumber;
  array<string>        interests;
}

And it’s JSON representation

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName",        "type": "string"},
    {"name": "favouriteNumber", "type": ["null", "long"]},
    {"name": "interests",       "type": {"type": "array", "items": "string"}}
  ]
}

Sample data using this schema would be encoded like this:
image1

source: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

There are no identifiers for fields or their data types. In the encoded byte stream, there is only a length prefix telling us how many bytes a given field occupies followed by UTF-8 data bytes. That’s why schema is necessary to decode such data. Code goes through the fields in order as they appear in the schema to detect the data type of each field.

For integers, variable-length encoding is used. Therefore space taken to encode an integer depends on its size. Numbers between -64 and 63 are encoded only in one byte. In two bytes we can fit numbers from -8192 to 8191 and so on.

Avro is based on the concept of the writer’s and reader’s schemas. An important fact is that those don’t have to be identical but only compatible with each other. That means Avro library will resolve differences between the writer’s and reader’s schema and translate data. This is possible because Avro schema resolution matches fields by field name as in the example:
image3

source: https://ebrary.net/64678/computer_science/writers_schema_readers_schema

When the code responsible for reading the data encounters a field that appears in the writer’s schema but not in the reader's schema it will simply omit it. In the opposite situation, it will be populated with a default value.

As we can see, in order to read encoded data, the reader needs to know about the writer's schema. Attaching schema to every encoded data contradicts the size optimization goal. There are a few ways to minimize this drawback:

  • for very large files with lots of records writer’s schema can be included at the beginning of the file
  • for smaller records stored in the database we can include the schema version number for each record
  • while sending records over a bidirectional network, processes can negotiate schema version on connection setup

Big advantage of Avro is dynamic schema generation. Once again because of a field resolution by its name. Thanks to this, we could generate an Avro schema for the database and dump it into the Avro object container file. If the database schema would change we could simply repeat the process. Software reading the new data file will see the fields changed but since the fields are identified by name, the updated writer’s schema can still be matched up with the old reader’s schema.

What is Protobuf

Protocol Buffers is an open-source platform data format developed by Google for systems communication. It was created in 2001 and made publicly available in 2008.
Protobuf schema looks like this:

syntax="proto3";

message Person {
  string user_name = 1;
  int64 favourite_number = 2;
  repeated string interests = 3;
}

Protobuf code generation tool takes this schema definition and produces classes that implement it in various programming languages. Here is an example of code generated in java and its usage:

Person myPerson = Person.newBuilder()
       .userName("Name")
       .favouriteNumber(42)
       .interests(interestsList)
       .build();
byte[] rawData = myPerson.toByteArray();

Encoded data is very similar to one generated by Avro, but there are a few differences.
image2

source https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Field names are not persisted in byte streams (like in Avro) but there are tag numbers and data types. This metadata is packed into one byte. Protobuf uses field tags for field recognition.

Since proto version 3, we don’t have required and optional field markers anymore so I won’t write about them.

Few words about schema evolution using Protobuf. Field names can be changed at any time without fear of breaking anything but tags have to stay untouched. New fields with new tags can be added and existing fields may be removed. Although for backward compatibility tag number can’t be used again. The situation was more complex before proto3 when required and optional markers existed but now it’s straightforward. Code that doesn't recognize new fields simply omits them. Instead of lists or arrays, Protobuf uses repeated markers. Because of that, it’s possible to change a single value field to a repeated field. In such cases, new code reading old data sees a list with zero or one element, and old code sees only the last element of the repeated field.

I mentioned code generation provided by Protobuf. It’s especially convenient for statically typed languages, like Java, C++ or C#, but it’s not required.

Conclusion

Avro and Protobuf both use schema for describing the binary encoded format. Their schema languages are much simpler than XML schema or JSON schema. On top of that, they can be much more compact.

Avro seems a better fit for BigData use cases. It supports the ability to modify data formats during runtime and is especially suited for generic data systems such as those needed for data engineering. On the other hand, Avro schemas and libraries are more complex to learn.

Protobuf is easy to use in a microservice architecture, where interoperability is important. It shines when data changes are not too frequent because it needs recompilation every then.

What solution is better for event sourcing? Find out in best serialization strategy for event sourcing

Blog Comments powered by Disqus.