Why using Google Protocol Buffers

Thursday. September 10, 2020 - 6 mins

Index

Introduction
Protobuf benefits
Protobuf use case
Repository hierarchy
Models life cycle
Public resources
Summary

Introduction

Google Protocol Buffers, also called Protobuf, is a language neutral, platform neutral, extensible dialect for describing data models. However, it is way more than this. Protobuf encapsulates a range of tools for serializing data objects so that they can be interchanged between different software components.

Something like JSON-schema, but on steroids.

This post does not cover the Protobuf syntax (currently at v3), but instead explains the reasons why to use it.

Protobuf benefits

Given the unique properties of Protobuf, it serves as a great solution for defining data schemas that several software components are going to share, where some of them may be written in different programming languages.

In addition, there are other benefits that come into play.

Language support 💱

As explained before, Protobuf offers both a data model describing dialect, and a set of encoders and decoders to serialize and deserialize shared data objects.

These encoders and decoders are bundled with the Protobuf libraries, and they have been created for supporting a wide variety of programming languages: C++, Python, Java, Ruby, Golang, JavaScript, PHP… The complete list of supported languages can be checked on its GitHub repository.

They are in charge of automatically generate language-specific classes, where the serialized data objects can be deserialized into. This is a great advantage over JSON-schema, as Protobuf generated classes can be used by IDEs and linters to statically check your code.

Simplified example:

// Protobuf model                           # Python class (generated)
message Ingredient {                        class Ingredient:
  string id = 1;                                id: str
  string name = 2;                              name: str
  int32 healthiness = 3;                        healthiness: int
}

Speed ⚡️

Serialization speed is another area where Protobuf excels, with the following speed-ups, compared to JSON:

On JS-to-Java: ~20%. Reference.
On Java-to-Java: ~80% (5-6 times faster). Reference.
On Golang-to-Golang: ~40% (2.3-2.7 times faster). Reference.

I do not consider this to be a huge benefit, as data serialization is usually not the bottleneck of modern days applications. Just something to keep in mind.

Customization 🎨

Protobuf offers default translations of .proto models to each particular programming language class equivalent. Nevertheless, it is flexible enough to allow developers create their own translations if they want to use modern features of those programming languages.

My favourite example is betterproto, a Protobuf to Python plugin that simplifies the resulting classes, and makes use of the dataclass decorator.

If the default translation is called:

protoc --proto_path=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/ingredient.proto

A customized translation is called:

protoc --proto_path=$SRC_DIR --python_betterproto_out=$DST_DIR $SRC_DIR/ingredient.proto

Protobuf use case

Overall, Protocol Buffers are useful on quite particular scenarios, that are only reached once an organization has a certain size, as it addresses a scalability problem by creating a single source of truth on how shared data models are defined and evolved over time.

Here is a real example:

Your company now has 50+ software engineers, and you are starting to migrate to a micro-services architecture where some are written in Python, others in Golang, and others in Ruby. How do you make sure data schemas shared among some of those microservices are consistent?

It is probable that Golang schemas evolve faster than the Ruby ones, and the required coordination among teams to ensure they are introducing the same new fields is slowing down the development.

That is where a centralized repository of Protobuf schemas helps.

Repository hierarchy

As introduced in the previous section, using Protobuf schemas become useful when they are defined on a repository representing the single source of truth within an organization.

That repository of Protobuf schemas hold the data models that teams managing components written in different programming languages are going to use. Therefore, they will need some kind of automatic translation from .proto to .py, .go, .rb…

That is when repository hierarchy comes into play.

The desired hierarchy would have an upstream repository called org-schemas with the Protobuf definitions, and a range of downstream repositories called:

org-schemas-python 🐍
org-schemas-golang 🐇
org-schemas-ruby 💎
…

So that every time a change is introduced on the Protobuf schemas, the CI/CD automatically:

Downloads a protoc compiler.
Generate the target language classes.
Clones and updates the target language repository.
Releases / tags the target language repository.

Models life-cycle ♻️

With a hierarchical structure of repositories, data models evolution is propagated from the central repo of Protobuf schemas to the downstream, language-specific, schemas repositories.

This approach automatically synchronize the state of data schemas across the different programming languages that an organization may decide to use, so schemas at version X.Y.Z of org-schemas-python repo have exactly the same fields as those in version X.Y.Z of org-schemas-golang repo.

Quite satisfying 😌

Now, there is a small catch: as automatically detecting what is considered a major, minor or patch change by the Protobuf repository CI is not easy, semantic versioning format cannot be followed on the downstream repositories.

That means that versions will not be X.Y.Z (i.e. 2.1.0), but X (i.e. 12), being incremented with each new change.

Public resources

GitHub skeleton

In order to facilitate the understanding and the setup of this structure of repositories (which lays more in the infrastructure than in the development part of software), I made a public repository with the skeleton of how the upstream Protobuf holding repo would look like.

Check it out: Sinclert/Protobuf-upstream.

Additional considerations

A pair of RSA keys is required for connecting upstream and downstream repos.
- The private part is set as “Secret” within the upstream repository.
- The public part is set as “Deploy key” within the downstream repositories.
Advance usage: if the resulting language-specific schema repositories want to be kept secret within an organization, but being used at Dockerized pieces of software, check out this fantastic Docker SSH forwarding guide on how to use --mount=type=ssh to access those.

Summary

Setting up a hierarchy of repositories to translate language-neutral schemas into language-specific classes is challenging, but it also provides fantastic benefits to big enough organizations:

Single source of truth.
Controlled evolution of downstream repositories.
Language-specific generated classes (compared to JSON-schema).
Language flexibility to choose from (C++, Python, Golang, Ruby, JS…).

Thanks for reading!

Feel free to contact me if you find any bugs on the skeleton repo 😉

Sinclert Pérez

Software Engineer @ Canonical