Avro Serialization

About the Apicurio API and Schema Registry

The Apicurio Registry open-source project provides several components that work with Avro:

  • An Avro converter that you can specify in Debezium connector configurations. This converter maps Kafka Connect schemas to Avro schemas. The converter then uses the Avro schemas to serialize the record keys and values into Avro’s compact binary form.

  • An API and schema registry that tracks:

    • Avro schemas that are used in Kafka topics

    • Where the Avro converter sends the generated Avro schemas

    Since the Avro schemas are stored in this registry, each record needs to contain only a tiny schema identifier. This makes each record even smaller. For an I/O bound system like Kafka, this means more total throughput for producers and consumers.

  • Avro Serdes (serializers and deserializers) for Kafka producers and consumers. Kafka consumer applications that you write to consume change event records can use Avro Serdes to deserialize the change event records.

To use the Apicurio Registry with Debezium, add Apicurio Registry converters and their dependencies to the Kafka Connect container image that you are using for running a Debezium connector.

The Apicurio Registry project also provides a JSON converter. This converter combines the advantage of less verbose messages with human-readable JSON. Messages do not contain the schema information themselves, but only a schema ID.

Deployment overview

To deploy a Debezium connector that uses Avro serialization, there are three main tasks:

  1. Deploy an Apicurio API and Schema Registry instance.

  2. Install the Avro converter from the installation package into Kafka Connect’s libs directory or directly into a plug-in directory.

  3. Configure a Debezium connector instance to use Avro serialization by setting configuration properties as follows:

    1. key.converter=io.apicurio.registry.utils.converter.AvroConverter
    2. key.converter.apicurio.registry.url=http://apicurio:8080/api
    3. key.converter.apicurio.registry.global-id=io.apicurio.registry.utils.serde.strategy.AutoRegisterIdStrategy
    4. value.converter=io.apicurio.registry.utils.converter.AvroConverter
    5. value.converter.apicurio.registry.url=http://apicurio:8080/api
    6. value.converter.apicurio.registry.global-id=io.apicurio.registry.utils.serde.strategy.AutoRegisterIdStrategy

Internally, Kafka Connect always uses JSON key/value converters for storing configuration and offsets.

Deploying with Debezium containers

In your environment, you might want to use a provided Debezium container to deploy Debezium connectors that use Avro serializaion. Follow the procedure here to do that. In this procedure, you build a custom Kafka Connect container image for Debezium, and you configure the Debezium connector to use the Avro converter.

Prerequisites

  • You have the cluster administrator access to an OpenShift cluster.

  • You downloaded the Debezium connector plug-in(s) that you want to deploy with Avro serialization.

Procedure

  1. Deploy an instance of Apicurio Registry.

    The following example uses a non-production, in-memory, Apicurio Registry instance:

    1. docker run -it --rm --name apicurio \
    2. -p 8080:8080 apicurio/apicurio-registry-mem:1.2.2.Final
  2. Build a Debezium container image that contains the Avro converter:

    1. Copy Dockerfile to a convenient location. This file has the following content:

      1. ARG DEBEZIUM_VERSION
      2. FROM debezium/connect:$DEBEZIUM_VERSION
      3. ENV KAFKA_CONNECT_DEBEZIUM_DIR=$KAFKA_CONNECT_PLUGINS_DIR/debezium-connector-mysql
      4. ENV APICURIO_VERSION=1.2.2.Final
      5. RUN cd $KAFKA_CONNECT_DEBEZIUM_DIR &&\
      6. curl https://repo1.maven.org/maven2/io/apicurio/apicurio-registry-distro-connect-converter/$APICURIO_VERSION/apicurio-registry-distro-connect-converter-$APICURIO_VERSION-converter.tar.gz | tar xzv
    2. Run the following command:

      1. docker build --build-arg DEBEZIUM_VERSION=1.1 -t debezium/connect-apicurio:1.1 .
  3. Run the newly built Kafka Connect image, configuring it so it uses the Avro converter:

    1. docker run -it --rm --name connect \
    2. --link zookeeper:zookeeper \
    3. --link kafka:kafka \
    4. --link mysql:mysql \
    5. --link apicurio:apicurio \
    6. -e GROUP_ID=1 \
    7. -e CONFIG_STORAGE_TOPIC=my_connect_configs \
    8. -e OFFSET_STORAGE_TOPIC=my_connect_offsets \
    9. -e KEY_CONVERTER=io.apicurio.registry.utils.converter.AvroConverter \
    10. -e VALUE_CONVERTER=io.apicurio.registry.utils.converter.AvroConverter \
    11. -e CONNECT_KEY_CONVERTER=io.apicurio.registry.utils.converter.AvroConverter \
    12. -e CONNECT_KEY_CONVERTER_APICURIO.REGISTRY_URL=http://apicurio:8080 \
    13. -e CONNECT_KEY_CONVERTER_APICURIO.REGISTRY_GLOBAL-ID=io.apicurio.registry.utils.serde.strategy.AutoRegisterIdStrategy \
    14. -e CONNECT_VALUE_CONVERTER=io.apicurio.registry.utils.converter.AvroConverter \
    15. -e CONNECT_VALUE_CONVERTER_APICURIO_REGISTRY_URL=http://apicurio:8080 \
    16. -e CONNECT_VALUE_CONVERTER_APICURIO_REGISTRY_GLOBAL-ID=io.apicurio.registry.utils.serde.strategy.AutoRegisterIdStrategy \
    17. -p 8083:8083 debezium/connect-apicurio:1.1

Naming

As stated in the Avro documentation, names must adhere to the following rules:

  • Start with [A-Za-z_]

  • Subsequently contains only [A-Za-z0-9_] characters

Debezium uses the column’s name as the basis for the corresponding Avro field. This can lead to problems during serialization if the column name does not also adhere to the Avro naming rules. Each Debezium connector provides a configuration property, sanitize.field.names that you can set to true if you have columns that do not adhere to Avro rules for names. Setting sanitize.field.names to true allows serialization of non-conformant fields without having to actually modify your schema.

Confluent Schema Registry

There is an alternative schema registry implementation provided by Confluent. The configuration is slightly different.

  1. In your Debezium connector configuration, specify the following properties:

    1. key.converter=io.confluent.connect.avro.AvroConverter
    2. key.converter.schema.registry.url=http://localhost:8081
    3. value.converter=io.confluent.connect.avro.AvroConverter
    4. value.converter.schema.registry.url=http://localhost:8081
  2. Deploy an instance of the Confluent Schema Registry:

    1. docker run -it --rm --name schema-registry \
    2. --link zookeeper \
    3. -e SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL=zookeeper:2181 \
    4. -e SCHEMA_REGISTRY_HOST_NAME=schema-registry \
    5. -e SCHEMA_REGISTRY_LISTENERS=http://schema-registry:8081 \
    6. -p 8181:8181 confluentinc/cp-schema-registry
  3. Run a Kafka Connect image configured to use Avro:

    1. docker run -it --rm --name connect \
    2. --link zookeeper:zookeeper \
    3. --link kafka:kafka \
    4. --link mysql:mysql \
    5. --link schema-registry:schema-registry \
    6. -e GROUP_ID=1 \
    7. -e CONFIG_STORAGE_TOPIC=my_connect_configs \
    8. -e OFFSET_STORAGE_TOPIC=my_connect_offsets \
    9. -e KEY_CONVERTER=io.confluent.connect.avro.AvroConverter \
    10. -e VALUE_CONVERTER=io.confluent.connect.avro.AvroConverter \
    11. -e CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL=http://schema-registry:8081 \
    12. -e CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL=http://schema-registry:8081 \
    13. -p 8083:8083 debezium/connect:1.1
  4. Run a console consumer that reads new Avro messages from the db.myschema.mytable topic and decodes to JSON:

    1. docker run -it --rm --name avro-consumer \
    2. --link zookeeper:zookeeper \
    3. --link kafka:kafka \
    4. --link mysql:mysql \
    5. --link schema-registry:schema-registry \
    6. debezium/connect:1.1 \
    7. /kafka/bin/kafka-console-consumer.sh \
    8. --bootstrap-server kafka:9092 \
    9. --property print.key=true \
    10. --formatter io.confluent.kafka.formatter.AvroMessageFormatter \
    11. --property schema.registry.url=http://schema-registry:8081 \
    12. --topic db.myschema.mytable

Getting More Information

This post from the Debezium blog describes the concepts of serializers, converters, and other components, and discusses the advantages of using Avro. Some Kafka Connect converter details have slightly changed since that post was written.

For a complete example of using Avro as the message format for Debezium change data events, see MySQL and the Avro message format.