Raw Format

Format: Serialization Schema Format: Deserialization Schema

The Raw format allows to read and write raw (byte based) values as a single column.

Note: this format encodes null values as null of byte[] type. This may have limitation when used in upsert-kafka, because upsert-kafka treats null values as a tombstone message (DELETE on the key). Therefore, we recommend avoiding using upsert-kafka connector and the raw format as a value.format if the field can have a null value.

Dependencies

In order to use the RAW format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

Maven dependencySQL Client JAR
flink-rawBuilt-in

Example

For example, you may have following raw log data in Kafka and want to read and analyse such data using Flink SQL.

  1. 47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"

The following creates a table where it reads from (and can writes to) the underlying Kafka topic as an anonymous string value in UTF-8 encoding by using raw format:

  1. CREATE TABLE nginx_log (
  2. log STRING
  3. ) WITH (
  4. 'connector' = 'kafka',
  5. 'topic' = 'nginx_log',
  6. 'properties.bootstrap.servers' = 'localhost:9092',
  7. 'properties.group.id' = 'testGroup',
  8. 'format' = 'raw'
  9. )

Then you can read out the raw data as a pure string, and split it into multiple fields using an user-defined-function for further analysing, e.g. my_split in the example.

  1. SELECT t.hostname, t.datetime, t.url, t.browser, ...
  2. FROM(
  3. SELECT my_split(log) as t FROM nginx_log
  4. );

In contrast, you can also write a single column of STRING type into this Kafka topic as an anonymous string value in UTF-8 encoding.

Format Options

OptionRequiredDefaultTypeDescription
format
required(none)StringSpecify what format to use, here should be ‘raw’.
raw.charset
optionalUTF-8StringSpecify the charset to encode the text string.
raw.endianness
optionalbig-endianStringSpecify the endianness to encode the bytes of numeric value. Valid values are ‘big-endian’ and ‘little-endian’. See more details of endianness.

Data Type Mapping

The table below details the SQL types the format supports, including details of the serializer and deserializer class for encoding and decoding.

Flink SQL typeValue
CHAR / VARCHAR / STRINGA UTF-8 (by default) encoded text string.
The encoding charset can be configured by ‘raw.charset’.
BINARY / VARBINARY / BYTESThe sequence of bytes itself.
BOOLEANA single byte to indicate boolean value, 0 means false, 1 means true.
TINYINTA single byte of the singed number value.
SMALLINTTwo bytes with big-endian (by default) encoding.
The endianness can be configured by ‘raw.endianness’.
INTFour bytes with big-endian (by default) encoding.
The endianness can be configured by ‘raw.endianness’.
BIGINTEight bytes with big-endian (by default) encoding.
The endianness can be configured by ‘raw.endianness’.
FLOATFour bytes with IEEE 754 format and big-endian (by default) encoding.
The endianness can be configured by ‘raw.endianness’.
DOUBLEEight bytes with IEEE 754 format and big-endian (by default) encoding.
The endianness can be configured by ‘raw.endianness’.
RAWThe sequence of bytes serialized by the underlying TypeSerializer of the RAW type.