Raw Format
Format: Serialization Schema Format: Deserialization Schema
The Raw format allows to read and write raw (byte based) values as a single column.
Note: this format encodes null
values as null
of byte[]
type. This may have limitation when used in upsert-kafka
, because upsert-kafka
treats null
values as a tombstone message (DELETE on the key). Therefore, we recommend avoiding using upsert-kafka
connector and the raw
format as a value.format
if the field can have a null
value.
The Raw connector is built-in into the Blink planner, no additional dependencies are required.
Example
For example, you may have following raw log data in Kafka and want to read and analyse such data using Flink SQL.
47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"
The following creates a table where it reads from (and can writes to) the underlying Kafka topic as an anonymous string value in UTF-8 encoding by using raw
format:
CREATE TABLE nginx_log (
log STRING
) WITH (
'connector' = 'kafka',
'topic' = 'nginx_log',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'format' = 'raw'
)
Then you can read out the raw data as a pure string, and split it into multiple fields using an user-defined-function for further analysing, e.g. my_split
in the example.
SELECT t.hostname, t.datetime, t.url, t.browser, ...
FROM(
SELECT my_split(log) as t FROM nginx_log
);
In contrast, you can also write a single column of STRING type into this Kafka topic as an anonymous string value in UTF-8 encoding.
Format Options
Option | Required | Default | Type | Description |
---|---|---|---|---|
format | required | (none) | String | Specify what format to use, here should be ‘raw’. |
raw.charset | optional | UTF-8 | String | Specify the charset to encode the text string. |
raw.endianness | optional | big-endian | String | Specify the endianness to encode the bytes of numeric value. Valid values are ‘big-endian’ and ‘little-endian’. See more details of endianness. |
Data Type Mapping
The table below details the SQL types the format supports, including details of the serializer and deserializer class for encoding and decoding.
Flink SQL type | Value |
---|---|
CHAR / VARCHAR / STRING | A UTF-8 (by default) encoded text string. The encoding charset can be configured by ‘raw.charset’. |
BINARY / VARBINARY / BYTES | The sequence of bytes itself. |
BOOLEAN | A single byte to indicate boolean value, 0 means false, 1 means true. |
TINYINT | A single byte of the singed number value. |
SMALLINT | Two bytes with big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
INT | Four bytes with big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
BIGINT | Eight bytes with big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
FLOAT | Four bytes with IEEE 754 format and big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
DOUBLE | Eight bytes with IEEE 754 format and big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
RAW | The sequence of bytes serialized by the underlying TypeSerializer of the RAW type. |