Raw Format
Format: Serialization Schema Format: Deserialization Schema
The Raw format allows to read and write raw (byte based) values as a single column.
Note: this format encodes null
values as null
of byte[]
type. This may have limitation when used in upsert-kafka
, because upsert-kafka
treats null
values as a tombstone message (DELETE on the key). Therefore, we recommend avoiding using upsert-kafka
connector and the raw
format as a value.format
if the field can have a null
value.
Dependencies
In order to use the RAW format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
Maven dependency | SQL Client JAR |
---|---|
flink-raw | Built-in |
Example
For example, you may have following raw log data in Kafka and want to read and analyse such data using Flink SQL.
47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"
The following creates a table where it reads from (and can writes to) the underlying Kafka topic as an anonymous string value in UTF-8 encoding by using raw
format:
CREATE TABLE nginx_log (
log STRING
) WITH (
'connector' = 'kafka',
'topic' = 'nginx_log',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'format' = 'raw'
)
Then you can read out the raw data as a pure string, and split it into multiple fields using an user-defined-function for further analysing, e.g. my_split
in the example.
SELECT t.hostname, t.datetime, t.url, t.browser, ...
FROM(
SELECT my_split(log) as t FROM nginx_log
);
In contrast, you can also write a single column of STRING type into this Kafka topic as an anonymous string value in UTF-8 encoding.
Format Options
Option | Required | Default | Type | Description |
---|---|---|---|---|
format | required | (none) | String | Specify what format to use, here should be ‘raw’. |
raw.charset | optional | UTF-8 | String | Specify the charset to encode the text string. |
raw.endianness | optional | big-endian | String | Specify the endianness to encode the bytes of numeric value. Valid values are ‘big-endian’ and ‘little-endian’. See more details of endianness. |
Data Type Mapping
The table below details the SQL types the format supports, including details of the serializer and deserializer class for encoding and decoding.
Flink SQL type | Value |
---|---|
CHAR / VARCHAR / STRING | A UTF-8 (by default) encoded text string. The encoding charset can be configured by ‘raw.charset’. |
BINARY / VARBINARY / BYTES | The sequence of bytes itself. |
BOOLEAN | A single byte to indicate boolean value, 0 means false, 1 means true. |
TINYINT | A single byte of the singed number value. |
SMALLINT | Two bytes with big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
INT | Four bytes with big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
BIGINT | Eight bytes with big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
FLOAT | Four bytes with IEEE 754 format and big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
DOUBLE | Eight bytes with IEEE 754 format and big-endian (by default) encoding. The endianness can be configured by ‘raw.endianness’. |
RAW | The sequence of bytes serialized by the underlying TypeSerializer of the RAW type. |