Transform Clause
Description
The TRANSFORM
clause allows user to transform inputs using user-specified command or script.
Syntax
query:
SELECT TRANSFORM ( expression [ , ... ] )
[ inRowFormat ]
[ inRecordWriter ]
USING command_or_script
[ AS colName [ colType ] [ , ... ] ]
[ outRowFormat ]
[ outRecordReader ]
rowFormat
: ROW FORMAT
(DELIMITED [FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[ESCAPED BY char]
[LINES SEPARATED BY char]
|
SERDE serde_name [WITH SERDEPROPERTIES
property_name=property_value,
property_name=property_value, ...])
outRowFormat : rowFormat
inRowFormat : rowFormat
outRecordReader : RECORDREADER className
inRecordWriter: RECORDWRITER record_write_class
Note:
MAP ..
andREDUCE ..
are syntactic transformations ofSELECT TRANSFORM ( ... )
in Hive dialect for such query. So you can useMAP
/REDUCE
to replaceSELECT TRANSFORM
.
Parameters
inRowFormat
Specific use what row format to feed to input data into the running script. By default, columns will be transformed to
STRING
and delimited byTAB
before feeding to the user script; Similarly, allNULL
values will be converted to the literal string\N
in order to differentiateNULL
values from empty strings.outRowFormat
Specific use what row format to read the output from the running script. By default, the standard output of the user script will be treated as TAB-separated
STRING
columns, any cell containing only\N
will be re-interpreted as aNULL
, and then the resultingSTRING
column will be cast to the data type specified in the table declaration in the usual way.inRecordWriter
Specific use what writer(fully-qualified class name) to write the input data. The default is
org.apache.hadoop.hive.ql.exec.TextRecordWriter
outRecordReader
Specific use what reader(fully-qualified class name) to read the output data. The default is
org.apache.hadoop.hive.ql.exec.TextRecordReader
command_or_script
Specifies a command or a path to script to process data.
Note:
- Add a script file and then transform input using the script is not supported yet.
- The script used must be a local script and should be accessible on all hosts in the cluster.
colType
Specific the output of the command/script should be cast what data type. By default, it will be
STRING
data type.
For the clause ( AS colName ( colType )? [, ... ] )?
, please be aware the following behavior:
- If the actual number of output columns is less than user specified output columns, additional user specified out columns will be filled with NULL.
- If the actual number of output columns is more than user specified output columns, the actual output will be truncated, keeping the corresponding columns.
- If user don’t specific the clause
( AS colName ( colType )? [, ... ] )?
, the default output schema is(key: STRING, value: STRING)
. The key column contains all the characters before the first tab and the value column contains the remaining characters after the first tab. If there is no tab, it will return the NULL value for the second columnvalue
. Note that this is different from specifying ASkey, value
because in that case,value
will only contain the portion between the first tab and the second tab if there are multiple tabs.
Examples
CREATE TABLE src(key string, value string);
-- transform using
SELECT TRANSFORM(key, value) USING 'cat' from t1;
-- transform using with specific record writer and record reader
SELECT TRANSFORM(key, value) ROW FORMAT SERDE 'MySerDe'
WITH SERDEPROPERTIES ('p1'='v1','p2'='v2')
RECORDWRITER 'MyRecordWriter'
USING 'cat'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
RECORDREADER 'MyRecordReader' FROM src;
-- use keyword MAP instead of TRANSFORM
FROM src INSERT OVERWRITE TABLE dest1 MAP src.key, CAST(src.key / 10 AS INT) USING 'cat' AS (c1, c2);
-- specific the output of transform
SELECT TRANSFORM(key, value) USING 'cat' AS c1, c2;
SELECT TRANSFORM(key, value) USING 'cat' AS (c1 INT, c2 INT);