Parquet File Output

GitHub 来源:Hop 浏览 264 扫码分享 2022-06-11 08:59:33

Parquet File Output
- Description
- Options

Parquet File Output

Description

This transform is capable of writing data into the Apache Parquet file format. For more information on this see: Apache Parquet.

Options

Notes:

The date optionally referenced in the output file name(s) will be the start of the pipeline execution.
Hop Date types are serialized as EPOC: milliseconds since 1970-01-01 00:00:00.000
Strings are written as binary in UTF-8
Compression of data into columnar format is being done in memory. This happens when all rows are written. To not run out of memory make sure to specify a split size.

Option	Description
Transform name	Name of the transform this name has to be unique in a single pipeline.
Base file name	Specify the base filename. This is composed of where you want to write the Parquet file to as well as the start of the filename. Examples: Write to Amazon AWS S3 : `s3://my-bucket-name/transactions` Write to a local folder : `/my/folder/customer-data`
Extension	This is the extension of the file. Usually this is simply `snappy`
Include date?	Check this box if you want to include the date in the filename with mask `yyyMMdd`
Include time?	Check this box if you want to include the time in the filename with mask `HHmmss`
Include date-time-format?	Check this box if you want to include a specific custom date-time format in the filename
Include transform copy number?	Enable this option if you run this transform in multiple copies to not have multiple threads write to the same file. The copy number is formatted with mask `00`
Split into parts and include number?	Enable this option if you want to split the output into multiple parts. Specify a split size larger than 0 and this is then the number of rows per file. The file part (split) number will be included in the filename to make sure that the same file is not being overwritten. The split number is formatted with mask `0000`
Compression codec	Here you can indicate which compression codec you want to use. The default is SNAPPY for Apache Snappy compression.
Version	Choose the protocol version of Parquet (1.0 or 2.0)
Row group size	The amount of rows in a group
Data page size	The data page size on a 1kB boundary (default is 1048576)
Dictionary page size	The data dictionary page size on a 1kB boundary (default is 1048576)
Fields	You can specify which fields to write and in which order. You can use the “Get Fields” button to populate the dialog.

当前内容版权归 Hop 或其关联方所有，如需对内容或内容相关联开源项目进行关注与资助，请访问 Hop .

本文档使用 BookStack 构建

展开/收起文章目录