Quick Start
Preparation
Paimon currently supports Spark 3.5, 3.4, 3.3, 3.2 and 3.1. We recommend the latest Spark version for a better experience.
Download the jar file with corresponding version.
Version | Jar |
---|---|
Spark 3.5 | paimon-spark-3.5-0.8.2.jar |
Spark 3.4 | paimon-spark-3.4-0.8.2.jar |
Spark 3.3 | paimon-spark-3.3-0.8.2.jar |
Spark 3.2 | paimon-spark-3.2-0.8.2.jar |
Spark 3.1 | paimon-spark-3.1-0.8.2.jar |
You can also manually build bundled jar from the source code.
To build from source code, clone the git repository.
Build bundled jar with the following command.
mvn clean install -DskipTests
For Spark 3.3, you can find the bundled jar in ./paimon-spark/paimon-spark-3.3/target/paimon-spark-3.3-0.8.2.jar
.
Setup
If you are using HDFS, make sure that the environment variable
HADOOP_HOME
orHADOOP_CONF_DIR
is set.
Step 1: Specify Paimon Jar File
Append path to paimon jar file to the --jars
argument when starting spark-sql
.
spark-sql ... --jars /path/to/paimon-spark-3.3-0.8.2.jar
OR use the --packages
option.
spark-sql ... --packages org.apache.paimon:paimon-spark-3.3:0.8.2
Alternatively, you can copy paimon-spark-3.3-0.8.2.jar
under spark/jars
in your Spark installation directory.
Step 2: Specify Paimon Catalog
Catalog
When starting spark-sql
, use the following command to register Paimon’s Spark catalog with the name paimon
. Table files of the warehouse is stored under /tmp/paimon
.
spark-sql ... \
--conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
--conf spark.sql.catalog.paimon.warehouse=file:/tmp/paimon \
--conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions
Catalogs are configured using properties under spark.sql.catalog.(catalog_name). In above case, ‘paimon’ is the catalog name, you can change it to your own favorite catalog name.
After spark-sql
command line has started, run the following SQL to create and switch to database default
.
USE paimon;
USE default;
After switching to the catalog ('USE paimon'
), Spark’s existing tables will not be directly accessible, you can use the spark_catalog.${database_name}.${table_name}
to access Spark tables.
Generic Catalog
When starting spark-sql
, use the following command to register Paimon’s Spark Generic catalog to replace Spark default catalog spark_catalog
. (default warehouse is Spark spark.sql.warehouse.dir
)
Currently, it is only recommended to use SparkGenericCatalog
in the case of Hive metastore, Paimon will infer Hive conf from Spark session, you just need to configure Spark’s Hive conf.
spark-sql ... \
--conf spark.sql.catalog.spark_catalog=org.apache.paimon.spark.SparkGenericCatalog \
--conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions
Using SparkGenericCatalog
, you can use Paimon tables in this Catalog or non-Paimon tables such as Spark’s csv, parquet, Hive tables, etc.
Create Table
Catalog
create table my_table (
k int,
v string
) tblproperties (
'primary-key' = 'k'
);
Generic Catalog
create table my_table (
k int,
v string
) USING paimon
tblproperties (
'primary-key' = 'k'
) ;
Insert Table
Paimon currently supports Spark 3.2+ for SQL write.
INSERT INTO my_table VALUES (1, 'Hi'), (2, 'Hello');
Query Table
SQL
SELECT * FROM my_table;
/*
1 Hi
2 Hello
*/
DataFrame
val dataset = spark.read.format("paimon").load("file:/tmp/paimon/default.db/my_table")
dataset.show()
/*
+---+------+
| k | v|
+---+------+
| 1| Hi|
| 2| Hello|
+---+------+
*/
Spark Type Conversion
This section lists all supported type conversion between Spark and Paimon. All Spark’s data types are available in package org.apache.spark.sql.types
.
Spark Data Type | Paimon Data Type | Atomic Type |
---|---|---|
StructType | RowType | false |
MapType | MapType | false |
ArrayType | ArrayType | false |
BooleanType | BooleanType | true |
ByteType | TinyIntType | true |
ShortType | SmallIntType | true |
IntegerType | IntType | true |
LongType | BigIntType | true |
FloatType | FloatType | true |
DoubleType | DoubleType | true |
StringType | VarCharType , CharType | true |
DateType | DateType | true |
TimestampType | TimestampType , LocalZonedTimestamp | true |
DecimalType(precision, scale) | DecimalType(precision, scale) | true |
BinaryType | VarBinaryType , BinaryType | true |