Parquet Files

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

Loading Data Programmatically

Using the data from the above example:

  1. // Encoders for most common types are automatically provided by importing spark.implicits._
  2. import spark.implicits._
  3. val peopleDF = spark.read.json("examples/src/main/resources/people.json")
  4. // DataFrames can be saved as Parquet files, maintaining the schema information
  5. peopleDF.write.parquet("people.parquet")
  6. // Read in the parquet file created above
  7. // Parquet files are self-describing so the schema is preserved
  8. // The result of loading a Parquet file is also a DataFrame
  9. val parquetFileDF = spark.read.parquet("people.parquet")
  10. // Parquet files can also be used to create a temporary view and then used in SQL statements
  11. parquetFileDF.createOrReplaceTempView("parquetFile")
  12. val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
  13. namesDF.map(attributes => "Name: " + attributes(0)).show()
  14. // +------------+
  15. // | value|
  16. // +------------+
  17. // |Name: Justin|
  18. // +------------+

Find full example code at “examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala” in the Spark repo.

  1. import org.apache.spark.api.java.function.MapFunction;
  2. import org.apache.spark.sql.Encoders;
  3. import org.apache.spark.sql.Dataset;
  4. import org.apache.spark.sql.Row;
  5. Dataset<Row> peopleDF = spark.read().json("examples/src/main/resources/people.json");
  6. // DataFrames can be saved as Parquet files, maintaining the schema information
  7. peopleDF.write().parquet("people.parquet");
  8. // Read in the Parquet file created above.
  9. // Parquet files are self-describing so the schema is preserved
  10. // The result of loading a parquet file is also a DataFrame
  11. Dataset<Row> parquetFileDF = spark.read().parquet("people.parquet");
  12. // Parquet files can also be used to create a temporary view and then used in SQL statements
  13. parquetFileDF.createOrReplaceTempView("parquetFile");
  14. Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19");
  15. Dataset<String> namesDS = namesDF.map(
  16. (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
  17. Encoders.STRING());
  18. namesDS.show();
  19. // +------------+
  20. // | value|
  21. // +------------+
  22. // |Name: Justin|
  23. // +------------+

Find full example code at “examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java” in the Spark repo.

  1. peopleDF = spark.read.json("examples/src/main/resources/people.json")
  2. # DataFrames can be saved as Parquet files, maintaining the schema information.
  3. peopleDF.write.parquet("people.parquet")
  4. # Read in the Parquet file created above.
  5. # Parquet files are self-describing so the schema is preserved.
  6. # The result of loading a parquet file is also a DataFrame.
  7. parquetFile = spark.read.parquet("people.parquet")
  8. # Parquet files can also be used to create a temporary view and then used in SQL statements.
  9. parquetFile.createOrReplaceTempView("parquetFile")
  10. teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
  11. teenagers.show()
  12. # +------+
  13. # | name|
  14. # +------+
  15. # |Justin|
  16. # +------+

Find full example code at “examples/src/main/python/sql/datasource.py” in the Spark repo.

  1. df <- read.df("examples/src/main/resources/people.json", "json")
  2. # SparkDataFrame can be saved as Parquet files, maintaining the schema information.
  3. write.parquet(df, "people.parquet")
  4. # Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
  5. # The result of loading a parquet file is also a DataFrame.
  6. parquetFile <- read.parquet("people.parquet")
  7. # Parquet files can also be used to create a temporary view and then used in SQL statements.
  8. createOrReplaceTempView(parquetFile, "parquetFile")
  9. teenagers <- sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
  10. head(teenagers)
  11. ## name
  12. ## 1 Justin
  13. # We can also run custom R-UDFs on Spark DataFrames. Here we prefix all the names with "Name:"
  14. schema <- structType(structField("name", "string"))
  15. teenNames <- dapply(df, function(p) { cbind(paste("Name:", p$name)) }, schema)
  16. for (teenName in collect(teenNames)$name) {
  17. cat(teenName, "\n")
  18. }
  19. ## Name: Michael
  20. ## Name: Andy
  21. ## Name: Justin

Find full example code at “examples/src/main/r/RSparkSQLExample.R” in the Spark repo.

  1. CREATE TEMPORARY VIEW parquetTable
  2. USING org.apache.spark.sql.parquet
  3. OPTIONS (
  4. path "examples/src/main/resources/people.parquet"
  5. )
  6. SELECT * FROM parquetTable

Partition Discovery

Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically. For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra columns, gender and country as partitioning columns:

  1. path
  2. └── to
  3. └── table
  4. ├── gender=male
  5. ├── ...
  6. ├── country=US
  7. └── data.parquet
  8. ├── country=CN
  9. └── data.parquet
  10. └── ...
  11. └── gender=female
  12. ├── ...
  13. ├── country=US
  14. └── data.parquet
  15. ├── country=CN
  16. └── data.parquet
  17. └── ...

By passing path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. Now the schema of the returned DataFrame becomes:

  1. root
  2. |-- name: string (nullable = true)
  3. |-- age: long (nullable = true)
  4. |-- gender: string (nullable = true)
  5. |-- country: string (nullable = true)

Notice that the data types of the partitioning columns are automatically inferred. Currently, numeric data types, date, timestamp and string type are supported. Sometimes users may not want to automatically infer the data types of the partitioning columns. For these use cases, the automatic type inference can be configured by spark.sql.sources.partitionColumnTypeInference.enabled, which is default to true. When type inference is disabled, string type will be used for the partitioning columns.

Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the above example, if users pass path/to/table/gender=male to either SparkSession.read.parquet or SparkSession.read.load, gender will not be considered as a partitioning column. If users need to specify the base path that partition discovery should start with, they can set basePath in the data source options. For example, when path/to/table/gender=male is the path of the data and users set basePath to path/to/table/, gender will be a partitioning column.

Schema Merging

Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

  1. setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
  2. setting the global SQL option spark.sql.parquet.mergeSchema to true.
  1. // This is used to implicitly convert an RDD to a DataFrame.
  2. import spark.implicits._
  3. // Create a simple DataFrame, store into a partition directory
  4. val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
  5. squaresDF.write.parquet("data/test_table/key=1")
  6. // Create another DataFrame in a new partition directory,
  7. // adding a new column and dropping an existing column
  8. val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
  9. cubesDF.write.parquet("data/test_table/key=2")
  10. // Read the partitioned table
  11. val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
  12. mergedDF.printSchema()
  13. // The final schema consists of all 3 columns in the Parquet files together
  14. // with the partitioning column appeared in the partition directory paths
  15. // root
  16. // |-- value: int (nullable = true)
  17. // |-- square: int (nullable = true)
  18. // |-- cube: int (nullable = true)
  19. // |-- key: int (nullable = true)

Find full example code at “examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala” in the Spark repo.

  1. import java.io.Serializable;
  2. import java.util.ArrayList;
  3. import java.util.Arrays;
  4. import java.util.List;
  5. import org.apache.spark.sql.Dataset;
  6. import org.apache.spark.sql.Row;
  7. public static class Square implements Serializable {
  8. private int value;
  9. private int square;
  10. // Getters and setters...
  11. }
  12. public static class Cube implements Serializable {
  13. private int value;
  14. private int cube;
  15. // Getters and setters...
  16. }
  17. List<Square> squares = new ArrayList<>();
  18. for (int value = 1; value <= 5; value++) {
  19. Square square = new Square();
  20. square.setValue(value);
  21. square.setSquare(value * value);
  22. squares.add(square);
  23. }
  24. // Create a simple DataFrame, store into a partition directory
  25. Dataset<Row> squaresDF = spark.createDataFrame(squares, Square.class);
  26. squaresDF.write().parquet("data/test_table/key=1");
  27. List<Cube> cubes = new ArrayList<>();
  28. for (int value = 6; value <= 10; value++) {
  29. Cube cube = new Cube();
  30. cube.setValue(value);
  31. cube.setCube(value * value * value);
  32. cubes.add(cube);
  33. }
  34. // Create another DataFrame in a new partition directory,
  35. // adding a new column and dropping an existing column
  36. Dataset<Row> cubesDF = spark.createDataFrame(cubes, Cube.class);
  37. cubesDF.write().parquet("data/test_table/key=2");
  38. // Read the partitioned table
  39. Dataset<Row> mergedDF = spark.read().option("mergeSchema", true).parquet("data/test_table");
  40. mergedDF.printSchema();
  41. // The final schema consists of all 3 columns in the Parquet files together
  42. // with the partitioning column appeared in the partition directory paths
  43. // root
  44. // |-- value: int (nullable = true)
  45. // |-- square: int (nullable = true)
  46. // |-- cube: int (nullable = true)
  47. // |-- key: int (nullable = true)

Find full example code at “examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java” in the Spark repo.

  1. from pyspark.sql import Row
  2. # spark is from the previous example.
  3. # Create a simple DataFrame, stored into a partition directory
  4. sc = spark.sparkContext
  5. squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
  6. .map(lambda i: Row(single=i, double=i ** 2)))
  7. squaresDF.write.parquet("data/test_table/key=1")
  8. # Create another DataFrame in a new partition directory,
  9. # adding a new column and dropping an existing column
  10. cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
  11. .map(lambda i: Row(single=i, triple=i ** 3)))
  12. cubesDF.write.parquet("data/test_table/key=2")
  13. # Read the partitioned table
  14. mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
  15. mergedDF.printSchema()
  16. # The final schema consists of all 3 columns in the Parquet files together
  17. # with the partitioning column appeared in the partition directory paths.
  18. # root
  19. # |-- double: long (nullable = true)
  20. # |-- single: long (nullable = true)
  21. # |-- triple: long (nullable = true)
  22. # |-- key: integer (nullable = true)

Find full example code at “examples/src/main/python/sql/datasource.py” in the Spark repo.

  1. df1 <- createDataFrame(data.frame(single=c(12, 29), double=c(19, 23)))
  2. df2 <- createDataFrame(data.frame(double=c(19, 23), triple=c(23, 18)))
  3. # Create a simple DataFrame, stored into a partition directory
  4. write.df(df1, "data/test_table/key=1", "parquet", "overwrite")
  5. # Create another DataFrame in a new partition directory,
  6. # adding a new column and dropping an existing column
  7. write.df(df2, "data/test_table/key=2", "parquet", "overwrite")
  8. # Read the partitioned table
  9. df3 <- read.df("data/test_table", "parquet", mergeSchema = "true")
  10. printSchema(df3)
  11. # The final schema consists of all 3 columns in the Parquet files together
  12. # with the partitioning column appeared in the partition directory paths
  13. ## root
  14. ## |-- single: double (nullable = true)
  15. ## |-- double: double (nullable = true)
  16. ## |-- triple: double (nullable = true)
  17. ## |-- key: integer (nullable = true)

Find full example code at “examples/src/main/r/RSparkSQLExample.R” in the Spark repo.

Hive metastore Parquet table conversion

When reading from Hive metastore Parquet tables and writing to non-partitioned Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.

Hive/Parquet Schema Reconciliation

There are two key differences between Hive and Parquet from the perspective of table schema processing.

  1. Hive is case insensitive, while Parquet is not
  2. Hive considers all columns nullable, while nullability in Parquet is significant

Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:

  1. Fields that have the same name in both schema must have the same data type regardless of nullability. The reconciled field should have the data type of the Parquet side, so that nullability is respected.

  2. The reconciled schema contains exactly those fields defined in Hive metastore schema.

    • Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
    • Any fields that only appear in the Hive metastore schema are added as nullable field in the reconciled schema.

Metadata Refreshing

Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

  1. // spark is an existing SparkSession
  2. spark.catalog.refreshTable("my_table")
  1. // spark is an existing SparkSession
  2. spark.catalog().refreshTable("my_table");
  1. # spark is an existing SparkSession
  2. spark.catalog.refreshTable("my_table")
  1. refreshTable("my_table")
  1. REFRESH TABLE my_table;

Columnar Encryption

Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12+.

Parquet uses the envelope encryption practice, where file parts are encrypted with “data encryption keys” (DEKs), and the DEKs are encrypted with “master encryption keys” (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. The Parquet Maven repository has a jar with a mock KMS implementation that allows to run column encryption and decryption using a spark-shell only, without deploying a KMS server (download the parquet-hadoop-tests.jar file and place it in the Spark jars folder):

  1. sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
  2. "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
  3. // Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
  4. sc.hadoopConfiguration.set("parquet.encryption.key.list" ,
  5. "keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==")
  6. // Activate Parquet encryption, driven by Hadoop properties
  7. sc.hadoopConfiguration.set("parquet.crypto.factory.class" ,
  8. "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
  9. // Write encrypted dataframe files.
  10. // Column "square" will be protected with master key "keyA".
  11. // Parquet file footers will be protected with master key "keyB"
  12. squaresDF.write.
  13. option("parquet.encryption.column.keys" , "keyA:square").
  14. option("parquet.encryption.footer.key" , "keyB").
  15. parquet("/path/to/table.parquet.encrypted")
  16. // Read encrypted dataframe files
  17. val df2 = spark.read.parquet("/path/to/table.parquet.encrypted")
  1. sc.hadoopConfiguration().set("parquet.encryption.kms.client.class" ,
  2. "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");
  3. // Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
  4. sc.hadoopConfiguration().set("parquet.encryption.key.list" ,
  5. "keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==");
  6. // Activate Parquet encryption, driven by Hadoop properties
  7. sc.hadoopConfiguration().set("parquet.crypto.factory.class" ,
  8. "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");
  9. // Write encrypted dataframe files.
  10. // Column "square" will be protected with master key "keyA".
  11. // Parquet file footers will be protected with master key "keyB"
  12. squaresDF.write().
  13. option("parquet.encryption.column.keys" , "keyA:square").
  14. option("parquet.encryption.footer.key" , "keyB").
  15. parquet("/path/to/table.parquet.encrypted");
  16. // Read encrypted dataframe files
  17. Dataset<Row> df2 = spark.read().parquet("/path/to/table.parquet.encrypted");
  1. # Set hadoop configuration properties, e.g. using configuration properties of
  2. # the Spark job:
  3. # --conf spark.hadoop.parquet.encryption.kms.client.class=\
  4. # "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS"\
  5. # --conf spark.hadoop.parquet.encryption.key.list=\
  6. # "keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA=="\
  7. # --conf spark.hadoop.parquet.crypto.factory.class=\
  8. # "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory"
  9. # Write encrypted dataframe files.
  10. # Column "square" will be protected with master key "keyA".
  11. # Parquet file footers will be protected with master key "keyB"
  12. squaresDF.write\
  13. .option("parquet.encryption.column.keys" , "keyA:square")\
  14. .option("parquet.encryption.footer.key" , "keyB")\
  15. .parquet("/path/to/table.parquet.encrypted")
  16. # Read encrypted dataframe files
  17. df2 = spark.read.parquet("/path/to/table.parquet.encrypted")

KMS Client

The InMemoryKMS class is provided only for illustration and simple demonstration of Parquet encryption functionality. It should not be used in a real deployment. The master encryption keys must be kept and managed in a production-grade KMS system, deployed in user’s organization. Rollout of Spark with Parquet encryption requires implementation of a client class for the KMS server. Parquet provides a plug-in interface for development of such classes,

  1. public interface KmsClient {
  2. // Wraps a key - encrypts it with the master key.
  3. public String wrapKey(byte[] keyBytes, String masterKeyIdentifier);
  4. // Decrypts (unwraps) a key with the master key.
  5. public byte[] unwrapKey(String wrappedKey, String masterKeyIdentifier);
  6. // Use of initialization parameters is optional.
  7. public void initialize(Configuration configuration, String kmsInstanceID,
  8. String kmsInstanceURL, String accessToken);
  9. }

An example of such class for an open source KMS can be found in the parquet-mr repository. The production KMS client should be designed in cooperation with organization’s security administrators, and built by developers with an experience in access control management. Once such class is created, it can be passed to applications via the parquet.encryption.kms.client.class parameter and leveraged by general Spark users as shown in the encrypted dataframe write/read sample above.

Note: By default, Parquet implements a “double envelope encryption” mode, that minimizes the interaction of Spark executors with a KMS server. In this mode, the DEKs are encrypted with “key encryption keys” (KEKs, randomly generated by Parquet). The KEKs are encrypted with MEKs in KMS; the result and the KEK itself are cached in Spark executor memory. Users interested in regular envelope encryption, can switch to it by setting the parquet.encryption.double.wrapping parameter to false. For more details on Parquet encryption parameters, visit the parquet-hadoop configuration page.

Data Source Option

Data source options of Parquet can be set via:

Property NameDefaultMeaningScope
datetimeRebaseMode(value of spark.sql.parquet.datetimeRebaseModeInRead configuration)The datetimeRebaseMode option allows to specify the rebasing mode for the values of the DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS logical types from the Julian to Proleptic Gregorian calendar.
Currently supported modes are:
  • EXCEPTION: fails in reads of ancient dates/timestamps that are ambiguous between the two calendars.
  • CORRECTED: loads dates/timestamps without rebasing.
  • LEGACY: performs rebasing of ancient dates/timestamps from the Julian to Proleptic Gregorian calendar.
read
int96RebaseMode(value of spark.sql.parquet.int96RebaseModeInRead configuration)The int96RebaseMode option allows to specify the rebasing mode for INT96 timestamps from the Julian to Proleptic Gregorian calendar.
Currently supported modes are:
  • EXCEPTION: fails in reads of ancient INT96 timestamps that are ambiguous between the two calendars.
  • CORRECTED: loads INT96 timestamps without rebasing.
  • LEGACY: performs rebasing of ancient timestamps from the Julian to Proleptic Gregorian calendar.
read
mergeSchema(value of spark.sql.parquet.mergeSchema configuration)Sets whether we should merge schemas collected from all Parquet part-files. This will override spark.sql.parquet.mergeSchema.read
compressionsnappyCompression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd). This will override spark.sql.parquet.compression.codec.write

Other generic options can be found in Generic Files Source Options

Configuration

Configuration of Parquet can be done using the setConf method on SparkSession or by running SET key=value commands using SQL.

Property NameDefaultMeaningSince Version
spark.sql.parquet.binaryAsStringfalseSome other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.1.1.1
spark.sql.parquet.int96AsTimestamptrueSome Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.1.3.0
spark.sql.parquet.int96TimestampConversionfalseThis controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala. This is necessary because Impala stores INT96 data with a different timezone offset than Hive & Spark.2.3.0
spark.sql.parquet.outputTimestampTypeINT96Sets which Parquet timestamp type to use when Spark writes data to Parquet files. INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.2.3.0
spark.sql.parquet.compression.codecsnappySets the compression codec used when writing Parquet files. If either compression or parquet.compression is specified in the table-specific options/properties, the precedence would be compression, parquet.compression, spark.sql.parquet.compression.codec. Acceptable values include: none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. Note that brotli requires BrotliCodec to be installed.1.1.1
spark.sql.parquet.filterPushdowntrueEnables Parquet filter push-down optimization when set to true.1.2.0
spark.sql.parquet.aggregatePushdownfalseIf true, aggregates will be pushed down to Parquet for optimization. Support MIN, MAX and COUNT as aggregate expression. For MIN/MAX, support boolean, integer, float and date type. For COUNT, support all data types. If statistics is missing from any Parquet file footer, exception would be thrown.3.3.0
spark.sql.hive.convertMetastoreParquettrueWhen set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support.1.1.1
spark.sql.parquet.mergeSchemafalse

When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available.

1.5.0
spark.sql.parquet.respectSummaryFilesfalseWhen true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldn’t be enabled before knowing what it means exactly.1.5.0
spark.sql.parquet.writeLegacyFormatfalseIf true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet’s fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.1.6.0
spark.sql.parquet.enableVectorizedReadertrueEnables vectorized parquet decoding.2.0.0
spark.sql.parquet.enableNestedColumnVectorizedReadertrueEnables vectorized Parquet decoding for nested columns (e.g., struct, list, map). Requires spark.sql.parquet.enableVectorizedReader to be enabled.3.3.0
spark.sql.parquet.recordLevelFilter.enabledfalseIf true, enables Parquet’s native record-level filtering using the pushed down filters. This configuration only has an effect when spark.sql.parquet.filterPushdown is enabled and the vectorized reader is not used. You can ensure the vectorized reader is not used by setting spark.sql.parquet.enableVectorizedReader to false.2.3.0
spark.sql.parquet.columnarReaderBatchSize4096The number of rows to include in a parquet vectorized reader batch. The number should be carefully chosen to minimize overhead and avoid OOMs in reading data.2.4.0
spark.sql.parquet.fieldId.write.enabledtrueField ID is a native field of the Parquet schema spec. When enabled, Parquet writers will populate the field Id metadata (if present) in the Spark schema to the Parquet schema.3.3.0
spark.sql.parquet.fieldId.read.enabledfalseField ID is a native field of the Parquet schema spec. When enabled, Parquet readers will use field IDs (if present) in the requested Spark schema to look up Parquet fields instead of using column names.3.3.0
spark.sql.parquet.fieldId.read.ignoreMissingfalseWhen the Parquet file doesn’t have any field IDs but the Spark read schema is using field IDs to read, we will silently return nulls when this flag is enabled, or error otherwise.3.3.0
spark.sql.parquet.timestampNTZ.enabledtrueEnables TIMESTAMP_NTZ support for Parquet reads and writes. When enabled, TIMESTAMP_NTZ values are written as Parquet timestamp columns with annotation isAdjustedToUTC = false and are inferred in a similar way. When disabled, such values are read as TIMESTAMP_LTZ and have to be converted to TIMESTAMP_LTZ for writes.3.4.0
spark.sql.parquet.datetimeRebaseModeInReadEXCEPTIONThe rebasing mode for the values of the DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS logical types from the Julian to Proleptic Gregorian calendar:
  • EXCEPTION: Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.
  • CORRECTED: Spark will not do rebase and read the dates/timestamps as it is.
  • LEGACY: Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files.
This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.
3.0.0
spark.sql.parquet.datetimeRebaseModeInWriteEXCEPTIONThe rebasing mode for the values of the DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS logical types from the Proleptic Gregorian to Julian calendar:
  • EXCEPTION: Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.
  • CORRECTED: Spark will not do rebase and write the dates/timestamps as it is.
  • LEGACY: Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files.
3.0.0
spark.sql.parquet.int96RebaseModeInReadEXCEPTIONThe rebasing mode for the values of the INT96 timestamp type from the Julian to Proleptic Gregorian calendar:
  • EXCEPTION: Spark will fail the reading if it sees ancient INT96 timestamps that are ambiguous between the two calendars.
  • CORRECTED: Spark will not do rebase and read the dates/timestamps as it is.
  • LEGACY: Spark will rebase INT96 timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files.
This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.
3.1.0
spark.sql.parquet.int96RebaseModeInWriteEXCEPTIONThe rebasing mode for the values of the INT96 timestamp type from the Proleptic Gregorian to Julian calendar:
  • EXCEPTION: Spark will fail the writing if it sees ancient timestamps that are ambiguous between the two calendars.
  • CORRECTED: Spark will not do rebase and write the dates/timestamps as it is.
  • LEGACY: Spark will rebase INT96 timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files.
3.1.0