Data Types

Supported Data Types

Spark SQL and DataFrames support the following data types:

  • Numeric types
    • ByteType: Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.
    • ShortType: Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.
    • IntegerType: Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.
    • LongType: Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.
    • FloatType: Represents 4-byte single-precision floating point numbers.
    • DoubleType: Represents 8-byte double-precision floating point numbers.
    • DecimalType: Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
  • String type
    • StringType: Represents character string values.
    • VarcharType(length): A variant of StringType which has a length limitation. Data writing will fail if the input string exceeds the length limitation. Note: this type can only be used in table schema, not functions/operators.
    • CharType(length): A variant of VarcharType(length) which is fixed length. Reading column of type CharType(n) always returns string values of length n. Char type column comparison will pad the short one to the longer length.
  • Binary type
    • BinaryType: Represents byte sequence values.
  • Boolean type
    • BooleanType: Represents boolean values.
  • Datetime type
    • DateType: Represents values comprising values of fields year, month and day, without a time-zone.
    • TimestampType: Timestamp with local time zone(TIMESTAMP_LTZ). It represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time-zone. The timestamp value represents an absolute point in time.
    • TimestampNTZType: Timestamp without time zone(TIMESTAMP_NTZ). It represents values comprising values of fields year, month, day, hour, minute, and second. All operations are performed without taking any time zone into account.
      • Note: TIMESTAMP in Spark is a user-specified alias associated with one of the TIMESTAMP_LTZ and TIMESTAMP_NTZ variations. Users can set the default timestamp type as TIMESTAMP_LTZ(default value) or TIMESTAMP_NTZ via the configuration spark.sql.timestampType.
  • Interval types

    • YearMonthIntervalType(startField, endField): Represents a year-month interval which is made up of a contiguous subset of the following fields:

      • MONTH, months within years [0..11],
      • YEAR, years in the range [0..178956970].

      Individual interval fields are non-negative, but an interval itself can have a sign, and be negative.

      startField is the leftmost field, and endField is the rightmost field of the type. Valid values of startField and endField are 0(MONTH) and 1(YEAR). Supported year-month interval types are:

      Year-Month Interval TypeSQL typeAn instance of the type
      YearMonthIntervalType(YEAR, YEAR) or YearMonthIntervalType(YEAR)INTERVAL YEARINTERVAL ‘2021’ YEAR
      YearMonthIntervalType(YEAR, MONTH)INTERVAL YEAR TO MONTHINTERVAL ‘2021-07’ YEAR TO MONTH
      YearMonthIntervalType(MONTH, MONTH) or YearMonthIntervalType(MONTH)INTERVAL MONTHINTERVAL ‘10’ MONTH
    • DayTimeIntervalType(startField, endField): Represents a day-time interval which is made up of a contiguous subset of the following fields:

      • SECOND, seconds within minutes and possibly fractions of a second [0..59.999999],
      • MINUTE, minutes within hours [0..59],
      • HOUR, hours within days [0..23],
      • DAY, days in the range [0..106751991].

      Individual interval fields are non-negative, but an interval itself can have a sign, and be negative.

      startField is the leftmost field, and endField is the rightmost field of the type. Valid values of startField and endField are 0 (DAY), 1 (HOUR), 2 (MINUTE), 3 (SECOND). Supported day-time interval types are:

      Day-Time Interval TypeSQL typeAn instance of the type
      DayTimeIntervalType(DAY, DAY) or DayTimeIntervalType(DAY)INTERVAL DAYINTERVAL ‘100’ DAY
      DayTimeIntervalType(DAY, HOUR)INTERVAL DAY TO HOURINTERVAL ‘100 10’ DAY TO HOUR
      DayTimeIntervalType(DAY, MINUTE)INTERVAL DAY TO MINUTEINTERVAL ‘100 10:30’ DAY TO MINUTE
      DayTimeIntervalType(DAY, SECOND)INTERVAL DAY TO SECONDINTERVAL ‘100 10:30:40.999999’ DAY TO SECOND
      DayTimeIntervalType(HOUR, HOUR) or DayTimeIntervalType(HOUR)INTERVAL HOURINTERVAL ‘123’ HOUR
      DayTimeIntervalType(HOUR, MINUTE)INTERVAL HOUR TO MINUTEINTERVAL ‘123:10’ HOUR TO MINUTE
      DayTimeIntervalType(HOUR, SECOND)INTERVAL HOUR TO SECONDINTERVAL ‘123:10:59’ HOUR TO SECOND
      DayTimeIntervalType(MINUTE, MINUTE) or DayTimeIntervalType(MINUTE)INTERVAL MINUTEINTERVAL ‘1000’ MINUTE
      DayTimeIntervalType(MINUTE, SECOND)INTERVAL MINUTE TO SECONDINTERVAL ‘1000:01.001’ MINUTE TO SECOND
      DayTimeIntervalType(SECOND, SECOND) or DayTimeIntervalType(SECOND)INTERVAL SECONDINTERVAL ‘1000.000001’ SECOND
  • Complex types

    • ArrayType(elementType, containsNull): Represents values comprising a sequence of elements with the type of elementType. containsNull is used to indicate if elements in a ArrayType value can have null values.
    • MapType(keyType, valueType, valueContainsNull): Represents values comprising a set of key-value pairs. The data type of keys is described by keyType and the data type of values is described by valueType. For a MapType value, keys are not allowed to have null values. valueContainsNull is used to indicate if values of a MapType value can have null values.
    • StructType(fields): Represents values with the structure described by a sequence of StructFields (fields).
      • StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of these fields can have null values.

All data types of Spark SQL are located in the package org.apache.spark.sql.types. You can access them by doing

  1. import org.apache.spark.sql.types._

Find full example code at “examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala” in the Spark repo.

Data typeValue type in ScalaAPI to access or create a data type
ByteTypeByteByteType
ShortTypeShortShortType
IntegerTypeIntIntegerType
LongTypeLongLongType
FloatTypeFloatFloatType
DoubleTypeDoubleDoubleType
DecimalTypejava.math.BigDecimalDecimalType
StringTypeStringStringType
BinaryTypeArray[Byte]BinaryType
BooleanTypeBooleanBooleanType
TimestampTypejava.time.Instant or java.sql.TimestampTimestampType
TimestampNTZTypejava.time.LocalDateTimeTimestampNTZType
DateTypejava.time.LocalDate or java.sql.DateDateType
YearMonthIntervalTypejava.time.PeriodYearMonthIntervalType
DayTimeIntervalTypejava.time.DurationDayTimeIntervalType
ArrayTypescala.collection.SeqArrayType(elementType, [containsNull])
Note: The default value of containsNull is true.
MapTypescala.collection.MapMapType(keyType, valueType, [valueContainsNull])
Note: The default value of valueContainsNull is true.
StructTypeorg.apache.spark.sql.RowStructType(fields)
Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed.
StructFieldThe value type in Scala of the data type of this field(For example, Int for a StructField with the data type IntegerType)StructField(name, dataType, [nullable])
Note: The default value of nullable is true.

All data types of Spark SQL are located in the package of org.apache.spark.sql.types. To access or create a data type, please use factory methods provided in org.apache.spark.sql.types.DataTypes.

Data typeValue type in JavaAPI to access or create a data type
ByteTypebyte or ByteDataTypes.ByteType
ShortTypeshort or ShortDataTypes.ShortType
IntegerTypeint or IntegerDataTypes.IntegerType
LongTypelong or LongDataTypes.LongType
FloatTypefloat or FloatDataTypes.FloatType
DoubleTypedouble or DoubleDataTypes.DoubleType
DecimalTypejava.math.BigDecimalDataTypes.createDecimalType()
DataTypes.createDecimalType(precision, scale).
StringTypeStringDataTypes.StringType
BinaryTypebyte[]DataTypes.BinaryType
BooleanTypeboolean or BooleanDataTypes.BooleanType
TimestampTypejava.time.Instant or java.sql.TimestampDataTypes.TimestampType
TimestampNTZTypejava.time.LocalDateTimeDataTypes.TimestampNTZType
DateTypejava.time.LocalDate or java.sql.DateDataTypes.DateType
YearMonthIntervalTypejava.time.PeriodDataTypes.YearMonthIntervalType
DayTimeIntervalTypejava.time.DurationDataTypes.DayTimeIntervalType
ArrayTypejava.util.ListDataTypes.createArrayType(elementType)
Note: The value of containsNull will be true.
DataTypes.createArrayType(elementType, containsNull).
MapTypejava.util.MapDataTypes.createMapType(keyType, valueType)
Note: The value of valueContainsNull will be true.
DataTypes.createMapType(keyType, valueType, valueContainsNull)
StructTypeorg.apache.spark.sql.RowDataTypes.createStructType(fields)
Note: fields is a List or an array of StructFields.Also, two fields with the same name are not allowed.
StructFieldThe value type in Java of the data type of this field (For example, int for a StructField with the data type IntegerType)DataTypes.createStructField(name, dataType, nullable)

All data types of Spark SQL are located in the package of pyspark.sql.types. You can access them by doing

  1. from pyspark.sql.types import *
Data typeValue type in PythonAPI to access or create a data type
ByteTypeint or long
Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127.
ByteType()
ShortTypeint or long
Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767.
ShortType()
IntegerTypeint or longIntegerType()
LongTypelong
Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807. Otherwise, please convert data to decimal.Decimal and use DecimalType.
LongType()
FloatTypefloat
Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime.
FloatType()
DoubleTypefloatDoubleType()
DecimalTypedecimal.DecimalDecimalType()
StringTypestringStringType()
BinaryTypebytearrayBinaryType()
BooleanTypeboolBooleanType()
TimestampTypedatetime.datetimeTimestampType()
TimestampNTZTypedatetime.datetimeTimestampNTZType()
DateTypedatetime.dateDateType()
DayTimeIntervalTypedatetime.timedeltaDayTimeIntervalType()
ArrayTypelist, tuple, or arrayArrayType(elementType, [containsNull])
Note:The default value of containsNull is True.
MapTypedictMapType(keyType, valueType, [valueContainsNull])
Note:The default value of valueContainsNull is True.
StructTypelist or tupleStructType(fields)
Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed.
StructFieldThe value type in Python of the data type of this field
(For example, Int for a StructField with the data type IntegerType)
StructField(name, dataType, [nullable])
Note: The default value of nullable is True.
Data typeValue type in RAPI to access or create a data type
ByteTypeinteger
Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127.
“byte”
ShortTypeinteger
Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767.
“short”
IntegerTypeinteger“integer”
LongTypeinteger
Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807. Otherwise, please convert data to decimal.Decimal and use DecimalType.
“long”
FloatTypenumeric
Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime.
“float”
DoubleTypenumeric“double”
DecimalTypeNot supportedNot supported
StringTypecharacter“string”
BinaryTyperaw“binary”
BooleanTypelogical“bool”
TimestampTypePOSIXct“timestamp”
DateTypeDate“date”
ArrayTypevector or listlist(type=”array”, elementType=elementType, containsNull=[containsNull])
Note: The default value of containsNull is TRUE.
MapTypeenvironmentlist(type=”map”, keyType=keyType, valueType=valueType, valueContainsNull=[valueContainsNull])
Note: The default value of valueContainsNull is TRUE.
StructTypenamed listlist(type=”struct”, fields=fields)
Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed.
StructFieldThe value type in R of the data type of this field (For example, integer for a StructField with the data type IntegerType)list(name=name, type=dataType, nullable=[nullable])
Note: The default value of nullable is TRUE.

The following table shows the type names as well as aliases used in Spark SQL parser for each data type.

Data typeSQL name
BooleanTypeBOOLEAN
ByteTypeBYTE, TINYINT
ShortTypeSHORT, SMALLINT
IntegerTypeINT, INTEGER
LongTypeLONG, BIGINT
FloatTypeFLOAT, REAL
DoubleTypeDOUBLE
DateTypeDATE
TimestampTypeTIMESTAMP, TIMESTAMP_LTZ
TimestampNTZTypeTIMESTAMP_NTZ
StringTypeSTRING
BinaryTypeBINARY
DecimalTypeDECIMAL, DEC, NUMERIC
YearMonthIntervalTypeINTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
DayTimeIntervalTypeINTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
ArrayTypeARRAY<element_type>
StructTypeSTRUCT<field1_name: field1_type, field2_name: field2_type, …>
Note: ‘:’ is optional.
MapTypeMAP<key_type, value_type>

Floating Point Special Values

Spark SQL supports several special floating point values in a case-insensitive manner:

  • Inf/+Inf/Infinity/+Infinity: positive infinity
    • FloatType: equivalent to Scala Float.PositiveInfinity.
    • DoubleType: equivalent to Scala Double.PositiveInfinity.
  • -Inf/-Infinity: negative infinity
    • FloatType: equivalent to Scala Float.NegativeInfinity.
    • DoubleType: equivalent to Scala Double.NegativeInfinity.
  • NaN: not a number
    • FloatType: equivalent to Scala Float.NaN.
    • DoubleType: equivalent to Scala Double.NaN.

Positive/Negative Infinity Semantics

There is special handling for positive and negative infinity. They have the following semantics:

  • Positive infinity multiplied by any positive value returns positive infinity.
  • Negative infinity multiplied by any positive value returns negative infinity.
  • Positive infinity multiplied by any negative value returns negative infinity.
  • Negative infinity multiplied by any negative value returns positive infinity.
  • Positive/negative infinity multiplied by 0 returns NaN.
  • Positive/negative infinity is equal to itself.
  • In aggregations, all positive infinity values are grouped together. Similarly, all negative infinity values are grouped together.
  • Positive infinity and negative infinity are treated as normal values in join keys.
  • Positive infinity sorts lower than NaN and higher than any other values.
  • Negative infinity sorts lower than any other values.

NaN Semantics

There is special handling for not-a-number (NaN) when dealing with float or double types that do not exactly match standard floating point semantics. Specifically:

  • NaN = NaN returns true.
  • In aggregations, all NaN values are grouped together.
  • NaN is treated as a normal value in join keys.
  • NaN values go last when in ascending order, larger than any other numeric value.

Examples

  1. SELECT double('infinity') AS col;
  2. +--------+
  3. | col|
  4. +--------+
  5. |Infinity|
  6. +--------+
  7. SELECT float('-inf') AS col;
  8. +---------+
  9. | col|
  10. +---------+
  11. |-Infinity|
  12. +---------+
  13. SELECT float('NaN') AS col;
  14. +---+
  15. |col|
  16. +---+
  17. |NaN|
  18. +---+
  19. SELECT double('infinity') * 0 AS col;
  20. +---+
  21. |col|
  22. +---+
  23. |NaN|
  24. +---+
  25. SELECT double('-infinity') * (-1234567) AS col;
  26. +--------+
  27. | col|
  28. +--------+
  29. |Infinity|
  30. +--------+
  31. SELECT double('infinity') < double('NaN') AS col;
  32. +----+
  33. | col|
  34. +----+
  35. |true|
  36. +----+
  37. SELECT double('NaN') = double('NaN') AS col;
  38. +----+
  39. | col|
  40. +----+
  41. |true|
  42. +----+
  43. SELECT double('inf') = double('infinity') AS col;
  44. +----+
  45. | col|
  46. +----+
  47. |true|
  48. +----+
  49. CREATE TABLE test (c1 int, c2 double);
  50. INSERT INTO test VALUES (1, double('infinity'));
  51. INSERT INTO test VALUES (2, double('infinity'));
  52. INSERT INTO test VALUES (3, double('inf'));
  53. INSERT INTO test VALUES (4, double('-inf'));
  54. INSERT INTO test VALUES (5, double('NaN'));
  55. INSERT INTO test VALUES (6, double('NaN'));
  56. INSERT INTO test VALUES (7, double('-infinity'));
  57. SELECT COUNT(*), c2 FROM test GROUP BY c2;
  58. +---------+---------+
  59. | count(1)| c2|
  60. +---------+---------+
  61. | 2| NaN|
  62. | 2|-Infinity|
  63. | 3| Infinity|
  64. +---------+---------+