Nullable integer data type
New in version 0.24.0.
Note
IntegerArray is currently experimental. Its API or implementation maychange without warning.
In Working with missing data, we saw that pandas primarily uses NaN
to representmissing data. Because NaN
is a float, this forces an array of integers withany missing values to become floating point. In some cases, this may not mattermuch. But if your integer column is, say, an identifier, casting to float canbe problematic. Some integers cannot even be represented as floating pointnumbers.
Pandas can represent integer data with possibly missing values usingarrays.IntegerArray
. This is an extension typesimplemented within pandas. It is not the default dtype for integers, and will not be inferred;you must explicitly pass the dtype into array()
or Series
:
- In [1]: arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
- In [2]: arr
- Out[2]:
- <IntegerArray>
- [1, 2, NaN]
- Length: 3, dtype: Int64
Or the string alias "Int64"
(note the capital "I"
, to differentiate fromNumPy’s 'int64'
dtype:
- In [3]: pd.array([1, 2, np.nan], dtype="Int64")
- Out[3]:
- <IntegerArray>
- [1, 2, NaN]
- Length: 3, dtype: Int64
This array can be stored in a DataFrame
or Series
like anyNumPy array.
- In [4]: pd.Series(arr)
- Out[4]:
- 0 1
- 1 2
- 2 NaN
- dtype: Int64
You can also pass the list-like object to the Series
constructorwith the dtype.
- In [5]: s = pd.Series([1, 2, np.nan], dtype="Int64")
- In [6]: s
- Out[6]:
- 0 1
- 1 2
- 2 NaN
- dtype: Int64
By default (if you don’t specify dtype
), NumPy is used, and you’ll endup with a float64
dtype Series:
- In [7]: pd.Series([1, 2, np.nan])
- Out[7]:
- 0 1.0
- 1 2.0
- 2 NaN
- dtype: float64
Operations involving an integer array will behave similar to NumPy arrays.Missing values will be propagated, and and the data will be coerced to anotherdtype if needed.
- # arithmetic
- In [8]: s + 1
- Out[8]:
- 0 2
- 1 3
- 2 NaN
- dtype: Int64
- # comparison
- In [9]: s == 1
- Out[9]:
- 0 True
- 1 False
- 2 False
- dtype: bool
- # indexing
- In [10]: s.iloc[1:3]
- Out[10]:
- 1 2
- 2 NaN
- dtype: Int64
- # operate with other dtypes
- In [11]: s + s.iloc[1:3].astype('Int8')
- Out[11]:
- 0 NaN
- 1 4
- 2 NaN
- dtype: Int64
- # coerce when needed
- In [12]: s + 0.01
- Out[12]:
- 0 1.01
- 1 2.01
- 2 NaN
- dtype: float64
These dtypes can operate as part of of DataFrame
.
- In [13]: df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
- In [14]: df
- Out[14]:
- A B C
- 0 1 1 a
- 1 2 1 a
- 2 NaN 3 b
- In [15]: df.dtypes
- Out[15]:
- A Int64
- B int64
- C object
- dtype: object
These dtypes can be merged & reshaped & casted.
- In [16]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
- Out[16]:
- A Int64
- B int64
- C object
- dtype: object
- In [17]: df['A'].astype(float)
- Out[17]:
- 0 1.0
- 1 2.0
- 2 NaN
- Name: A, dtype: float64
Reduction and groupby operations such as ‘sum’ work as well.
- In [18]: df.sum()
- Out[18]:
- A 3
- B 5
- C aab
- dtype: object
- In [19]: df.groupby('B').A.sum()
- Out[19]:
- B
- 1 3
- 3 0
- Name: A, dtype: Int64