数据类型

数据类型

The main types stored in pandas objects are float, int, bool, datetime64[ns] and datetime64[ns, tz], timedelta[ns], category and object. In addition these dtypes have item sizes, e.g. int64 and int32. See Series with TZ for more detail on datetime64[ns, tz] dtypes.

A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.

In [344]: dft = pd.DataFrame(dict(A = np.random.rand(3),
   .....:                         B = 1,
   .....:                         C = 'foo',
   .....:                         D = pd.Timestamp('20010102'),
   .....:                         E = pd.Series([1.0]*3).astype('float32'),
   .....:                                     F = False,
   .....:                                     G = pd.Series([1]*3,dtype='int8')))
   .....: 
In [345]: dft
Out[345]: 
          A  B    C          D    E      F  G
0  0.809585  1  foo 2001-01-02  1.0  False  1
1  0.128238  1  foo 2001-01-02  1.0  False  1
2  0.775752  1  foo 2001-01-02  1.0  False  1
In [346]: dft.dtypes
Out[346]: 
A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

On a Series object, use the dtype attribute.

In [347]: dft['A'].dtype
Out[347]: dtype('float64')

If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

# these ints are coerced to floats
In [348]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[348]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64
# string data forces an ``object`` dtype
In [349]: pd.Series([1, 2, 3, 6., 'foo'])
Out[349]: 
0      1
1      2
2      3
3      6
4    foo
dtype: object

The number of columns of each type in a DataFrame can be found by calling get_dtype_counts().

In [350]: dft.get_dtype_counts()
Out[350]: 
float64           1
float32           1
int64             1
int8              1
datetime64[ns]    1
bool              1
object            1
dtype: int64

Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste.

In [351]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
In [352]: df1
Out[352]: 
          A
0  0.890400
1  0.283331
2 -0.303613
3 -1.192210
4  0.065420
5  0.455918
6  2.008328
7  0.188942
In [353]: df1.dtypes
Out[353]: 
A    float32
dtype: object
In [354]: df2 = pd.DataFrame(dict( A = pd.Series(np.random.randn(8), dtype='float16'),
   .....:                         B = pd.Series(np.random.randn(8)),
   .....:                         C = pd.Series(np.array(np.random.randn(8), dtype='uint8')) ))
   .....: 
In [355]: df2
Out[355]: 
          A         B    C
0 -0.454346  0.200071  255
1 -0.916504 -0.557756  255
2  0.640625 -0.141988    0
3  2.675781 -0.174060    0
4 -0.007866  0.258626    0
5 -0.204224  0.941688    0
6 -0.100098 -1.849045    0
7 -0.402100 -0.949458    0
In [356]: df2.dtypes
Out[356]: 
A    float16
B    float64
C      uint8
dtype: object

defaults

By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit). The following will all result in int64 dtypes.

In [357]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[357]: 
a    int64
dtype: object
In [358]: pd.DataFrame({'a': [1, 2]}).dtypes
Out[358]: 
a    int64
dtype: object
In [359]: pd.DataFrame({'a': 1 }, index=list(range(2))).dtypes
Out[359]: 
a    int64
dtype: object

Note that Numpy will choose platform-dependent types when creating arrays. The following WILL result in int32 on 32-bit platform.

In [360]: frame = pd.DataFrame(np.array([1, 2]))

upcasting

Types can potentially be upcasted when combined with other types, meaning they are promoted from the current type (e.g. int to float).

In [361]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
In [362]: df3
Out[362]: 
          A         B      C
0  0.436054  0.200071  255.0
1 -0.633173 -0.557756  255.0
2  0.337012 -0.141988    0.0
3  1.483571 -0.174060    0.0
4  0.057555  0.258626    0.0
5  0.251695  0.941688    0.0
6  1.908231 -1.849045    0.0
7 -0.213158 -0.949458    0.0
In [363]: df3.dtypes
Out[363]: 
A    float32
B    float64
C    float64
dtype: object

The values attribute on a DataFrame return the lower-common-denominator of the dtypes, meaning the dtype that can accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force some upcasting.

In [364]: df3.values.dtype
Out[364]: dtype('float64')

astype

You can use the astype() method to explicitly convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition, they will raise an exception if the astype operation is invalid.

Upcasting is always according to the numpy rules. If two different dtypes are involved in an operation, then the more general one will be used as the result of the operation.

In [365]: df3
Out[365]: 
          A         B      C
0  0.436054  0.200071  255.0
1 -0.633173 -0.557756  255.0
2  0.337012 -0.141988    0.0
3  1.483571 -0.174060    0.0
4  0.057555  0.258626    0.0
5  0.251695  0.941688    0.0
6  1.908231 -1.849045    0.0
7 -0.213158 -0.949458    0.0
In [366]: df3.dtypes
Out[366]: 
A    float32
B    float64
C    float64
dtype: object
# conversion of dtypes
In [367]: df3.astype('float32').dtypes
Out[367]: 
A    float32
B    float32
C    float32
dtype: object

Convert a subset of columns to a specified type using astype().

In [368]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})
In [369]: dft[['a','b']] = dft[['a','b']].astype(np.uint8)
In [370]: dft
Out[370]: 
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9
In [371]: dft.dtypes
Out[371]: 
a    uint8
b    uint8
c    int64
dtype: object

New in version 0.19.0.

Convert certain columns to a specific dtype by passing a dict to astype().

In [372]: dft1 = pd.DataFrame({'a': [1,0,1], 'b': [4,5,6], 'c': [7, 8, 9]})
In [373]: dft1 = dft1.astype({'a': np.bool, 'c': np.float64})
In [374]: dft1
Out[374]: 
       a  b    c
0   True  4  7.0
1  False  5  8.0
2   True  6  9.0
In [375]: dft1.dtypes
Out[375]: 
a       bool
b      int64
c    float64
dtype: object

Note: When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.

loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from the right hand side. Therefore the following piece of code produces the unintended result.

In [376]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})
In [377]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes
Out[377]: 
a    uint8
b    uint8
dtype: object
In [378]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)
In [379]: dft.dtypes
Out[379]: 
a    int64
b    int64
c    int64
dtype: object

object conversion

pandas offers various functions to try to force conversion of types from the object dtype to other types. In cases where the data is already of the correct type, but stored in an object array, the DataFrame.infer_objects() and Series.infer_objects() methods can be used to soft convert to the correct type.

In [380]: import datetime
In [381]: df = pd.DataFrame([[1, 2],
   .....:                    ['a', 'b'],
   .....:                    [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)]])
   .....: 
In [382]: df = df.T
In [383]: df
Out[383]: 
   0  1                    2
0  1  a  2016-03-02 00:00:00
1  2  b  2016-03-02 00:00:00
In [384]: df.dtypes
Out[384]: 
0    object
1    object
2    object
dtype: object

Because the data was transposed the original inference stored all columns as object, which infer_objects will correct.

In [385]: df.infer_objects().dtypes
Out[385]: 
0             int64
1            object
2    datetime64[ns]
dtype: object

The following functions are available for one dimensional object arrays or scalars to perform hard conversion of objects to a specified type:

to_numeric() (conversion to numeric dtypes)

In [386]: m = ['1.1', 2, 3]
In [387]: pd.to_numeric(m)
Out[387]: array([ 1.1,  2. ,  3. ])

to_datetime() (conversion to datetime objects)

In [388]: import datetime
In [389]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)]
In [390]: pd.to_datetime(m)
Out[390]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

to_timedelta() (conversion to timedelta objects)

In [391]: m = ['5us', pd.Timedelta('1day')]
In [392]: pd.to_timedelta(m)
Out[392]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with elements that cannot be converted to desired dtype or object. By default, errors='raise', meaning that any errors encountered will be raised during the conversion process. However, if errors='coerce', these errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta) or np.nan (for numeric). This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to represent as missing:

In [393]: import datetime
In [394]: m = ['apple', datetime.datetime(2016, 3, 2)]
In [395]: pd.to_datetime(m, errors='coerce')
Out[395]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)
In [396]: m = ['apple', 2, 3]
In [397]: pd.to_numeric(m, errors='coerce')
Out[397]: array([ nan,   2.,   3.])
In [398]: m = ['apple', pd.Timedelta('1day')]
In [399]: pd.to_timedelta(m, errors='coerce')
Out[399]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

The errors parameter has a third option of errors='ignore', which will simply return the passed in data if it encounters any errors with the conversion to a desired data type:

In [400]: import datetime
In [401]: m = ['apple', datetime.datetime(2016, 3, 2)]
In [402]: pd.to_datetime(m, errors='ignore')
Out[402]: array(['apple', datetime.datetime(2016, 3, 2, 0, 0)], dtype=object)
In [403]: m = ['apple', 2, 3]
In [404]: pd.to_numeric(m, errors='ignore')
Out[404]: array(['apple', 2, 3], dtype=object)
In [405]: m = ['apple', pd.Timedelta('1day')]
In [406]: pd.to_timedelta(m, errors='ignore')
Out[406]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

In addition to object conversion, to_numeric() provides another argument downcast, which gives the option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:

In [407]: m = ['1', 2, 3]
In [408]: pd.to_numeric(m, downcast='integer')   # smallest signed int dtype
Out[408]: array([1, 2, 3], dtype=int8)
In [409]: pd.to_numeric(m, downcast='signed')    # same as 'integer'
Out[409]: array([1, 2, 3], dtype=int8)
In [410]: pd.to_numeric(m, downcast='unsigned')  # smallest unsigned int dtype
Out[410]: array([1, 2, 3], dtype=uint8)
In [411]: pd.to_numeric(m, downcast='float')     # smallest float dtype
Out[411]: array([ 1.,  2.,  3.], dtype=float32)

As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over each column efficiently:

In [412]: import datetime
In [413]: df = pd.DataFrame([['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
In [414]: df
Out[414]: 
            0                    1
0  2016-07-09  2016-03-02 00:00:00
1  2016-07-09  2016-03-02 00:00:00
In [415]: df.apply(pd.to_datetime)
Out[415]: 
           0          1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02
In [416]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')
In [417]: df
Out[417]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3
In [418]: df.apply(pd.to_numeric)
Out[418]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3
In [419]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')
In [420]: df
Out[420]: 
     0                1
0  5us  1 days 00:00:00
1  5us  1 days 00:00:00
In [421]: df.apply(pd.to_timedelta)
Out[421]: 
                0      1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days

gotchas

Performing selection operations on integer type data can easily upcast the data to floating. The dtype of the input data will be preserved in cases where nans are not introduced. See also Support for integer NA.

In [422]: dfi = df3.astype('int32')
In [423]: dfi['E'] = 1
In [424]: dfi
Out[424]: 
   A  B    C  E
0  0  0  255  1
1  0  0  255  1
2  0  0    0  1
3  1  0    0  1
4  0  0    0  1
5  0  0    0  1
6  1 -1    0  1
7  0  0    0  1
In [425]: dfi.dtypes
Out[425]: 
A    int32
B    int32
C    int32
E    int64
dtype: object
In [426]: casted = dfi[dfi>0]
In [427]: casted
Out[427]: 
     A   B      C  E
0  NaN NaN  255.0  1
1  NaN NaN  255.0  1
2  NaN NaN    NaN  1
3  1.0 NaN    NaN  1
4  NaN NaN    NaN  1
5  NaN NaN    NaN  1
6  1.0 NaN    NaN  1
7  NaN NaN    NaN  1
In [428]: casted.dtypes
Out[428]: 
A    float64
B    float64
C    float64
E      int64
dtype: object

While float dtypes are unchanged.

In [429]: dfa = df3.copy()
In [430]: dfa['A'] = dfa['A'].astype('float32')
In [431]: dfa.dtypes
Out[431]: 
A    float32
B    float64
C    float64
dtype: object
In [432]: casted = dfa[df2>0]
In [433]: casted
Out[433]: 
          A         B      C
0       NaN  0.200071  255.0
1       NaN       NaN  255.0
2  0.337012       NaN    NaN
3  1.483571       NaN    NaN
4       NaN  0.258626    NaN
5       NaN  0.941688    NaN
6       NaN       NaN    NaN
7       NaN       NaN    NaN
In [434]: casted.dtypes
Out[434]: 
A    float32
B    float64
C    float64
dtype: object