Tabular core
Basic function to preprocess tabular data before assembling it in a DataLoaders
.
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
Initial preprocessing
make_date
[source]
make_date
(df
,date_field
)
Make sure df[date_field]
is of the right date type.
df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
make_date(df, 'date')
test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))
add_datepart
[source]
add_datepart
(df
,field_name
,prefix
=None
,drop
=True
,time
=False
)
Helper function that adds columns relevant to a date in the column field_name
of df
.
For example if we have a series of dates we can then generate features such as Year
, Month
, Day
, Dayofweek
, Is_month_start
, etc as shown below:
df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df.head()
Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019.0 | 12.0 | 49.0 | 4.0 | 2.0 | 338.0 | False | False | False | False | False | False | 1.575418e+09 |
1 | NaN | NaN | NaN | NaN | NaN | NaN | False | False | False | False | False | False | NaN |
2 | 2019.0 | 11.0 | 46.0 | 15.0 | 4.0 | 319.0 | False | False | False | False | False | False | 1.573776e+09 |
3 | 2019.0 | 10.0 | 43.0 | 24.0 | 3.0 | 297.0 | False | False | False | False | False | False | 1.571875e+09 |
add_elapsed_times
[source]
add_elapsed_times
(df
,field_names
,date_field
,base_field
)
Add in df
for each event in field_names
the elapsed time according to date_field
grouped by base_field
df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
'event': [False, True, False, True], 'base': [1,1,2,2]})
df = add_elapsed_times(df, ['event'], 'date', 'base')
df.head()
date | event | base | Afterevent | Beforeevent | event_bw | event_fw | |
---|---|---|---|---|---|---|---|
0 | 2019-12-04 | False | 1 | 5 | 0 | 1.0 | 0.0 |
1 | 2019-11-29 | True | 1 | 0 | 0 | 1.0 | 1.0 |
2 | 2019-11-15 | False | 2 | 22 | 0 | 1.0 | 0.0 |
3 | 2019-10-24 | True | 2 | 0 | 0 | 1.0 | 1.0 |
cont_cat_split
[source]
cont_cat_split
(df
,max_card
=20
,dep_var
=None
)
Helper function that returns column names of cont and cat variables from given df
.
This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card
parameter (or a float
datatype) then it will be added to the cont_names
else cat_names
. An example is below:
df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
cont_names, cat_names = cont_cat_split(df)
cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
})
df = add_datepart(df, 'd1_date', drop=False)
df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True, inplace=True)
cont_names, cat_names = cont_cat_split(df, max_card=0)
cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']
df_shrink_dtypes
[source]
df_shrink_dtypes
(df
,skip
=[]
,obj2cat
=True
,int2uint
=False
)
Return any possible smaller data types for DataFrame columns. Allows object
->category
, int
->uint
, and exclusion.
For example we will make a sample DataFrame
with int
, float
, bool
, and object
datatypes:
df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypes
i int64
f float64
e bool
date object
dtype: object
We can then call df_shrink_dtypes
to find the smallest possible datatype that can support the data:
dt = df_shrink_dtypes(df)
dt
{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}
df_shrink
[source]
df_shrink
(df
,skip
=[]
,obj2cat
=True
,int2uint
=False
)
Reduce DataFrame memory usage, by casting to smaller types returned by df_shrink_dtypes()
.
df_shrink(df)
attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:
boolean
,category
,datetime64[ns]
dtype columns are ignored.- ‘object’ type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by
obj2cat=False
. int2uint=True
, to fitint
types touint
types, if all data in the column is >= 0.- columns can be excluded by name using
excl_cols=['col1','col2']
.
To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes()
with all the same parameters for df_shrink()
.
df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
'date':['2019-12-04','2019-11-29','2019-11-15']})
df2 = df_shrink(df, skip=['date'])
Let’s compare the two:
df.dtypes
i int64
f float64
u int64
date object
dtype: object
df2.dtypes
i int8
f float32
u int16
date object
dtype: object
We can see that the datatypes changed, and even further we can look at their relative memory usages:
Initial Dataframe: 224 bytes
Reduced Dataframe: 173 bytes
Here’s another example using the ADULT_SAMPLE
dataset:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
new_df = df_shrink(df, int2uint=True)
Initial Dataframe: 3.907448 megabytes
Reduced Dataframe: 0.818329 megabytes
We reduced the overall memory used by 79%!
class
Tabular
[source]
Tabular
(df
,procs
=None
,cat_names
=None
,cont_names
=None
,y_names
=None
,y_block
=None
,splits
=None
,do_setup
=True
,device
=None
,inplace
=False
,reduce_memory
=True
) ::CollBase
A DataFrame
wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__
df
: ADataFrame
of your datacat_names
: Your categoricalx
variablescont_names
: Your continuousx
variablesy_names
: Your dependenty
variables- Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
y_block
: How to sub-categorize the type ofy_names
(CategoryBlock
orRegressionBlock
)splits
: How to split your datado_setup
: A parameter for ifTabular
will run the data through theprocs
upon initializationdevice
:cuda
orcpu
inplace
: IfTrue
,Tabular
will not keep a separate copy of your originalDataFrame
in memory. You should ensurepd.options.mode.chained_assignment
isNone
before setting thisreduce_memory
:fastai
will attempt to reduce the overall memory usage by the inputtedDataFrame
withdf_shrink
class
TabularPandas
[source]
TabularPandas
(df
,procs
=None
,cat_names
=None
,cont_names
=None
,y_names
=None
,y_block
=None
,splits
=None
,do_setup
=True
,device
=None
,inplace
=False
,reduce_memory
=True
) ::Tabular
A Tabular
object with transforms
class
TabularProc
[source]
TabularProc
(enc
=None
,dec
=None
,split_idx
=None
,order
=None
) ::InplaceTransform
Base class to write a non-lazy tabular processor for dataframes
These transforms are applied as soon as the data is available rather than as data is called from the DataLoader
class
Categorify
[source]
Categorify
(enc
=None
,dec
=None
,split_idx
=None
,order
=None
) ::TabularProc
Transform the categorical variables to something similar to pd.Categorical
While visually in the DataFrame
you will not see a change, the classes are stored in to.procs.categorify
as we can see below on a dummy DataFrame
:
df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.show()
a | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 0 |
4 | 2 |
Each column’s unique values are stored in a dictionary of column:[values]
:
cat = to.procs.categorify
cat.classes
{'a': ['#na#', 0, 1, 2]}
class
FillStrategy
[source]
FillStrategy
()
Namespace containing the various filling strategies.
Currently, filling with the median
, a constant
, and the mode
are supported.
class
FillMissing
[source]
FillMissing
(fill_strategy
=median
,add_col
=True
,fill_vals
=None
) ::TabularProc
Fill the missing values in continuous columns.
class
ReadTabBatch
[source]
ReadTabBatch
(to
) ::ItemTransform
Transform TabularPandas
values into a Tensor
with the ability to decode
class
TabDataLoader
[source]
TabDataLoader
(dataset
,bs
=16
,shuffle
=False
,after_batch
=None
,num_workers
=0
,verbose
=False
,do_setup
=True
,pin_memory
=False
,timeout
=0
,batch_size
=None
,drop_last
=False
,indexed
=None
,n
=None
,device
=None
,persistent_workers
=False
,wif
=None
,before_iter
=None
,after_item
=None
,before_batch
=None
,after_iter
=None
,create_batches
=None
,create_item
=None
,create_batch
=None
,retain
=None
,get_idxs
=None
,sample
=None
,shuffle_fn
=None
,do_batch
=None
) ::TfmdDL
A transformed DataLoader
for Tabular data
Integration example
For a more in-depth explanation, see the tabular tutorial
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)
dls = to.dataloaders()
dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | HS-grad | Never-married | Handlers-cleaners | Own-child | Black | False | 28.000000 | 335356.999710 | 9.0 | <50k |
1 | ? | HS-grad | Married-civ-spouse | ? | Husband | White | False | 65.999999 | 37330.998172 | 9.0 | <50k |
2 | Private | Masters | Never-married | #na# | Not-in-family | Asian-Pac-Islander | False | 32.000000 | 116137.997932 | 14.0 | <50k |
3 | Private | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 45.000000 | 273434.998017 | 9.0 | <50k |
4 | Private | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 51.000000 | 101431.996842 | 9.0 | <50k |
5 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Husband | White | False | 48.000000 | 332465.003428 | 13.0 | <50k |
6 | Private | Some-college | Never-married | Sales | Own-child | White | False | 17.999999 | 192409.000024 | 10.0 | <50k |
7 | Private | HS-grad | Divorced | Machine-op-inspct | Unmarried | Black | True | 37.000000 | 175390.000108 | 10.0 | <50k |
8 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 38.000000 | 192337.000006 | 13.0 | >=50k |
9 | Federal-gov | HS-grad | Married-civ-spouse | Adm-clerical | Husband | White | False | 37.000000 | 32528.006470 | 9.0 | >=50k |
to.show()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
279 | Private | HS-grad | Never-married | #na# | Own-child | White | True | 20.0 | 155775.0 | 10.0 | <50k |
6459 | Private | HS-grad | Divorced | Craft-repair | Not-in-family | White | False | 55.0 | 35551.0 | 9.0 | <50k |
5544 | Private | Assoc-voc | Divorced | Tech-support | Not-in-family | Black | False | 53.0 | 479621.0 | 11.0 | <50k |
3500 | ? | 10th | Never-married | ? | Not-in-family | White | False | 19.0 | 182590.0 | 6.0 | <50k |
3788 | Self-emp-not-inc | Bachelors | Married-civ-spouse | Sales | Husband | White | False | 31.0 | 340880.0 | 13.0 | <50k |
4002 | Self-emp-not-inc | Some-college | Never-married | Sales | Own-child | White | False | 30.0 | 196342.0 | 10.0 | <50k |
204 | ? | HS-grad | Married-civ-spouse | #na# | Husband | White | True | 60.0 | 174073.0 | 10.0 | <50k |
9097 | Private | HS-grad | Married-civ-spouse | Adm-clerical | Husband | White | False | 39.0 | 83893.0 | 9.0 | >=50k |
5972 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 48.0 | 105838.0 | 13.0 | >=50k |
5661 | Private | HS-grad | Never-married | Adm-clerical | Own-child | White | False | 26.0 | 262656.0 | 9.0 | <50k |
We can decode any set of transformed data by calling to.decode_row
with our raw data:
row = to.items.iloc[0]
to.decode_row(row)
age 20.0
workclass Private
fnlwgt 155775.0
education HS-grad
education-num 10.0
marital-status Never-married
occupation #na#
relationship Own-child
race White
sex Male
capital-gain 0
capital-loss 0
hours-per-week 30
native-country United-States
salary <50k
education-num_na True
Name: 279, dtype: object
We can make new test datasets based on the training data with the to.new()
Note: Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training
to_tst = to.new(df_test)
to_tst.process()
to_tst.items.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | education-num_na | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10000 | 0.455476 | 5 | 1.326789 | 10 | 1.178200 | 3 | 2 | 1 | 2 | Male | 0 | 0 | 40 | Philippines | 1 |
10001 | -0.936297 | 5 | 1.240484 | 12 | -0.420714 | 3 | 15 | 1 | 4 | Male | 0 | 0 | 40 | United-States | 1 |
10002 | 1.041486 | 5 | 0.146895 | 2 | -1.220171 | 1 | 9 | 2 | 5 | Female | 0 | 0 | 37 | United-States | 1 |
10003 | 0.528727 | 5 | -0.282639 | 12 | -0.420714 | 7 | 2 | 5 | 5 | Female | 0 | 0 | 43 | United-States | 1 |
10004 | 0.748481 | 6 | 1.428478 | 9 | 0.378743 | 3 | 5 | 1 | 5 | Male | 0 | 0 | 60 | United-States | 1 |
We can then convert it to a DataLoader
:
tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | Asian-Pac-Islander | False | 45.000000 | 338105.001967 | 13.0 |
1 | Private | HS-grad | Married-civ-spouse | Transport-moving | Husband | Other | False | 26.000000 | 328663.005601 | 9.0 |
2 | Private | 11th | Divorced | Other-service | Not-in-family | White | False | 53.000000 | 209021.999795 | 7.0 |
3 | Private | HS-grad | Widowed | Adm-clerical | Unmarried | White | False | 46.000000 | 162029.999497 | 9.0 |
4 | Self-emp-inc | Assoc-voc | Married-civ-spouse | Exec-managerial | Husband | White | False | 49.000000 | 349229.997780 | 11.0 |
5 | Local-gov | Some-college | Married-civ-spouse | Exec-managerial | Husband | White | False | 34.000000 | 124827.002450 | 10.0 |
6 | Self-emp-inc | Some-college | Married-civ-spouse | Sales | Husband | White | False | 53.000000 | 290640.001644 | 10.0 |
7 | Private | Some-college | Never-married | Sales | Own-child | White | False | 19.000000 | 106272.998740 | 10.0 |
8 | Private | Some-college | Married-civ-spouse | Protective-serv | Husband | Black | False | 72.000001 | 53684.003462 | 10.0 |
9 | Private | Some-college | Never-married | Sales | Own-child | White | False | 20.000000 | 505980.007069 | 10.0 |
Other target types
Multi-label categories
one-hot encoded label
def _mock_multi_label(df):
sal,sex,white = [],[],[]
for row in df.itertuples():
sal.append(row.salary == '>=50k')
sex.append(row.sex == ' Male')
white.append(row.race == ' White')
df['salary'] = np.array(sal)
df['male'] = np.array(sex)
df['white'] = np.array(white)
return df
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | male | white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | True | False | True |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | True | True | True |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | False | False | False |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | True | True | False |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | False | False | False |
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
y_names=["salary", "male", "white"]
%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names=y_names, y_block=MultiCategoryBlock(encoded=True, vocab=y_names), splits=splits)
CPU times: user 60 ms, sys: 0 ns, total: 60 ms
Wall time: 59.4 ms
dls = to.dataloaders()
dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | male | white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | HS-grad | Married-civ-spouse | Sales | Husband | White | False | 47.000000 | 186533.999848 | 9.0 | True | True | True |
1 | Private | Some-college | Never-married | Adm-clerical | Not-in-family | White | False | 32.000000 | 115631.001216 | 10.0 | False | False | True |
2 | Federal-gov | Some-college | Widowed | Exec-managerial | Not-in-family | White | False | 60.000001 | 27466.003873 | 10.0 | False | False | True |
3 | Private | HS-grad | Never-married | Other-service | Not-in-family | White | False | 49.000000 | 129639.997602 | 9.0 | False | False | True |
4 | Local-gov | Prof-school | Married-civ-spouse | Prof-specialty | Husband | White | False | 37.000000 | 265038.001582 | 15.0 | True | True | True |
5 | Private | Bachelors | Never-married | Handlers-cleaners | Other-relative | White | False | 23.000001 | 256755.002929 | 13.0 | False | False | True |
6 | Private | HS-grad | Never-married | Machine-op-inspct | Not-in-family | White | False | 39.000000 | 185052.999958 | 9.0 | False | False | True |
7 | Private | HS-grad | Never-married | Handlers-cleaners | Own-child | White | False | 28.000000 | 189346.000139 | 9.0 | False | True | True |
8 | Private | 10th | Married-civ-spouse | Other-service | Husband | Asian-Pac-Islander | False | 35.000000 | 176122.999494 | 6.0 | False | True | False |
9 | Private | 5th-6th | Never-married | Machine-op-inspct | Other-relative | White | False | 25.000000 | 521399.996882 | 3.0 | False | True | True |
Not one-hot encoded
def _mock_multi_label(df):
targ = []
for row in df.itertuples():
labels = []
if row.salary == '>=50k': labels.append('>50k')
if row.sex == ' Male': labels.append('male')
if row.race == ' White': labels.append('white')
targ.append(' '.join(labels))
df['target'] = np.array(targ)
return df
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k | >50k white |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k | >50k male white |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k | |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k | >50k male |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
@MultiCategorize
def encodes(self, to:Tabular):
#to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
return to
@MultiCategorize
def decodes(self, to:Tabular):
#to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
return to
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="target", y_block=MultiCategoryBlock(), splits=splits)
CPU times: user 68 ms, sys: 0 ns, total: 68 ms
Wall time: 65 ms
to.procs[2].vocab
['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']
Regression
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names='age', splits=splits)
CPU times: user 60 ms, sys: 4 ms, total: 64 ms
Wall time: 63.3 ms
to.procs[-1].means
{'fnlwgt': 192492.332875, 'education-num': 10.075499534606934}
dls = to.dataloaders()
dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | fnlwgt | education-num | age | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Private | 9th | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 288185.002301 | 5.0 | 25.0 |
1 | Self-emp-inc | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 383492.997753 | 9.0 | 44.0 |
2 | Private | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 84136.001920 | 9.0 | 40.0 |
3 | Private | Bachelors | Never-married | Handlers-cleaners | Own-child | White | True | 31778.002656 | 10.0 | 28.0 |
4 | Private | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 193036.000001 | 10.0 | 34.0 |
5 | Private | 10th | Divorced | Machine-op-inspct | Not-in-family | Black | False | 131713.998819 | 6.0 | 29.0 |
6 | Private | HS-grad | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 275632.002074 | 9.0 | 30.0 |
7 | Private | HS-grad | Married-civ-spouse | Other-service | Husband | White | False | 107236.003015 | 9.0 | 27.0 |
8 | Private | HS-grad | Married-civ-spouse | Machine-op-inspct | Husband | Black | False | 83878.997816 | 9.0 | 28.0 |
9 | Private | 7th-8th | Never-married | Handlers-cleaners | Own-child | White | False | 255476.000025 | 4.0 | 29.0 |
Not being used now - for multi-modal
class TensorTabular(fastuple):
def get_ctxs(self, max_n=10, **kwargs):
n_samples = min(self[0].shape[0], max_n)
df = pd.DataFrame(index = range(n_samples))
return [df.iloc[i] for i in range(n_samples)]
def display(self, ctxs): display_df(pd.DataFrame(ctxs))
class TabularLine(pd.Series):
"A line of a dataframe that knows how to show itself"
def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)
class ReadTabLine(ItemTransform):
def __init__(self, proc): self.proc = proc
def encodes(self, row):
cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
return TensorTabular(tensor(cats).long(),tensor(conts).float())
def decodes(self, o):
to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
to = self.proc.decode(to)
return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))
class ReadTabTarget(ItemTransform):
def __init__(self, proc): self.proc = proc
def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])
# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)
# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')
# test_stdout(lambda: print(show_at(tds, 1)), """a 1
# b_na False
# b 1
# category a
# dtype: object""")
©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021