Tabular data
Helper functions to get data in a DataLoaders
in the tabular application and higher class TabularDataLoaders
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
The main class to get your data ready for model training is TabularDataLoaders
and its factory methods. Checkout the tabular tutorial for examples of use.
class
TabularDataLoaders
[source]
TabularDataLoaders
(*loaders
,path
='.'
,device
=None
) ::DataLoaders
Basic wrapper around several DataLoader
s with factory methods for tabular data
This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:
cat_names
: the names of the categorical variablescont_names
: the names of the continuous variablesy_names
: the names of the dependent variablesy_block
: theTransformBlock
to use for the targetvalid_idx
: the indices to use for the validation set (defaults to a random split otherwise)bs
: the batch sizeval_bs
: the batch size for the validationDataLoader
(defaults tobs
)shuffle_train
: if we shuffle the trainingDataLoader
or notn
: overrides the numbers of elements in the datasetdevice
: the PyTorch device to use (defaults todefault_device()
)
TabularDataLoaders.from_df
[source]
TabularDataLoaders.from_df
(df
,path
='.'
,procs
=None
,cat_names
=None
,cont_names
=None
,y_names
=None
,y_block
=None
,valid_idx
=None
,bs
=64
,shuffle_train
=None
,shuffle
=True
,val_shuffle
=False
,n
=None
,device
=None
,drop_last
=None
,val_bs
=None
)
Create from df
in path
using procs
Let’s have a look on an example with the adult dataset:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names="salary", valid_idx=list(range(800,1000)), bs=64)
dls.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | 11th | Separated | Adm-clerical | Unmarried | Black | False | 55.0 | 213894.000562 | 7.0 | <50k |
1 | Private | HS-grad | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 53.0 | 228500.001385 | 9.0 | >=50k |
2 | Private | HS-grad | Married-civ-spouse | Tech-support | Husband | White | False | 38.0 | 256864.000909 | 9.0 | >=50k |
3 | Private | Bachelors | Married-civ-spouse | Tech-support | Husband | White | False | 40.0 | 247879.997190 | 13.0 | >=50k |
4 | Private | Some-college | Divorced | Craft-repair | Not-in-family | White | False | 41.0 | 40151.001925 | 10.0 | >=50k |
5 | Private | HS-grad | Married-civ-spouse | Sales | Husband | White | False | 37.0 | 110713.001599 | 9.0 | >=50k |
6 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 38.0 | 278924.000902 | 13.0 | >=50k |
7 | Self-emp-not-inc | 11th | Married-civ-spouse | Farming-fishing | Husband | White | False | 60.0 | 220341.999356 | 7.0 | <50k |
8 | ? | 9th | Never-married | ? | Not-in-family | White | False | 30.0 | 104965.001013 | 5.0 | <50k |
9 | ? | HS-grad | Never-married | ? | Not-in-family | White | False | 21.0 | 105311.997415 | 9.0 | <50k |
TabularDataLoaders.from_csv
[source]
TabularDataLoaders.from_csv
(csv
,skipinitialspace
=True
,path
='.'
,procs
=None
,cat_names
=None
,cont_names
=None
,y_names
=None
,y_block
=None
,valid_idx
=None
,bs
=64
,shuffle_train
=None
,shuffle
=True
,val_shuffle
=False
,n
=None
,device
=None
,drop_last
=None
,val_bs
=None
)
Create from csv
file in path
using procs
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names="salary", valid_idx=list(range(800,1000)), bs=64)
External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ..."
. Often trimming is needed. Pandas has a convenient parameter skipinitialspace
that is exposed by TabularDataLoaders.from_csv()
). Otherwise category labels use for inference later such as workclass
:Private
will be categorized wrongly to 0 or "#na#"
if training label was read as " Private"
. Let’s test this feature.
test_data = {
'age': [49],
'workclass': ['Private'],
'fnlwgt': [101320],
'education': ['Assoc-acdm'],
'education-num': [12.0],
'marital-status': ['Married-civ-spouse'],
'occupation': [''],
'relationship': ['Wife'],
'race': ['White'],
}
input = pd.DataFrame(test_data)
tdl = dls.test_dl(input)
test_ne(0, tdl.dataset.iloc[0]['workclass'])
©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021