tabular.data
Base class to deal with tabular data and get a DataBunch
Tabular data handling
This module defines the main class to handle tabular data in the fastai library: TabularDataBunch
. As always, there is also a helper function to quickly get your data.
To allow you to easily create a Learner
for your data, it provides tabular_learner
.
class
TabularDataBunch
[source][test]
TabularDataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,fix_dl
:DataLoader
=None
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
) ::DataBunch
No tests found forTabularDataBunch
. To contribute a test please refer to this guide and this discussion.
Create a DataBunch
suitable for tabular data.
The best way to quickly get your data in a DataBunch
suitable for tabular data is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for testing. Here we are interested in a subsample of the adult dataset.
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
valid_idx = range(len(df)-2000, len(df))
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
dep_var = 'salary'
The initialization of TabularDataBunch
is the same as DataBunch
so you really want to use the factory method instead.
from_df
[source][test]
from_df
(path
,df
:DataFrame
,dep_var
:str
,valid_idx
:Collection
[int
],procs
:Optional
[Collection
[TabularProc
]]=None
,cat_names
:OptStrList
=None
,cont_names
:OptStrList
=None
,classes
:Collection
[T_co
]=None
,test_df
=None
,bs
:int
=64
,val_bs
:int
=None
,num_workers
:int
=8
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,device
:device
=None
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
) →DataBunch
Tests found forfrom_df
:
Some other tests where from_df
is used:
pytest -sv tests/test_tabular_data.py::test_from_df
[source]
To run tests please refer to this guide.
Create a DataBunch
from df
and valid_idx
with dep_var
. kwargs
are passed to DataBunch.create
.
Optionally, use test_df
for the test set. The dependent variable is dep_var
, while the categorical and continuous variables are in the cat_names
columns and cont_names
columns respectively. If cont_names
is None then we assume all variables that aren’t dependent or categorical are continuous. The TabularProcessor
in procs
are applied to the dataframes as preprocessing, then the categories are replaced by their codes+1 (leaving 0 for nan
) and the continuous variables are normalized.
Note that the TabularProcessor
should be passed as Callable
: the actual initialization with cat_names
and cont_names
is done during the preprocessing.
procs = [FillMissing, Categorify, Normalize]
data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)
You can then easily create a Learner
for this data with tabular_learner
.
tabular_learner
[source][test]
tabular_learner
(data
:DataBunch
,layers
:Collection
[int
],emb_szs
:Dict
[str
,int
]=None
,metrics
=None
,ps
:Collection
[float
]=None
,emb_drop
:float
=0.0
,y_range
:OptRange
=None
,use_bn
:bool
=True
, **learn_kwargs
) No tests found fortabular_learner
. To contribute a test please refer to this guide and this discussion.
Get a Learner
using data
, with metrics
, including a TabularModel
created using the remaining params.
emb_szs
is a dict
mapping categorical column names to embedding sizes; you only need to pass sizes for columns where you want to override the default behaviour of the model.
class
TabularList
[source][test]
TabularList
(items
:Iterator
[T_co
],cat_names
:OptStrList
=None
,cont_names
:OptStrList
=None
,procs
=None
, **kwargs
) →TabularList
::ItemList
Tests found forTabularList
:
Some other tests where TabularList
is used:
pytest -sv tests/test_tabular_data.py::test_from_df
[source]
To run tests please refer to this guide.
Basic ItemList
for tabular data.
Basic class to create a list of inputs in items
for tabular data. cat_names
and cont_names
are the names of the categorical and the continuous variables respectively. processor
will be applied to the inputs or one will be created from the transforms in procs
.
from_df
[source][test]
from_df
(df
:DataFrame
,cat_names
:OptStrList
=None
,cont_names
:OptStrList
=None
,procs
=None
, **kwargs
) →ItemList
Tests found forfrom_df
:
pytest -sv tests/test_tabular_data.py::test_from_df
[source]
To run tests please refer to this guide.
Get the list of inputs in the col
of path/csv_name
.
get_emb_szs
[source][test]
get_emb_szs
(sz_dict
=None
) No tests found forget_emb_szs
. To contribute a test please refer to this guide and this discussion.
Return the default embedding sizes suitable for this data or takes the ones in sz_dict
.
show_xys
[source][test]
show_xys
(xs
,ys
) No tests found forshow_xys
. To contribute a test please refer to this guide and this discussion.
Show the xs
(inputs) and ys
(targets).
show_xyzs
[source][test]
show_xyzs
(xs
,ys
,zs
) No tests found forshow_xyzs
. To contribute a test please refer to this guide and this discussion.
Show xs
(inputs), ys
(targets) and zs
(predictions).
class
TabularLine
[source][test]
TabularLine
(cats
,conts
,classes
,names
) ::ItemBase
No tests found forTabularLine
. To contribute a test please refer to this guide and this discussion.
An object that will contain the encoded cats
, the continuous variables conts
, the classes
and the names
of the columns. This is the basic input for a dataset dealing with tabular data.
class
TabularProcessor
[source][test]
TabularProcessor
(ds
:ItemBase
=None
,procs
=None
) ::PreProcessor
No tests found forTabularProcessor
. To contribute a test please refer to this guide and this discussion.
Regroup the procs
in one PreProcessor
.
Create a PreProcessor
from procs
.
©2021 fast.ai. All rights reserved.
Site last generated: Jan 5, 2021