Collaborative filtering
Tools to quickly get the data and train models suitable for collaborative filtering
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
This module contains all the high-level functions you need in a collaborative filtering application to assemble your data, get a model and train it with a Learner
. We will go other those in order but you can also check the collaborative filtering tutorial.
Gather the data
class
TabularCollab
[source]
TabularCollab
(df
,procs
=None
,cat_names
=None
,cont_names
=None
,y_names
=None
,y_block
=None
,splits
=None
,do_setup
=True
,device
=None
,inplace
=False
,reduce_memory
=True
) ::TabularPandas
Instance of TabularPandas
suitable for collaborative filtering (with no continuous variable)
This is just to use the internal of the tabular application, don’t worry about it.
class
CollabDataLoaders
[source]
CollabDataLoaders
(*loaders
,path
='.'
,device
=None
) ::DataLoaders
Base DataLoaders
for collaborative filtering.
This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:
valid_pct
: the random percentage of the dataset to set aside for validation (with an optionalseed
)user_name
: the name of the column containing the user (defaults to the first column)item_name
: the name of the column containing the item (defaults to the second column)rating_name
: the name of the column containing the rating (defaults to the third column)path
: the folder where to workbs
: the batch sizeval_bs
: the batch size for the validationDataLoader
(defaults tobs
)shuffle_train
: if we shuffle the trainingDataLoader
or notdevice
: the PyTorch device to use (defaults todefault_device()
)
CollabDataLoaders.from_df
[source]
CollabDataLoaders.from_df
(ratings
,valid_pct
=0.2
,user_name
=None
,item_name
=None
,rating_name
=None
,seed
=None
,path
='.'
,bs
=64
,val_bs
=None
,shuffle
=True
,device
=None
)
Create a DataLoaders
suitable for collaborative filtering from ratings
.
Let’s see how this works on an example:
path = untar_data(URLs.ML_SAMPLE)
ratings = pd.read_csv(path/'ratings.csv')
ratings.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 73 | 1097 | 4.0 | 1255504951 |
1 | 561 | 924 | 3.5 | 1172695223 |
2 | 157 | 260 | 3.5 | 1291598691 |
3 | 358 | 1210 | 5.0 | 957481884 |
4 | 130 | 316 | 2.0 | 1138999234 |
dls = CollabDataLoaders.from_df(ratings, bs=64)
dls.show_batch()
userId | movieId | rating | |
---|---|---|---|
0 | 157 | 1265 | 3.0 |
1 | 130 | 2858 | 2.0 |
2 | 481 | 1196 | 5.0 |
3 | 105 | 597 | 3.5 |
4 | 128 | 597 | 5.0 |
5 | 587 | 5952 | 4.0 |
6 | 111 | 2571 | 5.0 |
7 | 105 | 356 | 3.0 |
8 | 77 | 5349 | 4.0 |
9 | 119 | 1240 | 4.0 |
CollabDataLoaders.from_csv
[source]
CollabDataLoaders.from_csv
(csv
,valid_pct
=0.2
,user_name
=None
,item_name
=None
,rating_name
=None
,seed
=None
,path
='.'
,bs
=64
,val_bs
=None
,shuffle
=True
,device
=None
)
Create a DataLoaders
suitable for collaborative filtering from csv
.
dls = CollabDataLoaders.from_csv(path/'ratings.csv', bs=64)
Models
fastai provides two kinds of models for collaborative filtering: a dot-product model and a neural net.
class
EmbeddingDotBias
[source]
EmbeddingDotBias
(n_factors
,n_users
,n_items
,y_range
=None
) ::Module
Base dot model for collaborative filtering.
The model is built with n_factors
(the length of the internal vectors), n_users
and n_items
. For a given user and item, it grabs the corresponding weights and bias and returns
torch.dot(user_w, item_w) + user_b + item_b
Optionally, if y_range
is passed, it applies a SigmoidRange
to that result.
x,y = dls.one_batch()
model = EmbeddingDotBias(50, len(dls.classes['userId']), len(dls.classes['movieId']), y_range=(0,5)
).to(x.device)
out = model(x)
assert (0 <= out).all() and (out <= 5).all()
EmbeddingDotBias.from_classes
[source]
EmbeddingDotBias.from_classes
(n_factors
,classes
,user
=None
,item
=None
,y_range
=None
)
Build a model with n_factors
by inferring n_users
and n_items
from classes
y_range
is passed to the main init. user
and item
are the names of the keys for users and items in classes
(default to the first and second key respectively). classes
is expected to be a dictionary key to list of categories like the result of dls.classes
in a CollabDataLoaders
:
dls.classes
{'userId': (#101) ['#na#',15,17,19,23,30,48,56,73,77...],
'movieId': (#101) ['#na#',1,10,32,34,39,47,50,110,150...]}
Let’s see how it can be used in practice:
model = EmbeddingDotBias.from_classes(50, dls.classes, y_range=(0,5)
).to(x.device)
out = model(x)
assert (0 <= out).all() and (out <= 5).all()
Two convenience methods are added to easily access the weights and bias when a model is created with EmbeddingDotBias.from_classes
:
EmbeddingDotBias.weight
[source]
EmbeddingDotBias.weight
(arr
,is_item
=True
)
Weight for item or user (based on is_item
) for all in arr
The elements of arr
are expected to be class names (which is why the model needs to be created with EmbeddingDotBias.from_classes
)
mov = dls.classes['movieId'][42]
w = model.weight([mov])
test_eq(w, model.i_weight(tensor([42])))
EmbeddingDotBias.bias
[source]
EmbeddingDotBias.bias
(arr
,is_item
=True
)
Bias for item or user (based on is_item
) for all in arr
The elements of arr
are expected to be class names (which is why the model needs to be created with EmbeddingDotBias.from_classes
)
mov = dls.classes['movieId'][42]
b = model.bias([mov])
test_eq(b, model.i_bias(tensor([42])))
class
EmbeddingNN
[source]
EmbeddingNN
(emb_szs
,layers
,ps
=None
,embed_p
=0.0
,y_range
=None
,use_bn
=True
,bn_final
=False
,bn_cont
=True
,act_cls
=ReLU(inplace=True)
) ::TabularModel
Subclass TabularModel
to create a NN suitable for collaborative filtering.
emb_szs
should be a list of two tuples, one for the users, one for the items, each tuple containing the number of users/items and the corresponding embedding size (the function get_emb_sz
can give a good default). All the other arguments are passed to TabularModel
.
emb_szs = get_emb_sz(dls.train_ds, {})
model = EmbeddingNN(emb_szs, [50], y_range=(0,5)
).to(x.device)
out = model(x)
assert (0 <= out).all() and (out <= 5).all()
Create a Learner
The following function lets us quickly create a Learner
for collaborative filtering from the data.
collab_learner
[source]
collab_learner
(dls
,n_factors
=50
,use_nn
=False
,emb_szs
=None
,layers
=None
,config
=None
,y_range
=None
,loss_func
=None
,opt_func
=Adam
,lr
=0.001
,splitter
=trainable_params
,cbs
=None
,metrics
=None
,path
=None
,model_dir
='models'
,wd
=None
,wd_bn_bias
=False
,train_bn
=True
,moms
=(0.95, 0.85, 0.95)
)
Create a Learner for collaborative filtering on dls
.
If use_nn=False
, the model used is an EmbeddingDotBias
with n_factors
and y_range
. Otherwise, it’s a EmbeddingNN
for which you can pass emb_szs
(will be inferred from the dls
with get_emb_sz
if you don’t provide any), layers
(defaults to [n_factors]
) y_range
, and a config
that you can create with tabular_config
to customize your model.
loss_func
will default to MSELossFlat
and all the other arguments are passed to Learner
.
learn = collab_learner(dls, y_range=(0,5))
learn.fit_one_cycle(1)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 2.521979 | 2.541627 | 00:00 |
©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021