Text data
Functions and transforms to help gather text data in a Datasets
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
Backwards
Reversing the text can provide higher accuracy with an ensemble with a forward model. All that is needed is a type_tfm
that will reverse the text as it is brought in:
reverse_text
[source]
reverse_text
(x
)
t = tensor([0,1,2])
r = reverse_text(t)
test_eq(r, tensor([2,1,0]))
Numericalizing
Numericalization is the step in which we convert tokens to integers. The first step is to build a correspondence token to index that is called a vocab.
make_vocab
[source]
make_vocab
(count
,min_freq
=3
,max_vocab
=60000
,special_toks
=None
)
Create a vocab of max_vocab
size from Counter
count
with items present more than min_freq
If there are more than max_vocab
tokens, the ones kept are the most frequent.
Note: For performance when using mixed precision, the vocabulary is always made of size a multiple of 8, potentially by adding xxfake
tokens.
count = Counter(['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'd'])
test_eq(set([x for x in make_vocab(count) if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'a'.split()))
test_eq(len(make_vocab(count))%8, 0)
test_eq(set([x for x in make_vocab(count, min_freq=1) if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'a b c d'.split()))
test_eq(set([x for x in make_vocab(count,max_vocab=12, min_freq=1) if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'a b c'.split()))
class
TensorText
[source]
TensorText
(x
, **kwargs
) ::TensorBase
Semantic type for a tensor representing text
class
LMTensorText
[source]
LMTensorText
(x
, **kwargs
) ::TensorText
Semantic type for a tensor representing text in language modeling
class
Numericalize
[source]
Numericalize
(vocab
=None
,min_freq
=3
,max_vocab
=60000
,special_toks
=None
) ::Transform
Reversible transform of tokenized texts to numericalized ids
num = Numericalize(min_freq=2)
num.setup(L('This is an example of text'.split(), 'this is another text'.split()))
start = 'This is an example of text '
If no vocab
is passed, one is created at setup from the data, using make_vocab
with min_freq
and max_vocab
.
start = 'This is an example of text'
num = Numericalize(min_freq=1)
num.setup(L(start.split(), 'this is another text'.split()))
test_eq(set([x for x in num.vocab if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'This is an example of text this another'.split()))
test_eq(len(num.vocab)%8, 0)
t = num(start.split())
test_eq(t, tensor([11, 9, 12, 13, 14, 10]))
test_eq(num.decode(t), start.split())
num = Numericalize(min_freq=2)
num.setup(L('This is an example of text'.split(), 'this is another text'.split()))
test_eq(set([x for x in num.vocab if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'is text'.split()))
test_eq(len(num.vocab)%8, 0)
t = num(start.split())
test_eq(t, tensor([0, 9, 0, 0, 0, 10]))
test_eq(num.decode(t), f'{UNK} is {UNK} {UNK} {UNK} text'.split())
class
LMDataLoader
[source]
LMDataLoader
(dataset
,lens
=None
,cache
=2
,bs
=64
,seq_len
=72
,num_workers
=0
,shuffle
=False
,verbose
=False
,do_setup
=True
,pin_memory
=False
,timeout
=0
,batch_size
=None
,drop_last
=False
,indexed
=None
,n
=None
,device
=None
,persistent_workers
=False
,wif
=None
,before_iter
=None
,after_item
=None
,before_batch
=None
,after_batch
=None
,after_iter
=None
,create_batches
=None
,create_item
=None
,create_batch
=None
,retain
=None
,get_idxs
=None
,sample
=None
,shuffle_fn
=None
,do_batch
=None
) ::TfmdDL
A DataLoader
suitable for language modeling
dataset
should be a collection of numericalized texts for this to work. lens
can be passed for optimizing the creation, otherwise, the LMDataLoader
will do a full pass of the dataset
to compute them. cache
is used to avoid reloading items unnecessarily.
The LMDataLoader
will concatenate all texts (maybe shuffle
d) in one big stream, split it in bs
contiguous sentences, then go through those seq_len
at a time.
bs,sl = 4,3
ints = L([0,1,2,3,4],[5,6,7,8,9,10],[11,12,13,14,15,16,17,18],[19,20],[21,22,23],[24]).map(tensor)
dl = LMDataLoader(ints, bs=bs, seq_len=sl)
test_eq(list(dl),
[[tensor([[0, 1, 2], [6, 7, 8], [12, 13, 14], [18, 19, 20]]),
tensor([[1, 2, 3], [7, 8, 9], [13, 14, 15], [19, 20, 21]])],
[tensor([[3, 4, 5], [ 9, 10, 11], [15, 16, 17], [21, 22, 23]]),
tensor([[4, 5, 6], [10, 11, 12], [16, 17, 18], [22, 23, 24]])]])
dl = LMDataLoader(ints, bs=bs, seq_len=sl, shuffle=True)
for x,y in dl: test_eq(x[:,1:], y[:,:-1])
((x0,y0), (x1,y1)) = tuple(dl)
#Second batch begins where first batch ended
test_eq(y0[:,-1], x1[:,0])
test_eq(type(x0), LMTensorText)
Classification
For classification, we deal with the fact that texts don’t all have the same length by using padding.
class
Pad_Input
[source]
Pad_Input
(enc
=None
,dec
=None
,split_idx
=None
,order
=None
) ::ItemTransform
A transform that always take tuples as items
pad_idx
is used for the padding, and the padding is applied to the pad_fields
of the samples. The padding is applied at the beginning if pad_first
is True
, and if backwards
is added, the tensors are flipped.
test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0),
[(tensor([1,2,3]),1), (tensor([4,5,0]),2), (tensor([6,0,0]), 3)])
test_eq(pad_input([(tensor([1,2,3]), (tensor([6]))), (tensor([4,5]), tensor([4,5])), (tensor([6]), (tensor([1,2,3])))], pad_idx=0, pad_fields=1),
[(tensor([1,2,3]),(tensor([6,0,0]))), (tensor([4,5]),tensor([4,5,0])), ((tensor([6]),tensor([1, 2, 3])))])
test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0, pad_first=True),
[(tensor([1,2,3]),1), (tensor([0,4,5]),2), (tensor([0,0,6]), 3)])
test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0, backwards=True),
[(tensor([3,2,1]),1), (tensor([5,4,0]),2), (tensor([6,0,0]), 3)])
x = pad_input([(TensorText([1,2,3]),1), (TensorText([4,5]), 2), (TensorText([6]), 3)], pad_idx=0)
test_eq(x, [(tensor([1,2,3]),1), (tensor([4,5,0]), 2), (tensor([6,0,0]), 3)])
test_eq(pad_input.decode(x[1][0]), tensor([4,5]))
Pads x
with pad_idx
to length pad_len
. If pad_first
is false, all padding is appended to x
, until x
is len pad_len
. Otherwise ff pad_first
is true, then chunks of size seq_len
are prepended to x
, the remainder of the padding is appended to x
.
pad_chunk
[source]
pad_chunk
(x
,pad_idx
=1
,pad_first
=True
,seq_len
=72
,pad_len
=10
)
Pad x
by adding padding by chunks of size seq_len
print('pad_first: ',pad_chunk(torch.tensor([1,2,3]),seq_len=3,pad_idx=0,pad_len=8))
print('pad_last: ',pad_chunk(torch.tensor([1,2,3]),seq_len=3,pad_idx=0,pad_len=8,pad_first=False))
pad_first: tensor([0, 0, 0, 1, 2, 3, 0, 0])
pad_last: tensor([1, 2, 3, 0, 0, 0, 0, 0])
pad_input_chunk
is the version of pad_chunk
that works over a list of lists.
pad_input_chunk
[source]
pad_input_chunk
(samples
,n_inp
=1
,pad_idx
=1
,pad_first
=True
,seq_len
=72
,pad_len
=10
)
Pad samples
by adding padding by chunks of size seq_len
The difference with the base pad_input
is that most of the padding is applied first (if pad_first=True
) or at the end (if pad_first=False
) but only by a round multiple of seq_len
. The rest of the padding is applied to the end (or the beginning if pad_first=False
). This is to work with SequenceEncoder
with recurrent models.
pad_input_chunk([(TensorText([1,2,3,4,5,6]),TensorText([1,2]),1)], pad_idx=0, seq_len=3,n_inp=2)
[(TensorText([1, 2, 3, 4, 5, 6]), TensorText([0, 0, 0, 1, 2, 0]), 1)]
test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),1), (tensor([1,2,3]), 2), (tensor([1,2]), 3)], pad_idx=0, seq_len=2),
[(tensor([1,2,3,4,5,6]),1), (tensor([0,0,1,2,3,0]),2), (tensor([0,0,0,0,1,2]), 3)])
test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),), (tensor([1,2,3]),), (tensor([1,2]),)], pad_idx=0, seq_len=2),
[(tensor([1,2,3,4,5,6]),), (tensor([0,0,1,2,3,0]),), (tensor([0,0,0,0,1,2]),)])
test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),), (tensor([1,2,3]),), (tensor([1,2]),)], pad_idx=0, seq_len=2, pad_first=False),
[(tensor([1,2,3,4,5,6]),), (tensor([1,2,3,0,0,0]),), (tensor([1,2,0,0,0,0]),)])
test_eq(pad_input_chunk([(TensorText([1,2,3,4,5,6]),TensorText([1,2]),1)], pad_idx=0, seq_len=2,n_inp=2),
[(TensorText([1,2,3,4,5,6]),TensorText([0,0,0,0,1,2]),1)])
Transform
version of pad_input_chunk
. This version supports types, decoding, and the other functionality of Transform
class
Pad_Chunk
[source]
Pad_Chunk
(pad_idx
=1
,pad_first
=True
,seq_len
=72
,decode
=True
, **kwargs
) ::DisplayedTransform
Pad samples
by adding padding by chunks of size seq_len
Here is an example of Pad_Chunk
pc=Pad_Chunk(pad_idx=0,seq_len=3)
out=pc([(TensorText([1,2,3,4,5,6]),TensorText([1,2]),1)])
print('Inputs: ',*[(TensorText([1,2,3,4,5,6]),TensorText([1,2]),1)])
print('Encoded: ',*out)
print('Decoded: ',*pc.decode(out))
Inputs: (TensorText([1, 2, 3, 4, 5, 6]), TensorText([1, 2]), 1)
Encoded: (TensorText([1, 2, 3, 4, 5, 6]), TensorText([0, 0, 0, 1, 2, 0]), 1)
Decoded: (TensorText([1, 2, 3, 4, 5, 6]), TensorText([1, 2]), 1)
pc=Pad_Chunk(pad_idx=0, seq_len=2)
test_eq(pc([(TensorText([1,2,3,4,5,6]),1), (TensorText([1,2,3]), 2), (TensorText([1,2]), 3)]),
[(tensor([1,2,3,4,5,6]),1), (tensor([0,0,1,2,3,0]),2), (tensor([0,0,0,0,1,2]), 3)])
pc=Pad_Chunk(pad_idx=0, seq_len=2)
test_eq(pc([(TensorText([1,2,3,4,5,6]),), (TensorText([1,2,3]),), (TensorText([1,2]),)]),
[(tensor([1,2,3,4,5,6]),), (tensor([0,0,1,2,3,0]),), (tensor([0,0,0,0,1,2]),)])
pc=Pad_Chunk(pad_idx=0, seq_len=2, pad_first=False)
test_eq(pc([(TensorText([1,2,3,4,5,6]),), (TensorText([1,2,3]),), (TensorText([1,2]),)]),
[(tensor([1,2,3,4,5,6]),), (tensor([1,2,3,0,0,0]),), (tensor([1,2,0,0,0,0]),)])
pc=Pad_Chunk(pad_idx=0, seq_len=2)
test_eq(pc([(TensorText([1,2,3,4,5,6]),TensorText([1,2]),1)]),
[(TensorText([1,2,3,4,5,6]),TensorText([0,0,0,0,1,2]),1)])
class
SortedDL
[source]
SortedDL
(dataset
,sort_func
=None
,res
=None
,bs
=64
,shuffle
=False
,num_workers
=None
,verbose
=False
,do_setup
=True
,pin_memory
=False
,timeout
=0
,batch_size
=None
,drop_last
=False
,indexed
=None
,n
=None
,device
=None
,persistent_workers
=False
,wif
=None
,before_iter
=None
,after_item
=None
,before_batch
=None
,after_batch
=None
,after_iter
=None
,create_batches
=None
,create_item
=None
,create_batch
=None
,retain
=None
,get_idxs
=None
,sample
=None
,shuffle_fn
=None
,do_batch
=None
) ::TfmdDL
A DataLoader
that goes throught the item in the order given by sort_func
res
is the result of sort_func
applied on all elements of the dataset
. You can pass it if available to make the init much faster by avoiding an initial pass over the whole dataset. For example if sorting by text length (as in the default sort_func
, called _default_sort
) you should pass a list with the length of each element in dataset
to res
to take advantage of this speed-up.
To get the same init speed-up for the validation set, val_res
(a list of text lengths for your validation set) can be passed to the kwargs
argument of SortedDL
. Below is an example to reduce the init time by passing a list of text lengths for both the training set and the validation set:
# Pass the training dataset text lengths to SortedDL
srtd_dl=partial(SortedDL, res = train_text_lens)
# Pass the validation dataset text lengths
dl_kwargs = [{},{'val_res': val_text_lens}]
# init our Datasets
dsets = Datasets(...)
# init our Dataloaders
dls = dsets.dataloaders(...,dl_type = srtd_dl, dl_kwargs = dl_kwargs)
If shuffle
is True
, this will shuffle a bit the results of the sort to have items of roughly the same size in batches, but not in the exact sorted order.
ds = [(tensor([1,2]),1), (tensor([3,4,5,6]),2), (tensor([7]),3), (tensor([8,9,10]),4)]
dl = SortedDL(ds, bs=2, before_batch=partial(pad_input, pad_idx=0))
test_eq(list(dl), [(tensor([[ 3, 4, 5, 6], [ 8, 9, 10, 0]]), tensor([2, 4])),
(tensor([[1, 2], [7, 0]]), tensor([1, 3]))])
ds = [(tensor(range(random.randint(1,10))),i) for i in range(101)]
dl = SortedDL(ds, bs=2, create_batch=partial(pad_input, pad_idx=-1), shuffle=True, num_workers=0)
batches = list(dl)
max_len = len(batches[0][0])
for b in batches:
assert(len(b[0])) <= max_len
test_ne(b[0][-1], -1)
TransformBlock for text
To use the data block API, you will need this build block for texts.
class
TextBlock
[source]
TextBlock
(tok_tfm
,vocab
=None
,is_lm
=False
,seq_len
=72
,backwards
=False
,min_freq
=3
,max_vocab
=60000
,special_toks
=None
) ::TransformBlock
A TransformBlock
for texts
For efficient tokenization, you probably want to use one of the factory methods. Otherwise, you can pass your custom tok_tfm
that will deal with tokenization (if your texts are already tokenized, you can pass noop
), a vocab
, or leave it to be inferred on the texts using min_freq
and max_vocab
.
is_lm
indicates if we want to use texts for language modeling or another task, seq_len
is only necessary to tune if is_lm=False
, and is passed along to pad_input_chunk
.
TextBlock.from_df
[source]
TextBlock.from_df
(text_cols
,vocab
=None
,is_lm
=False
,seq_len
=72
,backwards
=False
,min_freq
=3
,max_vocab
=60000
,tok
=None
,rules
=None
,sep
=' '
,n_workers
=2
,mark_fields
=None
,tok_text_col
='text'
, **kwargs
)
Build a TextBlock
from a dataframe using text_cols
Here is an example using a sample of IMDB stored as a CSV file:
path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
imdb_clas = DataBlock(
blocks=(TextBlock.from_df('text', seq_len=72), CategoryBlock),
get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())
dls = imdb_clas.dataloaders(df, bs=64)
dls.show_batch(max_n=2)
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order)
text | category | |
---|---|---|
0 | xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review nn xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it ‘s warm and gooey , but you ‘re not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n’t quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director ‘s part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n’t fool me . xxmaj raising xxmaj victor xxmaj vargas is | negative |
1 | xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there ‘s just no getting around that , and it ‘s hard to actually put one ‘s feeling for this film into words . xxmaj it ‘s not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it ‘s easy to think that such a love story , as beautiful as any other ever told , could happen to you … a feeling you do n’t often get from other romantic comedies | positive |
vocab
, is_lm
, seq_len
, min_freq
and max_vocab
are passed to the main init, the other argument to Tokenizer.from_df
.
TextBlock.from_folder
[source]
TextBlock.from_folder
(path
,vocab
=None
,is_lm
=False
,seq_len
=72
,backwards
=False
,min_freq
=3
,max_vocab
=60000
,tok
=None
,rules
=None
,extensions
=None
,folders
=None
,output_dir
=None
,skip_if_exists
=True
,output_names
=None
,n_workers
=2
,encoding
='utf8'
, **kwargs
)
Build a TextBlock
from a path
vocab
, is_lm
, seq_len
, min_freq
and max_vocab
are passed to the main init, the other argument to Tokenizer.from_folder
.
class
TextDataLoaders
[source]
TextDataLoaders
(*loaders
,path
='.'
,device
=None
) ::DataLoaders
Basic wrapper around several DataLoader
s with factory methods for NLP problems
You should not use the init directly but one of the following factory methods. All those factory methods accept as arguments:
text_vocab
: the vocabulary used for numericalizing texts (if not passed, it’s inferred from the data)tok_tfm
: if passed, uses thistok_tfm
instead of the defaultseq_len
: the sequence length used for batchbs
: the batch sizeval_bs
: the batch size for the validationDataLoader
(defaults tobs
)shuffle_train
: if we shuffle the trainingDataLoader
or notdevice
: the PyTorch device to use (defaults todefault_device()
)
TextDataLoaders.from_folder
[source]
TextDataLoaders.from_folder
(path
,train
='train'
,valid
='valid'
,valid_pct
=None
,seed
=None
,vocab
=None
,text_vocab
=None
,is_lm
=False
,tok_tfm
=None
,seq_len
=72
,backwards
=False
,bs
=64
,val_bs
=None
,shuffle
=True
,device
=None
)
Create from imagenet style dataset in path
with train
and valid
subfolders (or provide valid_pct
)
If valid_pct
is provided, a random split is performed (with an optional seed
) by setting aside that percentage of the data for the validation set (instead of looking at the grandparents folder). If a vocab
is passed, only the folders with names in vocab
are kept.
Here is an example on a sample of the IMDB movie review dataset:
path = untar_data(URLs.IMDB)
dls = TextDataLoaders.from_folder(path)
dls.show_batch(max_n=3)
text | category | |
---|---|---|
0 | xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero | pos |
1 | xxbos xxmaj this movie was recently released on xxup dvd in the xxup us and i finally got the chance to see this hard - to - find gem . xxmaj it even came with original theatrical previews of other xxmaj italian horror classics like “ xxunk “ and “ beyond xxup the xxup darkness “ . xxmaj unfortunately , the previews were the best thing about this movie . nn “ zombi 3 “ in a bizarre way is actually linked to the infamous xxmaj lucio xxmaj fulci “ zombie “ franchise which began in 1979 . xxmaj similarly compared to “ zombie “ , “ zombi 3 “ consists of a threadbare plot and a handful of extremely bad actors that keeps this ‘ horror ‘ trash barely afloat . xxmaj the gore is nearly non - existent ( unless one is frightened of people running around with | neg |
2 | xxbos xxup anchors xxup aweigh sees two eager young sailors , xxmaj joe xxmaj brady ( gene xxmaj kelly ) and xxmaj clarence xxmaj doolittle / xxmaj brooklyn ( frank xxmaj sinatra ) , get a special four - day shore leave . xxmaj eager to get to the girls , particularly xxmaj joe ‘s xxmaj lola , neither xxmaj joe nor xxmaj brooklyn figure on the interruption of little xxmaj navy - mad xxmaj donald ( dean xxmaj stockwell ) and his xxmaj aunt xxmaj susie ( kathryn xxmaj grayson ) . xxmaj unexperienced in the ways of females and courting , xxmaj brooklyn quickly enlists xxmaj joe to help him win xxmaj aunt xxmaj susie over . xxmaj along the way , however , xxmaj joe finds himself falling for the gal he thinks belongs to his best friend . xxmaj how is xxmaj brooklyn going to take | pos |
TextDataLoaders.from_df
[source]
TextDataLoaders.from_df
(df
,path
='.'
,valid_pct
=0.2
,seed
=None
,text_col
=0
,label_col
=1
,label_delim
=None
,y_block
=None
,text_vocab
=None
,is_lm
=False
,valid_col
=None
,tok_tfm
=None
,tok_text_col
='text'
,seq_len
=72
,backwards
=False
,bs
=64
,val_bs
=None
,shuffle
=True
,device
=None
)
Create from df
in path
with valid_pct
seed
can optionally be passed for reproducibility. text_col
, label_col
and optionally valid_col
are indices or names of columns for texts/labels and the validation flag. label_delim
can be passed for a multi-label problem if your labels are in one column, separated by a particular char. y_block
should be passed to indicate your type of targets, in case the library did no infer it properly.
Along with this, you can specify the specific column the tokenized text are sent to with tok_text_col
. By default they are stored in a column named text
after tokenizing.
Here are examples on subsets of IMDB:
path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/"texts.csv"); df.head()
label | text | is_valid | |
---|---|---|---|
0 | negative | Un-bleeping-believable! Meg Ryan doesn’t even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh… Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! | False |
1 | positive | This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som… | False |
2 | negative | Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li… | False |
3 | positive | Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man’s life - interestingly enough the man’s entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie “Duty, Honor, Country” are not just mere words blathered from the lips of a high-brassed offic… | False |
4 | negative | This movie succeeds at being one of the most unique movies you’ve seen. However this comes from the fact that you can’t make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don’t want to feel slighted you’ll sit through this horrible film and develop a real sense of pity for the actors involved, they’ve all seen better days, but then you realize they actually got paid quite a bit of money to do this and you’ll lose pity for them just like you’ve alr… | False |
path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/"texts.csv")
dls = TextDataLoaders.from_df(df, path=path, text_col='text', label_col='label', valid_col='is_valid')
dls.show_batch(max_n=3)
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order)
text | category | |
---|---|---|
0 | xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review nn xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it ‘s warm and gooey , but you ‘re not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n’t quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director ‘s part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n’t fool me . xxmaj raising xxmaj victor xxmaj vargas is | negative |
1 | xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there ‘s just no getting around that , and it ‘s hard to actually put one ‘s feeling for this film into words . xxmaj it ‘s not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it ‘s easy to think that such a love story , as beautiful as any other ever told , could happen to you … a feeling you do n’t often get from other romantic comedies | positive |
2 | xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of “ at xxmaj the xxmaj movies “ in taking xxmaj steven xxmaj soderbergh to task . nn xxmaj it ‘s usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh ‘s most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh ‘s main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside “ edgy “ projects . nn xxmaj none of this excuses him this present , almost diabolical | negative |
dls = TextDataLoaders.from_df(df, path=path, text_col='text', is_lm=True, valid_col='is_valid')
dls.show_batch(max_n=3)
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order)
text | text_ | |
---|---|---|
0 | xxbos xxmaj sarah xxmaj xxunk is a dangerous xxmaj bitch ! xxmaj she ‘s beautiful , sexy , funny and talent , dark and demonic . i read the other ‘ comment ‘ on this show as well as the message board stuff and people just do n’t get it . xxmaj nothing that appears on xxup t.v . is an accident . xxmaj too much money , time and work is | xxmaj sarah xxmaj xxunk is a dangerous xxmaj bitch ! xxmaj she ‘s beautiful , sexy , funny and talent , dark and demonic . i read the other ‘ comment ‘ on this show as well as the message board stuff and people just do n’t get it . xxmaj nothing that appears on xxup t.v . is an accident . xxmaj too much money , time and work is put |
1 | do not have xxunk on about the xxunk that may xxunk them , we xxunk with their hopes and dreams without dwelling on the negative . xxmaj our xxmaj song is an emotionally satisfying film about growing up in the projects that refuses to see life in any terms other than possibility . xxbos xxmaj kubrick again puts on display his stunning ability to craft a perfect ambiance for a film . | not have xxunk on about the xxunk that may xxunk them , we xxunk with their hopes and dreams without dwelling on the negative . xxmaj our xxmaj song is an emotionally satisfying film about growing up in the projects that refuses to see life in any terms other than possibility . xxbos xxmaj kubrick again puts on display his stunning ability to craft a perfect ambiance for a film . xxmaj |
2 | myself . i want my money and time back . nn xxup do xxup not xxup watch xxup this xxup movie . nn xxmaj even if curiosity is xxunk you , stick xxunk xxunk in your eyes instead . xxmaj it will be much more enjoyable . xxmaj you have been warned ! xxbos xxmaj this is one of those movies that made me feel strongly for the need of making movies | . i want my money and time back . nn xxup do xxup not xxup watch xxup this xxup movie . nn xxmaj even if curiosity is xxunk you , stick xxunk xxunk in your eyes instead . xxmaj it will be much more enjoyable . xxmaj you have been warned ! xxbos xxmaj this is one of those movies that made me feel strongly for the need of making movies at |
TextDataLoaders.from_csv
[source]
TextDataLoaders.from_csv
(path
,csv_fname
='labels.csv'
,header
='infer'
,delimiter
=None
,valid_pct
=0.2
,seed
=None
,text_col
=0
,label_col
=1
,label_delim
=None
,y_block
=None
,text_vocab
=None
,is_lm
=False
,valid_col
=None
,tok_tfm
=None
,tok_text_col
='text'
,seq_len
=72
,backwards
=False
,bs
=64
,val_bs
=None
,shuffle
=True
,device
=None
)
Create from csv
file in path/csv_fname
Opens the csv file with header
and delimiter
, then pass all the other arguments to TextDataLoaders.from_df
.
dls = TextDataLoaders.from_csv(path=path, csv_fname='texts.csv', text_col='text', label_col='label', valid_col='is_valid')
dls.show_batch(max_n=3)
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order)
text | category | |
---|---|---|
0 | xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review nn xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it ‘s warm and gooey , but you ‘re not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n’t quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director ‘s part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n’t fool me . xxmaj raising xxmaj victor xxmaj vargas is | negative |
1 | xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there ‘s just no getting around that , and it ‘s hard to actually put one ‘s feeling for this film into words . xxmaj it ‘s not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it ‘s easy to think that such a love story , as beautiful as any other ever told , could happen to you … a feeling you do n’t often get from other romantic comedies | positive |
2 | xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of “ at xxmaj the xxmaj movies “ in taking xxmaj steven xxmaj soderbergh to task . nn xxmaj it ‘s usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh ‘s most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh ‘s main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside “ edgy “ projects . nn xxmaj none of this excuses him this present , almost diabolical | negative |
©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021