Text Preprocessing
Tokenizer
keras.preprocessing.text.Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)
Text tokenization utility class.
This class allows to vectorize a text corpus, by turning eachtext into either a sequence of integers (each integer being the indexof a token in a dictionary) or into a vector where the coefficientfor each token could be binary, based on word count, based on tf-idf…
Arguments
- num_words: the maximum number of words to keep, based on word frequency. Only the most common
num_words-1
words will be kept. - filters: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the
'
character. - lower: boolean. Whether to convert the texts to lowercase.
- split: str. Separator for word splitting.
- char_level: if True, every character will be treated as a token.
- oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls
By default, all punctuation is removed, turning the texts intospace-separated sequences of words(words maybe include the '
character). These sequences are thensplit into lists of tokens. They will then be indexed or vectorized.
0
is a reserved index that won't be assigned to any word.
hashing_trick
keras.preprocessing.text.hashing_trick(text, n, hash_function=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
Converts a text to a sequence of indexes in a fixed-size hashing space.
Arguments
- text: Input text (string).
- n: Dimension of the hashing space.
- hash_function: defaults to python
hash
function, can be 'md5' or any function that takes in input a string and returns a int. Note that 'hash' is not a stable hashing function, so it is not consistent across different runs, while 'md5' is a stable hashing function. - filters: list (or concatenation) of characters to filter out, such as punctuation. Default:
!"#$%&()*+,-./:;<=>?@[]^_`{|}~\t\n
, includes basic punctuation, tabs, and newlines. - lower: boolean. Whether to set the text to lowercase.
- split: str. Separator for word splitting.
Returns
A list of integer word indices (unicity non-guaranteed).
0
is a reserved index that won't be assigned to any word.
Two or more words may be assigned to the same index, due to possiblecollisions by the hashing function.The probabilityof a collision is in relation to the dimension of the hashing space andthe number of distinct objects.
one_hot
keras.preprocessing.text.one_hot(text, n, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
One-hot encodes a text into a list of word indexes of size n.
This is a wrapper to the hashing_trick
function using hash
as thehashing function; unicity of word to index mapping non-guaranteed.
Arguments
- text: Input text (string).
- n: int. Size of vocabulary.
- filters: list (or concatenation) of characters to filter out, such as punctuation. Default:
!"#$%&()*+,-./:;<=>?@[]^_`{|}~\t\n
, includes basic punctuation, tabs, and newlines. - lower: boolean. Whether to set the text to lowercase.
- split: str. Separator for word splitting.
Returns
List of integers in [1, n]. Each integer encodes a word(unicity non-guaranteed).
text_to_word_sequence
keras.preprocessing.text.text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
Converts a text to a sequence of words (or tokens).
Arguments
- text: Input text (string).
- filters: list (or concatenation) of characters to filter out, such as punctuation. Default:
!"#$%&()*+,-./:;<=>?@[]^_`{|}~\t\n
, includes basic punctuation, tabs, and newlines. - lower: boolean. Whether to convert the input to lowercase.
- split: str. Separator for word splitting.
Returns
A list of words (or tokens).