Text Preprocessing

[source]

Tokenizer

  1. keras.preprocessing.text.Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)

Text tokenization utility class.

This class allows to vectorize a text corpus, by turning eachtext into either a sequence of integers (each integer being the indexof a token in a dictionary) or into a vector where the coefficientfor each token could be binary, based on word count, based on tf-idf…

Arguments

  • num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.
  • filters: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character.
  • lower: boolean. Whether to convert the texts to lowercase.
  • split: str. Separator for word splitting.
  • char_level: if True, every character will be treated as a token.
  • oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

By default, all punctuation is removed, turning the texts intospace-separated sequences of words(words maybe include the ' character). These sequences are thensplit into lists of tokens. They will then be indexed or vectorized.

0 is a reserved index that won't be assigned to any word.

hashing_trick

  1. keras.preprocessing.text.hashing_trick(text, n, hash_function=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')

Converts a text to a sequence of indexes in a fixed-size hashing space.

Arguments

  • text: Input text (string).
  • n: Dimension of the hashing space.
  • hash_function: defaults to python hash function, can be 'md5' or any function that takes in input a string and returns a int. Note that 'hash' is not a stable hashing function, so it is not consistent across different runs, while 'md5' is a stable hashing function.
  • filters: list (or concatenation) of characters to filter out, such as punctuation. Default: !"#$%&()*+,-./:;<=>?@[]^_`{|}~\t\n, includes basic punctuation, tabs, and newlines.
  • lower: boolean. Whether to set the text to lowercase.
  • split: str. Separator for word splitting.

Returns

A list of integer word indices (unicity non-guaranteed).

0 is a reserved index that won't be assigned to any word.

Two or more words may be assigned to the same index, due to possiblecollisions by the hashing function.The probabilityof a collision is in relation to the dimension of the hashing space andthe number of distinct objects.

one_hot

  1. keras.preprocessing.text.one_hot(text, n, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')

One-hot encodes a text into a list of word indexes of size n.

This is a wrapper to the hashing_trick function using hash as thehashing function; unicity of word to index mapping non-guaranteed.

Arguments

  • text: Input text (string).
  • n: int. Size of vocabulary.
  • filters: list (or concatenation) of characters to filter out, such as punctuation. Default: !"#$%&()*+,-./:;<=>?@[]^_`{|}~\t\n, includes basic punctuation, tabs, and newlines.
  • lower: boolean. Whether to set the text to lowercase.
  • split: str. Separator for word splitting.

Returns

List of integers in [1, n]. Each integer encodes a word(unicity non-guaranteed).

text_to_word_sequence

  1. keras.preprocessing.text.text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')

Converts a text to a sequence of words (or tokens).

Arguments

  • text: Input text (string).
  • filters: list (or concatenation) of characters to filter out, such as punctuation. Default: !"#$%&()*+,-./:;<=>?@[]^_`{|}~\t\n, includes basic punctuation, tabs, and newlines.
  • lower: boolean. Whether to convert the input to lowercase.
  • split: str. Separator for word splitting.

Returns

A list of words (or tokens).