2.2 按文体计数词汇
在1中,我们看到一个条件频率分布,其中条件为布朗语料库的每一节,并对每节计数词汇。FreqDist()
以一个简单的列表作为输入,ConditionalFreqDist()
以一个配对列表作为输入。
>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))
让我们拆开来看,只看两个文体,新闻和言情。对于每个文体,我们遍历文体中的每个词,以产生文体与词的配对 :
>>> genre_word = [(genre, word) ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
... for genre in ['news', 'romance'] ![[2]](/projects/nlp-py-2e-zh/Images/6efeadf518b11a6441906b93844c2b19.jpg)
... for word in brown.words(categories=genre)] ![[3]](/projects/nlp-py-2e-zh/Images/e941b64ed778967dd0170d25492e42df.jpg)
>>> len(genre_word)
170576
因此,在下面的代码中我们可以看到,列表genre_word
的前几个配对将是 ('news'
, word) 的形式,而最后几个配对将是 ('romance'
, word) 的形式。
>>> genre_word[:4]
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]
>>> genre_word[-4:]
[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]
现在,我们可以使用此配对列表创建一个ConditionalFreqDist
,并将它保存在一个变量cfd
中。像往常一样,我们可以输入变量的名称来检查它,并确认它有两个条件:
>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['news', 'romance'] # [_conditions-cfd]
让我们访问这两个条件,它们每一个都只是一个频率分布:
>>> print(cfd['news'])
<FreqDist with 14394 samples and 100554 outcomes>
>>> print(cfd['romance'])
<FreqDist with 8452 samples and 70022 outcomes>
>>> cfd['romance'].most_common(20)
[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502),
('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993),
('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690),
('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]
>>> cfd['romance']['could']
193