2. 获得文本语料和词汇资源 - 2.2 按文体计数词汇 - 《Python 自然语言处理第二版》

2.2 按文体计数词汇

2.2 按文体计数词汇

在1中，我们看到一个条件频率分布，其中条件为布朗语料库的每一节，并对每节计数词汇。FreqDist()以一个简单的列表作为输入，ConditionalFreqDist() 以一个配对列表作为输入。

>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
...           (genre, word)
...           for genre in brown.categories()
...           for word in brown.words(categories=genre))

让我们拆开来看，只看两个文体，新闻和言情。对于每个文体，我们遍历文体中的每个词，以产生文体与词的配对：

>>> genre_word = [(genre, word) ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
...               for genre in ['news', 'romance'] ![[2]](/projects/nlp-py-2e-zh/Images/6efeadf518b11a6441906b93844c2b19.jpg)
...               for word in brown.words(categories=genre)] ![[3]](/projects/nlp-py-2e-zh/Images/e941b64ed778967dd0170d25492e42df.jpg)
>>> len(genre_word)
170576

因此，在下面的代码中我们可以看到，列表genre_word的前几个配对将是 ('news', word) 的形式，而最后几个配对将是 ('romance', word) 的形式。

>>> genre_word[:4]
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]
>>> genre_word[-4:]
[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]

现在，我们可以使用此配对列表创建一个ConditionalFreqDist，并将它保存在一个变量cfd中。像往常一样，我们可以输入变量的名称来检查它，并确认它有两个条件：

>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['news', 'romance'] # [_conditions-cfd]

让我们访问这两个条件，它们每一个都只是一个频率分布：

>>> print(cfd['news'])
<FreqDist with 14394 samples and 100554 outcomes>
>>> print(cfd['romance'])
<FreqDist with 8452 samples and 70022 outcomes>
>>> cfd['romance'].most_common(20)
[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502),
('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993),
('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690),
('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]
>>> cfd['romance']['could']
193