1. 语言处理与 Python - 3.4 计数其他东西 - 《Python 自然语言处理第二版》

3.4 计数其他东西

3.4 计数其他东西

计数词汇是有用的，我们也可以计数其他东西。例如，我们可以查看文本中词长的分布，通过创造一长串数字的列表的FreqDist，其中每个数字是文本中对应词的长度：

>>> [len(w) for w in text1] ![[1]](/projects/nlp-py-2e-zh/Images/4b5cae275c53c53ccc8f2f779acada3e.jpg)
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> fdist = FreqDist(len(w) for w in text1)  ![[2]](/projects/nlp-py-2e-zh/Images/3a93e0258a010fdda935b4ee067411a5.jpg)
>>> print(fdist)  ![[3]](/projects/nlp-py-2e-zh/Images/334be383b5db7ffe3599cc03bc74bf9e.jpg)
<FreqDist with 19 samples and 260819 outcomes>
>>> fdist
FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399,
 8: 9966, 9: 6428, 10: 3528, ...})
>>>

我们以导出text1 中每个词的长度的列表开始，然后FreqDist 计数列表中每个数字出现的次数。结果是一个包含 25 万左右个元素的分布，每一个元素是一个数字，对应文本中一个词标识符。但是只有 20 个不同的元素，从 1 到 20，因为只有 20 个不同的词长。也就是说，有由 1 个字符，2 个字符，…，20 个字符组成的词，而没有由 21 个或更多字符组成的词。有人可能会问不同长度的词的频率是多少？（例如，文本中有多少长度为 4 的词？长度为 5 的词是否比长度为 4 的词多？等等）。下面我们回答这个问题：

>>> fdist.most_common()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),
(8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),
(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
3
>>> fdist[3]
50223
>>> fdist.freq(3)
0.19255882431878046
>>>

由此我们看到，最频繁的词长度是 3，长度为 3 的词有 50,000 多个（约占书中全部词汇的 20％）。虽然我们不会在这里追究它，关于词长的进一步分析可能帮助我们了解作者、文体或语言之间的差异。

3.1 总结了 NLTK 频率分布类中定义的函数。

表 3.1：

NLTK 频率分布类中定义的函数

>>> sent7
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the',
'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
>>> [w for w in sent7 if len(w) < 4]
[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
>>> [w for w in sent7 if len(w) <= 4]
[',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']
>>> [w for w in sent7 if len(w) == 4]
['will', 'join', 'Nov.']
>>> [w for w in sent7 if len(w) != 4]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board',
'as', 'a', 'nonexecutive', 'director', '29', '.']
>>>

所有这些例子都有一个共同的模式：[w for w in text if condition ]，其中 condition 是 Python 中的一个“测试”，得到真或者假。在前面的代码例子所示的情况中，条件始终是数值比较。然而，我们也可以使用表4.2 中列出的函数测试词汇的各种属性。

表 4.2:

一些词比较运算符

>>> sorted(w for w in set(text1) if w.endswith('ableness'))
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', ...]
>>> sorted(term for term in set(text4) if 'gnt' in term)
['Sovereignty', 'sovereignties', 'sovereignty']
>>> sorted(item for item in set(text6) if item.istitle())
['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', ...]
>>> sorted(item for item in set(sent7) if item.isdigit())
['29', '61']
>>>

我们还可以创建更复杂的条件。如果 c 是一个条件，那么not c 也是一个条件。如果我们有两个条件 c1 和 c2，那么我们可以使用合取和析取将它们合并形成一个新的条件：c1 and c2, c1 or c2。

注意

轮到你来： 运行下面的例子，尝试解释每一条指令中所发生的事情。然后，试着自己组合一些条件。

>>> sorted(w for w in set(text7) if '-' in w and 'index' in w)
>>> sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10)
>>> sorted(w for w in set(sent7) if not w.islower())
>>> sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)