2.7 未简化的标记

让我们找出每个名词类型中最频繁的名词。2.2中的程序找出所有以NN开始的标记,并为每个标记提供了几个示例单词。你会看到有许多NN的变种;最重要有此外,大多数的标记都有后缀修饰符:-NC表示引用,-HL表示标题中的词,-TL`表示标题(布朗标记的特征)。

  1. def findtags(tag_prefix, tagged_text):
  2. cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
  3. if tag.startswith(tag_prefix))
  4. return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())
  5. >>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
  6. >>> for tag in sorted(tagdict):
  7. ... print(tag, tagdict[tag])
  8. ...
  9. NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)]
  10. NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("company's", 6)]
  11. NN$-HL [("Golf's", 1), ("Navy's", 1)]
  12. NN$-TL [("President's", 11), ("Army's", 3), ("Gallery's", 3), ("University's", 3), ("League's", 3)]
  13. NN-HL [('sp.', 2), ('problem', 2), ('Question', 2), ('business', 2), ('Salary', 2)]
  14. NN-NC [('eva', 1), ('aya', 1), ('ova', 1)]
  15. NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
  16. NN-TL-HL [('Fort', 2), ('Dr.', 1), ('Oak', 1), ('Street', 1), ('Basin', 1)]
  17. NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
  18. NNS$ [("children's", 7), ("women's", 5), ("janitors'", 3), ("men's", 3), ("taxpayers'", 2)]
  19. NNS$-HL [("Dealers'", 1), ("Idols'", 1)]
  20. NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Bros.'", 1), ("Writers'", 1)]
  21. NNS-HL [('comments', 1), ('Offenses', 1), ('Sacrifices', 1), ('funds', 1), ('Results', 1)]
  22. NNS-TL [('States', 38), ('Nations', 11), ('Masters', 10), ('Rules', 9), ('Communists', 9)]
  23. NNS-TL-HL [('Nations', 1)]

当我们开始在本章后续部分创建词性标注器时,我们将使用未简化的标记。