2. 获得文本语料和词汇资源 - 1.9 加载你自己的语料库 - 《Python 自然语言处理第二版》

1.9 加载你自己的语料库

1.9 加载你自己的语料库

如果你有自己收集的文本文件，并且想使用前面讨论的方法访问它们，你可以很容易地在 NLTK 中的PlaintextCorpusReader帮助下加载它们。检查你的文件在文件系统中的位置；在下面的例子中，我们假定你的文件在/usr/share/dict目录下。不管是什么位置，将变量corpus_root 的值设置为这个目录。PlaintextCorpusReader初始化函数的第二个参数可以是一个如['a.txt', 'test/b.txt']这样的 fileids 列表，或者一个匹配所有 fileids 的模式，如'[abc]/.*\.txt'（关于正则表达式的信息见3.4节）。

>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict' ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*') ![[2]](/projects/nlp-py-2e-zh/Images/6efeadf518b11a6441906b93844c2b19.jpg)
>>> wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
>>> wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

举另一个例子，假设你在本地硬盘上有自己的宾州树库（第 3 版）的拷贝，放在C:\corpora。我们可以使用BracketParseCorpusReader访问这些语料。我们指定corpus_root为存放语料库中解析过的《华尔街日报》部分的位置，并指定file_pattern与它的子文件夹中包含的文件匹配（用前斜杠）。

>>> from nltk.corpus import BracketParseCorpusReader
>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj" ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
>>> file_pattern = r".*/wsj_.*\.mrg" ![[2]](/projects/nlp-py-2e-zh/Images/6efeadf518b11a6441906b93844c2b19.jpg)
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)
>>> ptb.fileids()
['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]
>>> len(ptb.sents())
49208
>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]
['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the',
'shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio',
'Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines',
'or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']