4.3 ElementTree 接口
Python 的 ElementTree 模块提供了一种方便的方式访问存储在 XML 文件中的数据。ElementTree 是 Python 标准库(自从 Python 2.5)的一部分,也作为 NLTK 的一部分提供,以防你在使用 Python 2.4。
我们将使用 XML 格式的莎士比亚戏剧集来说明 ElementTree 的使用方法。让我们加载 XML 文件并检查原始数据,首先在文件的顶部,在那里我们看到一些 XML 头和一个名为play.dtd
的模式,接着是根元素 PLAY
。我们从 Act 1再次获得数据。(输出中省略了一些空白行。)
>>> merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
>>> raw = open(merchant_file).read()
>>> print(raw[:163]) ![[1]](/projects/nlp-py-2e-zh/Images/346344c2e5a627acfdddf948fb69cb1d.jpg)
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="shakes.css"?>
<!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->
<PLAY>
<TITLE>The Merchant of Venice</TITLE>
>>> print(raw[1789:2006]) ![[2]](/projects/nlp-py-2e-zh/Images/f9e1ba3246770e3ecb24f813f33f2075.jpg)
<TITLE>ACT I</TITLE>
<SCENE><TITLE>SCENE I. Venice. A street.</TITLE>
<STAGEDIR>Enter ANTONIO, SALARINO, and SALANIO</STAGEDIR>
<SPEECH>
<SPEAKER>ANTONIO</SPEAKER>
<LINE>In sooth, I know not why I am so sad:</LINE>
我们刚刚访问了作为一个字符串的 XML 数据。正如我们看到的,在 Act 1 开始处的字符串包含 XML 标记 title、scene、stage directions 等。
下一步是作为结构化的 XML 数据使用ElementTree
处理文件的内容。我们正在处理一个文件(一个多行字符串),并建立一棵树,所以方法的名称是parse
并不奇怪。变量merchant
包含一个 XML 元素PLAY
。此元素有内部结构;我们可以使用一个索引来得到它的第一个孩子,一个TITLE
元素。我们还可以看到该元素的文本内容:戏剧的标题。要得到所有的子元素的列表,我们使用getchildren()
方法。
>>> from xml.etree.ElementTree import ElementTree
>>> merchant = ElementTree().parse(merchant_file) ![[1]](/projects/nlp-py-2e-zh/Images/346344c2e5a627acfdddf948fb69cb1d.jpg)
>>> merchant
<Element 'PLAY' at 0x10ac43d18> # [_element-play]
>>> merchant[0]
<Element 'TITLE' at 0x10ac43c28> # [_element-title]
>>> merchant[0].text
'The Merchant of Venice' # [_element-text]
>>> merchant.getchildren() ![[5]](/projects/nlp-py-2e-zh/Images/63a8e4c47e813ba9630363f9b203a19a.jpg)
[<Element 'TITLE' at 0x10ac43c28>, <Element 'PERSONAE' at 0x10ac43bd8>,
<Element 'SCNDESCR' at 0x10b067f98>, <Element 'PLAYSUBT' at 0x10af37048>,
<Element 'ACT' at 0x10af37098>, <Element 'ACT' at 0x10b936368>,
<Element 'ACT' at 0x10b934b88>, <Element 'ACT' at 0x10cfd8188>,
<Element 'ACT' at 0x10cfadb38>]
这部戏剧由标题、角色、一个场景的描述、字幕和五幕组成。每一幕都有一个标题和一些场景,每个场景由台词组成,台词由行组成,有四个层次嵌套的结构。让我们深入到第四幕:
>>> merchant[-2][0].text
'ACT IV'
>>> merchant[-2][1]
<Element 'SCENE' at 0x10cfd8228>
>>> merchant[-2][1][0].text
'SCENE I. Venice. A court of justice.'
>>> merchant[-2][1][54]
<Element 'SPEECH' at 0x10cfb02c8>
>>> merchant[-2][1][54][0]
<Element 'SPEAKER' at 0x10cfb0318>
>>> merchant[-2][1][54][0].text
'PORTIA'
>>> merchant[-2][1][54][1]
<Element 'LINE' at 0x10cfb0368>
>>> merchant[-2][1][54][1].text
"The quality of mercy is not strain'd,"
注意
轮到你来:对语料库中包含的其他莎士比亚戏剧,如《罗密欧与朱丽叶》或《麦克白》,重复上述的一些方法;方法列表请参阅nltk.corpus.shakespeare.fileids()
。
虽然我们可以通过这种方式访问整个树,使用特定名称查找子元素会更加方便。回想一下顶层的元素有几种类型。我们可以使用merchant.findall('ACT')
遍历我们感兴趣的类型(如幕)。下面是一个做这种特定标记在每一个级别的嵌套搜索的例子:
>>> for i, act in enumerate(merchant.findall('ACT')):
... for j, scene in enumerate(act.findall('SCENE')):
... for k, speech in enumerate(scene.findall('SPEECH')):
... for line in speech.findall('LINE'):
... if 'music' in str(line.text):
... print("Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text))
Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice;
Act 3 Scene 2 Speech 9: Fading in music: that the comparison
Act 3 Scene 2 Speech 9: And what is music then? Then music is
Act 5 Scene 1 Speech 23: And bring your music forth into the air.
Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music
Act 5 Scene 1 Speech 23: And draw her home with music.
Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music.
Act 5 Scene 1 Speech 25: Or any air of music touch their ears,
Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet
Act 5 Scene 1 Speech 25: But music for the time doth change his nature.
Act 5 Scene 1 Speech 25: The man that hath no music in himself,
Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music.
Act 5 Scene 1 Speech 29: It is your music, madam, of the house.
Act 5 Scene 1 Speech 32: No better a musician than the wren.
不是沿着层次结构向下遍历每一级,我们可以寻找特定的嵌入的元素。例如,让我们来看看演员的顺序。我们可以使用频率分布看看谁最能说:
>>> from collections import Counter
>>> speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')]
>>> speaker_freq = Counter(speaker_seq)
>>> top5 = speaker_freq.most_common(5)
>>> top5
[('PORTIA', 117), ('SHYLOCK', 79), ('BASSANIO', 73),
('GRATIANO', 48), ('LORENZO', 47)]
我们也可以查看对话中谁跟着谁的模式。由于有 23 个演员,我们需要首先使用3中描述的方法将“词汇”减少到可处理的大小。
>>> from collections import defaultdict
>>> abbreviate = defaultdict(lambda: 'OTH')
>>> for speaker, _ in top5:
... abbreviate[speaker] = speaker[:4]
...
>>> speaker_seq2 = [abbreviate[speaker] for speaker in speaker_seq]
>>> cfd = nltk.ConditionalFreqDist(nltk.bigrams(speaker_seq2))
>>> cfd.tabulate()
ANTO BASS GRAT OTH PORT SHYL
ANTO 0 11 4 11 9 12
BASS 10 0 11 10 26 16
GRAT 6 8 0 19 9 5
OTH 8 16 18 153 52 25
PORT 7 23 13 53 0 21
SHYL 15 15 2 26 21 0
忽略 153 的条目,因为是前五位角色(标记为OTH
)之间相互对话,最大的值表示 Othello 和 Portia 的相互对话最多。