4 编写结构化程序 - 4.5 更多关于函数 - 《Python 自然语言处理第二版》

4.5 更多关于函数

4.5 更多关于函数

本节将讨论更高级的特性，你在第一次阅读本章时可能更愿意跳过此节。

作为参数的函数

到目前为止，我们传递给函数的参数一直都是简单的对象，如字符串或列表等结构化对象。Python 也允许我们传递一个函数作为另一个函数的参数。现在，我们可以抽象出操作，对相同数据进行不同操作。正如下面的例子表示的，我们可以传递内置函数len()或用户定义的函数last_letter()作为另一个函数的参数：

>>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
...         'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
>>> def extract_property(prop):
...     return [prop(word) for word in sent]
...
>>> extract_property(len)
[4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]
>>> def last_letter(word):
...     return word[-1]
>>> extract_property(last_letter)
['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

对象len和last_letter可以像列表和字典那样被传递。请注意，只有在我们调用该函数时，才在函数名后使用括号；当我们只是将函数作为一个对象，括号被省略。

Python 提供了更多的方式来定义函数作为其他函数的参数，即所谓的 lambda 表达式。试想在很多地方没有必要使用上述的last_letter()函数，因此没有必要给它一个名字。我们可以等价地写以下内容：

>>> extract_property(lambda w: w[-1])
['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

我们的下一个例子演示传递一个函数给sorted()函数。当我们用唯一的参数（需要排序的链表）调用后者，它使用内置的比较函数cmp()。然而，我们可以提供自己的排序函数，例如按长度递减排序。

>>> sorted(sent)
[',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds',
'take', 'the', 'the', 'themselves', 'will']
>>> sorted(sent, cmp)
[',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds',
'take', 'the', 'the', 'themselves', 'will']
>>> sorted(sent, lambda x, y: cmp(len(y), len(x)))
['themselves', 'sounds', 'sense', 'Take', 'care', 'will', 'take', 'care',
'the', 'and', 'the', 'of', 'of', ',', '.']

累计函数

这些函数以初始化一些存储开始，迭代和处理输入的数据，最后返回一些最终的对象（一个大的结构或汇总的结果）。做到这一点的一个标准的方式是初始化一个空链表，累计材料，然后返回这个链表，如4.6中所示函数search1()。

def search1(substring, words):
    result = []
    for word in words:
        if substring in word:
            result.append(word)
    return result
def search2(substring, words):
    for word in words:
        if substring in word:
            yield word

函数search2()是一个生成器。第一次调用此函数，它运行到yield语句然后停下来。调用程序获得第一个词，完成任何必要的处理。一旦调用程序对另一个词做好准备，函数会从停下来的地方继续执行，直到再次遇到yield语句。这种方法通常更有效，因为函数只产生调用程序需要的数据，并不需要分配额外的内存来存储输出（参见前面关于生成器表达式的讨论）。

下面是一个更复杂的生成器的例子，产生一个词列表的所有排列。为了强制permutations()函数产生所有它的输出，我们将它包装在list()调用中。

>>> def permutations(seq):
...     if len(seq) <= 1:
...         yield seq
...     else:
...         for perm in permutations(seq[1:]):
...             for i in range(len(perm)+1):
...                 yield perm[:i] + seq[0:1] + perm[i:]
...
>>> list(permutations(['police', 'fish', 'buffalo'])) ![[1]](/projects/nlp-py-2e-zh/Images/ffa808c97c7034af1bc2806ed7224203.jpg)
[['police', 'fish', 'buffalo'], ['fish', 'police', 'buffalo'],
 ['fish', 'buffalo', 'police'], ['police', 'buffalo', 'fish'],
 ['buffalo', 'police', 'fish'], ['buffalo', 'fish', 'police']]

注意

permutations函数使用了一种技术叫递归，将在下面4.7讨论。产生一组词的排列对于创建测试一个语法的数据十分有用（8.)。

高阶函数

Python 提供一些具有函数式编程语言如 Haskell 标准特征的高阶函数。我们将在这里演示它们，与使用列表推导的相对应的表达一起。

让我们从定义一个函数is_content_word()开始，它检查一个词是否来自一个开放的实词类。我们使用此函数作为filter()的第一个参数，它对作为它的第二个参数的序列中的每个项目运用该函数，只保留该函数返回True的项目。

>>> def is_content_word(word):
...     return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.']
>>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
...         'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
>>> list(filter(is_content_word, sent))
['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves']
>>> [w for w in sent if is_content_word(w)]
['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves']

另一个高阶函数是map()，将一个函数运用到一个序列中的每一项。它是我们在4.5看到的函数extract_property()的一个通用版本。这里是一个简单的方法找出布朗语料库新闻部分中的句子的平均长度，后面跟着的是使用列表推导计算的等效版本：

>>> lengths = list(map(len, nltk.corpus.brown.sents(categories='news')))
>>> sum(lengths) / len(lengths)
21.75081116158339
>>> lengths = [len(sent) for sent in nltk.corpus.brown.sents(categories='news')]
>>> sum(lengths) / len(lengths)
21.75081116158339

在上面的例子中，我们指定了一个用户定义的函数is_content_word() 和一个内置函数len()。我们还可以提供一个 lambda 表达式。这里是两个等效的例子，计数每个词中的元音的数量。

>>> list(map(lambda w: len(filter(lambda c: c.lower() in "aeiou", w)), sent))
[2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]
>>> [len(c for c in w if c.lower() in "aeiou") for w in sent]
[2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]

列表推导为基础的解决方案通常比基于高阶函数的解决方案可读性更好，我们在整个这本书的青睐于使用前者。

命名的参数

当有很多参数时，很容易混淆正确的顺序。我们可以通过名字引用参数，甚至可以给它们分配默认值以供调用程序没有提供该参数时使用。现在参数可以按任意顺序指定，也可以省略。

>>> def repeat(msg='<empty>', num=1):
...     return msg * num
>>> repeat(num=3)
'<empty><empty><empty>'
>>> repeat(msg='Alice')
'Alice'
>>> repeat(num=5, msg='Alice')
'AliceAliceAliceAliceAlice'

这些被称为关键字参数。如果我们混合使用这两种参数，就必须确保未命名的参数在命名的参数前面。必须是这样，因为未命名参数是根据位置来定义的。我们可以定义一个函数，接受任意数量的未命名和命名参数，并通过一个就地的参数列表*args和一个就地的关键字参数字典**kwargs来访问它们。（字典将在3中讲述。）

>>> def generic(*args, **kwargs):
...     print(args)
...     print(kwargs)
...
>>> generic(1, "African swallow", monty="python")
(1, 'African swallow')
{'monty': 'python'}

当*args作为函数参数时，它实际上对应函数所有的未命名参数。下面是另一个这方面的 Python 语法的演示，处理可变数目的参数的函数zip()。我们将使用变量名*song来表示名字*args并没有什么特别的。

>>> song = [['four', 'calling', 'birds'],
...         ['three', 'French', 'hens'],
...         ['two', 'turtle', 'doves']]
>>> list(zip(song[0], song[1], song[2]))
[('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')]
>>> list(zip(*song))
[('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')]

应该从这个例子中明白输入*song仅仅是一个方便的记号，相当于输入了song[0], song[1], song[2]。

下面是另一个在函数的定义中使用关键字参数的例子，有三种等效的方法来调用这个函数：

>>> def freq_words(file, min=1, num=10):
...     text = open(file).read()
...     tokens = word_tokenize(text)
...     freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)
...     return freqdist.most_common(num)
>>> fw = freq_words('ch01.rst', 4, 10)
>>> fw = freq_words('ch01.rst', min=4, num=10)
>>> fw = freq_words('ch01.rst', num=10, min=4)

命名参数的另一个作用是它们允许选择性使用参数。因此，我们可以在我们高兴使用默认值的地方省略任何参数：freq_words('ch01.rst', min=4), freq_words('ch01.rst', 4)。可选参数的另一个常见用途是作为标志使用。这里是同一个的函数的修订版本，如果设置了verbose标志将会报告其进展情况：

>>> def freq_words(file, min=1, num=10, verbose=False):
...     freqdist = FreqDist()
...     if verbose: print("Opening", file)
...     text = open(file).read()
...     if verbose: print("Read in %d characters" % len(file))
...     for word in word_tokenize(text):
...         if len(word) >= min:
...             freqdist[word] += 1
...             if verbose and freqdist.N() % 100 == 0: print(".", sep="")
...     if verbose: print
...     return freqdist.most_common(num)

小心！

注意不要使用可变对象作为参数的默认值。这个函数的一系列调用将使用同一个对象，有时会出现离奇的结果，就像我们稍后会在关于调试的讨论中看到的那样。

小心！

如果你的程序将使用大量的文件，它是一个好主意来关闭任何一旦不再需要的已经打开的文件。如果你使用with语句，Python 会自动关闭打开的文件︰

>>> with open("lexicon.txt") as f:
...     data = f.read()
...     # process the data