被动语态自动识别

中文说明

依赖于NLTK库,感谢前人的贡献。

本脚本可以识别出英语文本中的被动语态,包括主句和从句。

被动语态识别规则:

  • 若一个句子中没有被动分词,则该句不可能是被动句
  • 若一个句子中仅有一个被动分词且其为“been”,则该句不可能为被动句 eg: He has been a teacher since 2000.
  • 若一个句子中至少含有一个“been”之外的过去分词,且该过去分词到前方最近的人称代词或名词之间的所有动词均为be的某种形式,则该句中存在被动语态。

由于缺乏测试集,仅在少量文本中验证,正确率尚可。如果谁有大规模进行过被动语态标注的语料库,欢迎使用此脚本验证其正确识别比例。

主要缺点:所有的工作均在NLTK的词性标注下进行,词性标注的正确率直接影响后续工作的正确率。若想达到最佳效果,恐怕需要单独训练一个识别正确率更高的标注工具。

Introduction

This script is based on NLTK. Sincerest gratitude goes to the predecessors.

As its name suggests, this script can automatically detect passive voice in English texts, be it in matrix clause or relative clause.

Rules of detection:

  • If there is no PP in a sentence, the sentence can’t be passive.
  • If there is only one PP in a sentence and it is “been”, the sentence can’t be passive. eg: He has been a teacher since 2000.
  • If there is at least one PP other than “been” in a sentence, and all the verbs between the PP(s) and the nearest preposition or noun in front are a certain form of “be”, the sentence is passive.

(PP is short for “past participle”)

Due to a lack of large passive-voice-annotated corpora, the script is only examine on small-scale texts and the success rate is acceptable.

A major drawback of this script is that all the work is entirely based on the POS_TAGGER of NLTK, whose accuracy determines the quality of the following work. To acheive the maximum efficiency, a tailor-made tagger must be trained first.

 

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import nltk
 
def isPassive(sentence):
    beforms = ['am', 'is', 'are', 'been', 'was', 'were', 'be', 'being']               # all forms of "be"
    aux = ['do', 'did', 'does', 'have', 'has', 'had']                                  # NLTK tags "do" and "have" as verbs, which can be misleading in the following section.
    words = nltk.word_tokenize(sentence)
    tokens = nltk.pos_tag(words)
    tags = [i[1] for i in tokens]
    if tags.count('VBN') == 0:                                                            # no PP, no passive voice.
        return False
    elif tags.count('VBN') == 1 and 'been' in words:                                    # one PP "been", still no passive voice.
        return False
    else:
        pos = [i for i in range(len(tags)) if tags[i] == 'VBN' and words[i] != 'been']  # gather all the PPs that are not "been".
        for end in pos:
            chunk = tags[:end]
            start = 0
            for i in range(len(chunk)-1, 0, -1):
                last = chunk.pop()
                if last == 'NN' or last == 'PRP':
                    start = i + 1                                                          # get the chunk between PP and the previous NN or PRP (which in most cases are subjects)
            sentchunk = words[start:end]
            tagschunk = tags[start:end]
            verbspos = [i for i in range(len(tagschunk)) if tagschunk[i].startswith('V')] # get all the verbs in between
            if verbspos != []:                                                            # if there are no verbs in between, it's not passive
                for i in verbspos:
                    if sentchunk[i].lower() not in beforms and sentchunk[i].lower() not in aux:  # check if they are all forms of "be" or auxiliaries such as "do" or "have".
                        break
                else:
                    return True
    return False
 
 
if __name__ == '__main__':
 
    samples = '''I like being hunted.
    The man is being hunted.
    Don't be frightened by what he said.
    I assume that you are not informed of the matter.
    Please be advised that the park is closing soon.
    The book will be released tomorrow.
    We're astonished to see the building torn down.
    The hunter is literally being chased by the tiger.
    He has been awesome since birth.
    She has been beautiful since birth.'''                                                   # "awesome" is wrongly tagged as PP. So the sentence gets a "True".
 
    sents = nltk.sent_tokenize(samples)
    for sent in sents:
        print(sent + '--> %s' % isPassive(sent))

GitHub地址: https://github.com/flycrane01/nltk-passive-voice-detector-for-English