基于深度神经网络的法语命名实体识别模型
2019-08-01严红陈兴蜀王文贤王海舟殷明勇
严红 陈兴蜀 王文贤 王海舟 殷明勇
摘 要:现有法语命名实体识别(NER)研究中,机器学习模型多使用词的字符形态特征,多语言通用命名实体模型使用字词嵌入代表的语义特征,都没有综合考虑语义、字符形态和语法特征。针对上述不足,设计了一种基于深度神经网络的法语命名实体识别模型CGCfr。首先从文本中提取单词的词嵌入、字符嵌入和语法特征向量; 然后由卷积神经网络(CNN)从单词的字符嵌入序列中提取单词的字符特征; 最后通过双向门控循环神经网络(BiGRU)和条件随机场(CRF)分类器根据词嵌入、字符特征和语法特征向量识别出法语文本中的命名实体。实验中,CGCfr在测试集的F1值能够达到82.16%,相对于机器学习模型NERCfr、多语言通用的神经网络模型LSTMCRF和Char attention模型,分别提升了5.67、1.79和1.06个百分点。实验结果表明,融合三种特征的CGCfr模型比其他模型更具有优势。
关键词:命名实体识别;法语;深度神经网络;自然语言处理;序列标注
中图分类号:TP391.1
文献标志码:A
Abstract: In the existing French Named Entity Recognition (NER) research, the machine learning models mostly use the character morphological features of words, and the multilingual generic named entity models use the semantic features represented by word embedding, both without taking into account the semantic, character morphological and grammatical features comprehensively. Aiming at this shortcoming, a deep neural network based model CGCfr was designed to recognize French named entity. Firstly, word embedding, character embedding and grammar feature vector were extracted from the text. Then, character feature was extracted from the character embedding sequence of words by using Convolution Neural Network (CNN). Finally, Bidirectional Gated Recurrent Unit Network (BiGRU) and Conditional Random Field (CRF) were used to label named entities in French text according to word embedding, character feature and grammar feature vector. In the experiments, F1 value of CGCfr model can reach 82.16% in the test set, which is 5.67 percentage points, 1.79 percentage points and 1.06 percentage points higher than that of NERCfr, LSTM(Long ShortTerm Memory network)CRF and Char attention models respectively. The experimental results show that CGCfr model with three features is more advantageous than the others.
英文关键词Key words: Named Entity Recognition (NER); French; neural network; Natural Language Processing (NLP); sequence labeling
0 引言
命名實体识别(Named Entity Recognition, NER)是指从文本中识别出特定类型事务名称或者符号的过程[1]。它提取出更具有意义的人名、组织名、地名等,使得后续的自然语言处理任务能根据命名实体进一步获取需要的信息。随着全球化发展,各国之间信息交换日益频繁。相对于中文,外语信息更能影响其他国家对中国的看法,多语言舆情分析应运而生。法语在非英语的语种中影响力相对较大,其文本是多语种舆情分析中重要目标之一。法语NER作为法语文本分析的基础任务,重要性不可忽视。
专门针对法语NER进行的研究较少,早期研究主要是基于规则和词典的方法[2], 后来,通常将人工选择的特征输入到机器学习模型来识别出文本中存在的命名实体[3-7]。Azpeitia等[3]提出了NERCfr模型,模型采用最大熵方法来识别法语命名实体,用到的特征包括词后缀、字符窗口、邻近词、词前缀、单词长度和首字母是否大写等。该方法取得了不错的结果,但可以看出用到的特征多为单词的形态结构特征而非语义特征,缺乏语义特征可能限制了模型的识别准确率。
近几年深度神经网络在自然语言处理领域取得了很好的效果: Hammerton[8]将长短时记忆网络(Long ShortTerm Memory network, LSTM)用于英语NER; Rei等[9]提出了多语言通用的Char attention模型,利用Attention机制融合词嵌入和字符嵌入,将其作为特征输入到双向长短时记忆网络(Bidirectional Long ShortTerm Memory network, BiLSTM)中,得到序列标注产生的命名实体; Lample等[10]提出BiLSTM后接条件随机场(Conditional Random Field, CRF)的LSTMCRF模型,它也是多语言通用的,使用了字词嵌入作为特征来识别英语的命名实体, 但LSTMCRF模型应用在法语上,和英语差距较大,这个问题可能是因为没有用到该语言的语法特征,毕竟法语语法的复杂程度大幅超过英语。
為了在抽取过程中兼顾语义、字符形态和语法特征,更为准确地抽取法语的命名实体,本文设计了模型CGCfr。该模型使用词嵌入表示文本中单词的语义特征,使用卷积神经网络(Convolutional Neural Network, CNN)提取字符嵌入蕴含的单词字符形态特征以及预先提取的法语语法特征,拼接后输入到双向门控循环网络(Gated Recurrent Unit Neural Network, GRU)和条件随机场结合的复合网络中,来识别出法语命名实体。CGCfr充分利用了这些特征,通过实验证明了每种特征的贡献度,并与其他模型进行比较证明了融合三种特征的CGCfr模型更具有优势。除此之外,本文贡献了一个法语的数据集,包含1005篇文章,29016个实体, 增加了法语命名实体识别的数据集,使得后续可以有更多的研究不被数据集的问题困扰。
4 结语
本文设计了用于法语命名实体识别的深度神经网络CGCfr模型,并构建了一个法语命名实体识别数据集。CGCfr模型将法语文本中单词的词嵌入作为语义特征,从单词对应的字符嵌入序列提取单词的形态结构特征,结合语法特征完成对命名实体的识别。这增加了传统统计机器学习方法中特征的多样性,丰富了特征的内涵, 也避免了多语言通用方法对法语语法的忽视。实验对比模型中各个特征的贡献度,验证了它们的有效性;还将CGCfr模型与最大熵模型NERCfr、多语言通用模型Char attention和LSTMCRF对比。实验结果表明,CGCfr模型相对三者的F1值都有提高,验证了融合三种特征的本文模型在法语命名实体识别上的有效性,进一步提高了法语命名实体的识别率。
然而,本文模型也存在着不足,在法语文本中组织名的识别率相比其余两种命名实体类型差距较大,模型对形式存在较大变化的命名实体类型的识别效果不是很好;其次,相对于英语较高的命名实体识别准确率,法语命名实体识别还有较大的提升空间。
参考文献 (References)
[1] NADEAU D, SEKINE S. A survey of named entity recognition and classification[J]. Lingvisticae Investigationes, 2007, 30(1): 3-26.
[2] WOLINSKI F, VICHOT F, DILLET B. Automatic processing of proper names in texts[C]// Proceedings of the 7th Conference on European Chapter of the Association for Computational Linguistics. San Francisco, CA: Morgan Kaufmann Publishers, 1995: 23-30.
[3] AZPEITIA A, CUDADROS M, GAINES S, et al. NERCfr: supervised named entity recognition for French[C]// TSD 2014: Proceedings of the 2014 International Conference on Text, Speech and Dialogue. Berlin: Springer, 2014: 158-165.
[4] POIBEAU T. The multilingual named entity recognition framework[C]// Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2003: 155-158.
[5] PETASIS G, VICHOT F, WOLINSKI F, et al. Using machine learning to maintain rulebased namedentity recognition and classification systems[C]// Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2001: 426-433.
[6] WU D, NGAI G, CARPUAT M. A stacked, voted, stacked model for named entity recognition[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT. Stroudsburg, PA: Association for Computational Linguistics, 2003: 200-203.
[7] NOTHMAN J, RINGLAND N, RADFORD W, et al. Learning multilingual named entity recognition from Wikipedia[J]. Artificial Intelligence, 2013, 194:151-175.
[8] HAMMERTON J. Named entity recognition with long shortterm memory[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT. Stroudsburg, PA: Association for Computational Linguistics, 2003: 172-175.
[9] REI M, CRICHTON G, PYYSALO S. Attending to characters in neural sequence labeling models[J/OL]. arXiv Preprint, 2016, 2016: arXiv:1611.04361[2016-11-14]. https://arxiv.org/abs/1611.04361.
[10] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2016: 260-270.
[11] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. New York: JMLR.org, 2014: 1188-1196.
[12] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1532-1543.
[13] SANTOS C D, ZADROZNY B. Learning characterlevel representations for partofspeech tagging[C]// Proceedings of the 31st International Conference on Machine Learning. New York: JMLR.org, 2014: 1818-1826.
[14] CHO K, van MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1724-1734.
[15] SANG E F, VEENSTRA J. Representing text chunks[C]// Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 1999: 173-179.