Download - Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation
![Page 1: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/1.jpg)
Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word
Segmentation
作者 : Xiaoqing Li主讲人:赵安邦
![Page 2: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/2.jpg)
问题• 跨领域OOV刑法,民法,宪法。。。吸星大法不同的领域,同一个词的 tag分布不同酸 (s),酸的 (b),酸性 (b)硫酸 (e),盐酸 (e),硝酸 (e)
![Page 3: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/3.jpg)
解决办法• 引入领域词典不引入新领域知识,很难解决这个问题。词典相对比较容易获得,如化工词词典,医学名词词典。
引入词典的方法( 1)机械匹配( 2)利用词典最长匹配词信息(在判别式分词方法中被广泛应用)
![Page 4: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/4.jpg)
例子句子:新华社报道。词典:华社,新华社最大匹配词长度: 3抽出特征C0=华 L=3 m
![Page 5: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/5.jpg)
相关概念Surface features
N-gram概率
Abstract features一个字是否选择它在字典中最长匹配词中 tag的分布,在不同领域是几乎不变的。(映射)
![Page 6: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/6.jpg)
Dictionary Coverage Status
• 一个包含五个元素的集合{No-Dictionary-Word, No-Ambiguity, Crossed-Ambiguity, Included-Ambiguity, Mixed-Ambiguity}
作用:给字在词典中匹配到的词的歧义情况分类。
![Page 7: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/7.jpg)
Dictionary Coverage Status
• 例子• Included-Ambiguity
![Page 8: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/8.jpg)
Dictionary Coverage Status
• 例子• Crossed-Ambiguity
![Page 9: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/9.jpg)
Tag Matching Status
• 一个包含四个元素的集合{Following-Longest-Word, Only-Following-Shorter-Word, Not-Following-Any-Word, Inapplicable}
作用:字的 tag和匹配到的词的 tag之间的关系分类。
![Page 10: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/10.jpg)
生成模型推导• 传统生成模型
• 加上词典特征的生成模型
![Page 11: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/11.jpg)
生成模型推导
近似成:
![Page 12: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/12.jpg)
生成模型推导
对 Abstract feature 和 Surface feature可以加上不同的权重
![Page 13: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/13.jpg)
生成模型推导这个模型还可以进一步融入判别式模型,得到以下公式:
![Page 14: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/14.jpg)
实验• 实验配置• 训练语料: PKU-News7 from CIPS-SIGHAN-2010• 同领域测试语料: PKU-News testing corpus of SIGHAN-2005• 跨领域测试语料: corpora of CIPS-SIGHAN-2010 (文学,计算机,医学,金融 )
![Page 15: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/15.jpg)
实验• 生成模型实验结果
B是基线系统G1的 Abstract Feature公式:G2的 Abstract Feature公式:
![Page 16: Integrating Surface and Abstract Features for Robust Cross-Domain Chinese Word Segmentation](https://reader036.vdocuments.site/reader036/viewer/2022062316/568156d4550346895dc46fef/html5/thumbnails/16.jpg)
实验• 生成 +判别式模型实验结果
SBest是基线系统( best results of SIGHAN 2005 (News) and CIPS-SIGHAN 2010 (other domains))ED是利用词典改进了的判别式系统 (Enhanced Discriminative)EG(Generative) EI(Integrated)