chinese parsing exploiting characters · 2013. 7. 25. · characters play an important role in the...

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 125–134,Sofia, Bulgaria, August 4-9 2013. c©2013 Association for Computational Linguistics

Chinese Parsing Exploiting Characters

Meishan Zhang†, Yue Zhang‡∗, Wanxiang Che†, Ting Liu††Research Center for Social Computing and Information Retrieval

Harbin Institute of Technology, China{mszhang, car, tliu}@ir.hit.edu.cn‡Singapore University of Technology and Design

yue [email protected]

Abstract

Characters play an important role in theChinese language, yet computational pro-cessing of Chinese has been dominatedby word-based approaches, with leaves insyntax trees being words. We investigateChinese parsing from the character-level,extending the notion of phrase-structuretrees by annotating internal structures ofwords. We demonstrate the importanceof character-level information to Chineseprocessing by building a joint segmen-tation, part-of-speech (POS) tagging andphrase-structure parsing system that inte-grates character-structure features. Ourjoint system significantly outperforms astate-of-the-art word-based baseline on thestandard CTB5 test, and gives the bestpublished results for Chinese parsing.

1 Introduction

Characters play an important role in the Chineselanguage. They act as basic phonetic, morpho-syntactic and semantic units in a Chinese sentence.Frequently-occurring character sequences that ex-press certain meanings can be treated as words,while most Chinese words have syntactic struc-tures. For example, Figure 1(b) shows the struc-ture of the word “建筑业 (construction and build-ing industry)”, where the characters “建 (construc-tion)” and “筑 (building)” form a coordination,and modify the character “业 (industry)”.

However, computational processing of Chineseis typically based on words. Words are treatedas the atomic units in syntactic parsing, machinetranslation, question answering and other NLPtasks. Manually annotated corpora, such as theChinese Treebank (CTB) (Xue et al., 2005), usu-ally have words as the basic syntactic elements

∗Email correspondence.

中国建筑业

呈现

新格局

NR NN

VV

JJ NN

NP NP

NPADJP

NP

VPNP

IP

NP NP

NPADJP

NP

VPNP

IP

NN

局

NR-e

格

NR-b

JJ

新

JJ-s

VV

现

VV-e

呈

VV-bNN

业

NN-e

筑

NN-m

建

NN-b

NR

国

NR-eNR-b

中

NP NP

NPADJP

NP

VPNP

IP

NN-c

局

NR-i

格

NR-b

JJ-t

新

JJ-b

VV-c

现

VV-i

呈

VV-bNR-r

国

NR-iNR-b

中

NR-t

NN-r

业

NN-i

筑

NN-i

建

NN-b

NN-c

NN-t

VV-t

NN-t

(a) CTB-style word-based syntax tree for “中国 (China)建筑业 (architecture industry) 呈现 (show) 新 (new) 格局(pattern)”.

中国建筑业

呈现

新格局

NR NN

VV

JJ NN

NP NP

NPADJP

NP

VPNP

IP

NP NP

NPADJP

NP

VPNP

IP

NN

局

NR-e

格

NR-b

JJ

新

JJ-s

VV

现

VV-e

呈

VV-bNN

业

NN-e

筑

NN-m

建

NN-b

NR

国

NR-eNR-b

中

NP NP

NPADJP

NP

VPNP

IP

NN-c

局

NN-i

格

NN-b

JJ-t

新

JJ-b

VV-c

现

VV-i

呈

VV-bNR-r

国

NR-iNR-b

中

NR-t

NN-r

业

NN-i

筑

NN-i

建

NN-b

NN-c

NN-t

VV-t

NN-t

(b) character-level syntax tree with hierarchal word structuresfor “中 (middle) 国 (nation) 建 (construction) 筑 (building)业 (industry) 呈 (present) 现 (show) 新 (new) 格 (style) 局(situation)”.

Figure 1: Word-based and character-level phrase-structure trees for the sentence “中国建筑业呈现新格局 (China’s architecture industry shows newpatterns)”, where “l”, “r”, “c” denote the direc-tions of head characters (see section 2).

(Figure 1(a)). This form of annotation does notgive character-level syntactic structures for words,a source of linguistic information that is more fun-damental and less sparse than atomic words.

In this paper, we investigate Chinese syn-tactic parsing with character-level informationby extending the notation of phrase-structure

125

(constituent) trees, adding recursive structures ofcharacters for words. We manually annotate thestructures of 37,382 words, which cover the entireCTB5. Using these annotations, we transformCTB-style constituent trees into character-leveltrees (Figure 1(b)). Our word structure corpus,together with a set of tools to transform CTB-styletrees into character-level trees, is released athttps://github.com/zhangmeishan/wordstructures.Our annotation work is in line with the work ofVadas and Curran (2007) and Li (2011), whichprovide extended annotations of Penn Treebank(PTB) noun phrases and CTB words (on themorphological level), respectively.

We build a character-based Chinese parsingmodel to parse the character-level syntax trees.Given an input Chinese sentence, our parser pro-duces its character-level syntax trees (Figure 1(b)).With richer information than word-level trees, thisform of parse trees can be useful for all the afore-mentioned Chinese NLP applications.

With regard to task of parsing itself, an impor-tant advantage of the character-level syntax trees isthat they allow word segmentation, part-of-speech(POS) tagging and parsing to be performed jointly,using an efficient CKY-style or shift-reduce algo-rithm. Luo (2003) exploited this advantage byadding flat word structures without manually an-notation to CTB trees, and building a generativecharacter-based parser. Compared to a pipelinesystem, the advantages of a joint system includereduction of error propagation, and the integrationof segmentation, POS tagging and syntax features.With hierarchical structures and head character in-formation, our annotated words are more informa-tive than flat word structures, and hence can bringfurther improvements to phrase-structure parsing.

To analyze word structures in addition to phrasestructures, our character-based parser naturallyperforms joint word segmentation, POS taggingand parsing jointly. Our model is based on thediscriminative shift-reduce parser of Zhang andClark (2009; 2011), which is a state-of-the-artword-based phrase-structure parser for Chinese.We extend their shift-reduce framework, addingmore transition actions for word segmentation andPOS tagging, and defining novel features that cap-ture character information. Even when trainedusing character-level syntax trees with flat wordstructures, our joint parser outperforms a strongpipelined baseline that consists of a state-of-the-

NN-c

NN-iNN-b

科(science)

技(technology)

VV-l

VV-iVV-b

烧(burn)

完(up)

NN-r

NN-iNN-b

库(repository)

存(saving)

NN-l

VV-iVV-b

考(investigate)

古(ancient)

NN-r

NN-iNN-b

败(bad)

类(kind)

AD-l

AD-iAD-b

徒(vain)

然(so)

NN-r

NN-iNN-b

卧(crouching)

虎(tiger)

NN-r

NN-iNN-i

藏(hidden)

龙(dragon)

NN-c

VV-r

VV-iVV-b

横(fiercely)

扫(sweep)

VV-r

VV-iVV-i

千(thousands)

军(troops)

VV-l

NN-c

NN-iNN-b

教(teach)

育(education)

NN-i

界(field)

NN-r

NN

NN-fNN-f

教育(education)

界(field)

NN-c

NN-iNN-b

朋(friend)

友(friend)

NN-i

们(plural)

NN-l

NN

NN-fNN-f

朋友(friend)

们(plural)

(a) subject-predicate.NN-c

NN-iNN-b

科(science)

技(technology)

VV-l

VV-iVV-b

烧(burn)

完(up)

NN-r

NN-iNN-b

库(repository)

存(saving)

NN-l

VV-iVV-b

考(investigate)

古(ancient)

NN-r

NN-iNN-b

败(bad)

类(kind)

AD-l

AD-iAD-b

徒(vain)

然(so)

NN-r

NN-iNN-b

卧(crouching)

虎(tiger)

NN-r

NN-iNN-i

藏(hidden)

龙(dragon)

NN-c

VV-r

VV-iVV-b

横(fiercely)

扫(sweep)

VV-r

VV-iVV-i

千(thousands)

军(troops)

VV-l

NN-c

NN-iNN-b

教(teach)

育(education)

NN-i

界(field)

NN-r

NN

NN-fNN-f

教育(education)

界(field)

NN-c

NN-iNN-b

朋(friend)

友(friend)

NN-i

们(plural)

NN-l

NN

NN-fNN-f

朋友(friend)

们(plural)

(b) verb-object.

NN-c

NN-iNN-b

科(science)

技(technology)

VV-l

VV-iVV-b

烧(burn)

完(up)

NN-r

NN-iNN-b

库(repository)

存(saving)

NN-l

VV-iVV-b

考(investigate)

古(ancient)

NN-r

NN-iNN-b

败(bad)

类(kind)

AD-l

AD-iAD-b

徒(vain)

然(so)

NN-r

NN-iNN-b

卧(crouching)

虎(tiger)

NN-r

NN-iNN-i

藏(hidden)

龙(dragon)

NN-c

VV-r

VV-iVV-b

横(fiercely)

扫(sweep)

VV-r

VV-iVV-i

千(thousands)

军(troops)

VV-l

NN-c

NN-iNN-b

教(teach)

育(education)

NN-i

界(field)

NN-r

NN

NN-fNN-f

教育(education)

界(field)

NN-c

NN-iNN-b

朋(friend)

友(friend)

NN-i

们(plural)

NN-l

NN

NN-fNN-f

朋友(friend)

们(plural)

(c) coordination.

NN-c

NN-iNN-b

科(science)

技(technology)

VV-l

VV-iVV-b

烧(burn)

完(up)

NN-r

NN-iNN-b

库(repository)

存(saving)

NN-l

VV-iVV-b

考(investigate)

古(ancient)

NN-r

NN-iNN-b

败(bad)

类(kind)

AD-l

AD-iAD-b

徒(vain)

然(so)

NN-r

NN-iNN-b

卧(crouching)

虎(tiger)

NN-r

NN-iNN-i

藏(hidden)

龙(dragon)

NN-c

VV-r

VV-iVV-b

横(fiercely)

扫(sweep)

VV-r

VV-iVV-i

千(thousands)

军(troops)

VV-l

NN-c

NN-iNN-b

教(teach)

育(education)

NN-i

界(field)

NN-r

NN

NN-fNN-f

教育(education)

界(field)

NN-c

NN-iNN-b

朋(friend)

友(friend)

NN-i

们(plural)

NN-l

NN

NN-fNN-f

朋友(friend)

们(plural)

(d) modifier-noun.

Figure 2: Inner word structures of “库存 (reper-tory)”,“考古 (archaeology)”, “科技 (science andtechnology)” and “败类 (degenerate)”.

art joint segmenter and POS tagger, and our base-line word-based parser. Our word annotations leadto further improvements to the joint system, espe-cially for phrase-structure parsing accuracy.

Our parser work falls in line with recent workof joint segmentation, POS tagging and parsing(Hatori et al., 2012; Li and Zhou, 2012; Qianand Liu, 2012). Compared with related work,our model gives the best published results forjoint segmentation and POS tagging, as well asjoint phrase-structure parsing on standard CTB5evaluations. With linear-time complexity, ourparser is highly efficient, processing over 30 sen-tences per second with a beam size of 16. Anopen release of the parser is freely available athttp://sourceforge.net/projects/zpar/, version 0.6.

2 Word Structures and Syntax Trees

The Chinese language is a character-based lan-guage. Unlike alphabetical languages, Chinesecharacters convey meanings, and the meaning ofmost Chinese words takes roots in their charac-ter. For example, the word “计算机 (computer)” iscomposed of the characters “计 (count)”, “算 (cal-culate)” and “机 (machine)”. An informal name of“computer” is “电脑”, which is composed of “电(electronic)” and “脑 (brain)”.

Chinese words have internal structures (Xue,2001; Ma et al., 2012). The way characters inter-act within words can be similar to the way wordsinteract within phrases. Figure 2 shows the struc-tures of the four words “库存 (repertory)”, “考古

126

NN-c

NN-iNN-b

科(science)

技(technology)

VV-l

VV-iVV-b

烧(burn)

完(up)

NN-r

NN-iNN-b

库(repository)

存(saving)

NN-l

VV-iVV-b

考(investigate)

古(ancient)

NN-r

NN-iNN-b

败(bad)

类(kind)

AD-l

AD-iAD-b

徒(vain)

然(so)

NN-r

NN-iNN-b

卧(crouching)

虎(tiger)

NN-r

NN-iNN-i

藏(hidden)

龙(dragon)

NN-c

VV-r

VV-iVV-b

横(fiercely)

扫(sweep)

VV-r

VV-iVV-i

千(thousands)

军(troops)

VV-l

NN-c

NN-iNN-b

教(teach)

育(education)

NN-i

界(field)

NN-r

NN

NN-fNN-f

教育(education)

界(field)

NN-c

NN-iNN-b

朋(friend)

友(friend)

NN-i

们(plural)

NN-l

NN

NN-fNN-f

朋友(friend)

们(plural)

Figure 3: Character-level word structure of “卧虎藏龙 (crouching tiger hidden dragon)”.

(archaeology)”, “科技 (science and technology)”and “败类 (degenerate)”, which demonstratefour typical syntactic structures of two-characterwords, including subject-predicate, verb-object,coordination and modifier-noun structures. Multi-character words can also have recursive syntac-tic structures. Figure 3 illustrates the structureof the word “卧虎藏龙 (crouching tiger hiddendragon)”, which is composed of two subwords “卧虎 (crouching tiger)” and “藏龙 (hidden dragon)”,both having a modifier-noun structure.

The meaning of characters can be a usefulsource of information for computational process-ing of Chinese, and some recent work has startedto exploit this information. Zhang and Clark(2010) found that the first character in a Chineseword is a useful indicator of the word’s POS. Theymade use of this information to help joint wordsegmentation and POS tagging.

Li (2011) studied the morphological structuresof Chinese words, showing that 35% percent ofthe words in CTB5 can be treated as having mor-phemes. Figure 4(a) illustrates the morphologicalstructures of the words “ 朋友们 (friends)” and“教育界 (educational world)”, in which the char-acters “们 (plural)” and “界 (field)” can be treatedas suffix morphemes. They studied the influenceof such morphology to Chinese dependency pars-ing (Li and Zhou, 2012).

The aforementioned work explores the influ-ence of particular types of characters to Chineseprocessing, yet not the full potentials of completeword structures. We take one step further in thisline of work, annotating the full syntactic struc-tures of 37,382 Chinese words in the form of Fig-ure 2 and Figure 3. Our annotation covers theentire vocabulary of CTB5. In addition to dif-ference in coverage (100% vs 35%), our annota-tion is structurally more informative than that ofLi (2011), as illustrated in Figure 4(b).

Our annotations are binarized recursive word

NN-c

NN-iNN-b

科(science)

技(technology)

VV-l

VV-iVV-b

烧(burn)

完(up)

NN-r

NN-iNN-b

库(repository)

存(saving)

NN-l

VV-iVV-b

考(investigate)

古(ancient)

NN-r

NN-iNN-b

败(bad)

类(kind)

AD-l

AD-iAD-b

徒(vain)

然(so)

NN-r

NN-iNN-b

卧(crouching)

虎(tiger)

NN-r

NN-iNN-i

藏(hidden)

龙(dragon)

NN-c

VV-r

VV-iVV-b

横(fiercely)

扫(sweep)

VV-r

VV-iVV-i

千(thousands)

军(troops)

VV-l

NN-c

NN-iNN-b

教(teach)

育(education)

NN-i

界(field)

NN-r

NN

NN-fNN-f

教育(education)

界(field)

NN-c

NN-iNN-b

朋(friend)

友(friend)

NN-i

们(plural)

NN-l

NN

NN-fNN-f

朋友(friend)

们(plural)

(a) morphological-level word structures, where “f” de-notes a special mark for fine-grained words.

NN-c

NN-iNN-b

科(science)

技(technology)

VV-l

VV-iVV-b

烧(burn)

完(up)

NN-r

NN-iNN-b

库(repository)

存(saving)

NN-l

VV-iVV-b

考(investigate)

古(ancient)

NN-r

NN-iNN-b

败(bad)

类(kind)

AD-l

AD-iAD-b

徒(vain)

然(so)

NN-r

NN-iNN-b

卧(crouching)

虎(tiger)

NN-r

NN-iNN-i

藏(hidden)

龙(dragon)

NN-c

VV-r

VV-iVV-b

横(fiercely)

扫(sweep)

VV-r

VV-iVV-i

千(thousands)

军(troops)

VV-l

NN-c

NN-iNN-b

教(teach)

育(education)

NN-i

界(field)

NN-r

NN

NN-fNN-f

教育(education)

界(field)

NN-c

NN-iNN-b

朋(friend)

友(friend)

NN-i

们(plural)

NN-l

NN

NN-fNN-f

朋友(friend)

们(plural)

(b) character-level word structures.

Figure 4: Comparison between character-level andmorphological-level word structures.

structures. For each word or subword, we spec-ify its POS and head direction. We use “l”, “r”and “c” to indicate the “left”, “right” and “coordi-nation” head directions, respectively. The “coor-dination” direction is mostly used in coordinationstructures, while a very small number of translit-eration words, such as “奥巴马 (Obama)” and “洛杉矶 (Los Angeles)”, have flat structures, and weuse “coordination” for their left binarization. Forleaf characters, we follow previous work on wordsegmentation (Xue, 2003; Ng and Low, 2004), anduse “b” and “i” to indicate the beginning and non-beginning characters of a word, respectively.

The vast majority of words do not have struc-tural ambiguities. However, the structures of somewords may vary according to different POS. Forexample, “制服” means “dominate” when it istagged as a verb, of which the head is the left char-acter; the same word means “uniform dress” whentagged as a noun, of which the head is the rightcharacter. Thus the input of the word structureannotation is a word together with its POS. Theannotation work was conducted by three persons,with one person annotating the entire corpus, andthe other two checking the annotations.

Using our annotations, we can extend CTB-style syntax trees (Figure 1(a)) into character-level trees (Figure 1(b)). In particular, we markthe original nodes that represent POS tags in CTB-style trees with “-t”, and insert our word structuresas unary subnodes of the “-t” nodes. For the restof the paper, we refer to the “-t” nodes as full-wordnodes, all nodes above full-word nodes as phrase

127

nodes, and all nodes below full-word nodes as sub-word nodes.

Our character-level trees contain additional syn-tactic information, which are potentially useful toChinese processing. For example, the head char-acters of words can be populated up to phrase-level nodes, and serve as an additional source ofinformation that is less sparse than head words. Inthis paper, we build a parser that yields character-level trees from raw character sequences. In addi-tion, we use this parser to study the effects of ourannotations to character-based statistical Chineseparsing, showing that they are useful in improvingparsing accuracies.

3 Character-based Chinese Parsing

To produce character-level trees for ChineseNLP tasks, we develop a character-based parsingmodel, which can jointly perform word segmen-tation, POS tagging and phrase-structure parsing.To our knowledge, this is the first work to developa transition-based system that jointly performs theabove three tasks. Trained using annotated wordstructures, our parser also analyzes the internalstructures of Chinese words.

Our character-based Chinese parsing model isbased on the work of Zhang and Clark (2009),which is a transition-based model for lexicalizedconstituent parsing. They use a beam-search de-coder so that the transition action sequence can beglobally optimized. The averaged perceptron withearly-update (Collins and Roark, 2004) is used totrain the model parameters. Their transition sys-tem contains four kinds of actions: (1) SHIFT,(2) REDUCE-UNARY, (3) REDUCE-BINARY and(4) TERMINATE. The system can provide bina-rzied CFG trees in Chomsky Norm Form, and theypresent a reversible conversion procedure to maparbitrary CFG trees into binarized trees.

In this work, we remain consistent with theirwork, using the head-finding rules of Zhang andClark (2008), and the same binarization algo-rithm.1 We apply the same beam-search algorithmfor decoding, and employ the averaged perceptronwith early-update to train our model.

We make two extensions to their work to en-able joint segmentation, POS tagging and phrase-structure parsing from the character level. First,we modify the actions of the transition system for

1We use a left-binarization process for flat word structuresthat contain more than two characters.

S2

stack

...

...

queue

Q0 Q1 ...S1

S1l S1r

... ...

S0

S0l S0r

... ...

Figure 5: A state in a transition-based model.

parsing the inner structures of words. Second, weextend the feature set for our parsing problem.

3.1 The Transition SystemIn a transition-based system, an input sentence isprocessed in a linear left-to-right pass, and theoutput is constructed by a state-transition pro-cess. We learn a model for scoring the transi-tion Ai from one state STi to the next STi+1. Asshown in Figure 5, a state ST consists of a stackS and a queue Q, where S = (· · · , S1, S0) con-tains partially constructed parse trees, and Q =(Q0, Q1, · · · , Qn−j) = (cj , cj+1, · · · , cn) is thesequence of input characters that have not beenprocessed. The candidate transition action A ateach step is defined as follows:

• SHIFT-SEPARATE(t): remove the headcharacter cj from Q, pushing a subword nodeS′cj

2 onto S, assigning S′.t = t. Note that theparse tree S0 must correspond to a full-wordor a phrase node, and the character cj is thefirst character of the next word. The argu-ment t denotes the POS of S′.

• SHIFT-APPEND: remove the head charactercj from Q, pushing a subword node S

′cj

ontoS. cj will eventually be combined with all thesubword nodes on top of S to form a word,and thus we must have S′.t = S0.t.

• REDUCE-SUBWORD(d): pop the top twonodes S0 and S1 off S, pushing a new sub-word node S

′S1 S0

onto S. The argument ddenotes the head direction of S′, of whichthe value can be “left”, “right” or “coordi-nation”.3 Both S0 and S1 must be subwordnodes and S′.t = S0.t = S1.t.

2We use this notation for a compact representation of atree node, where the numerator represents a father node, andthe denominator represents the children.

3For the head direction “coordination”, we extract thehead character from the left node.

128

Category Feature templates When to Apply

Structure S0ntl S0nwl S1ntl S1nwl S2ntl S2nwl S3ntl S3nwl, Allfeatures Q0c Q1c Q2c Q3c Q0c ·Q1c Q1c ·Q2c Q2c ·Q3c,

S0ltwl S0rtwl S0utwl S1ltwl S1rtwl S1utwl,S0nw · S1nw S0nw · S1nl S0nl · S1nw S0nl · S1nl,S0nw ·Q0c S0nl ·Q0c S1nw ·Q0c S1nlQ0c,S0nl · S1nl · S2nl S0nw · S1nl · S2nl S0nl · S1nw · S2nl S0nl · S1nl · S2nw,S0nw · S1nl ·Q0c S0nl · S1nw ·Q0c S0nl · S1nl ·Q0c,S0ncl S0nct S0nctl S1ncl S1nct S1nctl,S2ncl S2nct S2nctl S3ncl S3nct S3nctl,S0nc · S1nc S0ncl · S1nl S0nl · S1ncl S0ncl · S1ncl,S0nc ·Q0c S0nl ·Q0c S1nc ·Q0c S1nl ·Q0c,S0nc · S1nc ·Q0c S0nc · S1nc ·Q0c ·Q1cstart(S0w) · start(S1w) start(S0w) · end(S1w), REDUCE-SUBWORDindict(S1wS0w) · len(S1wS0w) indict(S1wS0w, S0t) · len(S1wS0w)

String t−1 · t0 t−2 · t−1t0 w−1 · t0 c0 · t0 start(w−1) · t0 c−1 · c0 · t−1 · t0, SHIFT-SEPARATEfeatures w−1 w−2 · w−1 w−1,where len(w−1) = 1 end(w−1) · c0, REDUCE-WORD

start(w−1) · len(w−1) end(w−1) · len(w−1) start(w−1) · end(w−1),w−1 · c0 end(w−2) · w−1 start(w−1) · c0 end(w−2) · end(w−1),w−1 · len(w−2) w−2 · len(w−1) w−1 · t−1 w−1 · t−2 w−1 · t−1 · c0,w−1 · t−1 · end(w−2) c−2 · c−1 · c0 · t−1,where len(w−1) = 1 end(w−1) · t−1,c · t−1 · end(w−1),where c ∈ w−1 and c 6= end(w−1)c0 · t−1 c−1 · c0 start(w−1) · c0t−1 c−1 · c0 · t−1 SHIFT-APPEND

Table 1: Feature templates for the character-level parser. The function start(·), end(·) and len(·) denotethe first character, the last character and the length of a word, respectively.

• REDUCE-WORD: pop the top node S0 off S,pushing a full-word node S

′S0

onto S. This re-duce action generates a full-word node fromS0, which must be a subword node.

• REDUCE-BINARY(d, l): pop the top twonodes S0 and S1 off S, pushing a binaryphrase node S

′S1 S0

onto S. The argument ldenotes the constituent label of S′, and the ar-gument d specifies the lexical head directionof S′, which can be either “left” or “right”.Both S0 and S1 must be a full-word node ora phrase node.

• REDUCE-UNARY(l): pop the top node S0off S, pushing a unary phrase node S

′S0

ontoS. l denotes the constituent label of S′.

• TERMINATE: mark parsing complete.

Compared to set of actions in our baselinetransition-based phrase-structure parser, we havemade three major changes. First, we split the orig-inal SHIFT action into SHIFT-SEPARATE(t)and SHIFT-APPEND, which jointly perform theword segmentation and POS tagging tasks. Sec-ond, we add an extra REDUCE-SUBWORD(d) op-eration, which is used for parsing the inner struc-

tures of words. Third, we add REDUCE-WORD,which applies a unary rule to mark a completedsubword node as a full-word node. The new nodecorresponds to a unary “-t” node in Figure 1(b).

3.2 Features

Table 1 shows the feature templates of our model.The feature set consists of two categories: (1)structure features, which encode the structural in-formation of subwords, full-words and phrases.(2) string features, which encode the informationof neighboring characters and words.

For the structure features, the symbols S0, S1,S2, S3 represent the top four nodes on the stack;Q0, Q1, Q2, Q3 denote the first four charactersin the queue; S0l, S0r, S0u represent the left,right child for a binary branching S0, and the sin-gle child for a unary branching S0, respectively;S1l, S1r, S1u represent the left, right child fora binary branching S1, and the single child fora unary branching S1, respectively; n representsthe type for a node; it is a binary value that indi-cates whether the node is a subword node; c, w,t and l represent the head character, word (or sub-word), POS tag and constituent label of a node, re-spectively. The structure features are mostly taken

129

from the work of Zhang and Clark (2009). Thefeature templates in bold are novel, are designedto encode head character information. In particu-lar, the indict function denotes whether a word isin a tag dictionary, which is collected by extract-ing all multi-character subwords that occur morethan five times in the training corpus.

For string features, c0, c−1 and c−2 representthe current character and its previous two charac-ters, respectively; w−1 and w−2 represent the pre-vious two words to the current character, respec-tively; t0, t−1 and t−2 represent the POS tags ofthe current word and the previous two words, re-spectively. The string features are used for wordsegmentation and POS tagging, and are adaptedfrom a state-of-the-art joint segmentation and tag-ging model (Zhang and Clark, 2010).

In summary, our character-based parser con-tains the word-based features of constituent parserpresented in Zhang and Clark (2009), the word-based and shallow character-based features ofjoint word segmentation and POS tagging pre-sented in Zhang and Clark (2010), and addition-ally the deep character-based features that encodeword structure information, which are the first pre-sented by this paper.

4 Experiments

4.1 Setting

We conduct our experiments on the CTB5 cor-pus, using the standard split of data, with sections1–270,400–931 and 1001–1151 for training, sec-tions 301–325 for system development, and sec-tions 271–300 for testing. We apply the same pre-processing step as Harper and Huang (2011), sothat the non-terminal yield unary chains are col-lapsed to single unary rules.

Since our model can jointly process word seg-mentation, POS tagging and phrase-structure pars-ing, we evaluate our model for the three tasks, re-spectively. For word segmentation and POS tag-ging, standard metrics of word precision, recalland F-score are used, where the tagging accuracyis the joint accuracy of word segmentation andPOS tagging. For phrase-structure parsing, weuse the standard parseval evaluation metrics onbracketing precision, recall and F-score. As ourconstituent trees are based on characters, we fol-low previous work and redefine the boundary ofa constituent span by its start and end characters.In addition, we evaluate the performance of word

65

70

75

80

85

90

95

0 10 20 30 40

64b16b4b1b

(a) Joint segmentation andPOS tagging F-scores.

30

40

50

60

70

80

90

0 10 20 30 40

64b16b4b1b

(b) Joint constituent parsingF-scores.

Figure 6: Accuracies against the training epochfor joint segmentation and tagging as well as jointphrase-structure parsing using beam sizes 1, 4, 16and 64, respectively.

structures, using the word precision, recall and F-score metrics. A word structure is correct only ifthe word and its internal structure are both correct.

4.2 Development Results

Figure 6 shows the accuracies of our model usingdifferent beam sizes with respect to the trainingepoch. The performance of our model increasesas the beam size increases. The amount of in-creases becomes smaller as the size of the beamgrows larger. Tested using gcc 4.7.2 and Fedora17 on an Intel Core i5-3470 CPU (3.20GHz), thedecoding speeds are 318.2, 98.0, 30.3 and 7.9 sen-tences per second with beam size 1, 4, 16 and 64,respectively. Based on this experiment, we set thebeam size 64 for the rest of our experiments.

The character-level parsing model has the ad-vantage that deep character information can be ex-tracted as features for parsing. For example, thehead character of a word is exploited in our model.We conduct feature ablation experiments to eval-uate the effectiveness of these features. We findthat the parsing accuracy decreases about 0.6%when the head character related features (the boldfeature templates in Table 1) are removed, whichdemonstrates the usefulness of these features.

4.3 Final Results

In this section, we present the final results of ourmodel, and compare it to two baseline systems, apipelined system and a joint system that is trainedwith automatically generated flat words structures.

The baseline pipelined system consists of thejoint segmentation and tagging model proposed by

130

Task P R FPipeline Seg 97.35 98.02 97.69

Tag 93.51 94.15 93.83Parse 81.58 82.95 82.26

Flat word Seg 97.32 98.13 97.73structures Tag 94.09 94.88 94.48

Parse 83.39 83.84 83.61Annotated Seg 97.49 98.18 97.84word structures Tag 94.46 95.14 94.80

Parse 84.42 84.43 84.43WS 94.02 94.69 94.35

Table 2: Final results on test corpus.

Zhang and Clark (2010), and the phrase-structureparsing model of Zhang and Clark (2009). Bothmodels give state-of-the-art performances, and arefreely available.4 The model for joint segmen-tation and POS tagging is trained with a 16-beam, since it achieves the best performance. Thephrase-structure parsing model is trained with a64-beam. We train the parsing model using theautomatically generated POS tags by 10-way jack-knifing, which gives about 1.5% increases in pars-ing accuracy when tested on automatic segmentedand POS tagged inputs.

The joint system trained with flat word struc-tures serves to test the effectiveness of our jointparsing system over the pipelined baseline, sinceflat word structures do not contain additionalsources of information over the baseline. It is alsoused to test the usefulness of our annotation in im-proving parsing accuracy.

Table 2 shows the final results of our modeland the two baseline systems on the test data.We can see that both character-level joint mod-els outperform the pipelined system; our modelwith annotated word structures gives an improve-ment of 0.97% in tagging accuracy and 2.17% inphrase-structure parsing accuracy. The results alsodemonstrate that the annotated word structures arehighly effective for syntactic parsing, giving an ab-solute improvement of 0.82% in phrase-structureparsing accuracy over the joint model with flatword structures.

Row “WS” in Table 2 shows the accuracy ofhierarchical word-structure recovery of our jointsystem. This figure can be useful for high-level ap-plications that make use of character-level trees by

4http://sourceforge.net/projects/zpar/, version 0.5.

our parser, as it reflects the capability of our parserin analyzing word structures. In particular, the per-formance of parsing OOV word structure is an im-portant metric of our parser. The recall of OOVword structures is 60.43%, while if we do not con-sider the influences of segmentation and taggingerrors, counting only the correctly segmented andtagged words, the recall is 87.96%.

4.4 Comparison with Previous Work

In this section, we compare our model to previoussystems on the performance of joint word segmen-tation and POS tagging, and the performance ofjoint phrase-structure parsing.

Table 3 shows the results. Kruengkrai+ ’09denotes the results of Kruengkrai et al. (2009),which is a lattice-based joint word segmentationand POS tagging model; Sun ’11 denotes a sub-word based stacking model for joint segmenta-tion and POS tagging (Sun, 2011), which uses adictionary of idioms; Wang+ ’11 denotes a semi-supervised model proposed by Wang et al. (2011),which additionally uses the Chinese GigawordCorpus; Li ’11 denotes a generative model thatcan perform word segmentation, POS tagging andphrase-structure parsing jointly (Li, 2011); Li+’12 denotes a unified dependency parsing modelthat can perform joint word segmentation, POStagging and dependency parsing (Li and Zhou,2012); Li ’11 and Li+ ’12 exploited annotatedmorphological-level word structures for Chinese;Hatori+ ’12 denotes an incremental joint modelfor word segmentation, POS tagging and depen-dency parsing (Hatori et al., 2012); they use exter-nal dictionary resources including HowNet WordList and page names from the Chinese Wikipedia;Qian+ ’12 denotes a joint segmentation, POS tag-ging and parsing system using a unified frame-work for decoding, incorporating a word segmen-tation model, a POS tagging model and a phrase-structure parsing model together (Qian and Liu,2012); their word segmentation model is a combi-nation of character-based model and word-basedmodel. Our model achieved the best performanceon both joint segmentation and tagging as well asjoint phrase-structure parsing.

Our final performance on constituent parsing isby far the best that we are aware of for the Chinesedata, and even better than some state-of-the-artmodels with gold segmentation. For example, theun-lexicalized PCFG model of Petrov and Klein

131

System Seg Tag ParseKruengkrai+ ’09 97.87 93.67 –Sun ’11 98.17* 94.02* –Wang+ ’11 98.11* 94.18* –Li ’11 97.3 93.5 79.7Li+ ’12 97.50 93.31 –Hatori+ ’12 98.26* 94.64* –Qian+ ’12 97.96 93.81 82.85Ours pipeline 97.69 93.83 82.26Ours joint flat 97.73 94.48 83.61Ours joint annotated 97.84 94.80 84.43

Table 3: Comparisons of our final model withstate-of-the-art systems, where “*” denotes thatexternal dictionary or corpus has been used.

(2007) achieves 83.45%5 in parsing accuracy onthe test corpus, and our pipeline constituent pars-ing model achieves 83.55% with gold segmenta-tion. They are lower than the performance of ourcharacter-level model, which is 84.43% withoutgold segmentation. The main differences betweenword-based and character-level parsing models arethat character-level model can exploit characterfeatures. This further demonstrates the effective-ness of characters in Chinese parsing.

5 Related Work

Recent work on using the internal structure ofwords to help Chinese processing gives impor-tant motivations to our work. Zhao (2009) stud-ied character-level dependencies for Chinese wordsegmentation by formalizing segmentsion task ina dependency parsing framework. Their resultsdemonstrate that annotated word dependenciescan be helpful for word segmentation. Li (2011)pointed out that the word’s internal structure isvery important for Chinese NLP. They annotatedmorphological-level word structures, and a unifiedgenerative model was proposed to parse the Chi-nese morphological and phrase-structures. Li andZhou (2012) also exploited the morphological-level word structures for Chinese dependencyparsing. They proposed a unified transition-basedmodel to parse the morphological and depen-dency structures of a Chinese sentence in a unifiedframework. The morphological-level word struc-

5We rerun the parser and evaluate it using the publicly-available code on http://code.google.com/p/berkeleyparserby ourselves, since we have a preprocessing step for theCTB5 corpus.

tures concern only prefixes and suffixes, whichcover only 35% of entire words in CTB. Accord-ing to their results, the final performances of theirmodel on word segmentation and POS tagging arebelow the state-of-the-art joint segmentation andPOS tagging models. Compared to their work,we consider the character-level word structuresfor Chinese parsing, presenting a unified frame-work for segmentation, POS tagging and phrase-structure parsing. We can achieve improved seg-mentation and tagging performance.

Our character-level parsing model is inspiredby the work of Zhang and Clark (2009), whichis a transition-based model with a beam-searchdecoder for word-based constituent parsing. Ourwork is based on the shift-reduce operations oftheir work, while we introduce additional opera-tions for segmentation and POS tagging. By suchan extension, our model can include all the fea-tures in their work, together with the features forsegmentation and POS tagging. In addition, wepropose novel features related to word structuresand interactions between word segmentation, POStagging and word-based constituent parsing.

Luo (2003) was the first work to introduce thecharacter-based syntax parsing. They use it asa joint framework to perform Chinese word seg-mentation, POS tagging and syntax parsing. Theyexploit a generative maximum entropy model forcharacter-based constituent parsing, and find thatPOS information is very useful for Chinese wordsegmentation, but high-level syntactic informationseems to have little effect on segmentation. Com-pared to their work, we use a transition-based dis-criminative model, which can benefit from largeamounts of flexible features. In addition, in-stead of using flat structures, we manually anno-tate hierarchal tree structures of Chinese wordsfor converting word-based constituent trees intocharacter-based constituent trees.

Hatori et al. (2012) proposed the first joint workfor the word segmentation, POS tagging and de-pendency parsing. They used a single transition-based model to perform the three tasks. Theirwork demonstrates that a joint model can improvethe performance of the three tasks, particularlyfor POS tagging and dependency parsing. Qianand Liu (2012) proposed a joint decoder for wordsegmentation, POS tagging and word-based con-stituent parsing, although they trained models forthe three tasks separately. They reported better

132

performances when using a joint decoder. In ourwork, we employ a single character-based dis-criminative model to perform segmentation, POStagging and phrase-structure parsing jointly, andstudy the influence of annotated word structures.

6 Conclusions and Future Work

We studied the internal structures of more than37,382 Chinese words, analyzing their structuresas the recursive combinations of characters. Usingthese word structures, we extended the CTB intocharacter-level trees, and developed a character-based parser that builds such trees from raw char-acter sequences. Our character-based parser per-forms segmentation, POS tagging and parsingsimultaneously, and significantly outperforms apipelined baseline. We make both our annotationsand our parser available online.

In summary, our contributions include:

• We annotated the internal structures of Chi-nese words, which are potentially usefulto character-based studies of Chinese NLP.We extend CTB-style constituent trees intocharacter-level trees using our annotations.

• We developed a character-based parsingmodel that can produce our character-levelconstituent trees. Our parser jointly performsword segmentation, POS tagging and syntac-tic parsing.

• We investigated the effectiveness of our jointparser over pipelined baseline, and the effec-tiveness of our annotated word structures inimproving parsing accuracies.

Future work includes investigations of ourparser and annotations on Chinese NLP tasks.

Acknowledgments

This work was supported by National NaturalScience Foundation of China (NSFC) via grant61133012, the National “863” Major Projectsvia grant 2011AA01A207, the National “863”Leading Technology Research Project via grant2012AA011102, and SRG ISTD 2012 038 fromSingapore University of Technology and Design.

ReferencesMichael Collins and Brian Roark. 2004. Incremen-

tal parsing with the perceptron algorithm. In Pro-ceedings of the 42nd Meeting of the Association for

Computational Linguistics (ACL’04), Main Volume,pages 111–118, Barcelona, Spain, July.

Mary Harper and Zhongqiang Huang. 2011. Chinesestatistical parsing. Handbook of Natural LanguageProcessing and Machine Translation.

Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, andJun’ichi Tsujii. 2012. Incremental joint approachto word segmentation, pos tagging, and dependencyparsing in chinese. In Proceedings of the 50th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1045–1053, Jeju Island, Korea, July. Association for Com-putational Linguistics.

Canasai Kruengkrai, Kiyotaka Uchimoto, Jun’ichiKazama, Yiou Wang, Kentaro Torisawa, and HitoshiIsahara. 2009. An error-driven word-character hy-brid model for joint chinese word segmentation andpos tagging. In Proceedings of the Joint Confer-ence of the 47th Annual Meeting of the ACL and the4th International Joint Conference on Natural Lan-guage Processing of the AFNLP, pages 513–521,Suntec, Singapore, August. Association for Compu-tational Linguistics.

Zhongguo Li and Guodong Zhou. 2012. Unified de-pendency parsing of chinese morphological and syn-tactic structures. In Proceedings of the 2012 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning, pages 1445–1454, Jeju Island, Ko-rea, July. Association for Computational Linguistics.

Zhongguo Li. 2011. Parsing the internal structure ofwords: A new paradigm for chinese word segmen-tation. In Proceedings of the 49th Annual Meetingof the Association for Computational Linguistics:Human Language Technologies, pages 1405–1414,Portland, Oregon, USA, June. Association for Com-putational Linguistics.

Xiaoqiang Luo. 2003. A maximum entropy Chi-nese character-based parser. In Michael Collins andMark Steedman, editors, Proceedings of the 2003Conference on Empirical Methods in Natural Lan-guage Processing, pages 192–199.

Jianqiang Ma, Chunyu Kit, and Dale Gerdemann.2012. Semi-automatic annotation of chinese wordstructure. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese LanguageProcessing, pages 9–17, Tianjin, China, December.Association for Computational Linguistics.

Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once?word-based or character-based? In Dekang Lin andDekai Wu, editors, Proceedings of EMNLP 2004,pages 277–284, Barcelona, Spain, July. Associationfor Computational Linguistics.

Slav Petrov and Dan Klein. 2007. Improved infer-ence for unlexicalized parsing. In Human Language

133

Technologies 2007: The Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics; Proceedings of the Main Confer-ence, pages 404–411, Rochester, New York, April.Association for Computational Linguistics.

Xian Qian and Yang Liu. 2012. Joint chinese wordsegmentation, pos tagging and parsing. In Pro-ceedings of the 2012 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning, pages501–511, Jeju Island, Korea, July. Association forComputational Linguistics.

Weiwei Sun. 2011. A stacked sub-word model forjoint chinese word segmentation and part-of-speechtagging. In Proceedings of the 49th Annual Meet-ing of the Association for Computational Linguis-tics: Human Language Technologies, pages 1385–1394, Portland, Oregon, USA, June. Association forComputational Linguistics.

David Vadas and James Curran. 2007. Adding nounphrase structure to the penn treebank. In Proceed-ings of the 45th Annual Meeting of the Associa-tion of Computational Linguistics, pages 240–247,Prague, Czech Republic, June. Association for Com-putational Linguistics.

Yiou Wang, Jun’ichi Kazama, Yoshimasa Tsuruoka,Wenliang Chen, Yujie Zhang, and Kentaro Tori-sawa. 2011. Improving chinese word segmentationand pos tagging with semi-supervised methods usinglarge auto-analyzed data. In Proceedings of 5th In-ternational Joint Conference on Natural LanguageProcessing, pages 309–317, Chiang Mai, Thailand,November. Asian Federation of Natural LanguageProcessing.

Nianwen Xue, Fei Xia, Fu-Dong Chiou, and MarthaPalmer. 2005. The penn chinese treebank: Phrasestructure annotation of a large corpus. Natural Lan-guage Engineering, 11(2):207–238.

Nianwen Xue. 2001. Defining and AutomaticallyIdentifying Words in Chinese. Ph.D. thesis, Univer-sity of Delaware.

Nianwen Xue. 2003. Chinese word segmentation ascharacter tagging. International Journal of Compu-tational Linguistics and Chinese Language Process-ing, 8(1).

Yue Zhang and Stephen Clark. 2008. A tale of twoparsers: Investigating and combining graph-basedand transition-based dependency parsing. In Pro-ceedings of the 2008 Conference on Empirical Meth-ods in Natural Language Processing, pages 562–571, Honolulu, Hawaii, October. Association forComputational Linguistics.

Yue Zhang and Stephen Clark. 2009. Transition-based parsing of the chinese treebank using a globaldiscriminative model. In Proceedings of the 11thInternational Conference on Parsing Technologies

(IWPT’09), pages 162–171, Paris, France, October.Association for Computational Linguistics.

Yue Zhang and Stephen Clark. 2010. A fast decoderfor joint word segmentation and POS-tagging usinga single discriminative model. In Proceedings of the2010 Conference on Empirical Methods in NaturalLanguage Processing, pages 843–852, Cambridge,MA, October. Association for Computational Lin-guistics.

Yue Zhang and Stephen Clark. 2011. Syntactic pro-cessing using the generalized perceptron and beamsearch. Computational Linguistics, 37(1):105–151.

Hai Zhao. 2009. Character-level dependencies in chi-nese: Usefulness and learning. In Proceedings ofthe 12th Conference of the European Chapter of theACL (EACL 2009), pages 879–887, Athens, Greece,March. Association for Computational Linguistics.

134

chinese parsing exploiting characters · 2013. 7. 25. · characters play an important role in the...

Documents