[ieee communication technologies, research, innovation, and vision for the future (rivf) - hanoi,...

6
Efficient Syntactic Parsing with Beam Search Huong Thanh Le, Lam Ba Do School of Information and Communication Technology Hanoi University of Technology Hanoi, Vietnam Nhung Thi Pham Advance and Nurture Laboratory Hanoi, Vietnam Abstract—Implementing a Vietnamese syntactic parser is a difficult task due to the complexity of Vietnamese language. Most existing Vietnamese syntactic parsers are limited by types of sentences they can analyze. This paper introduces a syntactic parser that can analyze a larger range of Vietnamese sentences in a reasonable time. The proposed parser uses a probabilistic Head-Driven Phrase Structure Grammar that can control syntactic and semantic constraints of words in order to produce accurate results. The system uses a parsing algorithm that combines an improved version of the CYK algorithm and the idea of beam search in order to reduce the search space. Our experimental results achieved precision and recall of 82% and 74%, respectively. Keywords-syntactic parsing; HPSG; beam search I. INTRODUCTION Syntactic parsing is a basic and important problem in natural language processing (NLP). The purpose of a syntactic parser is to derive the syntactic structure of sentences, in order to understand the text. A good parser can be integrated into NLP applications such as machine translation, text summarization, question answering systems, etc. to produce more accurate results. There are a lot of research in syntactic parsing, especially for English. Some English parsers that have high accuracy are Stanford Parser [4], CMU Link Parser [2], and Charniak Parser [1], etc. These parsers have been implemented for a long time and are improved continuously. There are only a few works on Vietnamese syntactic parsing. Most of them use a small set of syntactic rules. As a result, only a few standard sentences can be parsed by these system. Many syntactically correct sentences used in everyday life (e.g., “Tôi 20 tu i/I am 20 years old”) but cannot be analyzed successfully by these systems. These systems also cannot parse correctly complex sentences, compound sentences and long sentences. The reason is these systems using a small syntactic rule set which does not cover all syntactic structures. These systems often return several syntactic trees for a long input sentence, among which only one or none is correct. This is the ambiguous problem in syntactic parsing. English parsers cannot apply directly to Vietnamese language, as the two languages have different characteristics. Vietnamese is a monosyllable whereas English is a multisyllable one. Therefore, word segmentation is necessary for Vietnamese whereas English is not. In addition, word order in Vietnamese is also different than that in English. For example, e.g., adjective goes after noun in a Vietnamese noun phrase (e.g., con mèo/cat en/black) whereas adjective goes before noun in an English noun phrase (e.g., black cat). In this paper, we propose an approach to Vietnamese syntactic parser that can deal with the problem mentioned above, using Head Driven Phrase Structure Grammar (HPSG) [8]. HPSG allows us to manage syntactic and semantic constraints through syntactic rules, word structures and phrase structures. For example, the word “ n/eat” is an action verb. It has Living Thing as the subject of action. Basing on this constraint, the sentence “Cái c c ang n bánh/The cup is eating a cake” is incorrect since the word “c c/cup” belongs to the object class instead of the people class. HPSG currently receives a lot of attention from the NLP community in the world. There is only a few Vietnamese research involving HPSG such as [5] and [11]. However, these works have not come up with a high accurate parser. Research in [11] focuses on analyzing Vietnamese noun phrases. The analysis capability of the parser in [5] is limited since it only uses a small syntactic rule set (95 rules). This paper is an improvement of [5] by using a larger syntactic rule set, an iterative CYK algorithm combining with beam search. The rest of this paper is organized as follows. Section II introduces our method of representing words and syntactic rules using HPSG. Improvements to the syntactic parsing algorithm are represented in Section III. Section IV analyzes the ambiguity problem in Vietnamese syntactic parsing and proposes a solution to this problem. Our experimental results are discussed in Section V. Finally, conclusions and future work are given in Section VI. II. REPRESENT VIETNAMESE SYNTACTIC RULES USING HPSG In this section, we describe structures of a word and a syntactic rule for Vietnamese using HPSG. A. Representing Vietnamese Word Structure using HPSG HPSG [8] can be regarded as an extension of the context free grammar (CFG) by adding attributes to word structures and constrains to syntactic rules. Parsing process will then rely on syntactic rules and syntactic and semantic constraints. HPSG uses attribute value matrix (AVM) to represent word information, in order to describe specific characteristics of word such as syntactic and semantic information. An AVM representing a word or a phrase can be very complex as introduced in [9]. In order to reduce the complexity of the parsing algorithm, we design a simpler AVM, which focuses on principles of verb combinations. This is because verb is the most important component that connects other parts of the This work was partially funded by the Vietnamese Ministry of Education & Training as part of the project B2009-01-225. 978-1-4244-8075-3/10/$26.00 ©2010 IEEE

Upload: nhung-thi

Post on 02-Mar-2017

218 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Hanoi, Vietnam (2010.11.1-2010.11.4)] 2010 IEEE RIVF International Conference on Computing

Efficient Syntactic Parsing with Beam Search

Huong Thanh Le, Lam Ba Do School of Information and Communication Technology

Hanoi University of Technology Hanoi, Vietnam

Nhung Thi Pham Advance and Nurture Laboratory

Hanoi, Vietnam

Abstract—Implementing a Vietnamese syntactic parser is a difficult task due to the complexity of Vietnamese language. Most existing Vietnamese syntactic parsers are limited by types of sentences they can analyze. This paper introduces a syntactic parser that can analyze a larger range of Vietnamese sentences in a reasonable time. The proposed parser uses a probabilistic Head-Driven Phrase Structure Grammar that can control syntactic and semantic constraints of words in order to produce accurate results. The system uses a parsing algorithm that combines an improved version of the CYK algorithm and the idea of beam search in order to reduce the search space. Our experimental results achieved precision and recall of 82% and 74%, respectively.

Keywords-syntactic parsing; HPSG; beam search

I. INTRODUCTION

Syntactic parsing is a basic and important problem in natural language processing (NLP). The purpose of a syntactic parser is to derive the syntactic structure of sentences, in order to understand the text. A good parser can be integrated into NLP applications such as machine translation, text summarization, question answering systems, etc. to produce more accurate results.

There are a lot of research in syntactic parsing, especially for English. Some English parsers that have high accuracy are Stanford Parser [4], CMU Link Parser [2], and Charniak Parser [1], etc. These parsers have been implemented for a long time and are improved continuously.

There are only a few works on Vietnamese syntactic parsing. Most of them use a small set of syntactic rules. As a result, only a few standard sentences can be parsed by these system. Many syntactically correct sentences used in everyday life (e.g., “Tôi 20 tu i/I am 20 years old”) but cannot be analyzed successfully by these systems. These systems also cannot parse correctly complex sentences, compound sentences and long sentences. The reason is these systems using a small syntactic rule set which does not cover all syntactic structures. These systems often return several syntactic trees for a long input sentence, among which only one or none is correct. This is the ambiguous problem in syntactic parsing.

English parsers cannot apply directly to Vietnamese language, as the two languages have different characteristics. Vietnamese is a monosyllable whereas English is a multisyllable one. Therefore, word segmentation is necessary for Vietnamese whereas English is not. In addition, word order in Vietnamese is also different than that in English. For example, e.g., adjective goes after noun in a Vietnamese noun

phrase (e.g., con mèo/cat en/black) whereas adjective goes before noun in an English noun phrase (e.g., black cat).

In this paper, we propose an approach to Vietnamese syntactic parser that can deal with the problem mentioned above, using Head Driven Phrase Structure Grammar (HPSG) [8]. HPSG allows us to manage syntactic and semantic constraints through syntactic rules, word structures and phrase structures. For example, the word “ n/eat” is an action verb. It has Living Thing as the subject of action. Basing on this constraint, the sentence “Cái c c ang n bánh/The cup is eating a cake” is incorrect since the word “c c/cup” belongs to the object class instead of the people class.

HPSG currently receives a lot of attention from the NLP community in the world. There is only a few Vietnamese research involving HPSG such as [5] and [11]. However, these works have not come up with a high accurate parser. Research in [11] focuses on analyzing Vietnamese noun phrases. The analysis capability of the parser in [5] is limited since it only uses a small syntactic rule set (95 rules). This paper is an improvement of [5] by using a larger syntactic rule set, an iterative CYK algorithm combining with beam search.

The rest of this paper is organized as follows. Section II introduces our method of representing words and syntactic rules using HPSG. Improvements to the syntactic parsing algorithm are represented in Section III. Section IV analyzes the ambiguity problem in Vietnamese syntactic parsing and proposes a solution to this problem. Our experimental results are discussed in Section V. Finally, conclusions and future work are given in Section VI.

II. REPRESENT VIETNAMESE SYNTACTIC RULESUSING HPSG

In this section, we describe structures of a word and a syntactic rule for Vietnamese using HPSG.

A. Representing Vietnamese Word Structure using HPSG HPSG [8] can be regarded as an extension of the context

free grammar (CFG) by adding attributes to word structures and constrains to syntactic rules. Parsing process will then rely on syntactic rules and syntactic and semantic constraints.

HPSG uses attribute value matrix (AVM) to represent word information, in order to describe specific characteristics of word such as syntactic and semantic information. An AVM representing a word or a phrase can be very complex as introduced in [9]. In order to reduce the complexity of the parsing algorithm, we design a simpler AVM, which focuses on principles of verb combinations. This is because verb is the most important component that connects other parts of the

This work was partially funded by the Vietnamese Ministry of Education & Training as part of the project B2009-01-225.

978-1-4244-8075-3/10/$26.00 ©2010 IEEE

Page 2: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Hanoi, Vietnam (2010.11.1-2010.11.4)] 2010 IEEE RIVF International Conference on Computing

sentence. The AVM of a Vietnamese word is represented in Fig. 1, in which • Phon: stores the word/phrase • Head: keeps information about the word/phrase. Head has

three properties: Category (e.g. noun, verb), SubCategory (e.g. intransitive verb, proper noun), and CategoryMeaning. The CategoryMeaning is referred by a semantic tree developed by Vietnam Lexicography Center [12].

• Spr and Comp have two properties: SubCategory and CategoryMeaning. Spr (Specifier) controls syntactic and semantic constraints with the previous word/phrase, whereas Comp (Complement) manages those constraints with the next one.

><><

><><

><><

><><

aningCategoryMeySubCategor

Comp

aningCategoryMeySubCategor

Spr

aningCategoryMeySubCategor

CategoryHead

textPhon

(a) The general AVM of a Vietnamese word

NPComp

gLivingThinN

Spr

ActionVtV

Head

eatnPhon )(

(b) The AVM of “ n/eat” in the sentence “Tôi/I n/eat bánh/cake”Figure 1. The AVM structure of a Vietnamese word

For example, the word “ n/eat” in Fig. 1b has the constraint “Sub + V + Dob”. The subject (Sub) is a noun (N) with the category meaning of “Living thing”. The direct complement (Dob) is a noun phrase (NP) with no constraints on category meaning. If there is no constraint about Spr and Comp, these values will be vacated.

B. Syntactic Rule Set for Vietnamese HPSG integrates syntactic and semantic constraints into

syntactic rules. These constraints are used to control the syntactic and semantic relations of words and phrases in a sentence. In our rule set, the left-hand side (LHS) of each syntactic rule is a grammatical unit, whereas the right-hand side (RHS) of the rule are components that constitute to the grammatical unit of the LHS. Each syntactic rule stores information about the head component which is the most important information of a phrase. The attribute values of the LHS follow unification rules as follows: • Phon is a combination of Phon components in the RHS • Head.Category is the syntactic category of the created

phrase

• Head.SubCategory is the SubCategory of the head component

• Head.CategoryMeaning is the CategoryMeaning of the head component.

• If the unification process based on constraints of Spr or Comp of the head component has been done, attribute values of Spr or Comp in the LHS will be vacated. Otherwise, they receive values from Spr and Comp of the head component. In the Vietnamese word dictionary used in our system,

verbs and adjectives have constraint information (corresponding to values in Spr and Comp) about components that can be unified with. When unifying the verb/adjective with other words/phrases, these constraints will be checked. These constraints are called default constraints. They do not need to be mentioned explicitly in the syntactic rule.

Example: VP V N Head = 1 Head=1 means the head component of the created phrase

(VP) is the first component (V) of the RHS of the rule. As mentioned above, syntactic rule sets of existing

Vietnamese syntactic parsers are usually small. In addition, they are built manually based on human experience and research documents about the Vietnamese grammar. Thus it is subjective and insufficient to cover all Vietnamese syntactic structures. Due to the limitations of this approach, we use another method to create the rule set. Syntactic rules are automatically extracted from a syntactically annotated corpus by our extracting module. The corpus is named VietTreebank, containing 9633 sentences [10] that were manually annotated by experts in the field. 937 rules were derived by this process. In addition to add the index of the head component to the rule, rule’s probability in the corpus is also added to the rule set. This probability is considered as the score of the rule.

III. IMPROVE THE SYNTACTIC PARSINGALGORITHM

Parsing algorithms can be divided into two main approaches: top-down parsing (e.g., Earley algorithm) and bottom-up parsing (e.g., CYK algorithm). The top-down parsing algorithm starts by deploying the non-terminal symbol representing the sentence into the symbols representing phrases or word category labels. The top-down parsing algorithm terminates when all word category labels of the input sentence are received or until it cannot deploy any further. The bottom-up parsing algorithm works in the opposite direction. Starting with word category labels from the input sentence, it applies syntactic rules to group consecutive words/phrases into larger grammatical units, until the largest grammar component, which is usually a sentence, is obtained. Do and Le [5] developed a Vietnamese syntactic parser using HPSG and the Earley parsing algorithm. This algorithm is feasible when using a small set of rules (95 rules). However, when the size of the syntactic rule set increases to nearly 1000 rules, this approach reveals its weakness. First, the search space of the Earley algorithm becomes very large. Starting from the non-terminal symbol S, the Earley algorithm applies all rules those LHS is S (e.g., S → NP VP). The grammar units in the

Page 3: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Hanoi, Vietnam (2010.11.1-2010.11.4)] 2010 IEEE RIVF International Conference on Computing

RHS, in turn, are expanded by using all rules those LHS is these grammar units. Our rule set contains 937 rules, in which 234 rules having the LHS “S”, 259 rules having the LHS “NP” and 284 rules having the LHS “VP”. The number of cases in only the first two steps of the Earley algorithm is 234 * 259 * 284 = 17 212 104. Though several techniques have been used to reduce the considered cases, the search space is still very large. For this reason, the CYK algorithm is chosen to be applied in this research. Since the CYK algorithm bases on matching a group of consecutive grammatical units from the input sentence to the RHS of the rule, the satisfying rule number is thus much smaller than that of the Earley algorithm.

The CYK algorithm requires syntactic rules in the Chomsky normal form (the maximum units in the RHS is 2). However, the syntactic rules extracted from the training corpus are not in the normal form. Therefore, some modifications to the syntactic rules of the CYK algorithm are necessary. Changing the set of syntactic rules is not our solution since syntactic rules whose RHS has 2 components at maximum cannot represent exactly the Vietnamese grammar. Instead, improving the CYK algorithm is our choice. When the syntactic parser finds a rule that has the first two components in its RHS equal to two syntactic category labels of two consecutive grammatical components from the input string, this rule will be applied. Simultaneously, a waiting variable (wait)is used to store information of the remaining components in the RHS of the rule. In the next step, this waiting variable is mapped with the remain part of the input string. A detailed description of this improvement is reported in [3].

Although the searching space of the CYK algorithm when using a large set of syntactic rules is much smaller than that of the Earley algorithm, the CYK algorithm still faces with the combinational explosion. There are many ways of combining consecutive components in a sentence into grammar components. In long sentences, the number of combinations increases very fast. To solve this problem, the improved CYK algorithm is combined with the beam search algorithm to cut off branches that do not promise to lead to the correct parse tree.

The beam search algorithm uses the breadth-first searching approach to develop the parse tree. At each tree level, it generates all next states and arranges them in the increasing order of heuristics values. Then, it stores a fix number of states in every parsing levels. The fix number here is called the beam breadth. The smaller the beam breadth is, the bigger the number of states is cut. If the beam breadth is infinite, no state is cut. In this case, the beam search algorithm is equivalent to the breadth search. The disadvantages of the beam search algorithm is that it looses the completeness (no syntactic tree is derived even though it exists) and the optimality (the best result may not be found). The reason for this problem is that the branches leading to the destination state can be cut during the parsing process. To solve this problem, an iterative CYK algorithm is used. This algorithm is described in Section 3.2. The method of computing heuristics values of a tree is represented in Section 3.1.

A. Estimate Heuristics Value of the Parse Tree In our approach, the heuristics value of a tree is calculated

based on inside and outside values of the parse tree. These probabilities are calculated by the inductive algorithm introduced in [7]. The method of calculating these values is briefly introduce here.

The inside value βj(p,q) is the total probability of all possible parse trees corresponding to the word string from wpto wq with the root node is non-terminal symbol Nj. This value is calculated recursively from the bottom up.

Figure 2. The tree corresponding to the rule Nj → Nr Ns

- Base case (applying for the rules in the form Nj → wk): βj(k,k) = P(wk|Nj

kk, G) = P(Nj → wk|G) (1) In the formula (1), G stands for Grammar; Nj

kk means the word at the position k has the syntactic category Nj

- Recursive case (applying for the rules in the form Nj → Nr

Ns): The inside value is calculated recursively based on the sum of products of the applying rule’s probability and inside values of the trees having the root Nr and Ns .

βj(p,q) = P(wpq|Njpq, G)

=

=

+→sr

q

pdsr

srj qddpNNNP,

1

),1(),()( ββ(2)

Outside value j(p,q) is calculated for the cases that generate trees satisfying the following conditions: the begin part is words from w1 to wp-1 ; the middle part is a tree Nj

pq ; the end part is words from wq+1 to wm (wm is the last word of the sentence). The outside value is calculated recursively from the top down. - Base case: the outside value of the tree whose root is a non-terminal symbol Nj:

j(1,m) = 1 if j=1 ( N1 is the start symbol, e.g., S) = 0 otherwise (3)

- Recursive case: node Njpq can be at the right or the left

branch of the parent node.j(p,q) =

++=

+−≠

),,,,( )1(1

)1()1(1,

geq

jpq

fpe

m

qemqp

jgfNNNwwP +

=+−

jpq

gpe

p

e

feqmqp

gfNNNwwP ,,,,( )1(

1

1)1()1(1

,

= +→+=≠

m

qeg

gjff

jgfeqNNNPep

1,),1()(),( βα +

−→−

=

)1,()(),(1

1,

peNNNPqe gjgf

p

ef

gfβα (4)

Page 4: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Hanoi, Vietnam (2010.11.1-2010.11.4)] 2010 IEEE RIVF International Conference on Computing

Figure 3. Two possible trees that contain the tree Njpq

The heuristics value of the tree having node Nj is calculated by the product of the inside value and the outside value.

j (p,q) j(p,q) = P(w1(p-1), Njpq, w(q+1)m | G) * P( wpq | Nj

pq, G) = P(w1m, Nj

pq | G) (5) The improved CYK algorithm faces the following

difficulties when calculates the inside value and the outside value: [1] Because the inside value and the outside value of each

node are calculated based on sum of products of used rules’ probabilities, this probabilities less than or equal 1, so values of the inside and the outside are usually very small and approaching 0.

[2] Because the classical CYK algorithm does not have any case that the RHS is incomplete (no wait variable is needed) as the improved CYK algorithm, the calculation of the inside value is easy. However, with the improved CYK algorithm, the inside values cannot be calculated by the same way due to the lack of some rule’s elements.

[3] The CYK parsing algorithm is bottom-up parsing, while the outside value is calculated from root to leaves, so when the parsing process has not finished, the parse tree is incomplete. Thus the outside value cannot be calculated. The following solutions are proposed to deal with the

above difficulties: Solving the first difficulty:

The logarithm of the products of probabilities is used instead of the products itself. The inside value of the tree when applying a particular rule is now equal to the sum of logarithm of this rule’s probability and the inside value of components in the right side of the rule. Solving the second difficulty:

In the parsing process of the improved CYK algorithm, the inside value can be calculated exactly for the rules that have been applied completely (wait = “”). For example, when

applying the rules S → NP NP VP, the inside value of the tree is:

inside(S(wait = “”)) = lg(P(S → NP NP VP)) + inside(NP) + inside(NP) + inside(VP) (6)

With the incomplete rule (wait ≠ “”), the system creates a tree whose root is a virtual node (the node that has wait ≠ “”), the inside value of this tree is calculated basing on following remarks:

Considering the rule S → NP NP VP, if wait = “VP”, we get the inside value of the tree when applying this rule is:

inside (S(wait = “VP”)) = lg(P(S → NP NP VP)) + inside(NP) + inside(NP) (7) Since the product of probabilities is always les than or

equal 1, its logarithm is always less than or equal 0. Therefore the inside value of a node is always less than or equal 0. We have lg(P(S→ NP NP VP)) + inside(NP) + inside(NP) + inside(VP) ≤ lg(P(S → NP NP VP)) + inside(NP) + inside(NP) (8) In the other words, inside(S(wait=“”)) ≤ inside(S(wait=“VP”))

Therefore, with each virtual node, if we ignore the inside value of the wait that has not found, its inside value will be greater than or equal to the inside value of the real node (the node that has all components). Therefore, if the inside value of the virtual node makes the heuristics value smaller than the threshold of beam search, the inside value of the real node is also smaller than the threshold. Due to this reason, we can use the inside value of the virtual node (based on the found components of the rule) to remove rules. If the inside value of the virtual node is greater than the threshold, the beam search algorithm temporarily keeps this virtual node and waits until it meets the real node to review the virtual node. Solving the third difficulty:

At the higher levels of the tree being constructed, since its components have not been established, the outside value cannot be calculated exactly when parsing process has not completed. To solve this problem, we do not calculate the outside value from the root node, but from the parent node of the current one. This is done by estimating one step ahead. When parsing row i in the CYK table, we calculate the inside value of the node in level i from bottom up in the tree. To calculate the outside value, we parse row (i + 1) to calculate the inside value of this raw. Then we can calculate the outside value of nodes in raw i . The calculation of the outside value is shown in Fig. 4 and Fig. 5 below.

Figure 4. Simulation of calculating outside estimated a step ahead when wait = “”

A(wait=“”) D(wait=””)

E(wait=“”) F(wait=“”) G(wait=“”) B(wait=“”)

D(wait=“”)

F(wait=“”) G(wait=“”) H(wait=“”)

Page 5: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Hanoi, Vietnam (2010.11.1-2010.11.4)] 2010 IEEE RIVF International Conference on Computing

Figure 5. Simulation of calculating outside looked a step ahead when wait ≠ “”

- If node D is not a virtual node as in Fig. 4: Outside(D(wait = “”)) = lg[P(A→ED)*(10inside(E))+P(B→DH)*(10inside(H))] (9) - If node D is a virtual node with node H as the missing child as in Fig. 5: Outside(D(wait=“H”)) = lg(P(A→ED)) + inside(E) (10)

The outside value of the rules that have more than three components are also calculated by the method above.

Since logarithm is used to make the result not too small, the heuristics value now equals to the sum of the inside value and the outside value. With each node in raw i, we calculate its heuristics value and compare this value to the maximum heuristics value in the raw i to decide keeping or removing the current node from the searching space. If a node is removed then all of its parent nodes in raw (i+1) are removed.

Because the inside and outside values are used to evaluate heuristics value of the trees in order to determine good parse trees during parsing process, improvement of the inside and outside calculating formula as mentioned above does not affect the accuracy of the program.

B. Solve Limitations of the Beam Search Algorithm Our program does not use a fixed width beam. Instead, a

threshold θ is used to retain states whose heuristics value is smaller than the maximum heuristics value no more than θ.The distances between heuristics increase during parsing process. Therefore, in order to keep enough nodes for the parsing, each iterative step of CYK algorithm increases the threshold by a fixed value. According to research by Li and Zong [6], if the level of node X in the tree is called Level(X) then

θ= 2.0 if Level(X) > 8 θ= 6.0 – Level(X)/2 if Level(X)<=8 (11) Based on this formula, we propose the method of

increasing θ as in the algorithm presented in Fig. 6, in which (n-i) is understood as the level of the node in the tree.

A risk when using heuristics to reduce the search space is the disappearance of correct parse trees although it exists. In this case, we automatically increase θ and repeat parsing process by recalling the CYK algorithm. This algorithm is called “iterative CYK algorithm”. Because the beam search is a pruning algorithm using heuristics, it tries to guide the parsing process toward the search space that contains the true parse. Therefore, the CYK algorithm rarely repeats. The algorithm is shown below:

1. Init θ = 2, step = 0, maxStep=3 2. if step < maxStep and the parse tree has not been found,

recall CYK to re-parse the sentence with a new threshold θ1. θ = θ + step 2. for i = 1 to n-1:

1. If (n-i ≥ 8+step) then θ is kept intact; Else θ = max (8+step-(n-i)/2, 2). 2. Create row i by merging cells in raw i-1 3. Update the inside value of components in raw i-1 4. Update the outside value in components in raw i-1 5. Update the heuristics value of components in raw

i-1. 6. max = max(heuristics) in raw i-1 7. If max – (heuristics) < θ then remove this

component. 3. If parsing is unsuccessful then step = step + 1

Figure 6. The iterative CYK algorithm

IV. AMBIGUITY RESOLUTION Ambiguity is a problem that usually occurs in syntactic

parsing. A sentence is syntactically ambiguous if it can be represented by different sets of rules. When the syntactic rule set is small, the ambiguity occurs less but the parser is not enough capacity to analyze various types of sentences in natural language. When the rule set is larger, more rules can be applied to analyze one sentence, which make the ambiguity problem more complex.

To reduce the ambiguity in syntactic analysis, two methods have been used in our system: (1) using the heuristics value calculated in the beam search algorithm to remove tree having small probability; and (2) using AVM in HPSG grammar to remove cases that do not satisfy syntactic and semantic constraints.

If only method (2) is used, the processing time will be relatively large because each word can have many different AVM structures, each of which corresponds to one meaning. If we only use method (1), it is not safe because the correct analysis tree is not always the most probable tree. Therefore, to select the best output parsing tree, we first remove a portion of analysis trees and retain only trees with acceptable probability. Next, we use AVM to check constraints between words or phrases and just keep the trees that satisfy the constraints.

V. EXPERIMENTS AND EVALUATION The data set in our experiment is 800 sentences taken from

the Vietnamese syntactically-annotated corpus [10]. 262 of which are simple sentences which have only one Subject + Verb structure. The rest are long and complex ones, which contain many Subject + Verb structures or have many verbs. Raw texts in this corpus are collected from the Youth online daily newspaper, with a number of topics including social and politics. To evaluate the system performance, the annotated sentences were compared with the one analyzed by the syntactic parser. The precision, recall and F-score measures are calculated as

A(wait=“”) B(wait=“”)

D(wait=“H”)

E(wait=“”) F(wait=“”) G(wait=“”) H(wait=“”)

Page 6: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Hanoi, Vietnam (2010.11.1-2010.11.4)] 2010 IEEE RIVF International Conference on Computing

systembyderivedtsconstituenofnumbertotalsystembyderivedtsconstituencorrectofnumber=P (12)

humanbyannotatedtsconstituenofnumbertotalsystembyderivedtsconstituencorrectofnumber=R (13)

RPRPscoreF

+=− **2

(14) We compare the system reported in this paper with our previous system in [5], which also uses the HPSG approach. The system in [5] does not use the iterative CYK algorithm with beam search. It works with a set of 95 syntactic rules. When we increase the number of syntactic rules to 937 rules (the number of rules in the current system), the program fails to work because of the large search space.

Table 1 shows the comparison between the system in [5] and the system reported in this paper.

TABLE I. MEASURE RESULTS IN TWO SYSTEMS

System P (%) R (%) F-score (%) The parser in [5] 48 44 46 The parser reported in this paper 82 74 78

Table 1 shows that the parser has been significantly improved its ability to analyze compared to the system in [5]. Most of long and complex sentences cannot proceed by the system in [5]. Meanwhile, the system reported in this paper is able to analyze many different types of sentences in practice, including complex sentences and long compound ones.

The accuracy of the system is P = 82%, R = 74% and F-score = 78%. The experimental results show that techniques used in the system is reasonable and effective. The system often returns one syntactic tree for one input sentence. The tree returned by the system is often more accurate than the one being removed.

There are some cases which the system can not give the correct tree. For example, the sentence “nhi u cánh ng n cv n còn tr ng xóa/many fields still have dazzlingly white water” is not analyzed correctly by the system as it considers “cánh ng/field n c/water” was a noun phrase. However, if we base purely on syntactic rules, this sentence is difficult to analyze. If a comma is added in this sentence to become “nhi u cánh ng, n c v n còn tr ng xóa/many fields still have dazzlingly white water”, the system will be able to analyze it correctly. In addition, with long sentences that satisfy many syntactic combinations, constraints in AVM can only solves part of the problem.

To deal with this problem, future work includes: (i) research methods to detect borders of phrases; and (ii) add syntactic and semantic constraints at a deeper level without reducing much of the program’s speed.

The closest work to ours for English language is [14]. The authors in [14] used beam threshold, unification filtering and hybrid parsing in probabilistic HPSG parsing. The precision and the recall of system are 87.85% and 86.85%, respectively.

We believe that our system can be higher in the future, when our proposed future works are implemented.

VI. CONCLUSION In this paper, we introduced an approach to construct a

Vietnamese syntactic parser using HPSG to integrate syntactic and semantic information into words, phrases and rules. The system is capable of analyzing many types of Vietnamese sentences. We have proposed an improved CYK algorithm combining with the beam search strategy. The algorithm can process rules which are not in context free grammar form. In addition, it can also remove branches that have low potential of leading to correct trees based on the estimate of tree’s probability. The ambiguity problem is handled by using probability’s estimation and constraints on properties of verbs and adjectives. The experimental results with 800 sentences from online newspapers shows our system’s accuracy of P = 82%, R = 74% and F-score = 78%.

In the future, we will continue to investigate methods to improve the speed and the accuracy of the program, as well as the ability to handle complex sentences. The proposed methods include detecting border of phrases and controlling constraints on syntax and semantic in a deeper level. We also would like to carry out experiments with a larger corpus to obtain a more objective evaluation of system.

REFERENCE [1] E. Charniak, “A Maximum-Entropy-Inspired Parser”, Proceedings of

NAACL, 2000. [2] D. Grinberg, J. Lafferty and D. Sleator, “A robust parsing algorithm for

link grammars”, Proceedings of the Fourth International Workshop on Parsing Technologies, Prague, 1995.

[3] T.H. Le, H.Q. Pham, T.T. Nguyen, “An approach to automatically syntactic parsing in Vietnamese”, Journal of Informatics and Cybernetics, Vol 15, No. 4, 2000.

[4] D. Klein and C.D. Manning, “Fast Exact Inference with a Factored Model for Natural Language Parsing”, In Advances in Neural Information Processing Systems 15, Cambridge, MA: MIT Press, 2003.

[5] B.L. Do and T.H. Le, “Building Vietnamese syntactic parsing sytem using HPSG”, Proceeding of ICT.rda, Hanoi, 2008.

[6] X. Li and C. Zong, “An effective framework for Chinese Syntactic Parsing”, Proceedings of the International Conference on Signal Processing, Turkey, 2004, pp.276-279.

[7] C. D. Manning and H. Schutze, “Foundations of Statistical Natural Language Processing”, The MIT Press, 1999.

[8] C.J. Pollard and I. Sag, “Head-Driven Phrase Structure Grammar”, CSLI Publications/Cambridge University Press, 1994.

[9] R. Susanne, “The HPSG Formalism, Stanford University”, DOI=http://www-csli.stanford.edu/~sag/L221a/hand2-formal.pdf, 1995.

[10] P.T. Nguyen, X.L.Vu, T.M.H. Nguyen, V.H. Nguyen, H.P. Le, “Building a Large Syntactically-Annotated Corpus of Vietnamese”, Proceedings of the 3rd Linguistic Annotation Workshop (LAW), ACL-IJCNLP, 2009.

[11] N.T. Tran, T.T. Phan, “Parsing Vietnamese noun phrase using unification grammar”, Journal of Post and Telecommunications and Information Technology, 2006.

[12] Vietlex Semantic Tree, DOI=http://www.vietlex.com/resources/semanticTree.html, 2009.