community answer summarization for multi-sentence question ... · best answer - chosen by asker...

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 582–591,Jeju, Republic of Korea, 8-14 July 2012. c©2012 Association for Computational Linguistics

Community Answer Summarization for Multi-Sentence Question withGroup L1 Regularization

Wen Chan†, Xiangdong Zhou†, Wei Wang†, Tat-Seng Chua‡†School of Computer Science, Fudan University, Shanghai, 200433, China

{11110240007,xdzhou,weiwang1}@fudan.edu.cn‡School of Computing, National University of Singapore

[email protected]

Abstract

We present a novel answer summarizationmethod for community Question Answeringservices (cQAs) to address the problem of “in-complete answer”, i.e., the “best answer” of acomplex multi-sentence question misses valu-able information that is contained in other an-swers. In order to automatically generate anovel and non-redundant community answersummary, we segment the complex originalmulti-sentence question into several sub ques-tions and then propose a general ConditionalRandom Field (CRF) based answer summarymethod with group L1 regularization. Vari-ous textual and non-textual QA features areexplored. Specifically, we explore four differ-ent types of contextual factors, namely, the in-formation novelty and non-redundancy mod-eling for local and non-local sentence inter-actions under question segmentation. To fur-ther unleash the potential of the abundant cQAfeatures, we introduce the group L1 regu-larization for feature learning. Experimentalresults on a Yahoo! Answers dataset showthat our proposed method significantly outper-forms state-of-the-art methods on cQA sum-marization task.

1 Introduction

Community Question and Answering services(cQAs) have become valuable resources for usersto pose questions of their interests and share theirknowledge by providing answers to questions. Theyperform much better than the traditional frequentlyasked questions (FAQ) systems (Jijkoun and Rijke, 2005; Riezler et al., 2007) which are just based

on natural language processing and information re-trieving technologies due to the need for human in-telligence in user generated contents(Gyongyi et al.,2007). In cQAs such as Yahoo! Answers, a resolvedquestion often gets more than one answers and a“best answer” will be chosen by the asker or votedby other community participants. This {question,best answer} pair is then stored and indexed for fur-ther uses such as question retrieval. It performs verywell in simple factoid QA settings, where the an-swers to factoid questions often relate to a singlenamed entity like a person, time or location. How-ever, when it comes to the more sophisticated multi-sentence questions, it would suffer from the problemof “incomplete answer”. That is, such question oftencomprises several sub questions in specific contextsand the asker wishes to get elaborated answers for asmany aspects of the question as possible. In whichcase, the single best answer that covers just one orfew aspects may not be a good choice (Liu et al.,2008; Takechi et al., 2007). Since “everyone knowssomething” (Adamic et al., 2008), the use of a singlebest answer often misses valuable human generatedinformation contained in other answers.

In an early literature, Liu et al.(2008) reported thatno more than 48% of the 400 best answers were in-deed the unique best answers in 4 most popular Ya-hoo! Answers categories. Table 1 shows an exampleof the “incomplete answer” problem from Yahoo!Answers1. The asker wishes to know why his teethbloods and how to prevent it. However, the best an-swer only gives information on the reason of teeth

1http://answers.yahoo.com/question/index?qid=20100610161858AAmAGrV

582

blooding. It is clear that some valuable informationabout the reasons of gums blooding and some solu-tions are presented in other answers.

Question

Why do teeth bleed at night and how do you prevent/stop it? Thismorning I woke up with blood caked between my two front teeth.This is the third morning in a row that it has happened. I brush andfloss regularly, and I also eat a balanced, healthy diet. Why is thishappening and how do I stop it?

Best Answer - Chosen by Asker

Periodontal disease is a possibility, gingivitis, or some gum infec-tion. Teeth don’t bleed; gums bleed.

Other Answers

Vitamin C deficiency!

Ever heard of a dentist? Not all the problems in life are solved onthe Internet.

You could be brushing or flossing too hard. Try a brush with softerbristles or brushing/flossing lighter and slower. If this doesn’t solveyour problem, try seeing a dentist or doctor. Gums that bleed couldbe a sign of a more serious issue like leukemia, an infection, gumdisease, a blood disorder, or a vitamin deficiency.

wash your mouth with warm water and salt, it will help to strengthenyour gum and teeth, also salt avoid infection. You probably haveweak gums, so just try to follow the advice, it works in many casesof oral problems.

Table 1: An example of question with incomplete answerproblem from Yahoo! Answers. The “best answer” seems tomiss valuable information and will not be ideal for re-use whensimilar question is asked again.

In general, as noted in (Jurafsky and Martin ,2009), most interesting questions are not factoidquestions. User’s needs require longer, more infor-mative answers than a single phrase. In fact, it isoften the case, that a complex multi-sentence ques-tion could be answered from multiple aspects by dif-ferent people focusing on different sub questions.Therefore we address the incomplete answer prob-lem by developing a novel summarization techniquetaking different sub questions and contexts into con-sideration. Specifically we want to learn a concisesummary from a set of corresponding answers assupplement or replacement to the “best answer”.

We tackle the answer summary task as a sequen-tial labeling process under the general ConditionalRandom Fields (CRF) framework: every answersentence in the question thread is labeled as a sum-mary sentence or non-summary sentence, and weconcatenate the sentences with summary label toform the final summarized answer. The contributionof this paper is two-fold:First, we present a general CRF based framework

and incorporate four different contextual factorsbased on question segmentation to model the localand non-local semantic sentence interactions to ad-dress the problem of redundancy and informationnovelty. Various textual and non-textual questionanswering features are exploited in the work.Second, we propose a group L1-regularization ap-proach in the CRF model for automatic optimal fea-ture learning to unleash the potential of the featuresand enhance the performance of answer summariza-tion.

We conduct experiments on a Yahoo! Answersdataset. The experimental results show that theproposed model improve the performance signifi-cantly(in terms of precision, recall and F1 measures)as well as the ROUGE-1, ROUGE-2 and ROUGE-Lmeasures as compared to the state-of-the-art meth-ods, such as Support Vector Machines (SVM), Lo-gistic Regression (LR) and Linear CRF (LCRF)(Shen et al., 2007).

The rest of the paper is arranged as follows: Sec-tion 2 presents some definitions and a brief reviewof related research. In Section 3, we propose thesummarization framework and then in Section 4 and5 we detail the experimental setups and results re-spectively. We conclude the paper in Section 6.

2 Definitions and Related Work

2.1 Definitions

In this subsection we define some concepts thatwould be helpful to clarify our problems. First wedefine a complex multi-sentence question as a ques-tion with the following properties:

Definition: A complex multi-sentence questionis one that contains multiple sub-questions.

In the cQAs scenario a question often consists ofone or more main question sentences accompany bysome context sentences described by askers. Wetreat the original question and context as a wholesingle complex multi-sentence question and obtainthe sub questions by question segmentation. Wethen define the incomplete answer problem as:

Definition: The incomplete answer problem isone where the best answer of a complex multi-sentence question is voted to be below certain starratings or the average similarity between the best an-swer and all the sub questions is below some thresh-

583

olds.We study the issues of similarity threshold and the

minimal number of stars empirically in the experi-mental section and show that they are useful in iden-tifying questions with the incomplete answer prob-lem.

2.2 Related WorkThere exist several attempts to alleviate the answercompleteness problem in cQA. One of them is tosegment the multi-sentence question into a set ofsub-questions along with their contexts, then se-quentially retrieve the sub questions one by one,and return similar questions and their best answers(Wang et al., 2010). This strategy works well in gen-eral, however, as the automatic question segmenta-tion is imperfect and the matched similar questionsare likely to be generated in different contextual sit-uations, this strategy often could not combine multi-ple independent best answers of sub questions seam-lessly and may introduce redundancy in final answer.

On general problem of cQA answer summariza-tion, Liu et al.(2008) manually classified both ques-tions and answers into different taxonomies and ap-plied clustering algorithms for answer summariza-tion.They utilized textual features for open and opin-ion type questions. Through exploiting metadata,Tomasoni and Huang(2010) introduced four char-acteristics (constraints) of summarized answer andcombined them in an additional model as well asa multiplicative model. In order to leverage con-text, Yang et al.(2011) employed a dual wing fac-tor graph to mutually enhance the performance ofsocial document summarization with user generatedcontent like tweets. Wang et al. (2011) learned on-line discussion structures such as the replying rela-tionship by using the general CRFs and presented adetailed description of their feature designs for sitesand edges embedded in discussion thread structures.However there is no previous work that explores thecomplex multi-sentence question segmentation andits contextual modeling for community answer sum-marization.

Some other works examined the evaluation of thequality of features for answers extracted from cQAservices (Jeon et al., 2006; Hong and Davison ,2009; Shah et al., 2010). In the work of Shah etal.(2010), a large number of features extracted for

predicting asker-rated quality of answers was evalu-ated by using a logistic regression model. However,to the best of our knowledge, there is no work inevaluating the quality of features for community an-swer summarization. In our work we model the fea-ture learning and evaluation problem as a group L1

regularization problem (Schmidt , 2010) on differentfeature groups.

3 The Summarization Framework

3.1 Conditional Random FieldsWe utilize the probabilistic graphical model to solvethe answer summarization task, Figure 1 gives someillustrations, in which the sites correspond to thesentences and the edges are utilized to model theinteractions between sentences. Specifically, let xbe the sentence sequence to all answers within aquestion thread, and y be the corresponding label se-quence. Every component yi of y has a binary value,with +1 for the summary sentence and -1 otherwise.Then under CRF (Lafferty et al., 2001), the condi-tional probability of y given x obeys the followingdistribution:

p(y|x) =1

Z(x)exp(

∑v∈V,l

µlgl(v, y|v, x)

+∑

e∈E,k

λkfk(e, y|e, x)),(1)

where Z(x) is the normalization constant calledpartition function, gl denotes the cQA feature func-tion of site l, fk denotes the function of edge k( mod-eling the interactions between sentences), µ and λare respectively the weights of function of sites andedges, and y|t denotes the components of y relatedto site (edge) t.

3.2 cQA Features and Contextual ModelingIn this section, we give a detailed description ofthe different sentence-level cQA features and thecontextual modeling between sentences used in ourmodel for answer summarization.

Sentence-level FeaturesDifferent from the conventional multi-document

summarization in which only the textual features areutilized, we also explore a number of non-textualauthor related features (Shah et al., 2010) in cQAs.

584

The textual features used are:1. Sentence Length: The length of the sentence in theanswers with the stop words removed. It seems that along sentence may contain more information.2. Position: The sentence’s position within the answer. Ifa sentence is at the beginning or at the end of one answer,it might be a generation or viewpoint sentence and willbe given higher weight in the summarization task.3. Answer Length: The length of the answer to which thesentence belonged, again with the stop words removed.4. Stopwords Rate: The rate of stop words in thesentence. If a sentence contains too many stop words, itis more likely a spam or chitchat sentence rather than aninformative one.5. Uppercase Rate: The rate of uppercase words.Uppercase words are often people’s name, address orother name entities interested by askers.6. Has Link Whether the sentence contains a hyperlinkor not. The link often points to a more detailed informa-tion source.7. Similarity to Question: Semantic similarity to thequestion and question context. It imports the semanticinformation relevance to the question and questioncontext.The non-textual features used include:8. Best Answer Star: The stars of the best answerreceived by the askers or voters.9. Thumbs Up: The number of thumbs-ups the answerwhich contains the sentence receives. Users are oftenused to support one answer by giving a thumbs up afterreading some relevant or interesting information for theirintentions.10. Author Level: The level of stars the author who givesthe answer sentence acquires. The higher the star level,the more authoritative the asker is.11. Best Answer Rate: Rate of answers annotated as thebest answer the author who gives the answer sentencereceives.12. Total Answer Number: The number of total answersby the author who gives the answer sentence. Themore answers one gives, the more experience he or sheacquires.13. Total Points: The total points that the author whogives the answer sentence receives.

The previous literature (Shah et al., 2010) hintedthat some cQA features, such as Sentence Length,Has Link and Best Answer Star, may be more im-

portant than others. We also expect that some fea-ture may be redundant when their most related fea-tures are given, e.g., the Author Level feature is pos-itively related with the Total Points received by an-swerers, and Stopwords Rate is of little help whenboth Sentence Length (not including stop words) andUppercase Rate are given. Therefore, to explore theoptimal combination of these features, we proposea group L1 regularization term in the general CRFmodel (Section 3.3) for feature learning.

All features presented here can be extracted au-tomatically from the Yahoo! Answers website. Wenormalize all these feature values to real numbersbetween 0 and 1 by dividing them by the corre-sponding maximal value of these features. Thesesentence-level features can be easily utilized in theCRF framework. For instance, if the rate of upper-case words is prominent or the position is close tothe beginning or end of the answer, then the proba-bility of the label +1 (summary sentence) should beboosted by assigning it with a large value.

Contextual Modeling Under QuestionSegmentation

For cQAs summarization, the semantic interac-tions between different sentence sites are crucial,that is, some context co-occurrences should be en-couraged and others should be penalized for require-ments of information novelty and non-redundancyin the generated summary. Here we consider bothlocal (sentences from the same answer) and global(sentences from different answers) settings. Thisgive rise to four contextual factors that we will ex-plore for modeling the pairwise semantic interac-tions based on question segmentation. In this paper,we utilize a simple but effective lightweight ques-tion segmentation method (Ding et al., 2008; Wanget al., 2010). It mainly involves the following twosteps:

Step 1. Question sentence detection: every sen-tence in the original multi-sentence question is clas-sified into question sentence and non-question (con-text) sentence. The question mark and 5W1H fea-tures are applied.

Step 2. Context assignment: every context sen-tence is assigned to the most relevant question sen-tence. We compute the semantic similarity(Simpsonand Crowe, 2005) between sentences or sub ques-

585

Figure 1: Four kinds of the contextual factors are considered for answer summarization in our general CRF basedmodels.

tions as:

sim(x, y) = 2×∑

(w1,w2)∈M(x,y)

sim(w1, w2)

|x|+ |y| (2)

where M(x, y) denotes synset pairs matched in sen-tences x and y; and the similarity between the twosynsets w1 and w2 is computed to be inversely pro-portional to the length of the path in Wordnet.

One answer sentence may related to more thanone sub questions to some extent. Thus, we de-fine the replied question Qri as the sub questionwith the maximal similarity to sentence xi: Qri =argmaxQjsim(xi, Qj). It is intuitive that differentsummary sentences aim at answering different subquestions. Therefore, we design the following twocontextual factors based on the similarity of repliedquestions.Dissimilar Replied Question Factor: Given twoanswer sentences xi , xj and their correspondingreplied questions Qri, Qrj . If the similarity2 of Qri

and Qrj is below some threshold τlq, it means thatxi and xj will present different viewpoints to answerdifferent sub questions. In this case, it is likely thatxi and xj are both summary sentences; we ensurethis by setting the contextual factor cf1 with a largevalue of exp ν, where ν is a positive real constantoften assigned to value 1; otherwise we set cf1 toexp− ν for penalization.

cf1 =

{exp ν, yi = yj = 1

exp− ν, otherwise

Similar Replied Question Factor: Given two an-2We use the semantic similarity of Equation 2 for all our

similarity measurement in this paper.

swer sentences xi , xj and their correspondingreplied questions Qri, Qrj . If the similarity of Qri

and Qrj is above some upper threshold τuq, thismeans that xi and xj are very similar and likely toprovide similar viewpoint to answer similar ques-tions. In this case, we want to select either xi orxj as answer. This is done by setting the contextualfactor cf2 such that xi and xj have opposite labels,

cf2 =

{exp ν, yi ∗ yj = −1

exp − ν, otherwise

Assuming that sentence xi is selected as a sum-mary sentence, and its next local neighborhood sen-tence xi+1 by the same author is dissimilar to it butit is relevant to the original multi-sentence question,then it is reasonable to also pick xi+1 as a summarysentence because it may offer new viewpoints bythe author. Meanwhile, other local and non-localsentences which are similar to it at above the up-per threshold will probably not be selected as sum-mary sentences as they offer similar viewpoint asdiscussed above. Therefore, we propose the follow-ing two kinds of contextual factors for selecting theanswer sentences in the CRF model.Local Novelty Factor: If the similarity of answersentence xi and xi+1 given by the same author isbelow a lower threshold τls, but their respective sim-ilarities to the sub questions both exceed an upperthreshold τus, then we will boost the probability ofselecting both as summary sentences by setting:

cf3 =

{exp ν, yi = yi+1 = 1


Redundance Factor: If the similarity of answer

586

sentence xi and xj is greater than the upper thresh-old τus, then they are likely to be redundant andhence should be given opposite labels. This is doneby setting:

cf4 =

{exp ν, yi ∗ yj = −1


Figure 1 gives an illustration of these four con-textual factors in our proposed general CRF basedmodel. The parameter estimation and model infer-ence are discussed in the following subsection.

3.3 Group L1 Regularization for FeatureLearning

In the context of cQA summarization task, some fea-tures are intuitively to be more important than oth-ers. As a result, we group the parameters in our CRFmodel with their related features3 and introduce agroup L1-regularization term for selecting the mostuseful features from the least important ones, wherethe regularization term becomes,

R(θ) = C

G∑g=1

∥−→θg∥2, (3)

where C controls the penalty magnitude of the pa-rameters, G is the number of feature groups and

−→θg

denotes the parameters corresponding to the partic-ular group g. Notice that this penalty term is indeeda L(1, 2) regularization because in every particu-lar group we normalize the parameters in L2 normwhile the weight of a whole group is summed in L1

form.Given a set of training data D = (x(i), y(i)), i =

1, ..., N , the parameters θ = (µl, λk) of the generalCRF with the group L1-regularization are estimatedin using a maximum log likelihood function L as:

L =

N∑i=1

log(pθ(y(i)|x(i)))− C

G∑g=1

∥−→θg∥2, (4)

3We note that every sentence-level feature discussed in Sec-tion 3.2 presents a variety of instances (e.g., the sentence withlonger or shorter length is the different instance), and we maycall it sub-feature of the original sentence-level feature in themicro view. Every sub-feature has its corresponding weight inour CRF model. Whereas in a macro view, those related sub-features can be considered as a group.

where N denotes the total number of training sam-ples. we compute the log-likelihood gradient com-ponent of θ in the first term of Equation 4 as inusual CRFs. However, the second term of Equation4 is non-differentiable when some special ∥

−→θg∥2 be-

comes exactly zero. To tackle this problem, an ad-ditional variable is added for each group (Schmidt ,2010); that is, by replacing each norm ∥

−→θg∥2 with

the variable αg, subject to the constraint αg ≥∥−→θg∥2, i.e.,

L =

N∑i=1

log(pθ(y(i)|x(i)))− C

G∑g=1

αg,

subject to αg ≥ ∥−→θg∥2,∀g.

(5)

This formulation transforms the non-differentiableregularizer to a simple linear function and maximiz-ing Equation 5 will lead to a solution to Equation 4because it is a lower bound of the latter. Then, weadd a sufficient small positive constant ε when com-puting the L2 norm (Lee et al., 2006), i.e., |

−→θg∥2 =√∑|g|

j=1 θ2gj + ε, where |g| denotes the number of

features in group g. To obtain the optimal value ofparameter θ from the training data, we use an effi-cient L-BFGS solver to solve the problem, and thefirst derivative of every feature j in group g is,

δL

δθgj=

N∑i=1

Cgj(y(i), x(i))−

N∑i=1

∑y

p(y|x(i))Cgj(y, x(i))− 2Cθgj√∑|g|

l=1 θ2gl + ε

(6)

where Cgj(y, x) denotes the count of feature j ingroup g of observation-label pair (x, y). The firsttwo terms of Equation 6 measure the difference be-tween the empirical and the model expected valuesof feature j in group g, while the third term is thederivative of group L1 priors.

For inference, the labeling sequence can be ob-tained by maximizing the probability of y condi-tioned on x,

y∗ = argmaxypθ(y|x). (7)

We use a modification of the Viterbi algorithm toperform inference of the CRF with non-local edges

587

previously used in (Galley , 2006). That is , wereplace the edge connection zt = (yt−2, yt−1, yt)of order-2 Markov model by zt = (yNt , yt−1, yt),where yNt represents the label at the source of thenon-local edge. Although it is an approximation ofthe exact inference, we will see that it works well forour answer summarization task in the experiments.

4 Experimental Setting

4.1 Dataset

To evaluate the performance of our CRF based an-swer summarization model, we conduct experimentson the Yahoo! Answers archives dataset. The Ya-hoo! WebscopeTM Program4 opens up a number ofYahoo! Answers datasets for interested academicsin different categories. Our original dataset con-tains 1,300,559 questions and 2,770,896 answers inten taxonomies from Yahoo! Answers. After fil-tering the questions which have less than 5 answersand some trivial factoid questions using the featuresby (Tomasoni and Huang, 2010) , we reduce thedataset to 55,132 questions. From this sub-set, wenext select the questions with incomplete answersas defined in Section 2.1. Specifically, we select thequestions where the average similarity between thebest answer and all sub questions is less than 0.6 orwhen the star rating of the best answer is less than 4.We obtain 7,784 questions after this step. To eval-uate the effectiveness of this method, we randomlychoose 400 questions in the filtered dataset and in-vite 10 graduate candidate students (not in NLP re-search field) to verify whether a question suffersfrom the incomplete answer problem. We divide thestudents into five groups of two each. We considerthe questions as the “incomplete answer questions”only when they are judged by both members in agroup to be the case. As a result, we find that 360(90%) of these questions indeed suffer from the in-complete answer problem, which indicates that ourautomatic detection method is efficient. This ran-domly selected 400 questions along with their 2559answers are then further manually summarized forevaluation of automatically generated answer sum-maries by our model in experiments.

4http://sandbox.yahoo.com/

4.2 Evaluation Measures

When taking the summarization as a sequential bi-classification problem, we can make use of the usualprecision, recall and F1 measures (Shen et al., 2007)for classification accuracy evaluation.

In our experiments, we also compare the preci-sion, recall and F1 score in the ROUGE-1, ROUGE-2 and ROUGE-L measures (Lin , 2004) for answersummarization performance.

5 Experimental Results

5.1 Summarization Results

We adapt the Support Vector Machine (SVM) andLogistic Regression (LR) which have been reportedto be effective for classification and the Linear CRF(LCRF) which is used to summarize ordinary textdocuments in (Shen et al., 2007) as baselines forcomparison. To better illustrate the effectiveness ofquestion segmentation based contextual factors andthe group L1 regularization term, we carry the testsin the following sequence: (a) we use only the con-textual factors cf3 and cf4 with default L2 regular-ization (gCRF); (b) we add the reply question basedfactors cf1 and cf2 to the model (gCRF-QS); and (c)we replace default L2 regularization with our pro-posed group L1 regularization term (gCRF-QS-l1).For linear CRF system, we use all our textual andnon-textual features as well as the local (exact pre-vious and next) neighborhood contextual factors in-stead of the features of (Shen et al., 2007) for fair-ness. For the thresholds used in the contextual fac-tors, we enforce τlq to be equal to τls and τuq equalto τus for the purpose of simplifying the parameterssetting (τlq = τls = 0.4, τuq = τus = 0.8 in our ex-periments). We randomly divide the dataset into tensubsets (every subset with 40 questions and the as-sociated answers), and conduct a ten-fold cross val-idation and for each round where the nine subsetsare used to train the model and the remaining onefor testing. The precision, recall and F1 measures ofthese models are presented in Table 2.

Table 2 shows that our general CRF model basedon question segmentation with group L1 regulariza-tion out-performs the baselines significantly in allthree measures (gCRF-QS-l1 is 13.99% better thanSVM in precision, 9.77% better in recall and 11.72%better in F1 score). We note that both SVM and LR,

588

Model R1 P R1 R R1 F1 R2 P R2 R R2 F1 RL P RL R RL F1SVM 79.2% 52.5% 63.1% 71.9% 41.3% 52.4% 67.1% 36.7% 47.4%LR 75.2%↓ 57.4%↑ 65.1%↑ 66.1%↓ 48.5%↑ 56.0%↑ 61.6%↓ 43.2%↑ 50.8%↑

LCRF 78.7%- 61.8%↑ 69.3%- 71.4%- 54.1%↑ 61.6%↑ 67.1%- 49.6%↑ 57.0%↑gCRF 81.9%↑ 65.2%↑ 72.6%↑ 76.8%↑ 57.3%↑ 65.7%↑ 73.9%↑ 53.5%↑ 62.1%↑

gCRF-QS 81.4%- 70.0%↑ 75.3%↑ 76.2%- 62.4%↑ 68.6%↑ 73.3%- 58.6%↑ 65.1%↑gCRF-QS-l1 86.6%↑ 68.3%- 76.4%↑ 82.6%↑ 61.5%- 70.5%↑ 80.4%↑ 58.2%- 67.5%↑

Table 3: The Precision, Recall and F1 of ROUGE-1, ROUGE-2, ROUGE-L in the baselines SVM,LR, LCRF and ourgeneral CRF based models (gCRF, gCRF-QS, gCRF-QS-l1). The down-arrow means performance degradation withstatistical significance.

Model Precision Recall F1SVM 65.93% 61.96% 63.88%LR 66.92%- 61.31%- 63.99%-

LCRF 69.80% ↑ 63.91%- 66.73%↑gCRF 73.77%↑ 69.43%↑ 71.53%↑

gCRF-QS 74.78%↑ 72.51%↑ 73.63%↑gCRF-QS-l1 79.92%↑ 71.73%- 75.60%↑

Table 2: The Precision, Recall and F1 measures of thebaselines SVM,LR, LCRF and our general CRF basedmodels (gCRF, gCRF-QS, gCRF-QS-l1). The up-arrowdenotes the performance improvement compared to theprecious method (above) with statistical significance un-der p value of 0.05, the short line ’-’ denotes there is nodifference in statistical significance.

which just utilize the independent sentence-levelfeatures, behave not vary well here, and there is nostatistically significant performance difference be-tween them. We also find that LCRF which utilizesthe local context information between sentences per-form better than the LR method in precision and F1with statistical significance. While we consider thegeneral local and non-local contextual factor cf3 andcf4 for novelty and non-redundancy constraints, thegCRF performs much better than LCRF in all threemeasures; and we obtain further performance im-provement by adding the contextual factors basedon QS, especially in the recall measurement. Thisis mainly because we have divided the question intoseveral sub questions, and the system is able to se-lect more novel sentences than just treating the origi-nal multi-sentence as a whole. In addition, when wereplace the default L2 regularization by the groupL1 regularization for more efficient feature weightlearning, we obtain a much better performance in

precision while not sacrificing the recall measure-ment statistically.

We also compute the Precision, Recall and F1in ROUGE-1, ROUGE-2 and ROUGE-L measure-ments, which are widely used to measure the qualityof automatic text summarization. The experimentalresults are listed in Table 3. All results in the Ta-ble are the average of the ten-fold cross validationexperiments on our dataset.

It is observed that our gCRF-QS-l1 model im-proves the performance in terms of precision, recalland F1 score on all three measurements of ROUGE-1, ROUGE-2 and ROUGE-L by a significant mar-gin compared to other baselines due to the use oflocal and non-local contextual factors and factorsbased on QS with group L1 regularization. Sincethe ROUGE measures care more about the recall andprecision of N-grams as well as common substringsto the reference summary rather than the whole sen-tence, they offer a better measurement in modelingthe user’s information needs. Therefore, the im-provements in these measures are more encouragingthan those of the average classification accuracy foranswer summarization.

From the viewpoint of ROUGE measures we ob-serve that our question segmentation method can en-hance the recall of the summaries significantly dueto the more fine-grained modeling of sub questions.We also find that the precision of the group L1 reg-ularization is much better than that of the defaultL2 regularization while not hurting the recall signifi-cantly. In general, the experimental results show thatour proposed method is more effective than otherbaselines in answer summarization for addressingthe incomplete answer problem in cQAs.

589

Figure 2: The accumulated weight of each site featuregroup in the group L1-regularization to our Yahoo! An-swer dataset. The horizonal axis corresponds to the nameof each feature group.

5.2 Evaluation of Feature Learning

For group L1 regularization term, we set the ε =10−4 in Equation 6. To see how much the dif-ferent textual and non-textual features contribute tocommunity answer summarization, the accumulatedweight of each group of sentence-level features5 ispresented in Figure 2. It shows that the textual fea-tures such as 1 (Sentence Length), 2 (Position) 3 (An-swer Length), 6 (Has Link) and non-textual featuressuch as 8 (Best Answer Star) , 12 (Total AnswerNumber) as well as 13 (Total Points) have largerweights, which play a significant role in the sum-marization task as we intuitively considered; fea-tures 4 (Stopwords Rate), 5 (Uppercase Rate) and 9(Thumbs Up) have medium weights relatively; andthe other features like 7 (Similarity to Question), 10(Author Level) and 11 (Best Answer Rate) have thesmallest accumulated weights. The main reasonsthat the feature 7 (Similarity to Question) has lowcontribution is that we have utilized the similarityto question in the contextual factors, and this simi-larity feature in the single site becomes redundant.Similarly, the features Author Level and Best An-swer Number are likely to be redundant when othernon-textual features(Total Answer Number and To-tal Points) are presented together. The experimentalresults demonstrate that with the use of group L1-regularization we have learnt better combination ofthese features.

5Note that we have already evaluated the contribution of thecontextual factors in Section 5.1.

5.3 An Example of Summarized Answer

To demonstrate the effectiveness of our proposedmethod, Table 4 shows the generated summary ofthe example question which is previously illustratedin Table 1 in the introduction section. The best an-swer available in the system and the summarized an-swer generated by our model are compared in Table4. It is found that the summarized answer containsmore valuable information about the original multi-sentence question, as it better answers the reason ofteeth blooding and offers some solution for it. Stor-ing and indexing this summarized answer in ques-tion archives should provide a better choice for an-swer reuse in question retrieval of cQAs.

Question

Why do teeth bleed at night and how do you prevent/stop it? Thismorning I woke up with blood caked between my two front teeth.[...]Best Answer - Chosen by AskerPeriodontal disease is a possibility, gingivitis, or some gum infec-tion. Teeth don’t bleed; gums bleed.

Summarized Answer Generated by Our MethodPeriodontal disease is a possibility, gingivitis, or some gum infec-tion. Teeth don’t bleed; gums bleed. Gums that bleed could be asign of a more serious issue like leukemia, an infection, gum dis-ease, a blood disorder, or a vitamin deficiency. wash your mouthwith warm water and salt, it will help to strengthen your gum andteeth, also salt avoid infection.

Table 4: Summarized answer by our general CRF based modelfor the question in Table 1.

6 Conclusions

We proposed a general CRF based community an-swer summarization method to deal with the in-complete answer problem for deep understanding ofcomplex multi-sentence questions. Our main con-tributions are that we proposed a systematic wayfor modeling semantic contextual interactions be-tween the answer sentences based on question seg-mentation and we explored both the textual and non-textual answer features learned via a group L1 reg-ularization. We showed that our method is able toachieve significant improvements in performance ofanswer summarization compared to other baselinesand previous methods on Yahoo! Answers dataset.We planed to extend our proposed model with moreadvanced feature learning as well as enriching oursummarized answer with more available Web re-

590

sources.

AcknowledgementsThis work was supported by the NSFC under GrantNo.61073002 and No.60773077.

ReferencesL. A. Adamic, J. Zhang, E. Bakshy, and M. S. Ackerman.

2008. Knowledge sharing and yahoo answers: every-one knows something. Proceedings of WWW 2008.

Shilin Ding, Gao Cong, Chin-Yew Lin and Xiaoyan Zhu.2008. Rouge: Using Conditional Random Fields toExtract Contexts and Answers of Questions from On-line Forums. Proceedings of ACL-08: HLT, pages710–718.

Michel Galley. 2006. A Skip-Chain Conditional Ran-dom Field for Ranking Meeting Utterances by Impor-tance. Proceedings of EMNLP 2006.

Z. Gyongyi, G. Koutrika, J. Pedersen, and H. Garcia-Molina. 2007. Questioning yahoo! answers. Tech-nical report. Stanford InfoLab.

F. Maxwell Harper, Daphne Raban, Sheizaf Rafaeli, andJoseph A. Konstan. 2008. Predictors of Answer Qual-ity in Online Q&A Sites. Proceedings of CHI 2008.

Liangjie Hong and Brian D. Davison. 2009. AClassification-based Approach to Question Answeringin Discussion Boards. Proceedings of the 32th ACMSIGIR Conference, pages 171–178.

Eduard Hovy, Chin Y. Lin, and Liang Zhou. 2005. ABE-based Multi-document Summarization with Sen-tence Compression. Proceedings of Multilingual Sum-marization Evaluation (ACL 2005 workshop).

Jiwoon Jeon, W. Bruce Croft, Joon Ho Lee and SoyeonPark 2006. A Framework to Predict the Quality ofAnswers with NonTextual Features. Proceedings ofthe 29th ACM SIGIR Conference, pages 228–235.

V. Jijkoun and M. de Rijke. 2005. Retrieving answersfrom frequently asked questions pages on the web. InCIKM.

Daniel Jurafsky and James H. Martin. 2009. Speechand Language Processing: An introduction to naturallanguage processing, computational linguistics, andspeech recognition. Published by Pearson Education.

John D. Lafferty, Andrew McCallum, and Fernando C.N. Pereira. 2001. Conditional random fields: Proba-bilistic models for segmenting and labeling sequencedata. Proceedings of the 18th ICML, pages 282–289.

S. Lee, H. Lee, P. Abbeel, and A. Ng. 2006. Efficient L1Regularized Logistic Regression. In AAAI.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. Proceedings of ACL Work-shop, pages 74–81.

Yandong Liu, Jiang Bian, and Eugene Agichtein. 2008.Predicting Information Seeker Satisfaction in Commu-nity Question Answering. Proceedings of the 31thACM SIGIR Conference.

Yuanjie Liu, Shasha Li, Yunbo Cao, Chin-Yew Lin,Dingyi Han, and Yong Yu. 2008. Understanding andsummarizing answers in community-based questionanswering services. Proceedings of the 22nd ICCL,pages 497–504.

S. Riezler, A. Vasserman, I. Tsochantaridis, V. Mittal, andY. Liu. 2007. Statistical machine translation for queryexpansion in answer retrieval. Proceedings of the 45thAnnual Meeting of ACL.

Mark Schmidt. 2010. Graphical Model Structure Learn-ing with L1-Regularization. Doctoral Thesis.

Chirag Shah and Jefferey Pomerantz. 2010. Evaluat-ing and Predicting Answer Quality in Community QA.Proceedings of the 33th ACM SIGIR Conference.

Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang and ZhengChen. 2007. Document Summarization using Condi-tional Random Fields. Proceedings of the 20th IJCAI.

Troy Simpson and Malcolm Crowe 2005. WordNet.Nethttp://opensource.ebswift.com/WordNet.Net

Mineki Takechi, Takenobu Tokunaga, and Yuji Mat-sumoto. 2007. Chunking-based Question Type Iden-tification for Multi-Sentence Queries. Proceedings ofSIGIR 2007 Workshop.

Mattia Tomasoni and Minlie Huang. 2010. Metadata-Aware Measures for Answer Summarization in Com-munity Question Answering. Proceedings of the 48thAnnual Meeting of ACL, pages 760–769.

Hongning Wang, Chi Wang, ChengXiang Zhai, JiaweiHan 2011. Learning Online Discussion Structures byConditional Random Fields. Proceedings of the 34thACM SIGIR Conference.

Kai Wang, Zhao-Yan Ming and Tat-Seng Chua. 2009.A Syntactic Tree Matching Approach to Finding Simi-lar Questions in Community-based QA Services. Pro-ceedings of the 32th ACM SIGIR Conference.

Kai Wang, Zhao-Yan Ming, Xia Hu and Tat-Seng Chua.2010. Segmentation of Multi-Sentence Questions:Towards Effective Question Retrieval in cQA Ser-vices. Proceedings of the 33th ACM SIGIR Confer-ence, pages 387–394.

X. Xue, J.Jeon, and W.B.Croft. 2008. Retrieval modelsfor question and answers archives. Proceedings of the31th ACM SIGIR Conference.

Zi Yang, Keke Cai, Jie Tang, Li Zhang, Zhou Su, andJuanzi Li. 2011. Social Context Summarization. Pro-ceedings of the 34th ACM SIGIR Conference.

Liang Zhou, Chin Y. Lin, and Eduard Hovy. 2006. Sum-marizing answers for complicated questions. Proceed-ings of the 5th International Conference on LREC,Genoa, Italy.

591

community answer summarization for multi-sentence question ... · best answer - chosen by asker...

Documents