designing an active learning based system for corpus

Designing an active learning based system for corpus annotation

Bertjan BusserTilburg UniversityThe [email protected]

Roser MoranteTilburg UniversityThe [email protected]

Resumen: En este artıculo revisamos algunos resultados de experimentos realizadoscon tecnicas de Active Learning para establecer las bases del diseno de un sistemade anotacion de corpus. A partir de los datos experimentales disenamos un sistemamodular que posibilita un aprendizaje rapido en una primera fase y que permitepasar a una segunda fase de aprendizaje mas lento, pero mas preciso. El sistemaesta disenado para realizar una tarea de anotacion de roles semanticos.Palabras clave: aprendizaje activo, anotacion de corpus, etiquetado de rolessemanticos

Abstract: In this paper we review some Active Learning experimental results inorder to set up the basis for designing an active learning based system for corpusannotation. Based on the experimental data we design a modular system that allowsfor initially learning fast, but that it is capable of switching to a slower and moreprecise learning strategy. The system is designed to perform a semantic role labellingtask.Keywords: active learning, corpus annotation, semantic role labelling

1 Introduction

In this paper we review some experimentalevidence, as a starting point to propose a de-sign for an Active Learning (AL) system forautomatic Semantic Role Labelling (SRL) ona Spanish corpus.

AL is an appropriate method for annota-ting large amounts of raw data, capable offunctioning within the constraints that a tasklike labelling a corpus for SRL imposes, suchas a limited availability of labelled data andthe cost of expert human annotators.

The data we present show that theremight be a trade off between learning fasterand learning better. Accordingly, we design amodular system that allows for initially lear-ning fast but that it is capable of switching toa slower and more precise learning strategy.

In section 2 we introduce AL. In section3 we describe the constraints given the taskto annotate a large corpus with minimal ini-tial annotation. In section 4 we review someexperimental results. Finally in section 5 wepropose a design for a system to annotate alarge corpus using AL.

2 Active Learning

AL (Thompson, Califf, and Mooney, 1999) isa technique that uses example selection met-hods for algorithm training. It can be usedwith different algorithms. A main characte-ristic of AL for corpus annotation is that it

allows to start from a small annotated cor-pus, as opposed to traditional machine lear-ning of natural language processing (NLP),which typically requires large amounts of an-notated data to learn a task adequately.

Another important characteristic is thatAL allows for human intervention at any timeduring the learning process, whereas the tra-ditional approach is a batch type process,which does not allow for influencing the lear-ning process and thus the quality of the an-notation.

2.1 General description of ActiveLearning

AL essentially is providing Machine Learningwith a feedback loop. The feedback loopis used to control the learning of the mainlearning algorithm. The two most straightfo-rward applications are semi-automatic selec-tion of examples to label, and reduction (orcompression) of the initial dataset.

In both methods, dataset reduction andselection of examples for labelling, a decisionabout the utility of the current example ismade based on the information the learningalgorithm has already learned, as well as thecharacteristics of the current example.

The AL process is an iterative process,which typically starts out by training thelearning algorithm on a labelled bootstrapcorpus. Ideally, the labelled bootstrap cor-

Procesamiento del Lenguaje Natural, núm. 35 (2005), pp. 375-381 recibido 29-04-2005; aceptado 01-06-2005

ISSN: 1135-5948 © 2005 Sociedad Española para el Procesamiento del Lenguaje Natural

pus is as small as possible, often only a fewexamples. After the initial training, the ALprocess is applied to each of the remainingexamples in turn. Ideally, if an example isaccepted for further processing, this exampleis labelled (in the case of selecting examplesfor labelling), added to the bootstrap corpus,and the learning algorithm is retrained withthe enhanced bootstrap corpus, before pro-cessing further examples. This optimal caseis often very computationally intensive. Itis therefore more common to retrain after afixed number of processed or accepted exam-ples.

The general AL algorithm is independentof the learning algorithm and example selec-tion criteria used. It can, in principle, becombined with any learning algorithm andselection criterium, and thus allow for ma-ximum optimization for specific tasks.

2.2 Learning strategies

AL can be combined with any learning stra-tegy. It is independent of the learning al-gorithm(s) used, the parameter settings, thedata characteristics (features), and even thenumber of classifiers. A classifier is a learningalgorithm with a specific parameter settingand trained on specific data.

Using more than one classifier, also knownas Query By Committee (Melville and Moo-ney, 2004), is a good way to avoid weaknessesof individual classifiers. A Committee usua-lly consists of an odd number of classifiers toavoid tie breaking problems, and those clas-sifiers should be as diverse as possible.

Another approach using multiple classi-fiers, and possible to combine with AL, areagents. Agents are classifiers for a specificfunction, such as disambiguating one specificword (Word Sense Disambiguation) (Hendri-ckx and van den Bosch, 2001).

In choosing a learning algorithm for apractical AL task there are also practical con-siderations. An algorithm that is capable ofincremental learning is preferable for AL, butnot essential. An algorithm that needs dis-proportionally long training times may notbe practical.

2.3 Example selection

There are many possible selection criteria,but the actual choice is often limited by thelearning algorithm used. For instance, anoften mentioned criterium is the confidence

score. The confidence score of an example isthe probability that the labelling of the cu-rrent example is correct. However, if the usedalgorithm does not provide enough informa-tion to determine the confidence score thiscriterium is not usable.

Another often used criterium is agree-ment, the proportion or number of individualclassifiers that agree on a certain labelling ofthe current example. This criterium appliesto Query By Committee.

Universally applicable criteria are rare;random selection is one.

3 Task based Constraints

Large amounts of raw linguistic data are avai-lable. However, these data are not imme-diately useful for most purposes. Annota-ted data, on the other hand, are scarce andexpensive. Having annotated corpora is es-sential for all NLP applications. The anno-tation task that we will focus on is a SRLtask, for a large corpus (described in 3.1).Semantic tagging of corpus is relevant for in-formation extraction, automatic summariza-tion, question-answering systems, and for allapplications that need semantic interpreta-tion.

SRL consists in recognizing the argumentsof the verbs in a sentence and assigning se-mantic roles to them. For every verb allthe constituents in the sentence that have asemantic role as argument (Agent, Patient,Instrument, etc.), or as adjunct (Locative,Temporal, Manner, Cause, etc.) are labelled.Most of the existing SRL systems carry outthe task in two stages: (1) Recognition of ar-guments: it consists in analyzing the sentencesyntactically in order to define the limits ofthe constituents that will be arguments of theverb. This is how the information to trainclassifiers is obtained. This stage requirespreprocessing the text. (2) Labelling: clas-sifier algorithms are used to assign roles au-tomatically. These algorithms need trainingthat will be carried out with annotated cor-pora.

The reference for a SRL task are the re-sults of the CoNLL-2004 Shared Task (Ca-rreras and Marquez, 2004), in which diffe-rent machine systems compete to performa preestablished SRL task using an Englishcorpus. As (Carreras and Marquez, 2004)put it, for English, different machine learningmodels have been applied: purely probabilis-

B. Busser, R. Morante

376

tic models, maximum entropy models, gene-rative models, decision trees, support vectormachines, memory based learning, voted per-ceptrons. The results obtained by these mo-dels are not precise enough so as to applythem to real tasks, like the labelling of a 70million word corpus. The highest result inthe CoNLL-2004 Shared Task (Carreras andMarquez, 2004) was a F1 rate of 71.72 in thedevelopment set, and of 69.49 in the test set,obtained by (Hacioglu et al., 2004).

3.1 Corpus description

The corpus to which the techniques will beapplied is the EFE corpus of Spanish, of70.082.709 words, that contains the news ofthe EFE agency of news of the year 1994.The corpus is made available by the researchgroup TALP of the Universitat Politecnica deCatalunya (http://www.talp.upc.es/).

4 Review of experimental resultson example selection

In this section we will review experimen-tal results on example selection from varioussources. Example selection has been, andis, of serious interest for the machine lear-ning community, since correctly distinguis-hing relevant material from irrelevant mate-rial for learning is essential for effective lear-ning. Consequently, there is a large body ofliterature related to this topic. We will fo-cus on a few studies which compare selectionmethods on closely related tasks.

4.1 Data editing

Daelemans and Van Den Bosch (Daelemansand den Bosch, 2005) experiment with cps,case prediction strength, and random selec-tion in editing data on three different NLPtasks. Data editing in this case is exam-ple selection. cps is an information mea-sure, closely related to entropy and informa-tion gain. It measures the importance of anindividual example, the case, for predictingthe right answer in a group of similar exam-ples. The authors apply data editing on th-ree tasks: gplural, generating the Germanplural from a German singular noun; dimin,generating Dutch diminutives from a Dutchnoun; and pp, attachment of a prepositionalphrase to either a noun or a verb phrase inEnglish.

The procedure applied is ranking the trai-ning data in three different ways: from low

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Dutch diminutive formation

From high to low CPSFrom low to high CPS

RandomNo editing

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100A

ccura

cy

German plural


Random editingNo editing

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Accura

cy

% Edited training instances

English PP attachment


RandomNo editing

Figure 1: From Daelemans and Van DenBosch (Daelemans and den Bosch, 2005). Ge-neralization accuracies by ib1 on gplural,dimin, and pp with incremental percentagesof edited example tokens, according to thecps editing criterion, from high to low cpsand vice versa, and using random incremen-tal editing.

to high cps, from high to low cps, and ran-domly. The learning algorithm, in this caseIB1 with MVDM, is then applied to the first10 %, 20 %, 30 %, ... , 100 % of each ran-ked dataset, and tested against a constant,held-out test set. The results of these seriesof experiments are displayed in Figure 1.

Summarizing these graphs, we note thatin all cases random selection gives the stee-pest learning curves. In these graphs a 100% edited training instances means 0 % trai-


377

ning data, or, in other words, the beginningof learning. The progress of the learning pro-cess should be read from right to left. Thelight gray line represents random selection,and starts out at the highest level in the ex-treme right of all graphs. Also, initially, itincreases the fastest in all graphs.

4.2 Active Learning topic agents

Figure 2: From Tong and Koller (Tong andKoller, 2001). Average test set accuracyover the ten most frequently occurring topicswhen using a pool of 1000.

Figure 3: From Tong and Koller (Tongand Koller, 2001). Average test set preci-sion/recall breakeven point over the ten mostfrequently occurring topics when using a poolof 1000.

Tong and Koller (Tong and Koller, 2001)apply AL to training topic agents for auto-matically classifying the topic of newswirestories from the Reuters corpus. The agentsare SVM-based classifiers trained on TFIDF-weighted word frequency vectors, and are bi-nary classifiers (yes/no decision for a parti-cular topic). The selection criteria used areSimple Margin, MaxMin Margin, and RatioMargin, and random selection.

SVMs model the instance space in hyper-planes, where each hyperplane corresponds toa certain decision, in this case a particulartopic. The margin of this hyperplane is thedistance from the center to the edge of thehyperplane. The center of this hyperplanemight colloquially be interpreted as corres-ponding to the most prototypical instantia-tion of this topic.

Simple Margin is a selection criteriumwhich minimizes the distance to a center, andtherefore selects the examples which mostclosely correspond to the hypothetical pro-totype. Maxmin Margin and Ratio Mar-gin both minimize the growth of either of twoopposing margins of the negative and positivehyperplane. They typically select exampleslying between the two hyperplanes in orderto more precisely define the borders of these,albeit by different criteria.

Summarizing these graphs, in Figure 2random selection shows the steepest learningcurve initially, when using accuracy as theperformance measure. In Figure 3, using aprecision/recall break-even point as perfor-mance measure, random selection is the worstselection measure.

4.3 Active Learning for reductionof data

In our experiments we have focused on redu-cing the data set size while keeping the per-formance constant. In Figure 4 we presenta graph showing results on Dutch graphemephoneme conversion, based on the TreeTalk-D system (Busser, 1998). We have appliedvarious selection criteria on this data, inclu-ding IB2, and two variants of our own imple-mentation of AL.

IB2 is an ancestor of AL, which attemptsdata set reduction by first classifying each la-belled example and only learning it if it wasclassified wrong. Our baseline variant doesthe same, but selects each example randomlyfrom the whole set, and removes examplesfrom the data set only when they are usedfor learning.

Our random variant is a combination ofAL and a form of co–training (Blum and Mit-chell, 1998). It uses a manually labelled testset to evaluate the quality of an automatica-lly generated labelling. Like in the previoustechnique, it selects a random example fromthe data set to learn, but it tests performancebefore and after learning it on a random test


378

set of a 1000 examples. This random testset is generated separately for each example.If performance increases after learning, theexample is removed from the data set, if not,it is unlearned. It is also much more compu-tationally expensive than the other two met-hods, so we were not able to complete the fullcurve.

40

50

60

70

80

90

100

100 1000 10000 100000

TestGraph

"rBAL.data""IB2.data"

"baseline.data"

Figure 4: Active Learning for data set reduc-tion in grapheme phoneme conversion.

Summarizing these results, our randomvariant has a much more irregular curve thanthe other two. Based on this graph we wouldhave to conclude that IB2 shows the best re-sult, but the random variant might outper-form it given enough resources.

In other unpublished experiments we ap-plied QBC with various selection criteria suchas majority voting, thresholded majority vo-ting, and thresholded entropy, but results inthese experiments were singularly unpromi-sing.

4.4 Discussion

While there are many different selection cri-teria (we reviewed only a few of these), wepropose to draw some conclusions which ap-ply strictly to annotating a large corpus.The constraints in this task make this pos-sible. Generally speaking, there are threeconstraints from a methodological perspec-tive. The first, and most important cons-traint is the limited amount of hand-labelled,“gold standard” data available. A secondconstraint is that the resulting system shouldattain reasonable performance levels as quickas possible, both for practical reasons and forfurther experimentation. A third constraintis that in the resulting system it should idea-lly be possible to plug in various learning al-gorithms to facilitate experimentation; it isvery hard to predict which algorithm is opti-

mal for a certain task.Furthermore, the task at hand, SRL, has

two characteristics which function as cons-traints. SRL is usually approached as a twophase process, (1) isolating relevant argu-ments, and (2) categorizing those arguments,which implies that the material to be hand la-belled must be carefully selected, taking intoaccount the learning that will be applied; ifagents are selected as the learning strategythen the hand-labelled corpus has to containcertain minimum numbers of hand-labelledexamples for each agent to be able to attainstatistical usability. A second considerationis that SRL usually is done on a corpus la-belled with Part-Of-Speech (POS) tags, andconstituent tags (chunks). Since this labe-lling will have to be automated, the systemwill have to be able to (learn to) deal witherrors in the POS tags and chunks.

The limited availability of hand-labelleddata rules out bootstrapping the system witha relatively large amount of reliable data inorder to attain reasonable performance im-mediately. Most likely it will also rule outlarge scale co-training, since this needs rea-sonable amounts of labelled data to initiallytrain on, and reasonable amounts of labelleddata to evaluate the system with. However,it will probably be possible to have one or asmall number of small in-the-loop evaluationcorpora, and a reasonable held-out test set toevaluate the system with.

We reviewed two sources that showed thatrandom selection facilitates rapid learning ofa system, and random selection is also trulyindependent of the algorithm used. However,based on this literature, caution is necessary,because different measurements lead to dif-ferent results. Tong and Koller (Tong andKoller, 2001) show that when using an F-score related performance measure, randomselection does not necessary learn either thefastest or the best. This might be causedby the characteristics of their algorithm andselection criteria, or by their task, or both.Informed selection criteria, such as MaxMinMargin or CPS, represent introducing a cer-tain bias to the system, distorting the exam-ple space. The effects of this bias may ormay not result in a positive effect on per-formance, depending on a variety of factors,including the algorithm used and the speci-fic task. Therefore it seems advisable to useseveral performance measures to be able to


379

keep track of these effects.The utility of examples, or in-the-loop

performance, should ideally be measured onan independent corpus and not as internalconsistency or inter-concept coherence basedon the already learned concept representationbecause, at least initially, the already learnedconcept representation might not be very ac-curate. Furthermore, an independent in-the-loop evaluation corpus allows for using per-formance measure that require a minimumnumber of observations to be statistically sig-nificant, such as F-score related measures.

5 A system for corpus annotation

Given the conclusions we reach in the pre-vious section, in this section we proposea modular system taking into account thetask characteristics, in particular the smallamount of labelled data available. The sys-tem will initially be geared for fast learningof the task to allow for further experimen-tation. Switching learning strategy will bepossible using the modularity of the system.

5.1 Corpus and general system

To start with, the corpus has to be dividedinto three subcorpora:

• “Gold Standard” corpus: the part of thecorpus manually annotated. It will haveapproximately 200.000 words.

– Evaluation Corpus: this is the partof the corpus which will be used toevaluate the system with (EC). Inthe machine learning literature thiswould be the held-out test set. Wewill use a 100.000 words corpus forthis purpose.

– In-The-Loop Evaluation Corpus(ITL-EC): it is the part of thecorpus in which the results ofevery iteration are evaluated, whichshould also be manually annotated.It does not necessarily have to bevery large, but it is desirable thatit is a valid random sample of thetotal example space to make accu-rate measurements. We might usevarious randomly selected parts ofthis corpus for different evaluationsin the process. As an upper boundwe will use maximally 50.000 wordsfor this purpose.

– Seed Corpus: this 50.000 word partof the “Gold Standard” corpus willbe used to initially bootstrap thesystem.

• To-be-annotated corpus (AC): the partof the corpus that will be semiautomati-cally annotated. It will be divided intosubparts of, for example 1 million words,to facilitate processing.

Active learning is an iterative process. Inthe first iteration the classifier is trained withthe SC. Next the classifier classifies a subpartof the AC. As a result, information will be ob-tained about what examples the system canautomatically annotate and what examplesthe system cannot annotate.

It is necessary to establish a methodologyto determine what is a classifiable and an un-classifiable example. We will use the In-The-Loop Evaluation Corpus to isolate produc-tive examples, that is those examples, clas-sified automatically, with which the systemclassifies more, or at least not less, exam-ples in the In-The-Loop Evaluation Corpuscorrectly, and take those examples to be clas-sifiable correctly.

The examples that the system cannot clas-sify productively will be annotated by a hu-man and incorporated to the SC. As a checkwe will also manually review a small randomsample of the examples that the system canclassify productively before entering them inthe SC for the next iteration.

The EC is used in order to evaluate if inevery iteration the classification results im-prove. This might be used as a stop crite-rium; if the performance does not increaseanymore, the AL process can be stopped.If the performance is acceptable, the systemmight be switched to fully automatic classi-fication. If it is not acceptable at that point,it is possible to switch strategies, for exam-ple by using a different algorithm or selectioncriterium, and continue the AL process.

5.2 System details

Initially we will start measuring accuracy andan F-score related measure. Since not all al-gorithms support manipulating decision th-reshold (see (Tong and Koller, 2001)), we willuse either traditional F-score with β = 1 ortrue positive / false positive rate (Fawcett,2005) which also allows for finding a break-


380

even point between them like (Tong and Ko-ller, 2001).

Initially we will use random selection dueto the indications that it facilitates fast lear-ning as well as being algorithm indepen-dent. However, we will do this exclusivelyon the productive, automatically classifiedexamples, and always select all the manua-lly labelled examples (see section 5.1). If thelearning rate of the system drops, accordingto the evaluation per iteration, we can switchto another criterium such as a productivitythreshold.

All of the knowledge and most of thecode necessary to implement such a systemusing knn, is already available for Timbl (seehttp://ilk.uvt.nl). We will therefore startwith the IB1 algorithm (pure knn) in theTimbl software package. We will explorethe possibility of integrating SVMs (SupportVector Machines) in this system early in thisproject.

In all aspects we will keep the system suf-ficiently modular so that we can plug in otherperformance measures, selection criteria, andalgorithms as needed.

Acknowledgements

We would like to thank Antal van den Boschand Walter Daelemans for kindly providingthe graphics and data in Figure 1, and SimonTong and Daphne Koller for the graphics inFigures 2 and 3. We also would like to thankM. Antonia Martı and CLiC for their supportwith the Spanish data. Finally, we thank th-ree anonymous reviewers for their comments.

References

Blum, A. and T. Mitchell. 1998. Com-bining labeled and unlabeled data withco-training. In COLT: Proceedings ofthe Workshop on Computational LearningTheory. Morgan Kaufmann Publishers.

Busser, G.J. 1998. Treetalk-d: A ma-chine learning approach to dutch wordpronunciation. In P. Sojka, V. Matousek,K. Pala, and I. Kopecek, editors, Procee-dings TSD Conferenc, pages 3–8, CzechRepublic. Masaryk University.

Carreras, X. and Ll. Marquez. 2004. Intro-duction to the conll-2004 shared task: Se-mantic role labeling. In Proceedings of theCoNLL Shared Task 2004, Boston MA,USA.

Daelemans, W. and A. Van den Bosch. 2005.Memory-based language processing. Cam-bridge University Press, Cambridge, UK.

Fawcett, T. 2005. Roc graphs with ins-tance varying costs. submitted to. PatternRecognition Letters Special Issue on ROCAnalysis in Pattern Recognition.

Hacioglu, K., S. Pradhan, J.H. Martin, andD. Jurafsy. 2004. Semantic role labe-ling by tagging syntactic chunks. In Pro-ceedings of the CoNLL Shared Task 2004,Boston MA, USA.

Hendrickx, I. and A. van den Bosch. 2001.Dutch word sense disambiguation: Dataand preliminary results. In Proceedings ofSenseval-2, Toulouse.

Melville, P. and R. J. Mooney. 2004. Di-verse ensembles for active learning. InProceedings of the 21st International Con-ference on Machine Learning, pages 584–591, Banff, Canada.

Thompson, C. A., M.E. Califf, and R.J. Moo-ney. 1999. Active learning for naturallanguage parsing and information extrac-tion. In Proceedings of the 16th Interna-tional Conference on Machine Learning,pages 406–414.

Tong, S. and D. Koller. 2001. Support vectormachine active learning with applicationsto text classification. Journal of MachineLearning Research, 2:45–66.


381

designing an active learning based system for corpus

Documents