the annual belgian-dutch machine learning … annual belgian-dutch machine learning conference 19-20...

The Annual Belgian-DutchMachine Learning Conference

19-20 May 2008

Spa, Belgium

EditorsLouis Wehenkel

Pierre GeurtsRaphaël Marée

Benelearn is the annual machine learning conference of Belgium and The Netherlands. It serves as a forumfor researchers to exchange ideas, present recent work, and foster collaboration in the broad field of MachineLearning and its applications. Benelearn 2008 is organised by the Systems and Modeling and Bioinformatics andModeling research units of the Department of Electrical Engineering and Computer Science and GIGA-Researchof the University of Liege. The conference takes place in the Solcress seminar center, at walking distance fromthe center of the city of Spa located in the Belgian Ardennes.

Conference Chair

Louis Wehenkel, Pierre Geurts, Raphael MareeUniversite of Liege, Belgium

Organizing Support

Michele Delville, Celine DizierAssociation des Ingenieurs de Montefiore (A.I.M.)

Programme Committee

H. Blockeel, K.U. LeuvenG. Bontempi, U.L. BruxellesW. Daelemans, Universiteit AntwerpenP. Dupont, U.C. LouvainD. Ernst, Universite de LiegeA. Feelders, Universiteit UtrechtP. Geurts, Universite de LiegeB. Goethals, Univ. AntwerpenT. Heskes, Radboud Univ. NijmegenB. Kappen, Radboud Univ. NijmegenJ. Kok, Universiteit LeidenB. Manderick, Vrije Universiteit BrusselR. Maree, Universite de LiegeB. de Moor, K.U. LeuvenM. van Otterlo, Universiteit TwenteJ. Piater, Universite de LiegeL. de Raedt, K.U. LeuvenJ. Ramon, K.U. LeuvenY. Saeys, Universiteit GentR. Sepulchre, Universite de LiegeM. van Someren, University of AmsterdamA. Siebes, Universiteit UtrechtY. Smirnov, Universiteit MaastrichtK. Van Steen, Universite de LiegeJ. Suykens, K.U. LeuvenA. van den Bosch, Universiteit TilburgK. VanHoof, Universiteit HasseltM. Verleysen, U.C. LouvainP. Vitanyi, Centrum voor Wiskunde en InformaticaL. Wehenkel, Universite de Liege

3

Schedule

Monday, May 19th

9h15-9h30 Welcome

9h30-10h30 Invited Talk (Chair: Damien Ernst)

Susan Murphy, Machine Learning and Reinforcement Learning in Clinical Researchp11

10h30-11h00 Coffee break

11h00-12h40 Session 1

Session 1.1 Reinforcement learning, planning, and games (Chair: Susan Murphy)

11h00-11h20 Tom Croonenborghs, Kurt Driessens, and Maurice Bruynooghe, Learninga transfer function for reinforcement learning problems p15

11h20-11h40 Boris Defourny, Damien Ernst, and Louis Wehenkel, Perturb and combinein sequential decision making under uncertainty p17

11h40-12h00 Raphael Fonteneau, Louis Wehenkel, and Damien Ernst, Variable selec-tion for dynamic treatment regimes: a reinforcement learning approach p19

12h00-12h20 Jan Lemeire, An alternative approach for playing complex games likechess p21

Session 1.2 Graphical and relational models (Chair: Kristel Van Steen)

11h00-11h20 Luc De Raedt, ProbLog p23

11h20-11h40 Bernd Gutmann, Angelika Kimmig, Luc De Raedt, and Kristian Kerst-ing, Parameter learning in probabilistic databases: a least squares approach p25

11h40-12h00 Ingo Thon, Niels Landwehr, and Luc De Raedt, CPT-L: an efficientmodel for relational stochastic processes p27

12h00-12h20 Vincent Auvray, and Louis Wehenkel, Learning inclusion-optimal chordalgraphs p29

12h20-12h40 Sourour Ammar, Philippe Leray, Boris Defourny, and Louis Wehenkel,Density estimation with ensembles of randomized poly-trees p31

12h40-14h00 Lunch break

14h00-15h00 Invited Talk (Chair: Raphael Maree)

Bill Triggs, Scene segmentation with latent topic markov field models- and - classification and dimensionality reduction using convex class models p11



Session 2.1 Vision and speech (Chair: Bill Triggs)

15h30-15h50 Fabien Scalzo, Georgios Bebis, Mircea Nicolescu, and Leandro Loss, Evo-lutionary learning of feature fusion hierarchies p33

15h50-16h10 Cedric Simon, Jerome Meessen, and Christophe De Vleeschouwer, Us-ing decision trees to build an event recognition framework for automated visualsurveillance p35

5

16h10-16h30 Raphael Maree, Pierre Geurts, and Louis Wehenkel, Content-based imageretrieval by indexing random subwindows with randomized trees p37

16h30-16h50 Herman Stehouwer, IGForest: From tree to forest p39

16h50-17h10 Marie Dumont, Raphael Maree, Pierre Geurts, and Louis Wehenkel, Fastimage annotation with random subwindows p41

17h10-17h30 Thomas Drugman, Alexis Moinet, and Thierry Dutoit, On the use ofmachine learning in statistical parametric speech synthesis p43

Session 2.2 Feature selection and active learning (Chair: Yvan Saeys)

15h30-15h50 Yvan Saeys, Thomas Abeel, and Yves Van de Peer, Towards robust fea-ture selection techniques p45

15h50-16h10 Marieke van Erp, Antal van den Bosch, Piroska Lendvai, and Steve Hunt,Feature selection techniques for database cleansing: knowledge-driven vs greedysearch p47

16h10-16h30 Van Anh Huynh-Thu, Louis Wehenkel, and Pierre Geurts, Deriving p-values for tree-based variable importance measures p49

16h30-16h50 Robby Goetschalckx, Scott Sanner, and Kurt Driessens, Linear regressionusing costly features p51

16h50-17h10 Dirk Gorissen, Tom Dhaene, and Eric Laermans, Automatic regressionmodeling with active learning p53

17h10-17h30 Kurt De Grave, Jan Ramon, and Luc De Raedt, Active learning forprimary drug screening p55

Evening Conference dinner

6

Tuesday, May 20th

9h30-10h30 Invited Talk (Chair: Pierre Geurts)

Johannes Furnkranz, Preference learning p12



Session 3.1 Ranking and complex outputs (Chair: Johannes Furnkranz)

11h00-11h20 Willem Waegeman, Bernard De Baets, and Luc Boullart, When can wesimplify a one-versus-one multi-class classifier to a single ranking? p57

11h20-11h40 Michael Rademaker, Bernard De Baets, and Hans De Meyer, Mono-tone Relabeling of Partially Non-Monotone Data: Restoring Regular or StochasticMonotonicity p59

11h40-12h00 Pierre Geurts, Louis Wehenkel, and Florence d’Alche-Buc, Learning inkernelized output spaces with tree-based methods p61

12h00-12h20 Beau Piccart, Jan Struyf, and Hendrik Blockeel, Selective Inductive Trans-fer p63

12h20-12h40 Justus Piater, Fabien Scalzo, and Renaud Detry, Vision as inference ina hierarchical markov network p65

Session 3.2 Semi-supervised learning, missing data, and automata (Chair: Pierre Dupont)

11h00-11h20 Jerome Callut, Kevin Francoisse, Marco Saerens, and Pierre Dupont,Semi-supervised Classification in Graphs using Bounded Random Walks p67

11h20-11h40 Amin Mantrach, Marco Saerens, and Luh Yen, The Sum-Over-Paths Co-variance: A novel covariance measure p69

11h40-12h00 Jort Gemmeke, Classification on incomplete data using sparse represen-tations: Imputation is optional p71

12h00-12h20 Yann-Michael De Hauwere, Peter Vrancx, and Ann Nowe, Multi-AgentState Space Aggregation using Generalized Learning Automata p73

12h20-12h40 Sicco Verwer, Mathijs de Weerdt, and Cees Witteveen, Efficiently learn-ing timed models from observations p75

12h40-14h00 Lunch break

14h00-15h00 Invited Talk (Chair: Louis Wehenkel)

Gunnar Ratsch, Boosting, margins, and beyond p12



Session 4.1 Bioinformatics (Chair: Gunnar Ratsch)

15h30-15h50 Thomas Abeel, Yvan Saeys, and Yves Van de Peer, ProSOM: Core pro-moter identification in the human genome p77

15h50-16h10 Sofie Van Landeghem, Yvan Saeys, Bernard De Baets, and Yves Vande Peer, Benchmarking machine learning techniques for the extraction of protein-protein interactions from text p79

16h10-16h30 Aalt-Jan van Dijk, Dirk Bosch, Cajo ter Braak, Sander van der Krol, andRoeland van Ham, Predicting sub-Golgi localization of glycosyltransferases p81

16h30-16h50 Vincent Botta, Pierre Geurts, Sarah Hansoul, and Louis Wehenkel, Pre-diction of genetic risk of complex diseases by supervised learning p83

7

16h50-17h10 Gilles Meyer, and Rodolphe Sepulchre, Component analysis for genome-wide association studies p85

17h10-17h30 Fabien Scalzo, Peng Xu, Marvin Bergsneider, Xiao Hu, MorphologicalFeature Extraction of Intracranial Pressure Signals via Nonlinear Regression p87

Session 4.2 Applications (Chair: Bernard Manderick)

15h30-15h50 Tim Van de Cruys, An extended NMF algorithm for word sense discrim-ination p89

15h50-16h10 Koen Smets, Bart Goethals, and Brigitte Verdonk, Automatic vandalismdetection in wikipedia: towards a machine learning approach p91

16h10-16h30 Bertrand Cornelusse, Louis Wehenkel, and Gerald Vignal, Supervisedlearning of short-term strategies for generation planning p93

16h30-16h50 Jean-Michel Dricot, Mathieu Van der Haegen, Yann-Ael Le Borgne, andGianluca Bontempi, Performance evaluation of machine learning techniques forthe localization of users in wireless sensor networks p95

16h50-17h10 Marc Ponsen, Jan Ramon, Kurt Driessens, Tom Croonenborghs, and KarlTuyls, Bayes-relational learning of opponent models from incomplete informationin no-limit poker p99

17h10-17h30 Francis wyffels, Benjamin Schrauwen, and Dirk Stroobandt, System mod-eling with Reservoir Computing p103

17:30-17:45 Closing

8

Invited talks

Machine Learning and Reinforcement Learning in Clinical Research

Susan A. MurphyInstitute for Social Research & Professor in Psychiatry, University of Michigan, USA

AbstractThis talk will survey some of the possible roles that machine learning researchers can play in informing andimproving clinical practice. Clinical decision making, particularly when the patient has a chronic disorder, isadaptive. That is the clinician must adapt and then readapt treatment type, combinations and dose to thewaxing and waning of the patient’s chronic disorder. This adaption naturally occurs via clinical measurementsof symptom severity, side effects, response to treatment, co-occuring disorders, etc. Currently most policies forguiding clinical decision making are informed primarily by expert opinion with an informal use of clinical trialdata and clinical databases.Some challenges in using trial data and databases are (1) there are usually many unknown causes of the patientobservations; as a result high quality mechanistic models for the ”system dynamics” are found only in veryspecial cases. And (2) clinical databases often include many associations that are not causal; hence a simplisticapplication of learning methods can lead to gross biases. In addition to the causal issues, measures of confidenceare crucial in gaining acceptance of policies constructed from data. Some advances in these areas will be discussed;however all of these are areas in which machine learning scientists could make a great impact.

Scene segmentation with latent topic markov field modelsand

Classification and dimensionality reduction using convex classmodels

Bill TriggsLaboratoire Jean Kuntzmann (LJK) and CNRS, Grenoble, France

AbstractThe talk will be in two parts. In the first part I will present work with Jakob Verbeek on semantic-level scenesegmentation by combining spatial coherence models such as Markov and Conditional Random Fields withlatent topic based local image content models such as Probabilisitic Latent Semantic Analysis over bag-of-wordsrepresentations. In the second part I will present some recent work with Hakan Cevikalp, Frederic Jurie andRobi Polikar on using simple convex approximations to high-dimensional classes for multi-class classification anddiscriminant dimensionality reduction.

11

Preference learning

Johannes FurnkranzKnowledge Engineering Group, TU Darmstadt, Germany

AbstractPreference Learning is a learning scenario that generalizes several conventional learning setttings, such as clas-sification, multi-label classification, and label ranking. In this talk, we will give a brief introduction into thisdeveloping research area, and will in the following focus on our work on explicit modeling of pairwise preferences.In this approach, we learn a separate model for each possible pair of labels, which is used to decide which ofthe two labels is preferred. The predictions of the pairwise models are then combined into an overall ranking ofall possible options. The key advantages of this approach lie in the simplicity of the pairwise models, and thepossibility to combine the pairwise models in various ways, which allows to minimize different loss functions withthe same set of trained classifiers. An obvious disadvantage is the complexity resulting from the need for traininga quadratic number of classifiers. However, it can be shown that in many cases this problem can be efficientlysolved. We will also briefly discuss extensions of the basic model for multilabel classification, for hierarchicalclassification, and for ordered classification.

Boosting, margins, and beyond

Gunnar RatschFriedrich Miescher Laboratory of the Max Planck Society, Tbingen, Germany

AbstractThis talk will survey recent work on understanding Boosting in the context of maximizing the margin of separa-tion. Starting with a brief introduction into Boosting in general and AdaBoost in particular, I will illustrate theconnection to von Neumann’s Minimax theorem and discuss AdaBoost’s strategy to achieve a large margin. Thiswill be followed by a presentation of algorithms which provably maximize the margin, are considerably quicker inmaximizing the margin in practice and implement the soft-margin idea to improve the robustness against noise.In the second part I will discuss how these techniques relate to other convex optimization techniques and howthey are connected to Support Vector Machines. Finally, I will talk about the effects of the the different keyingredients of Boosting and lessons learned from the application of such algorithms to real world problems.

12

Contributed abstracts

Learning a Transfer Function for Reinforcement Learning Problems

Tom Croonenborghs [email protected]

Biosciences and Technology Department, KH Kempen University College, Geel, Belgium

Kurt Driessens [email protected] Bruynooghe [email protected]

Declarative Languages and Artificial Intelligence Group, Katholieke Universiteit Leuven (KUL), Belgium

Abstract

The goal of transfer learning algorithms is toutilize knowledge gained in a source task tospeed up learning in a different but relatedtarget task. Recently, several transfer meth-ods for reinforcement learning have been pro-posed. A lot of them require a hand-codedmapping that relates features from one taskto another. This paper proposes a methodto learn such a mapping automatically frominteractions with the environment. Prelimi-nary experiments show that our approach canlearn a meaningful mapping that can be usedto speed up learning through the execution oftransferred actions during exploration.

1. Introduction

An area where transfer learning is particularly impor-tant is the domain of Reinforcement Learning (RL). InRL (Sutton & Barto, 1998), an agent can observe itsworld and perform actions in it. The agent’s learningtask is to maximize the reward he obtains. At the startof the learning task, the agent has no or little informa-tion and is forced to perform random exploration. Asa consequence, learning can become infeasible or tooslow in practice for complex domains and leveragingknowledge could increase the learning speed.

Recently, several approaches have been proposed totransfer knowledge between different reinforcementlearning tasks. Often, a user-defined mapping is usedto relate the new task to the task for which a policywas already learned, e.g. (Taylor et al., 2007; Torreyet al., 2006). There has been some work on learn-ing such a mapping. E.g. in (Liu & Stone, 2006) agraph-matching algorithm is used to find similaritiesbetween state variables in the source and target task.This approach however needs a complete and correct

transition model for both tasks.

In this paper, transfer learning is achieved by consid-ering transfer actions, i.e. actions transferred from thesource task to the target task during the explorationphase of the learning. To decide which action to trans-fer, the agent learns a function that predicts for eachsource action the probability that executing the trans-fered action is at least as good as executing the actionwhich is best according to the agent’s current utilityfunction and selects the one with the highest proba-bility.

2. Using Exploration to TransferKnowledge

The standard exploration policy of the agent is alteredsucht that with a probability of 10% the agent will se-lect and execute a transfer action. To select the trans-fer action in a state t of the target problem, we willemplay a transfer function p(s, t) that represents foreach source state s the probability that the best ac-tion for s is at least as good for t as the best actionaccording to the current approximation of the utilityfunction. The transfer action executed in t is thenthe action performed on the state s for which p(s, t)is maximal, i.e., πs(argmaxs p(s, t)) with πs the sourcetask policy. Note that we assume for simplicity thatactions in the source task are executable in the tar-get task. In future work, we will extend our approachso that the transfer function maps (state,action)-pairsand is thereby able to incorporate the relation betweenactions.

3. Learning the Transfer Function

At the end of every learning episode in the target task,a number of learning examples for the transfer func-tion can be generated. For every transfer action at

the agent executed (in a state s) during that episode,

15

we compare the quality of the transfer action with thequality of his current policy using two different utilityvalues. On the one hand, a Q-value QMC(s, at) can beobtained by backtracking the Q-values from the end ofthat episode to the step where at is executed. This iscomparable to a Monte Carlo estimate of this Q-value.On the other hand we let the agent learn a Q-value ap-proximation Q of its current policy using a standardQ-learning type algorithm with generalization1. Thegenerated learning example for each executed transferaction then consists of the states in both the sourceand target task and a label for the example: “trans-fer” if QMC(s, at) ≥ maxaQ(s, a) and “no transfer”otherwise.

4. Preliminary Experiments

To evaluate our approach, the target task consists ofa sequence of three four by four rooms. The roomsare connected with doors in the bottom right of eachroom. The agent can only pass a door if he possesses akey of the same color as the door. The keys are placedat random locations in a room. The primitive actionsavailable to the agent include four movement actions(up,down,left and right) and a pickup action that picksup the key that is located at the agent’s location (if ap-plicable). The agent can execute at most 500 actionsper episode and receives a reward of 1 if he exits thelast room and 0 otherwise. The state representationincludes the dimensions of the different rooms, the lo-cations and colors of the doors, the location and colorsof the keys, the keys the agent possesses, the agent’slocation and the goal location. The location consistsof two coordinates, where a coordinate is determinedby the relative position in the room and a certain off-set for every room. In the source task, the agent has(successfully) learned how to navigate to the bottomright location in a four by four grid.

In a first experiment, instead of continuous learningthe transfer function, we learned a single transfer func-tion once with Tilde (Blockeel & De Raedt, 1998)based on learning examples created during the first100 episodes in the target task and actions transferredfrom random source states. The algorithm was able tolearn both the shift in coordinates between the differ-ent rooms and that successful transfer is very unlikelyif the agent does not have the key needed to leave thecurrent room.

We then restarted the agent in the environment withthe learned transfer function. Figure 1 shows the av-

1In the experiments, the SARSA-algorithm is used tolearn a Q-function with the TG-algorithm, a first-orderincremental decision tree learner.

erage reward and number of actions per episode of aSARSA-learning agent, both with and without trans-fer. The numbers are obtained by freezing the currentutility function and following a greedy test policy for100 episodes every 50 episodes. We show results aver-aged over 10 runs.

Figure 1. Results in the multi-room grid wold domain.

5. Future Work

Besides evaluating our approach in more detail, oneimportant future direction is incorporating the qualityof possible transfer in the agent’s exploration policy.We would also like to substitute the batch learning ofthe transfer function as employed in our experiments,by a continuous, incremental approach.

Acknowledgments: Kurt Driessens is a post doc-toral research fellow of the FWO.

References

Blockeel, H., & De Raedt, L. (1998). Top-down induc-tion of first order logical decision trees. ArtificialIntelligence, 101, 285–297.

Liu, Y., & Stone, P. (2006). Value-function-basedtransfer for reinforcement learning using structuremapping. Proceedings of the Twenty-First NationalConference on Artificial Intelligence (pp. 415–20).

Sutton, R., & Barto, A. (1998). Reinforcement learn-ing: An introduction. Cambridge, MA: The MITPress.

Taylor, M. E., Stone, P., & Liu, Y. (2007). Transferlearning via inter-task mappings for temporal dif-ference learning. Journal of Machine Learning Re-search, 8, 2125–2167.

Torrey, L., Shavlik, J., Walker, T., & Maclin, R.(2006). Skill acquisition via transfer learning and ad-vice taking. Proceedings of the 17th European Con-ference on Machine Learning (pp. 425–436).

16

Perturb and Combine in Sequential Decision Making underUncertainty

Boris Defourny [email protected] Ernst [email protected] Wehenkel [email protected]

University of Liege, Department of Electrical Engineering and Computer Science,Grande Traverse 10, Sart-Tilman, B-4000 Liege, Belgium

1. Overview

In the context of sequential decision making, rules thatadapt the decisions to the evolving state of informationare often preferable to a fixed sequence of decisions. Asfinding optimal such rules is complex, an approxima-tion of the information state is often called for at thecost of some suboptimality (see Bertsekas, 2005).

In the classical discrete-time optimal controlparadigm, the state of information is capturedby state variables. An alternative approach (formal-ized in Section 2) consists in modeling the state ofinformation by the history of a disturbance processthat accounts for all the uncertainty over the planninghorizon. Approximations over the disturbance process(in the context of convex optimization, see e.g. Rachev& Romisch, 2002) can simplify the representation ofthe information state without affecting the state spaceor the decision space.

If the disturbance space W has a finite number of ele-ments, the disturbance process over T steps can be ex-actly represented by a disturbance tree of depth T andbranching factor |W |. A decision rule mapping anyparticular outcome of a sequence of t− 1 disturbancesto a decision at time t can also be fully described byassigning decisions to the nodes of the tree. However,the size of the tree grows exponentially with T .

The celebrated Perturb and Combine principle (trac-ing back to Breiman, 1996) would recommend toreplace the disturbance tree by an ensemble of in-complete, much smaller disturbance trees, built by anondeterministic algorithm (described in Section 3).These partial representations of the disturbance pro-cess lead to distinct, partially specified decision rules,that can be optimized in parallel. A first-stage decisionobtained from a combination of their first-stage deci-sions can be implemented, assuming that subsequentdecisions will be subject to renewed computations.

Preliminary experiments (reported in Section 4) sug-gest that this approach brings formidable time savings,while being only slightly suboptimal.

2. Planning over a disturbance tree

Let xt ∈ X, ut ∈ U , wt ∈ W be the systemstate, the decision and the disturbance at time t. Letxt+1 = ft(xt, ut, wt) be the state transition equation.Let rt(xt, ut, wt) at 0 ≤ t < T and rT (xT ) at t = Tbe the reward at time t. Optimal decision rules µ∗t ,0 ≤ t < T , maximize the expectation of a γ-discountedsum of the rewards:

J∗x0= max

µtE{

T−1∑t=0

γtrt(xt, ut, wt) + γT rT (xT )} . (1)

In (1) the decision rule µt at time t is a mapping fromhistories ht = [w0, w1, . . . , wt−1] of the disturbanceprocess to a decision ut. At t = 0, h0 is empty and µ0

degenerates into a decision u0. The state variable xtis recovered by a chain of state transitions using thet decisions uτ = µτ (hτ ) and the t disturbances wτ ,0 ≤ τ < t, where hτ and wτ are extracted from ht.

Under the convenient assumption that W is finite,there is a one-to-one correspondence between the non-terminal nodes of a disturbance tree of depth T andbranching factor |W |, and all the possible histories ht,0 ≤ t ≤ T . It suffices to assign a distinct element ofW to the |W | children of the nonterminal nodes, andview the t disturbances on the path from the root to anode n of depth t as the history associated to node n.

By assigning a decision un to nonterminal nodes n, amapping from histories to decisions and hence a deci-sion rule can be fully specified on the tree.

The expectation in (1) can also be optimized on thetree. Assuming first that the decisions un are set, val-ues of state variables xn and rewards rn are assignedto a node n of history hn, starting from x0 at the

17

root (that has no reward associated with) and using achain of proper state transitions along the path to n.The probability of an element of W is assigned to thenodes with that element. The node probabilities arethen used in a progressive computation of a weightedsum of node rewards which ultimately gives the expec-tation Sµ of the γ-discounted sum of rewards under theun’s. Such a procedure serves as an oracle for scoringthe un’s. A global solution to the maximization of Sµover the un’s gives (1), optimal un’s, and hence an op-timal decision rule, while suboptimal un’s might stillrepresent an acceptable suboptimal decision rule.

3. Decision making on incomplete trees

The complete disturbance tree is cumbersome. A sim-ple method for building an incomplete disturbance treeconsists in sampling a number m of disturbances, withm a random number in {1, 2, . . . , q}, so as to create thechildren of its root. The number m′ ≤ m of distinctdisturbances defines the number of children, while thesample multiplicities induce probabilities assigned tothem. The procedure is repeated for each node ofdepth t < T . The distribution of m, along with thetree depth T , determines the expected size of the tree.

The partial representation of the disturbance processby an incomplete tree leads to spurious opportunitieslikely to be exploited during the optimization of thedecisions. Hence the great appeal of the Perturb andCombine paradigm to mitigate this effect.

An optimal first-stage decision is always well-definedfor each incomplete tree. Under the assumption thatthe decision space U is finite, the aggregation of thesedecisions can be done through a majority vote.

4. Results on a navigation benchmark

The benchmark will illustrate how an ensemble ofsmall incomplete trees advantageously replaces a sin-gle complete one.

A robot is in a corridor, at some position x ∈{1, 2, . . . , 19}. There are 2 exits at x = 0 and x = 20,with distinct rewards r = 1 and r = 5 obtained whenthe robot enters one of these terminal positions. Therobot can move in both directions. A disturbancealso affects the move. Specifically, U = {−1,+1},W = {−1, 0, 1} with Pw = {0.25, 0.50, 0.25}, and xt+1 = xt + ut + wt if 0 < xt + ut + wt < 20,

xt+1 = 0, if xt + ut + wt ≤ 0,xt+1 = 20, if xt + ut + wt ≥ 20 .

The horizon is T = 20 with a discount factor γ = 0.75,so that the robot is in a hurry. (With these parameters,

the optimal decisions are ut = −1 if 1 ≤ xt ≤ 7, andut = +1 if 8 ≤ xt ≤ 19.)

The proposed approach is implemented as follows. Thedecision u0 is chosen by a majority vote over 10 incom-plete trees on which the node decisions have been opti-mized using the Cross-Entropy method (Rubinstein &Kroese, 2004). The sampling of m disturbances in thetree building algorithm is done with P(m = 1) = 0.6,P(m = 2) = 0.2, P(m = 3) = 0.2, resulting in in-complete trees of about 1200 nodes in expectation (inconstrast with the 5.2 ·109 nodes of the complete tree).

Simulations are carried out for the interesting initialpositions x0 = 6, 7, 8, 9, 10. The resulting decisionsare respectively −1 (by 100% of the trees), −1(80%),−1(60%), +1(60%), +1(90%). The percentages sug-gest that some decisions are more reliable than others.(And in fact the decision −1 at x0 = 8 is even sub-optimal.) One could live with that, or carry out newsimulations to ascertain the majority decision.

More technical details about the proposed approachare provided in (Defourny et al., 2008). Future workwill focus on a broader set of benchmarks to evaluatethe advantages of this approach on problems with largedecision spaces and/or large disturbance spaces.

Acknowledgments

Damien Ernst is a Research Associate of the Belgian Na-tional Fund of Scientific Research (FNRS). This paperpresents research results of the Belgian Network DYSCO(Dynamical Systems, Control, and Optimization), fundedby the Interuniversity Attraction Poles Programme, ini-tiated by the Belgian State, Science Policy Office. Thescientific responsibility rests with its authors.

References

Bertsekas, D. (2005). Dynamic programming and subopti-mal control: survey from ADP to MPC. Proceedings ofthe 44th IEEE Conference on Decision and Control andEuropean Control Conference (p. 50).

Breiman, L. (1996). Bagging predictors. Machine Leaning,24, 123–140.

Defourny, B., Ernst, D., & Wehenkel, L. (2008). Lazyplanning under uncertainty by optimizing decisions onan ensemble of incomplete disturbance trees. Submittedfor publication at EWRL08.

Rachev, S., & Romisch, W. (2002). Quantitative stabilityin stochastic programming: The method of probabilitymetrics. Mathematics of Operations Research, 27, 792–818.

Rubinstein, R., & Kroese, D. (2004). The Cross-EntropyMethod. A Unified Approach to Combinatorial Optimiza-tion, Monte-Carlo Simulation, and Machine Learning.Information Science and Statistics. Springer.

18

Variable selection for dynamic treatment regimes:a reinforcement learning approach

Raphael Fonteneau [email protected] Wehenkel [email protected] Ernst† [email protected]

Department of Electrical Engineering and Computer Science and GIGA-Research, University of Liege, GrandeTraverse 10, 4000 Liege, Belgium. † Research Associate FNRS.

1. Introduction

Nowadays, many diseases as for example HIV/AIDS,cancer, inflammatory or neurological diseases are seenby the medical community as being chronic-like dis-eases, resulting in medical treatments that can lastover very long periods. For treating such diseases,physicians often adopt explicit, operationalized seriesof decision rules specifying how drug types and treat-ment levels should be administered over time, whichare referred to in the medical community as DynamicTreatment Regimes (DTRs). Designing an appropri-ate DTR for a given disease is a challenging issue.Among the difficulties encountered, we can mentionthe complex dynamics of the human body interactingwith treatments and other environmental factors, aswell as the often poor compliance to treatments dueto the side effects of some of the administered drugs.While typically DTRs are based on clinical judgmentand medical insight, since a few years the biostatisticscommunity is investigating a new research field ad-dressing specifically the problem of inferring in a wellprincipled way DTRs directly from clinical data gath-ered from patients under treatment. Among the re-sults already published in this area, we mention (Mur-phy, 2005) which uses statistical tools for designingDTRs for psychotic patients.

2. Problem formulation

One possible approach to infer DTR from the data col-lected through clinical trials is to formalize this prob-lem as an optimal control problem for which mostof the information available on the ‘system dynam-ics’ (the system is here the patient and the input ofthe system is the treatment) is ‘hidden’ in the clin-ical data. This problem has been vastly studied inReinforcement Learning (RL), a subfield of machinelearning (see e.g., (Ernst et al., 2005)). Its applicationto the DTR problem would consist of processing theclinical data so as to compute a closed-loop treatment

strategy which takes as inputs all the various clinicalindicators which have been collected from the patients.Using policies computed in this way may however beinconvenient for the physicians who may prefer DTRsbased on an as small as possible subset of relevant in-dicators rather than on the possibly very large set ofvariables monitored through the clinical trial. In thisresearch, we therefore address the problem of deter-mining a small subset of indicators among a larger setof candidate ones, in order to infer by RL convenientdecision strategies. Our approach is closely inspiredby work on ‘variable selection’ for supervised learning.

3. Learning from a sample

We assume that the information available for design-ing DTRs is a sample of discrete-time trajectories oftreated patients, i.e. successive tuples (xt, ut, xt+1),where xt represents the state of a patient at sometime-step t and lies in an n-dimensional space X ofclinical indicators, ut is an element of the action space(representing treatments taken by the patient in thetime interval [t, t + 1]), and xt+1 is the state at thesubsequent time-step.

We further suppose that the responses of patients suf-fering from a specific type of chronic disease all obeythe same discrete-time dynamics:

xt+1 = f(xt, ut, wt) t = 0, 1, . . .

where disturbances wt are generated by the probabilitydistribution P (w|x, u). Finally, we assume that onecan associate to the state of the patient at time t and tothe action at time t, a reward signal rt = r(xt, ut) ∈ Rwhich represents the ‘well being’ of the patient over thetime interval [t, t + 1]. Once the choice of the functionrt = r(xt, ut) has been realized (a problem known aspreference elicitation), the problem of finding a ‘good’DTR may be stated as an optimal control problemfor which one seeks to find a policy which leads to asequence of actions u0, u1, . . . , uT−1, which maximizes,

19

over the time horizon T ∈ N, and for any initial statethe criterion:

R(u0,u1,...,uT−1)T (x0) = E

wtt=0,1,...,T−1

[T−1∑t=0

r(xt, ut)

]

One can show (see e.g., (Ernst et al., 2005)) that thereexists a policy π∗T : X × [0, . . . , T − 1] → U whichproduces such a sequence of actions for any initial statex0. To characterize these optimal T -stage policies, letus define iteratively the sequence of state-action valuefunctions QN : X × U → R, N = 1, . . . , T as follows:

QN (x, u) = Ew

[r(x, u) + sup

u′∈UQN−1(f(x, u, w), u′)

](1)

with Q0(x, u) = 0 for all (x, u) ∈ X × U . Dy-namic programming theory implies that, for all t ∈{1, . . . , T − 1} and x ∈ X , the policy

π⋆T (t, x) = argmax

u∈UQT−t(x, u)

is a T-step optimal policy.

Exploiting directly (1) for computing the QN -functions is not possible in our context since f isunknown and replaced here by an ensemble of one-step trajectories F =

{(xl

t, ult, r

lt, x

lt+1)

}#Fl=1

, whererlt = r(xl

t, ult). To address this problem, we exploit

the fitted Q iteration algorithm which offers a wayfor computing (approximations of) the QN -functions(QN ) from the sole knowledge of F (Ernst et al., 2005).Notice that when used with tree based approximators,as it is the case in this paper, this algorithm offersgood inference performances. Furthermore, we exploitthe particular structure of these tree-based approxi-mators in order to identify the most relevant clinicalindicators among the n candidate ones.

4. Selection of clinical indicators

As mentioned in Section 2, we propose to find a smallsubset of state variables (clinical indicators), the m(m ≪ n) most relevant ones with respect to a certaincriterion, so as to create an m-dimensional subspaceof X on which DTRs will be computed. The approachwe propose for this exploits the tree structure of theQN -functions computed by the fitted Q iteration algo-rithm. More specifically, it evaluates the relevance ofeach state variable xi, by the score function:

S(xi) =

∑TN=1

∑τ∈QN

∑ν∈τ δ(ν, xi)∆var(ν)|ν|∑T

N=1

∑τ∈QN

∑ν∈τ ∆var(ν)|ν|

where ν is a nonterminal node in a tree τ (used tobuild the ensemble model representing one of the QN -functions), δ(ν, xi) = 1 if xi is used to split at node

ν and 0 otherwise, ∆var(ν) is the variance reductionwhen splitting node ν, and |ν| is the cardinality of thesubset of tuples residing at node ν.

The approach then sorts the state variables xi by de-creasing values of their score so as to identify the mmost relevant ones. A DTR defined on this subset ofattributes is then computed by running the fitted Qiteration algorithm again on a ‘modified F ’, where thestate variables of xl

t and xlt+1 that are not among these

m most relevant ones are discarded.

The algorithm for computing a DTR defined on a smallsubset of state variables is thus as follows:(1) compute the QN -functions (N = 1, . . . , T ) usingthe fitted Q iteration algorithm on F ,(2) compute the score function for each state variable,and determine the m best ones,

(3) run the fitted Q iteration algorithm on∼F ={

(∼x

l

t, ult, r

lt,∼x

l

t+1)}#

∼F

l=1where

∼xt =

∼Mxt, and

∼M is a

m×n boolean matrix where∼mi,j = 1 if the state vari-

able xj is the i-th most relevant one and 0 otherwise.

5. Preliminary validation

The method has been tested on the ‘car on the hill’problem, a classical benchmark in RL (Ernst et al.,2005). This problem, which has a (continuous) statespace of dimension two (the position p and the speeds of the car), is originally a deterministic problem. Wehave added to these variables some non-informativecomponents so as to set up an experimental protocol.In our trials, the algorithm described previously wasable to identify s and p as the most informative vari-ables, which is encouraging for our future work withreal-life clinical data.

Acknowledgments

This paper presents research results of the BelgianNetwork BIOMAGNET (Bioinformatics and Model-ing: from Genomes to Networks), funded by the In-teruniversity Attraction Poles Programme, initiatedby the Belgian State, Science Policy Office. The sci-entific responsibility rests with its authors.

References

Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journalof Machine Learning Research, 6, 503–556.

Murphy, S. (2005). An experimental design forthe development of adaptative treatment strategies.Statistics in Medicine, 24, 1455–1481.

20

An Alternative Approach for Playing Complex Games like Chess

Jan Lemeire [email protected]

ETRO Dept., Vrije Universiteit Brussel, Brussels, Belgium

AbstractComputer algorithms for game playing rely ona state evaluation which is based on a set of fea-tures and patterns. Such evaluation can, however,never fully capture the full complexity of gamessuch as chess, since the impact of a feature ora pattern on the game outcome heavily relies onthe game’s context. It is a well-known problemin pattern-based learning that too many too spe-cialized patterns are needed to capture all pos-sible situations. We hypothesize that a patternshould be regarded as an opportunity to attain acertain state during the continuation of the game,which we call the effect of a pattern. For cor-rect game state evaluation, one should analyzewhether the desired effects of the matched pat-terns can be reached. Patterns indicate opportuni-ties to reach a more advantageous situation. Tes-ting whether this is possible in the current contextis performed through a well-directed game treeexploration. We argue that this approach comescloser to the human way of game playing.

1. Why the Evaluation FailsBesides the abundant game playing research in optimizingthe brute-force minimax search much work is done on lear-ning algorithms. They try to mimic human game playing.Explanation-based algorithms offer such an approach. Inexplanation-based learning (EBL), prior knowledge is usedto analyze, or explain, how each observed training exam-ple satisfies the target concept (Mitchell et al., 1986). Thisexplanation is then used to distinguish the relevant featuresof the training examples from the irrelevant, so that exam-ples can be generalized based on logical rather than statis-tical reasoning. A pattern denotes an advantageous situa-tion. The explanations must give the sufficient and neces-sary conditions for a pattern to be successful.

However, for a complex game like chess, patterns that haveto capture all aspects of a game become too complex. Con-sider the task of learning to recognize chess positions - theexplanations - in which “one’s queen will be lost within the

next few moves” - the pattern (Mitchell & Thrun, 1996).In a particular example, the queen could be lost due to afork, in which “the white knight is attacking both the blackking and queen”. A fork is, however, hard to define cor-rectly. One has to capture all situations in which the pat-tern leads to a successful outcome. All counter-plans thatare available to the opponent for saving both its threatenedpieces have to be excluded (Furnkranz, 2001, p. 25). Aquasi-unlimited number of counter moves, generated by thecontext in which the pattern appears, exist that can neutral-ize the effects. Minton (Minton, 1984) and Epstein (Ep-stein et al., 1996) highlight the same problem of learningtoo many too specialized rules with explanation-based lear-ning. Even in simple games, such as tic-tac-toe, 45 con-cepts were learned with 52 exception clauses (Fawcett &Utgoff, 1991).

For adequate pattern-based evaluation functions, the pat-terns must contain all information on the outcome of thegame. This can be written as:

state ⊥⊥ outcome | Patterns(state) (1)

where Patterns(state) stands for the patterns that applyfor state. All game-playing algorithms rely in one wayor another on an evaluation of game states. A brute-forcesearch tries to postpone an evaluation as much as possible,it explores all possible move sequences as far as possibleinto the future.

2. Alternative ApproachOur analysis is based on the observation that the outcomeof a game is determined by the exact interaction of the pat-terns and heavily depends on the context of the game state.Trying to describe all the interactions leads, by the com-plexity of the game, to an enormous amount of rules orpatterns. We hypothesize that the influence of a pattern onthe game outcome depends on the achievement of certainstates during the continuation of the game. We call thesestates the effects of the pattern. The influence of a patternon the game outcome is completely described by these ef-fects. The game can be analyzed by the set of existing pat-terns and whether their effects can be achieved. The differ-ence with the explanation-based approach is that we do not

21

Figure 1. Game tree exploration by looking at patterns and theirpossible effects

expect the game always to reach the effect in the presenceof the pattern.

Take the game tree of Fig. 1. Assume that the white playerconsiders playing move a by which he arrives at a positionin which pattern 1 is true. He hopes of achieving one ofthe advantageous effects of the pattern. The black playersees two possible counter moves. If he chooses for movec, however, white can collect the benefits of pattern 1 withmove e. This is not possible if black chooses for move d.White can then play f or g, but in both cases black neutral-izes the threat of pattern 1 with moves h and j respectively.Both moves bring the game in a state in which the posi-tive effects of the pattern cannot be attained anymore. Notethat not all possible moves have to be explored. The gametree can be pruned effectively. Only the moves interferingwith the pattern has to be explored. The other moves can beclassified as being irrelevant, since they do not approximatewhite to the achievement of pattern 1’s effect.

We thus have defined a new kind of generic knowledge;patterns together with their effects. However, an imple-mentation of this approach needs a yet inexistent patternengine. We do not have a general way to describe, recog-nize, learn and reason with patterns.

3. Human-like Game PlayingPsychological studies have shown that the differences inplaying strengths between chess experts and novices arenot so much due to differences in the ability to calculatelong move sequences, but to which moves they start to cal-culate. Cowley and Byrne showed that chess experts rely

on falsification (Cowley & Byrne, 2004). The results ofthe research show that chess masters were readily able tofalsify their plans. They generated move sequences thatfalsified their plans more readily than novice players, whotended to confirm their plans. Our approach confirms this;it is based on plans and on falsification.

It’s well-known that humans have difficulties formallydefining the knowledge they use. Our approach can ex-plain this. A pattern only denotes an opportunity. A pre-cise description of the states in which it is successful isnot necessary, a well-directed tree search is used to con-firm or falsify the hypothesis. Our approach also explainswhy humans can reason about a game, why we can exactlypinpoint which actions were decisive in a game and why.

ReferencesCowley, M., & Byrne, R. M. J. (2004). Chess masters hy-

pothesis testing. Proceedings of the 26th Annual Confer-ence of the Cognitive Science Society. Mahwah, NJ (pp.250–255).

Epstein, S. L., Gelfand, J. J., & Lesniak, J. (1996). Pattern-based learning and spatially-oriented concept formationin a multi-agent, decision-making expert. ComputationalIntelligence, 12, 199–221.

Fawcett, T., & Utgoff, P. E. (1991). A hybrid method forfeature generation. Machine Learning: Proceedings ofthe Eighth International Workshop (pp. 137–141). Mor-gan Kaufmann.

Furnkranz, J. (2001). Machine learning in games: A sur-vey. In J. Furnkranz and M. Kubat (Eds.), Machines thatlearn to play games, 11–59. Huntington, NY: Nova Sci-ence Publishers.

Gould, J., & Levinson, R. A. (1991). Method integrationfor experience-based learning (Technical Report UCSC-CRL-91-27). UCSC, Santa Cruz, CA.

Minton, S. (1984). Constraint-based generalization: Lear-ning game-playing plans from single examples. Pro-ceedings of the 2nd National Conference on ArtificialIntelligence, Austin, TX (pp. 251–254).

Mitchell, T., Keller, R., & Kedar-Cabelli, S. (1986).Explanation-based generalization: A unifying view. Ma-chine Learning, 1, 47–80.

Mitchell, T. M., & Thrun, S. B. (1996). Learning analyti-cally and inductively. In D. M. Steier and T. M. Mitchell(Eds.), Mind matters: A tribute to Allen Newell, 85–110.Mahwah, New Jersey: Lawrence Erlbaum Associates,Inc.

22

ProbLog and its Application to Link Mining in Biological Networks

Luc De Raedt [email protected]

Dept. of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, POBox 2402, B-3001 Hever-lee, Belgium

AbstractProbLog is a recently introduced probabilis-tic extension of Prolog (De Raedt et al.,2007). A ProbLog program defines a distri-bution over logic programs by specifying foreach clause the probability that it belongsto a randomly sampled program, and theseprobabilities are mutually independent. Thesemantics of ProbLog is then defined by thesuccess probability of a query in a randomlysampled program. It has been applied to linkmining and discovery in a large biological net-work. In the talk, I will also discuss variouslearning settings for ProbLog and link min-ing, in particular, I shall present techniquesfor probabilistic local pattern mining (Kim-mig & De Raedt, 2008), probabilistic expla-nation based learning (Kimmig et al., 2007)and theory compression from examples (DeRaedt et al., 2008).

Acknowledgments

This is joint work with Angelika Kimmig, HannuToivonen, Bernd Gutmann, Kate Revoredo and Kris-tian Kersting.

References

De Raedt, L., Kersting, K., Kimmig, A., Revoredo, K.,& Toivonen, H. (2008). Compressing probabilisticprolog programs. Machine Learning, 70, 151–168.

De Raedt, L., Kimmig, A., & Toivonen, H. (2007).ProbLog: A probabilistic Prolog and its applicationin link discovery. Proceedings of IJCAI (pp. 2462–2467).

Kimmig, A., & De Raedt, L. (2008). Local patternmining in probabilistic databases. submitted.

Kimmig, A., De Raedt, L., & Toivonen, H. (2007).Probabilistic explanation based learning. Proceed-ings 18th European Conference on Machine Learn-ing (pp. 176–187).

23

Parameter Learning in Probabilistic Databases:A Least Squares Approach

Bernd Gutmann [email protected] Kimmig [email protected] De Raedt [email protected]

Dept. of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, POBox 2402, BE-3001 Hev-erlee, Belgium

Kristian Kersting [email protected]

Fraunhofer IAIS, Schloß Birlinghoven, 53754 Sankt Augustin, Germany

Keywords: Learning, Graphs, Probabilistic Databases, Logic

Abstract

Probabilistic databases compute the successprobabilities of queries. We introduce theproblem of learning the parameters of theprobabilistic database ProbLog. Given theobserved success probabilities of a set ofqueries, we use a least squares approach tocompute the probabilities attached to factsthat have a low approximation error on thetraining data as well as on unseen examples.1

1. Introduction

The statistical relational learning community has de-voted a lot of attention to learning both the struc-ture and parameters of probabilistic logics, cf. (Getoor& Taskar, 2007; De Raedt et al., 2008), but so farseems to have devoted little attention to the learn-ing of probabilistic database formalisms. Probabilis-tic databases (Dalvi & Suciu, 2004; De Raedt et al.,2007) associate probabilities to facts, indicating theprobabilities with which the facts hold. This informa-tion is then used to define and compute the successprobability of queries or derived facts or tuples. Be-cause probabilistic databases do not constitute a gen-erative model, it has – so far – been unclear as howto learn such databases. In this paper, we introducethe problem of learning the parameters of probabilis-tic databases from a set of queries together with theirtarget probabilities. The approach is incorporated in

1This is a shortened version of an extended abstractsubmitted to the 6th International Workshop on Miningand Learning with Graphs (MLG2008)

the probabilistic database ProbLog (De Raedt et al.,2007), though it can easily be integrated in other prob-abilistic databases as well. ProbLog has been designedto support life scientists that mine a large network ofbiological entities in interactive querying sessions..

a

b

0.7 c0.8

0.6

d0.9

(a)

0

0.005

0.01

0.015

0.02

0.025

0 10 20 30 40 50

MS

E

Iteration

MSE on test set

(b)

Figure 1. (a) Probabilistic graph. (b) Learning curve withstandard deviation, on test set for graph with 88 edges.

2. ProbLog

ProbLog is a simple probabilistic extension of Prologintroduced in (De Raedt et al., 2007). A ProbLogprogram consists – as Prolog – of a set of definiteclauses. However, in ProbLog every fact ci is la-beled with the probability pi that it is true, and thoseprobabilities are assumed to be mutually independent.For ease of illustration, we will consider probabilis-tic graphs like the one in Figure 1(a) in the follow-ing, but the entire discussion carries over to arbitraryProbLog programs. Such a probabilistic graph canbe used to sample subgraphs by tossing a biased coinfor each edge. The corresponding ProbLog programT = {p1 : c1, · · · , pn : cn} therefore defines a probabil-ity distribution over subgraphs L ⊆ LT = {c1, · · · , cn}in the following way:

P (L|T ) =∏

ci∈Lpi

∏ci∈LT \L

(1− pi).

25

It is straightforward to add background knowledge inthe form of Prolog clauses, say, the definition of a pathby combining edges. We can then ask for the probabil-ity that there exists e.g. a path between nodes a andc in our probabilistic graph, i.e. the probability that arandomly sampled subgraph contains the edge from ato c, or the path from a to c via b (or both of them).Formally, the success probability Ps(q|T ) of a query qin a ProbLog program T is defined as

Ps(q|T ) =∑

L⊆LT

P (q|L) · P (L|T ) , (1)

where P (q|L) = 1 if there exists a θ such that L |= qθ,and P (q|L) = 0 otherwise. The success probability ofquery q corresponds to the probability that the queryq is provable in a randomly sampled logic program.Due to presence of multiple paths in samples, eval-uating the success probability of ProbLog queries iscomputationally hard, see (De Raedt et al., 2007) foran approximation algorithm.

3. Parameter Learning

ProbLog does not provide a generative model forsampling queries (e.g. paths between nodes). Thus,we cannot directly apply standard maximum likeli-hood techniques for parameter estimation based onthe EM algorithm. We consider parameter learningfor ProbLog as a function optimization problem:

Definition 1 (ProbLog Parameter Learning)Given a set of training examples {qi, pi}Ki=1, K > 0,where each qi ∈ H is a logical query with successprobability pi, find a function h : H → [0, 1] withlow approximation error on the training data as wellas on unseen examples. H comprises all parameterassigments for a given logical program T .

We want to minimize the mean squared error (MSE):

MSE(T ) =1K

∑1≤i≤K

(Ps(qi|T )− pi

)2. (2)

It is easy to show that minimizing the squared error inthis case corresponds to finding a maximum likelihoodhypothesis, provided that for each training example(qi, pi), a Gaussian error is included, i.e. pi = p(qi)+ei,with p(qi) the actual probability of query qi and ei

drawn independently from a Gaussian with mean zero.We now derive the gradient of the MSE. Applying thesum and chain rule to Eq. (2) yields the partial deriva-tive ∂MSE(T )/∂pj =

2K

∑1≤i≤K

(Ps(qi|T )− pi

)︸︷︷︸Part 1

· ∂ Ps(qi|T )∂pj︸︷︷︸

Part 2

. (3)

We apply standard gradient descent to minimize theMSE. (3) can be evaluated by ProbLog inference di-rectly (Part 1) and by slightly adapting the underlyingtechniques (Part 2).

4. Experiments

We implemented the gradient descent algorithm inProlog (Yap-5.1.3). Since this is ongoing work, we pri-marily try to answer the question: does the gradientdescent minimize the MSE?

As our test graph G, we used a real biological grapharound 3 random Alzheimer genes, with 45 nodes and88 edges, cf. (De Raedt et al., 2007). We randomlysampled 100 node pairs and calculated the probabilitythat there exists a path between them using approxi-mative ProbLog inference. We performed 5-fold cross-validation, initializing the parameters randomly, withfixed seed for succeeding experiments. Figure 1(b)shows the learning curve for the test set. After 50 it-erations, the MSE averaged over 5 folds is 0.00016 ±0.00001 on the training set and 0.00107 ± 0.00065 onthe test set, answering our question positively.

5. Conclusions

We introduced an approach to parameter learning forthe probabilistic database ProbLog and successfullyshowed it at work on a real biological application. In-teresting directions for future research include optimiz-ing the learning algorithm and regularization-basedcost functions. Those enable domain experts to refineprobabilities of a database by stating examples.

Acknowledgments AK, BG are supported by theResearch Foundation-Flanders (FWO-Vlaanderen),KK by a Fraunhofer ATTRACT fellowship.

References

Dalvi, N. N., & Suciu, D. (2004). Efficient query eval-uation on probabilistic databases. VLDB (pp. 864–875).

De Raedt, L., Frasconi, P., Kersting, K., & Muggle-ton, S. (Eds.). (2008). Probabilistic inductive logicprogramming - theory and applications, vol. 4911 ofLNAI. Springer-Verlag.

De Raedt, L., Kimmig, A., & Toivonen, H. (2007).ProbLog: A probabilistic Prolog and its applicationin link discovery. IJCAI (pp. 2462–2467).

Getoor, L., & Taskar, B. (Eds.). (2007). Statisticalrelational learning. The MIT press.

26

CPT-L: an Efficient Model for Relational Stochastic Processes

Ingo Thon [email protected] Landwehr [email protected] De Raedt [email protected]

Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Heverlee, Belgium

AbstractAgents that learn and act in real-world environ-ments have to cope with both complex state de-scriptions and non-deterministic transition be-havior of the world. Standard statistical rela-tional learning techniques can capture this com-plexity, but are often inefficient. We presenta simple probabilistic model for such environ-ments based on CP-Logic. efficiency is main-tained by restriction to a fully observable setting.

1. IntroductionArtificial intelligence aims at developing agents that learnand act in complex environments. Realistic environmentstypically feature a variable number of objects, relationsamongst them, and non-deterministic transition behav-ior. Standard probabilistic sequence models provide effi-cient inference and learning techniques, but typically can-not fully capture the relational complexity. On the otherhand, statistical relational learning techniques are oftentoo inefficient. In this paper, we present a simple modelthat occupies an intermediate position in this expressive-ness/efficiency trade-off. More specifically, we contributea novel representation, called CPT-L (for CPTime-Logic),that essentially defines a probability distribution over se-quences of interpretations. Interpretations are relationalstate descriptions that are typically used in planning andmany other applications of artificial intelligence. CPT-Lis a variation of CP-logic (Vennekens et al., 2006), a recentexpressive logic for modeling causality. By focusing on thesequential aspect and deliberately avoiding the complica-tions that arise when dealing with hidden variables, CPT-Lis more restricted, but also more efficient to use than itspredecessor and alternative formalisms within the artificialintelligence and statistical relational learning literature.

This is clear when positioning CPT-L w.r.t. to the fewexisting approaches that can probabilistically model se-quences of relational state descriptions. First, standardSRL-approaches (Getoor & Taskar, 2007) can be used inthis setting by explicitly modeling time. However, suchmodels are often intractable for complex sequential real-

world domains. Second, relational STRIPS-based tech-niques (Zettlemoyer et al., 2005) are able to probabilisti-cally model relational sequences. However, they are re-stricted by the fact that only one rule can “fire” at a par-ticular point in time and thus only one aspect of the worldcan be changed. The key contributions of our work are theintroduction of 1) the CPT-L model for representing proba-bility distributions over sequences of interpretations, 2) wereport that efficient inference is possible under the assump-tion of fully observability and the restriction to sequentialcausal effects.

2. CPT-LA relational interpretation I is a set of ground factsa1, ..., aN .A relational stochastic process defines a distri-bution P (I1, ..., IT ) over sequences of interpretations oflength T .The semantics of CPT-L is based on CP-logic, aprobabilistic first-order logic that defines probability distri-butions over interpretations (Vennekens et al., 2006). CP-logic has a strong focus on causality and constructive pro-cesses: an interpretation is incrementally constructed bya process that adds facts which are probabilistic outcomesof other already given facts (the causes). CPT-L com-bines the semantics of CP-logic with that of (first-order)Markov processes. Causal influences only stretch fromIt to It+1 (Markov assumption), are identical for all timesteps (stationarity), and all causes and outcomes are ob-servable. Models in CPT-L are also called CP-theories, andare defined as follows:

Definition 1. A CPT-theory is a set of rules of the form

r = (h1 : p1) ∨ . . . ∨ (hn : pn)︸︷︷︸head(r)

← b1, . . . , bm︸︷︷︸body(r)

where the hi are logical atoms, the bi are literals (i.e.,atoms or their negation) and pi ∈ [0, 1] probabilities s.t.∑n

i=1 pi = 1.

We shall also assume all variables appearing in the headof the rule also appear in its body. The intuition behind arule is that whenever the (grounded) body of the rule holdsin the current state It, one of the (grounded) heads willhold in the next state It+1. In this way, the rule models

27

-10000

-9000

-8000

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0 2 4 6 8 10

Log-

Like

lihoo

d

Iterations of the EM-Algorithm

10 blocks25 blocks50 blocks

0

2

4

6

8

10

12

10 20 30 40 50

runt

ime

[min

utes

] for

10

itera

tions

Number of blocks

runtime

Figure 1. Large graph: per-sequence log-likelihood on trainingdata as a function of the EM iteratio. Small graph: Running timeof EM as a function of the number of blocks in the world model.

a (probabilistic) causal process as the condition specifiedin the body causes one (probabilistically chosen) atoms inthe head to become true in the next time step. One of themain features of CPT-theories is that they are easily ex-tended to include background knowledge, which can be anylogic program (cf. (Bratko, 1990)). In the presence of back-ground knowledge, we say that a ground rule is applicablein an interpretation It if its body b1θ, . . . , bmθ can be logi-cally derived from It and the logic program B.A CPT-theory defines a distribution over possible succes-sor states, P (It+1 | It), in the following way. Let Rt ={r1, ..., rk} denote the set of all ground rules applicablein the current state It. Each ground rule applicable in Itwill cause one of its head elements to become true in It+1.More formally, a selection σ is a mapping from rules ri toindices ji denoting that head element hiji ∈ head(ri) isselected. In the stochastic process to be defined, It+1 is apossible successor for the state It if and only if there is aselection σ such that It+1 = {h1σ(1), ..., hkσ(k)}. We saythat σ yields It+1 from It, denoted It

σ→ It+1, and define

P (It+1|It) =∑

σ:Itσ→It+1

P (σ) =∑

σ:Itσ→It+1

∏(ri,ji)∈σ

pji , (1)

where pji is the probability associated with head elementhiji in ri. As for propositional Markov processes, the prob-ability of a sequence I1, ..., IT given an initial state I0 isdefined by

P (I1, ..., IT ) = P (I1)T∏

t=0

P (It+1 | It). (2)

Intuitively, it is clear that this defines a distribution overall sequences of interpretations of length T much as in thepropositional case.

3. Experimental EvaluationThe proposed CPT-L model has been evaluated in astochastic version of the well-known blocks world do-

main. The domain was chosen because it is truly rela-tional and also serves as a popular artificial world modelin agent-based approaches such as planning and reinforce-ment learning. Furthermore, it is an example for a domainin which multiple aspects of the world can change concur-rently — for instance, a block can be moved from A toB while at the same time a stack collapses, spilling all ofits blocks on the floor. In an experiment, we explore theconvergence behavior of the EM algorithm for CPT-L. Theworld model together is implemented by a (gold-standard)CPT-theory T , and a training set of 20 sequences of length50 each is sampled from T . From this data, the parametersare re-learned using EM. Figure 1, large graph, shows theconvergence behavior of the algorithm on the training datafor different numbers of blocks in domain, averaged over15 runs. It shows rapid and reliable convergence. Figure 1,small graph, shows the running time of EM as a functionof the number of blocks. The scaling behavior is roughlylinear, indicating that the model scales well to reasonablylarge domains. Absolute running times are also low, withabout 1 minute for an EM iteration in a world with 50blocks.This is in contrast to other, more expressive model-ing techniques which typically scale badly to domains withmany objects. The difference between the log likelihood onan independent test set of the gold-standard model and thelearned model, were by four orders of magnitudes smallerthan the difference to a random model. Manual inspectionof the learned model also shows that parameter values areon average very close to those in the gold-standard model.

4. Conclusions and Future WorkWe have introduced CPT-L, a probabilistic model for se-quences of relational state descriptions. In contrast to otherapproaches that address this setting, the focus in CPT-L ison computational efficiency rather than maximal expres-sivity. The main interesting directions for future work is tofurther evaluate representation power and scaling behaviorof the model in challenging real-world domains.

ReferencesBratko, I. (1990). Prolog programming for artificial intel-

ligence. Addison-Wesley. 2nd Edition.

Getoor, L., & Taskar, B. (Eds.). (2007). Statistical rela-tional learning. MIT press.

Vennekens, J., Denecker, M., & Bruynooghe, M. (2006).Representing causal information about a probabilisticprocess. Logics In Artificial Intelligence (pp. 452–464).

Zettlemoyer, L. S., Pasula, H., & Kaelbling, L. P. (2005).Learning planning rules in noisy stochastic worlds.AAAI-05 (pp. 911–918).

28

Learning Inclusion-Optimal Chordal Graphs

Vincent Auvray [email protected]

Louis Wehenkel [email protected]

GIGA-R and Department of Electrical Engineering and Computer Science, University of Liège, Belgium

Abstract

This abstract discusses a very simple and e�-cient algorithm to learn the chordal structureof a probabilistic model from data. The al-gorithm is a greedy hill-climbing search algo-rithm that uses the inclusion boundary neigh-borhood over chordal graphs. In the limitof a large sample size and under appropriatehypotheses on the scoring criterion, the algo-rithm will �nd a structure that is inclusion-optimal when the dependency model of thedata-generating distribution can be perfectlyrepresented by an undirected graph.

1. Introduction

This abstract considers the class of graphical modelswhose structure is a chordal graph, known as the classof decomposable models. A chordal graph is an undi-rected graph (UG) where every cycle comprising morethan three edges has a chord. The class of dependencymodels de�ned by chordal graphs is the intersectionof the class of directed acyclic graphs (DAGs) depen-dency models and the class of UG dependency models.

A greedy hill-climbing search algorithm is often usedto learn the DAG structure of a Bayesian Network.Di�erent choices of search spaces and neighborhoodsconnecting the search space are possible. In partic-ular, the search may proceed over the set of Markovequivalence classes of DAG structures by exploitingthe inclusion boundary neighborhood (see (Chicker-ing, 2002; Auvray & Wehenkel, 2002)). Under appro-priate assumptions on the scoring criterion and on thedata-generating distribution, a greedy algorithm us-ing the inclusion boundary neighborhood returns aninclusion-optimal structure in the limit of a large sam-ple size (see (Chickering & Meek, 2002) and (Castelo &Ko£ka, 2003)). Unfortunately, the size of the inclusionboundary of an equivalence class of a DAG structureis in the worst case exponential in the number of vari-ables. The notion of inclusion boundary neighborhood

can also be de�ned over sets of chordal graphs. In thiscontext, its size is bounded from above by the squareof the number of variables and it can be computedeasily.

In this abstract, we investigate the optimality prop-erties of the greedy hill-climbing search algorithm us-ing the inclusion boundary neighborhood to learn achordal structure. As mentionned above in the caseof DAG structures, a desirable property of a structurelearning algorithm is to return an inclusion-optimalsolution. We describe a local asymptotic consistencyproperty of scoring criteria that ensures that a greedysearch will produce an inclusion-optimal chordal struc-ture when the independence relations holding in thedata-generating distribution can be represented ex-actly by an undirected graph. For more details andomitted proofs, see (Auvray & Wehenkel, 2008).

2. Background

Consider an undirected graph G whose vertex set Xis a set of random variables. Given disjoints setsA,B,C ⊆ X, we say that A and B are separated byC in G if all paths between a vertex in A and a vertexin B go through at least one vertex in C. The depen-dency model encoded by G consists of the marginaland conditional independence relations A ⊥ B|C suchthat A and B are separated by C in G. In the se-quel, we sometimes identify an undirected graph andits dependency model.

Let us de�ne the notion of inclusion-optimality forchordal graphs. Consider a particular dependencymodel M0. We say that a chordal dependency modelM is inclusion-optimal for M0 if M0 ⊆ M and thereis no chordal dependency model M ′ such that M0 ⊆M ′ ⊆ M . This notion has a simple graphical inter-pretation: a chordal graph G encodes an inclusion-optimal dependency model for M0 if, and only if, (a)it does not encode any independence assumption thatdoes not hold inM0 and (b) every chordal subgraph ofG encodes such an incorrect independence assumption.

29

Let us present the notion of inclusion boundary forchordal graphs. The inclusion boundary of a chordalgraph G is the set of chordal graphs H satisfying

• I(G) ⊆ I(H) and there is no chordal graph Ksuch that I(G) ⊆ I(K) ⊆ I(H), or

• I(H) ⊆ I(G) and there is no chordal graph Ksuch that I(H) ⊆ I(K) ⊆ I(G),

where I(C) denotes the dependency model (i.e. the setof conditional independencies) encoded by the chordalgraph C. It is straightforward to describe graphicallythe inclusion boundary of a chordal graph G: it con-sists of the chordal graphs that di�er from G by theaddition or removal of a single edge. This is a con-sequence of the fact that, for any two chordal graphsG,H such that H is a subgraph of G, there existsa sequence of chordal graphs K0, . . . ,Kn such thatK0 = H, Kn = G and Ki+1 is obtained from Ki byadding a single edge (see (Giudici & Green, 1999)).

3. Inclusion-optimality of greedy search

Following the terminology of (Chickering & Meek,2002), we say that a scoring criterion score(·) forchordal graphs is locally consistent for a dependencymodel I if, for each vertices a, b and chordal graphs G,H such that H is obtained from G by removing a− b,we have

1. a ⊥ b|neG(a) ∩ neG(b) ∈ I ⇒ score(H) >score(G),

2. a ⊥ b|neG(a) ∩ neG(b) /∈ I ⇒ score(G) >score(H),

where neK(a) denotes the sets of neighboring (i.e. ad-jacent) vertices of a in K.

Recall that a scoring criterion score(·) for a DAG de-pendency model encoded by G is decomposable if itcan be written as a sum of functions that depend eachon only one vertex and its parents, i.e.

score(G) =∑v∈V

f(v, paG(v)

).

The following proposition holds.

Proposition 1. If score(·) is a scoring criterion overDAG dependency models that is decomposable and con-sistent for a dependency model I, then it is locally con-sistent for I when restricted to chordal graphs and

score(G)−score(H) = f(b, {a}∪(neG(a)∩neG(b))

)− f(

b, neG(a) ∩ neG(b)),

for chordal graphs G and H such that H is obtainedfrom G by removing the edge a− b.In practice, scoring criteria over DAG dependencymodels only satisfy the consistency property asymp-totically in the limit of a large sample size. Whenrestricted to chordal dependency models, such scoringcriteria will only be locally consistent asymptotically.

The main result of this abstract can now be stated.

Proposition 2. If score is a scoring criterion forchordal graphs that is consistent and locally consistentfor a graph-isomorph dependency model I, then localoptima of score(·) w.r.t. the inclusion boundary neigh-borhood are inclusion-optimal for I.

Acknowledgments

This work presents research results of the BelgianNetwork BIOMAGNET funded by the Interuniver-sity Attraction Poles Programme. Vincent Auvrayis supported by the �Action de recherche concertée�BIOMOD funded by the French Speaking Communityof Belgium.

References

Auvray, V., & Wehenkel, L. (2002). On the con-struction of the inclusion boundary neighbourhoodfor Markov equivalent classes of Bayesian networkstructures. Proceedings of Eighteenth Conferenceon Uncertainty in Arti�cial Intelligence (pp. 26�35).Morgan Kaufmann.

Auvray, V., & Wehenkel, L. (2008). Learninginclusion-optimal chordal graphs. Proceedings ofTwenty-Fourth Conference on Uncertainty in Arti-�cial Intelligence. (to appear).

Castelo, R., & Ko£ka, T. (2003). On inclusion-drivenlearning of Bayesian networks. Journal of MachineLearning Research, 4, 527�574.

Chickering, D., & Meek, C. (2002). Finding optimalbayesian networks. Proceedings of the 18th AnnualConference on Uncertainty in Arti�cial Intelligence(UAI-02) (pp. 94�102). San Francisco, CA: MorganKaufmann.

Chickering, D. M. (2002). Optimal structure iden-ti�cation with greedy search. Journal of MachineLearning Research, 3, 507�554.

Giudici, P., & Green, P. J. (1999). Decom-posable graphical Gaussian model determination.Biometrika, 86, 785�801.

30

Density estimation with ensembles of randomized poly-trees

Sourour Ammar 1 2 [email protected] Leray 1 [email protected] Defourny 3 [email protected] Wehenkel 3 [email protected] Laboratoire d’Informatique de Nantes Atlantique (LINA) UMR 6241, Ecole Polytechnique de l’Universite deNantes, France2 Laboratoire d’Informatique, Traitement de l’Information et des Systemes (LITIS) EA 4108 - Institut Nationaldes Sciences Appliquees de Rouen, France3 Department of Electrical Engineering and Computer Science & GIGA-Research, University of Liege, Belgium

1. Motivation

Learning of Bayesian networks aims at modeling thejoint density of a set of random variables from a ran-dom sample of joint observations of these variables(Naım et al., 2007). Such a graphical model maybe used for elucidating the conditional independencesholding in the datagenerating distribution, for auto-matic reasoning under uncertainties, and for Monte-Carlo simulations. Unfortunately, currently availablealgorithms for Bayesian network structure learningare either restrictive in the kind of distributions theysearch for, or of too high computational complexity tobe applicable in high dimensional spaces.

Ensembles of weakly fitted randomized models havebeen studied intensively and used successfully in thesupervised learning literature during the last twodecades. Among the advantages of these methods, letus quote the improved scalability of their learning al-gorithms thanks to randomization and the improvedpredictive accuracy the induced models thanks to theirhigher flexibility in terms of bias/variance trade-off.For example, ensembles of extremely randomized treeshave been applied successfully in very complex high-dimensional tasks, such as image and sequence classi-fication (Geurts et al., 2006).

In this work we explore the Perturb and Combineidea celebrated in supervised learning in the contextof probability density estimation in high-dimensionalspaces. We propose a new family of unsupervisedlearning methods of mixtures of large ensembles ofrandomly generated poly-trees. The specific featureof these methods is their scalability to very large num-bers of variables and training instances. We explorevarious variants of these methods empirically on a setof discrete test problems of growing complexity.

2. Methods

2.1. Poly-Tree density models

Let X = {X1, . . . , Xn} denote a finite set of discreterandom variables.

A poly-tree model P for the density over X is definedby a directed acyclic graph which skeleton is acyclicand connected, and the set of vertices of which is inbijection with X and with a set of conditional densi-ties PP (Xi|paP (Xi)), where paP (Xi) denotes the setof variables in bijection with the parents of Xi in P .It represents graphically the density factorization

PP (X1, . . . , Xn) =n∏

i=1

PP (Xi|paP (Xi)). (1)

Poly-tree models can be used for probabilistic inferenceover P(X1, . . . , Xn) with a computational complexitylinear in the number of variables n (Pearl, 1986).

One can define nested subclasses of poly-tree densitymodels by imposing constraints on the maximum num-ber p of parents of any node. In these subclasses, notonly inference but also parameter learning is of lin-ear complexity in the number of variables. The small-est such subclass is called the tree subspace, in whichnodes have exactly one parent (p = 1).

2.2. Mixture models of poly-trees

A mixture model of m poly-tree models (P1, . . . , Pm)is defined as a convex combination of the elementarypoly-tree models, ie.

PM (X1, . . . , Xn) =m∑

i=1

µiPPi(X1, . . . , Xn), (2)

where µi ∈ [0, 1] and∑m

i µi = 1.

31

While single poly-tree models impose restrictions onthe kind of densities they can faithfully represent, mix-tures of poly-trees are universal approximators.

2.3. Learning a random mixture from data

Let X be a set of discrete random variables, andD = (x1, · · · , xd) be a sample of joint observationsxi = (xi

1, · · · , xin) drawn from some datagenerating

distribution PG(X). Let P be the space of all possiblepoly-tree graphical structures defined over X .

Our generic procedure for generating a random mix-ture of poly-tree models from D is described by Algo-rithm 1; it receives as inputs X , D, m, and three pro-cedures DrawPolytree, LearnPars, ComputeWeight.

Algorithm 1 (Learning random poly-tree mixtures)

1. Repeat for i = 1, · · · , m:

(a) Pi = DrawPolytree(P),(b) For j = 1, · · · , n:

PPi(Xj |paPi(Xj)) = LearnPars(Pi, Xj , D)(c) µi = ComputeWeight(Pi, D, m)

2. Return(µi, (PPi(Xj |paPi(Xj)))

nj=1

)m

i=1.

3. Experiments and preliminary results

In (Ammar et al., 2008) we report some first resultswith the above algorithm applied to datasets of sized = 1000 generated from discrete distributions withn = 8, which could be faithfully represented by achain, a single tree, or a single poly-tree model.

In these simulations we have considered two differ-ent instances of DrawPolytree, namely a uniform drawover the class P of all poly-trees, and a uniform drawover the subclass P1 of trees. In order to achieve thisfor m ∈ {1, 2, . . . , 1000}, we have used efficient algo-rithms for sampling trees given in (Quiroz, 1989).

For parameter learning, we used maximum a posteriorivalues given the dataset and structure, while assumingnon-informative priors on the parameters. Concern-ing the µis, we used a uniform weighting strategy, ie.ComputeWeight(Pi, D, m) = 1/m.

Overall, these results showed that the quality of themixture-models converges rather rapidly (ie. for m ≈20), and that the poly-tree mixtures were slightly su-perior when targeting poly-tree datagenerating distri-butions, while the mixtures of trees were superior inthe other two cases. We also observed a slightly non-monotonic behavior of the model quality with growingvalues of m, which we suspect to be related to theuniform weighting scheme.

In the immediate future, we will carry out further moresystematic experiments on larger problems and span-ning different versions of the algorithm.

In particular, we will consider non-uniform weightingschemes, by exploiting the score obtained for a givenstructure and dataset so as to downweight structuresthat fit less well to the datagenerating distribution.We will also consider sampling from the spaces Pp ofpoly-trees of bounded number of parents.

Experiments will be made over a richer set of datagen-erating distributions, in particular ones that can notbe represented faithfully by a single poly-tree model.For instance, we will consider general directed acyclicgraph models as datagenerating distributions.

We will compare our algorithm in terms of sample andcomputational efficiency with Bayesian network struc-ture learning and algorithms targeting an optimal mix-ture of tree-models (Meila-Predoviciu, 1999).

Subsequently, we plan to extend our approach to han-dle continuous variables and incomplete datasets.

Acknowledgments

This work presents research results of the Belgian Net-work BIOMAGNET (Bioinformatics and Modeling:from Genomes to Networks), funded by the Interuni-versity Attraction Poles Programme, initiated by theBelgian State, Science Policy Office.

References

Ammar, S., Leray, P., & Wehenkel, L. (2008). Esti-mation de densite par ensembles aleatoires de poly-arbres. Proceedings of JFRB.

Geurts, P., Ernst, D., & Wehenkel, L. (2006). Ex-tremely randomized trees. Machine Learning, 63,3–42.

Meila-Predoviciu, M. (1999). Learning with mixturesof trees. Doctoral dissertation, MIT.

Naım, P., Wuillemin, P.-H., Leray, P., Pourret, O., &Becker, A. (2007). Reseaux bayesiens. Paris: Ey-rolles. 3 edition.

Pearl, J. (1986). Fusion, propagation, and structuringin belief networks. Artificial Intelligence, 29, 241–288.

Quiroz, A. (1989). Fast random generation ofbinary, t-ary and other types of trees. Jour-nal of Classification, 6, 223–231. available athttp://ideas.repec.org/a/spr/jclass/v6y1989i1p223-231.html.

32

Evolutionary Learning of Feature Fusion Hierarchies

Fabien Scalzo [email protected]

Division of Neurosurgery, Geffen School of Medicine, University of California, Los Angeles, CA, USA

George Bebis [email protected] Nicolescu [email protected] Loss [email protected]

Computer Vision Lab. Department of Computer Science, University of Nevada, Reno, NV, USA

Abstract

We present a hierarchical feature fusionmodel for image classification that is con-structed by an evolutionary learning algo-rithm. The model has the ability to com-bine local patches whose location, width andheight are automatically determined duringlearning. The representational frameworktakes the form of a two-level hierarchy whichcombines feature fusion and decision fusioninto a unified model. The structure of thehierarchy itself is constructed automaticallyduring learning to produce optimal local fea-ture combinations. A comparative evaluationof different classifiers is provided on a chal-lenging gender classification image database.It demonstrates the effectiveness of these Fea-ture Fusion Hierarchies (FFH).

1. Introduction

The generalization of new image acquisition devicesand the development of new feature extractors haverecently increased the interest of combining comple-mentary modalities or features to perform automaticimage classification. Hierarchical approaches (Singhet al., 2008; Podolak, 2008; Kim & Oh, 2008) to im-age classification are particularly interesting to solvecomplex problems because they are capable to decom-pose them into tasks that are often easier to tackle.However, these approaches often tend to manually de-fine the structure of their hierarchy depending on thefeatures involved (Tan & Triggs, 2007), and can onlyexploit a limited number of features. The current pa-per addresses these problems by presenting a frame-work that performs gender classification based on alarge set of features extracted from facial images. Thestructure of the model as well as its parameters are es-timated by a genetic learning algorithm that exploresthe space of possible hierarchies.

2. Feature Fusion Hierarchies

Feature Fusion Hierarchies (FFH) address the problemof fusing high-dimensional registered feature sets forimage classification. The representational frameworktakes the form of a two-level hierarchy which combineslocal feature fusion and decision fusion into a unifiedmodel (Figure 1).

Given a feature set I(x, y, f), where (x, y) denotes aposition in the image, and f is a feature, the featurefusion level is defined as a set of compound features C.Each compound feature Ci combines a subset of fea-tures fCi ∈ f over a local window θCi . This fusion isdone using a dimensionality reduction technique, de-noted Ri(IfCi

,θCi), and learned in a supervised way

(e.g. LDA). A key property of this function Ri isto operate locally in the sense that it exploits localadaptive windows (Scalzo & Piater, 2007) whose pa-rameters θCi

= {x, y, Sx, Sy} are automatically ad-justed during learning (position in the image (x, y),width Sx and height Sy). The output of the func-tion Si = Ri(IfCi

,θCi) corresponds to a lower dimen-

sionality response vector. An additional classifier islearned on the top of the first level to form the secondlevel D corresponding to the decision fusion. Its in-put data correspond to the compound feature output{S1, S2, . . . , Sn} merged into a single vector S.

3. Learning of Fusion Hierarchies

A canonical genetic algorithm is used to explore thespace of possible hierarchies (both the structure andthe parameters are estimated). The optimal solutionis the one that offers the best classification rate on thevalidation data and minimizes the number of featuresused as well as the size of the patches.

3.1. Genome Representation

Each evolving genome in the population is representedas a binary vector encoding the structure and the pa-

33

Feature Extraction

Decision Fusion

Feature Fusion Cn

Sn

C1

S1

Figure 1. Overview of a Feature Fusion Hierarchy (FFH).

rameters of a specific hierarchy (Fig. 2). A genomedefines the hierarchy as a set of Nc combinations Ci.The structure of Ci corresponds to the subset of fea-tures that are combined fCi = {f1, . . . , fn} whereas itsparameters define the location (x, y) and size (Sx, Sy)of the local window in the image on which the fusionis performed.

Given n features at the first level, the structural partis represented as a n-length binary vector encodingthe presence of the features in the combination. Forthe parameter part, variables {x, y, Sx, Sy} are eachrepresented as b bits vector.

00 1 0 11

Features Local Window

f1 f2 f3 fn x y Sx Sy

100 1 00 1 0 11

Genome

Combination

c1 c2 c3 c4

Figure 2. A Feature Fusion Hierarchy made of four com-pound features (c1, c2, c3, c4) is encoded into a genome.The structural part and the parameters are embedded intoa single binary vector.

3.2. Fitness, Crossover and Mutation

The fitness function fit(h) is used to evaluate each in-dividual in the population. It is set proportional tothe classification rate r of the genome encoded hier-archy g, fit(g) = r(g) + α1n + α2s

−1, where n is thenumber of zeros in the structure part of the genomeg and s is the total area covered by the patches. Pa-

rameters α1 and α2 are used respectively to supportcombinations that have a fewer number of features andare defined over a smaller window. A bi-parental ran-dom crossover and a single point mutation operatorare used in our algorithm to produce new individuals.

4. Experiments

The effectiveness of the proposed framework is evalu-ated on a gender classification problem. Given a setof 400 facial images (Sun et al., 2002) captured un-der various conditions, the task is to correctly iden-tify the gender of the subject present in the image.Each image is convolved with 35 Gabor filters and 5Laplacian filters to produce the initial feature set onwhich our Feature Fusion Hierarchies (FFH) are con-structed. The classification results after a three-foldcross-validation are reported in Table 1 for LDA, SVMand KSR classifiers. It can be observed that the use ofthe Feature Fusion Hierarchies (FFH) reduces signifi-cantly the classification error of a PCA-based frame-work and outperforms the results obtained by PCA-GA approach (Sun et al., 2002). This can be explainedby the fact that our FFH approach exploits local fea-tures whereas PCA-GA computes the projections onthe entire image.

LDA SVM KSRPCA 14.2% 8.9% -%

GA-PCA(Sun et al., 2002) 9% 4.7% -%FFH (this paper) 7.2% -% 3.8%

Table 1. Results for three different classifiers are reportedfor PCA, GA-PCA (Sun et al., 2002) and the Feature Fu-sion Hierarchies (FFH).

References

Kim, Y., & Oh, I. (2008). Classifier ensemble selectionusing hybrid genetic algorithms. Pattern Recogn. Lett.,29, 796–802.

Podolak, I. T. (2008). Hierarchical classifier with overlap-ping class groups. Expert Syst. Appl., 34, 673–682.

Scalzo, F., & Piater, J. (2007). Adaptive patch features forobject class recognition with learned hierarchical mod-els. Beyond Patches.

Singh, R., Vatsa, M., & Noore, A. (2008). Hierarchicalfusion of multi-spectral face images for improved recog-nition performance. Information Fusion, 9, 200–210.

Sun, Z., Bebis, G., Yuan, X., & Louis, S. J. (2002). Ge-netic Feature Subset Selection for Gender Classication:A Comparison Study. IEEE International Conferenceon Image Processing.

Tan, X., & Triggs, B. (2007). Fusing gabor and lbp fea-ture sets for kernel-based face recognition. Analysis andModeling of Faces and Gestures (pp. 235–249).

34

Using decision trees to build an event recognition framework forautomated visual surveillance

Cedric Simon [email protected] De Vleeschouwer [email protected]

Communication and Remote Sensing Lab, UCL, Louvain-La-Neuve, Belgium

Jerome Meessen [email protected]

Multitel, Mons, Belgium

Abstract

This paper presents a classifier-based ap-proach to recognize possibly sophisticatedevents in video surveillance. The aim of thiswork is to propose a flexible and generic eventrecognition system that can be used in a realworld context. Our system uses the ensem-ble of randomized trees procedure to modeleach event as a sequence of structured ac-tivity patterns, without using any trackingmethod. Experimental results demonstratethe robustness of the system toward artifactsand passer-by, and the effectiveness of itsframework for event recognition applicationsin visual surveillance.

1. Overview

The growing number of cameras in public and pri-vate areas increases the interest of the image process-ing community in automated visual surveillance sys-tem. Nonetheless, there is still a broad gap betweenwhat automated systems offer and the actual needsof the industry (Dee & Velastin, 2007). The systemwe propose aims at reducing this gap, by using a co-herent framework that links local and coarse featureswith meaningful concepts. Those local features are at-tributes representing the moving objects (the blobs)in the scene, at each frame. No tracking procedurenor intermediate reasoning (eg. occlusions) is neededwhich leads in a greater genericity. The frameworkwe propose (figure 1) relies on two main parts thatare both based on the ensemble of randomized treesconcept (Geurts et al., 2006).

In the first stage (section 2), sophisticated features arebuilt by clustering the blobs according to their fea-ture’s values at each frame. Those clusters are tagged,

I - Preprocessing

Video sequences

II - Supervised Clustering

III - Supervised Learning

Event label

Database of event models

Database of new features

Machine Learning Strategy

Figure 1. Overview of the system

and each tag is then associated to the appropriateframes in the video sequences. In the second stage(section 3), the events are classified, based on the tem-poral distribution of those tags in the sequences. Themain idea is to discriminate event classes by investi-gating the temporal relationships between the tags, forexample by asking if there is a blob of one type beforea blob of another type. This framework is inspiredfrom Geman works in (Amit & Geman, 1997).

2. Construction of elaborated featuresand definition of tags

From the blob’s features, we adopt an information-theoretic approach and cluster similar blobs by usingthe decision tree methodology. Two sets of clusteringtrees are built according to the type of attributes usedby the trees:

35

• In the first pool, attributes are based on blob’sfeatures (i.e its position, size and velocity)

• In the second pool, attributes are based on pairof blob’s features (i.e the distance between theblobs and their relative velocity). For efficiency,from all pairs of blobs in each frame, we keep onlypairs having a relatively short distance betweenthe blobs

At each node of one tree, a question is selected bychoosing one attribute, i.e by picking one feature andone threshold that reduce as much as possible the classentropy within the node. Thereby, each blob (or pairof blobs) inherit the event class of the whole sequenceit belongs to.

Once the trees are built, we tag each node of the trees,except for the root node. Hence, by definition, eachtag correspond to a specific combination of coarse andlocal features, that characterize each frame of the videosequences. Diversity of the tags can then be boostedby increasing the number of the clustering trees NCT ,while more specificity is obtained by increasing thedepth of the trees. For instance, we observed in ourexperiments that the system achieved the best perfor-mance for NCT ≈ 10. This supervised clustering pro-cess allows each tag to be more discriminant regardingthe event classes, while using only local information(each frame individually).

3. Modeling the spatio-temporal eventswith randomized trees

Once the elaborated features are computed, we useanother set of randomized trees to model and classifythe video sequences, based on the spatio-temporal ar-rangement of those advanced features. This is doneby defining the two output branches of a tree nodebased on the presence/absence of a specific temporalarrangement A of tags in the video sequence. In orderfor these arrangements to be scale invariant (and bet-ter fit the notion of visual surveillance event), coarsebinary relations like ”before”, ”after” and ”at the sametime” are used. An example of arrangement wouldthen be: tag x exists ”before” tag y which exists ”atthe same time” as tag z.

At the root node Nr a tag Tx from the full set T oftags is chosen. The Question (Q0) is ”Does Tx existin the training sequences?”. Those training samplesfor which QNr

= 0 are in the ”no” child node andwe search again through T . Those samples for whichQNr

= 1 are in the ”yes” child node and we then putTx among the participating tags Tp. Further question

can be:

∃Ti?∃Ti ? Tj?∃Tjx ? Tjy ?

with

Ti ∈ TTj , Tjx , Tjy ∈ Tp

? ∈ [≺, |,�](1)

with ≺ meaning ’before’, | ’during’ and � ’after’.

4. Experiments and perspective

The described system was tested on two scenarios.The first one used simulated video surveillance events,while the second is a real case scenario, which occursin the entrance lobby of a public company. More in-formation about those datasets can be found on theweb 1. Results are encouraging our framework, evenif passer-by or by-stander are significantly decreasingthe accuracy.

Further improvement could then analyze how to usehistograms of similarity between the blobs to enrichour elaborated features and gain some robustness,along with features characterizing each frame moreglobally (i.e. the number of blobs in the frame, thedensity of the moving objects in each frame, the aver-age/variance of each features..). We will also focus onusing an active learning procedure in order to interactwith the user, and to be able to enhance the systemon line.

References

Amit, Y., & Geman, D. (1997). Shape quantizationand recognition with randomized trees. Neural Com-putation, 9, 1545–1588.

Dee, H., & Velastin, S. (2007). How close are weto solving the problem of automated visual surveil-lance? a review of real-world surveillance, scien-tific progress and evaluative mechanisms. MachineVision and Applications, Special Issue on VideoSurveillance Research in Industry and AcademicSpringer.


1http://www.tele.ucl.ac.be/~csimon/trajectories.html.

36

Content-based Image Retrieval by IndexingRandom Subwindows with Randomized Trees

Raphael Maree [email protected]

GIGA Bioinformatics, University of Liege, GIGA Tower B34 (+1), Avenue de l’Hopital, 1, 4000 Liege, Belgium

Pierre Geurts [email protected] Wehenkel [email protected]

Systems and Modeling, Department of Electrical Engineering and Computer Science & GIGA Research, Univer-sity of Liege, Institut Montefiore B28, Grande Traverse 10, 4000 Liege, Belgium

Abstract

We propose a method for content-based im-age retrieval which exploits the similaritymeasure and indexing structure of totallyrandomized tree ensembles induced from aset of subwindows randomly extracted froma sample of images. The approach is quanti-tatively evaluated on various types of imageswith good results despite its conceptual sim-plicity and computational efficiency.

1. Introduction

In content-based image retrieval (CBIR), users want toretrieve images that share some similar visual elementswith a query image, without any further text descrip-tion neither for images in the reference database, norfor the query image (see Figure 1). To be practicallyvaluable, a CBIR method should combine computervision techniques that derive rich image descriptions,and efficient indexing structures.

Figure 1. Illustration of the goal of a CBIR system for onequery image (left) from the IRMA-2005 dataset.

Our method for CBIR is an extension of the method weproposed for image classification (Maree et al., 2005)and was published in (Maree et al., 2007) where moreformal notations and experiment details are given.

2. Method

The different steps of our algorithm are now described.

2.1. Extraction of random subwindows

Square patches of random sizes are extracted at ran-dom locations in images, resized by bilinear interpola-tion to a fixed-size (16 × 16), and described by HSVvalues (resulting into 768 feature vectors) for color im-ages, or gray intensities (resulting into 256 feature vec-tors) for graylevel images. This provides a rich repre-sentation of images corresponding to various overlap-ping regions, both local and global, whatever the taskand content of images. Using raw pixel values as de-scriptors avoids discarding potentially useful informa-tion while being generic, and fast.

2.2. Indexing subwindows with totallyrandomized trees

We use ensembles of totally randomized trees (Geurtset al., 2006) for indexing extracted random subwin-dows. The method recursively partitions the trainingsample of subwindows by randomly generated tests.Each test is chosen by selecting a random pixel compo-nent and a random cut-point in the range of variationof the pixel component in the subset of subwindowsassociated to the node to split. The development ofa node is stopped as soon as either all descriptors areconstant in the leaf or the number of subwindows inthe leaf is smaller than a predefined number nmin. Anumber T of such trees are grown from the trainingsample. To be able to later perform image retrieval,at each leaf of each tree, we record for each image ofthe reference set that appears in the leaf the numberof its subwindows that have reached this leaf.

37

2.3. Inducing image similarities from treeensembles

We first defined a similarity measure between any twosubwindows by considering that two subwindows arevery similar if they fall in a same leaf that has a verysmall subset of training subwindows. Then, we definedthe similarity induced by an ensemble of T trees byconsidering that two subwindows are similar if they areconsidered similar by a large proportion of the trees.Given this similarity measure between subwindows, wederived a similarity between two images that is the av-erage similarity between all pairs of their subwindows.

2.4. Image retrieval

The similarities between the query image and all refer-ence images could then computed by propagating intoeach tree all subwindows from the query image, andby incrementing, for each of its subwindow, each tree,and each reference image, the similarity between thequery and reference images. It is then possible to rankthe reference images according to their similarity tothe query image, as illustrated by Figure 2.

Figure 2. Ranking of the reference images according totheir similarity to the query image.

3. Results

We performed a quantitative evaluation of ourmethod in terms of its retrieval accuracy on vari-ous image datasets with ground-truth labels. Thesedatabases contains images representing various build-ings (ZuBuD), objects and scenes (META and Uk-Bench), and radiographs (IRMA). In our experiments,we consider that an image is relevant to a query if itis of the same class as the query image, and irrelevantotherwise. Note that, while using class labels to as-sess accuracy, this information is not used during the

indexing phase. Results are shown in Table 1.

Dataset #images ls/ts AccuracyZuBuD 1005/115 96.52%

IRMA-2005 9000/1000 85.4%UkBench 10200/10200 75.25%

META/UkBench 205763/10200 66.74%

Table 1. Image retrieval accuracy results.

4. Discussion

This method has nice properties that we will explore infuture work for large-scale, distributed content-basedimage retrieval: The possibility of updating the modelas new images come in, the capability of comparingnew images using a model previously constructed froma different set of images, and its computational effi-ciency (due to randomization used both in image de-scription and indexing) while keeping overall good per-formances. Finally, let us note that our image similar-ity measure actually defines a kernel and it could thusbe exploited in clustering or supervised learning withkernel methods.

5. Acknowledgements

Raphael Maree is supported by the GIGA (Universityof Liege) with the help of the Walloon Region andthe European Regional Development Fund. PG is aresearch associate of the FNRS, Belgium. This workpresents research results of the Belgian Network BIO-MAGNET, funded by the Interuniversity AttractionPoles Programme, initiated by the Belgian State, Sci-ence Policy Office. The authors thank Vincent Bottafor Figure 2.

References


Maree, R., Geurts, P., Piater, J., & Wehenkel, L.(2005). Random subwindows for robust image clas-sification. Proc. IEEE CVPR (pp. 34–40).

Maree, R., Geurts, P., & Wehenkel, L. (2007).Content-based image retrieval by indexing randomsubwindows with randomized trees. Proc. 8th AsianConference on Computer Vision (ACCV), LNCS(pp. 611–620). Springer-Verlag.

38

IGForestFrom tree to forest

Herman Stehouwer [email protected]

ILK Research Group, Faculty of Humanities Tilburg University

Abstract

TiMBL is an implementation of K-nn con-taining several algorithms for classification.One of the implemented, and often used, clas-sifiers is IGTree. IGTree is a fast trie-basedapproximation of k-nn. Because of its trie-based nature IGTree can mismatch on a fea-ture quite fast, resulting in sub-optimal clas-sification compared to IB1. We present anearly version of IGForest that tries to lessenthe impact of this problem while retainingthe behavior of this set of algorithms. IGFor-est consists of an ensemble of IGTrees. Theperformance of IGForest is compared to bothIGTree and IB1 on a diverse set of NLP prob-lems.

1. Introduction

IGTree is a fast approximation of k-nearest neighborclassification (Daelemans et al., 1997). IGTree allowsfor fast training and testing even with millions of ex-amples. IGTree compresses a set of labeled examplesinto a decision tree structure similar to the classic C4.5algorithm (Quinlan, 1993), except that throughout onelevel in the IGTree decision tree, the same feature istested. Classification in IGTree is a simple procedurein which the decision tree is traversed from the rootnode down, and one path is followed that matches theactual values of the new example to be classified. If anend node is met, the outcome stored at the end nodeis generated as classification. If the last visited node isa non-ending node, but no outgoing arcs match withthe value in the new example, the most likely outcomestored at that node is produced as the resulting clas-sification.

IGTree is typically able to compress a large exampleset into a lean decision tree with high compression fac-tors, in reasonably short time, comparable to othercompression algorithms. More importantly, IGTree’sclassification time depends only on the number of fea-

tures (O(f)). Indeed, we observe high compressionrates: trained on almost 2 million examples of a Dutchd/dt inflection task containing both part-of-speech andfull-word features, IGTree builds a tree containing amere 46,466 nodes, with which it can classify 17 thou-sand examples per second on a current computingserver.

Well known techniques for boosting classifier perfor-mance are ensemble creation methods such as baggingand boosting. In (Banfield et al., 2007) a comparisonis given of several such combination techniques on dif-ferent training sets. They compare randomised C4.5,random subspaces, random forests, AdaBoost.M1W,and bagging. Most of these training sets where takenfrom the UCI repository (Murphy & Aha, 1995). Theirmain conclusion is that “. . . for any given data set thestatistically significantly better algorithms are likely tobe more accurate, just not by a significant amount onthat data set.”.

We frequently see that on large datasets, such as thatthe CoNLL 2000 shared task (Tjong Kim Sang &Buchholz, 2000), that IB1 easily outperforms IGTreein terms of accuracy, at the cost of classification timeas can be clearly seen in Table 1. As the speed differ-ence increases non-lineairly with the amount of train-ing data, it would be prohibitive to run IB1 on exper-iments with millions of training instances.

The algorithm proposed here tries to find the middleground between IB1 and IGTree by creating a forestof semi-random IGTrees. This is motivated by thefact that we wish to retain some of the unique charac-teristics of IGTree, such as its resemblance to a basiclanguage model in smoothing (Zavrel & Daelemans,1997), while still being a generic classifier supportingany number and type of features.

2. System Architecture

IGForest consists of a number of simple steps that gen-erate an ensemble of IGTrees, which is than applied to

39

F-score TimeIB1 0.926 866IGTree 0.868 1IGForest 0.906 219

Table 1. F-score and total training and testing time inseconds on the CoNLL 2000 shared task (Tjong Kim Sang& Buchholz, 2000) by IB1, IGTree and IGForest using IB1as an arbiter. 100.000 training instances where used.

the data which is to be classified. These steps are dis-cussed briefly below. As should become clear IGForestis easy to parallelize by running IGTRee training andtagging in parallel. IGForest implements the followingsteps, that combined result in an ensemble classifier

1: Analyse Data The first step is the analysis of theincoming training instances. This includes gener-ating a default set of feature weights (Gain Ratio)and determining the performance on the traininginstances of a simple most-frequent outcome base-line.

2: Generate IGTrees In the second step a fixednumber of IGTree feature orders is generatedsemi-randomly. We use the default set of featureweights as a bias for the random order, i.e. if onefeature has a default weight of half the total ofthe default weights it has a 50% chance of beingpicked first.

These feature orders are filtered to ensure unique-ness and improved performance on the training in-stances using cross-validation. It is ensured thateach ordering improves on the most-frequent out-come baseline. The forest will always contain thedefault IGTree.

3: Train Combination Method IGForest containsseveral combination methods. Most of these com-bination methods need some training.

The simplest of these is linear unweighted voting,which does not need training and simply takes avote of the different trees.

Somewhat more complex is linear weighted vot-ing, where the weight of the vote of each tree isdetermined by the performance on the traininginstances using cross-validation.

The final combination method is that of an arbiterwhich trains on a holdout part of the training in-stances. Any classifier can be used as an arbiter.We use either IGTree or IB1.

4: Apply Finally, all the separate IGTrees are ap-plied to the unclassified instances, after which the

combination method is applied resulting in classi-fied instances.

3. Preliminary Results

We have implemented a basic version of IGForest,which does not check for uniqueness of the trees, nordoes it cull the trees that perform worse than the most-frequent outcome baseline. Nevertheless, the first re-sults are promising. The results for the CoNLL 2000task for IGForest using IB1 as an arbiter can be foundin Table 1.

This basic version performs in between IB1 and IGTreeon the CoNLL 2000 data. We observe similar numbersin other experiments, such as pos-tagging on the Wall-street Journal. In different experiments, where IGTreealready outperforms IB1, we do not see such an im-provement, such as -d/-dt inflection in Dutch. IGTreeoutperforms IB1 in cases where the feature order ismuch more absolute than the feature weighting wouldsuggest.

References

Banfield, R. E., Hall, L. O., Bowyer, K. W., &Kegelmeyer, W. P. (2007). A comparison of decisiontree ensemble creation techniques. IEEE Transac-tions on Pattern Analysis and Machine Intelligence.

Daelemans, W., Van den Bosch, A., & Weijters, A.(1997). igtree: using trees for compression andclassification in lazy learning algorithms. ArtificialIntelligence Review, 11, 407–423.

Murphy, P., & Aha, D. W. (1995). Uci repositoryof machine learning databases – a machine-readablerepository. Maintained at the Department of Infor-mation and Computer Science, University of Cali-fornia, Irvine. Anonymous ftp from ics.uci.edu inthe directory pub/machine-learning/databases.

Quinlan, J. (1993). c4.5: Programs for machine learn-ing. San Mateo, CA: Morgan Kaufmann.

Tjong Kim Sang, E., & Buchholz, S. (2000). Intro-duction to the CoNLL-2000 shared task: Chunk-ing. Proceedings of CoNLL-2000 and LLL-2000 (pp.127–132).

Zavrel, J., & Daelemans, W. (1997). Memory-basedlearning: Using similarity for smoothing. Proceed-ings of the 35th Annual Meeting of the Associationfor Computational Linguistics (pp. 436–443).

40

Fast Image Annotation with Random Subwindows and MultipleOutput Randomized Trees

Marie Dumont [email protected]


Raphael Maree [email protected]

GIGA Bioinformatics, University of Liege, GIGA Tower B34 (+1), Avenue de l’Hopital, 1, 4000 Liege, Belgium



AbstractThis work addresses image annotation, i.e.labelling pixels of an image with a classamong a finite set of predefined classes. Theproposed method extracts a sample of sub-windows from a set of annotated training im-ages to train a subwindow annotation modelby tree-based ensemble methods. In one vari-ant of the approach, the classifier is train tolabel only the central pixel of the subwin-dow, while in a second variant the classifieris extended to annotate all subwindow pix-els simultaneously. This approach is eval-uated on four different problems, where itsoverall good performance and efficiency arehighlighted.

1. Image annotation

In this work, we propose a supervised learning ap-proach for the generic problem of image annotation.Given a training set of images with pixel-wise labelling(ie. every pixel is hand-labelled with one class amonga finite set of predefined classes), the goal is to build amodel that will be able to predict accurately the classof every pixel of any new, unseen image.

2. Methods

To tackle this problem, our starting point is the resultsobtained by (Maree et al., 2005) for image classifica-tion. Their method first randomly extracts a largeset of image subwindows (or patches) and describes

those by high-dimensional feature vectors composed byraw pixel values. Then, the method uses an ensembleof extremely randomized decision trees (Geurts et al.,2006a) to build a subwindow classification model. Topredict the class of a new image, the method extractsrandom subwindows from this image, classifies thesesubwindows using the decision tree ensemble and thenaggregates the subwindow classifications by majorityvoting to get a single class prediction for the wholeimage.

In the context of image annotation, we follow a similarscheme based on the extraction of random subwindowsand the use of ensemble of randomized trees. We pro-pose two approaches.

The first approach builds extremely randomized treesto predict the class of the central pixel of subwindows.To classify a pixel of an unseen image, a subwindowcentered on that pixel is extracted and the pixel clas-sifications obtained by applying the trees on the sub-window are aggregated by majority voting. We denotethis method SCM for Subwindow Classification Model.

The second approach extends the classifier so as topredict the class of every subwindow pixels simultane-ously. We propose to replace the score function usedto evaluate and select splits during tree induction bythe average over all output pixel classes of the infor-mation theoretic score used for standard classificationproblems. Once the model is built, to annotate a newimage, several (overlapping) subwindows are randomlyextracted from the image, a prediction is computed bythe classifier for every subwindow pixel and the finalannotation of the image is obtained by taking for every

41

image pixel the majority class among all class predic-tions that were obtained for this pixel. We call thismethod SCMMO for Subwindow Classification Modelwith Multiple Outputs.

3. Experiments

To assess the performance and usefulness of the pro-posed methods as a foundation for image annotation,we have evaluated them on four datasets represent-ing various types of images (microscope imaging, pho-tographs of natural scenes, etc.) and a large variety ofclasses. Table 1 reports results with both methods interms of pixel misclassification error rate on these fourproblems. Table 2 reports computing times. Examplesof annotated images are shown in Figure 1. Detailedresults and description of the datasets can be found in(Dumont et al., 2008).

SCMMO is almost always significantly better thanSCM, which was expected as SCM predictions do notexploit correlation between pixels. On three problemsamong the four, SCMMO results are in fact compara-ble to those obtained by the state of the art. Moreover,our methods are very attractive in terms of computingtimes.

Table 1. Error rates obtained on four datasets.Database SCM SCMMO

Retina 7.53% 7.56%Bronchial 3.42% 3.13%

Corel 49.43% 36.01%Sowerby 14.96% 10.93%

Table 2. Computing times on two datasets.

Dataset Sowerby RetinaMethod Training Prediction Training Prediction

SCM 76.63 s 0.12 s 41.22 s 17.64 sSCMMO 315.68 s 0.26 s 125.5 s 26.98 s

4. Conclusion

We have introduced and compared two generic imageannotation methods: SCM and SCMMO. From ourexperiments on four distinct databases, SCMMO ap-pears to be the most interesting one among the twoproposed methods: it is almost always significantlybetter than SCM and obtains state-of-the-art resultson three problems among four. We deem that the mainmerit of our approach is its good overall performancewhile remaining conceptually very simple and keepingcomputing times very low. Moreover, it might be ex-

Figure 1. From left to right: original image, manual an-notation, annotation obtained with SCM, annotation ob-tained with SCMMO. From top to bottom: images fromRetina, Bronchial, Corel, and Sowerby datasets.

ploited as a first, fast step for image annotation as itsoutput predictions might be post-processed by varioustechniques at the local and/or global level.

Acknowledgments

MD is a research fellow of the FNRS, Belgium. RMis supported by the GIGA interdisciplinary cluster ofGenoproteomics of the University of Liege with thehelp of the Walloon Region and the European RegionalDevelopment Fund. PG is a research associate of theFNRS, Belgium.

References

Dumont, M., Maree, R., Geurts, P., & Wehenkel, L.(2008). Fast image annotation with random subwin-dows and multiple output randomized trees. Submit-ted.

Geurts, P., Ernst, D., & Wehenkel, L. (2006a). Ex-tremely randomized trees. Machine Learning, 36,3–42.

Geurts, P., Wehenkel, L., & d Alche-Buc, F. (2006b).Kernelizing the output of tree-based methods.ICML (pp. 345–352).

Maree, R., Geurts, P., Piater, J., & Wehenkel, L.(2005). Random subwindows for robust image clas-sification. CVPR (pp. 34–40).

42

On the use of Machine Learning in Statistical Parametric SpeechSynthesis

Thomas DrugmanAlexis MoinetThierry Dutoit [email protected]

Faculte Polytechnique de Mons, TCTS Lab, 31 Boulevard Dolez, 7000 Mons, Belgium

Abstract

Statistical parametric speech synthesis hasrecently shown its ability to produce natu-ral sounding speech while keeping a certainflexibility for voice transformation withoutrequiring a huge amount of data. This ab-stract presents how machine learning tech-niques such as Hidden Markov Models in gen-eration mode or context oriented clusteringwith decision trees are applied in speech syn-thesis. Fields that are investigated in ourlaboratory to improve this method are alsodiscussed.

1. HMM-based Speech Synthesis

Before the last five years, synthetic speech was typ-ically produced by concatenating frames of naturalspeech selected from a huge database, possibly apply-ing signal processing to them so as to smooth dis-continuities. In 2002, Tokuda et al. (K. Tokuda,2002) proposed a system relying on HMM generationof speech parameters. Compared to the previous one,this approach has the advantage to allow voice trans-formation without requiring a large amount of data,merely by adapting its statistics through a short train-ing (A. W. Black & Tokuda, 2007). By voice transfor-mation we here mean voice conversion towards a giventarget speaker or expressive/emotive speech produc-tion from the initial trained system.

The key idea of a HMM-based synthesizer is to gener-ate sequences of speech parameters directly from thetrained HMMs. Next subsections describe the twomain steps in the bloc diagram of such a synthesizer(see Figure 1).

Figure 1. Bloc diagram of a HMM-based speech synthe-sizer

1.1. The training part

Training our system assumes that a large segmentedspeech database is available. Labels consist of phoneticenvironment description. First, speech waveforms aredecomposed into their source (glottal) and filter (vocaltract) components. Representative features are thenextracted from both contributions. Since source mod-eling is composed either of continuous values or a dis-crete symbol (respectively during voiced and unvoicedregions), multi-space probability density HMMs havebeen proposed. Indeed this approach turns out to be

43

able to model sequences of observations having a vari-able dimensionality.

Given these latter parameters and the labels, HMMsare trained using the Viterbi and Baum-Welch re-estimation algorithms. Till that point this may seemvery close to building a speech recognizer. Neverthe-less decision tree-based context clustering is here usedto statistically model data appearing in similar con-textual situations. Indeed contextual factors such asstress-related, locational, syntaxical or phone identityfactors affect prosody (duration and source excitationcharacteristics) as well as spectrum. More precisely anexhaustive list of possible contextual questions is firstdrawn up. Decision trees are then built for source,spectrum and duration independently (as factors havea different impact on them) using a maximum likeli-hood criterion. Probability densities for each tree leafare finally approximated by a Gaussian mixture model.

1.2. The synthesis part

The text typed by the user is first converted into asequence of contextual labels. From them, a paththrough context-dependent HMMs is computed usingthe duration decision tree. Source and spectrum pa-rameters are then generated by maximizing the out-put probability. The incorporation of dynamic fea-tures makes the coefficients evolution more realisticand smooth. Speech is finally synthesized from thegenerated parameters by an operation of signal pro-cessing.

2. Our ongoing research activities

Our main goal is to develop an efficient HMM-basedspeech synthesizer for French. For this, the ACAPELAgroup kindly provided us with their natural languageprocessor. Since English and French have both theirown phonological particularities, an adaptation of thequestions used for the context oriented clustering wasnecessary.

Basically our (ongoing) research activities focus onthree main issues:

• Speech analysis: A major disadvantage of sucha synthesizer is the ”buzziness” of the producedspeech. This is typically due to the parametricalrepresentation of speech. To overcome this hin-drance a particular interest is devoted to speechanalysis. Our approach particularly investigatesa method of source-filter deconvolution based onthe zeros of the Z-transform (B. Bozkurt & Du-toit, 2007). By this way an estimation of the glot-

tal signal and the vocal tract impulse responseis achieved. Different models for the source (LFand CALM models) as well as for the spectrum(MLSA, LSP or MFCC coefficients) are testedand their perceptual quality is assessed.

• Intelligibility enhancement: In some applica-tions speech has to be synthesized in adverse con-ditions (in cars, at the station,...). Intelligibilityconsequently becomes of a paramount importance(Langner & Black, 2005). If we can model themodifications occurring when speech is producedin noise (possibly implying a training), a synthe-sizer with (adaptive) intelligibility enhancementcould be carried out.

• Voice conversion : In voice conversion(Y. Stylianou & Moulines, 1995) it is aimed atmodifying the source speaker’s voice towards aparticular target speaker given a limited datasetof his utterances. This approach implies the studyof the statistical learning transforming represen-tation spaces of both speakers. This could allowus to easily generate new voices, including theproduction of more emotions and expressivity inspeech.

Acknowledgments

Thomas Drugman is supported by the “Fonds Nationalde la Recherche Scientifique”. The authors also wouldlike to thank the ACAPELA group and the WalloonRegion (grant #415911) for their support.

References

A. W. Black, H. Z., & Tokuda, K. (2007). Statisticalparametric speech synthesis. Proc. of ICASSP (pp.1229–1232).

B. Bozkurt, L. C., & Dutoit, T. (2007). Chirp groupdelay analysis of speech signals. Speech Comm., 49,issue 3, 159–176.

K. Tokuda, H. Zen, A. W. B. (2002). An hmm-basedspeech synthesis system applied to english. Proc. ofIEEE SSW.

Langner, B., & Black, A. W. (2005). Improving theunderstandability of speech synthesis by modelingspeech in noise. Proc. ICASSP.

Y. Stylianou, O. C., & Moulines, E. (1995). Statisti-cal methods for voice quality transformation. Proc.EUROSPEECH.

44

Towards robust feature selection techniques

Yvan Saeys [email protected] Abeel [email protected] Van de Peer [email protected]

Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium,Department of Molecular Genetics, Ghent University, Technologiepark 927, 9052 Gent, Belgium

Abstract

Robustness of feature selection techniquesis a topic of recent interest, especially inhigh dimensional domains with small sam-ple sizes, where selected feature subsets aresubsequently analysed by domain experts togain more insight into the problem modelled.In this work, we investigate the robustnessof various feature selection techniques, andprovide a general scheme to improve robust-ness using ensemble feature selection. Weshow that ensemble feature selection tech-niques show great promise for small sampledomains, and provide more robust featuresubsets than a single feature selection tech-nique. In addition, we also investigate the ef-fect of ensemble feature selection techniqueson classification performance, giving rise to anew model selection strategy.

1. Introduction

During the past decade, the use of feature selectionfor knowledge discovery has become increasingly im-portant in many domains that are characterized by alarge number of features, but a small number of sam-ples. Typical examples of such domains include textmining, computational chemistry and the bioinformat-ics and biomedical field, where the number of features(problem dimensionality) often exceeds the number ofsamples by orders of magnitude (Saeys et al., 2007).When using feature selection in these domains, notonly model performance but also robustness of the fea-ture selection process is important, as domain expertswould prefer a stable feature selection algorithm overan unstable one when only small changes are made tothe dataset.

Surprisingly, the robustness (stability) of feature selec-tion techniques is an important aspect that received

only relatively little attention during the past. Re-cent works in this area mainly focus on the stabilityindices to be used for feature selection, introducingmeasures based on Hamming distance (Dunne et al.,2002), correlation coefficients (Kalousis et al., 2007),consistency (Kuncheva, 2007) and information theory(Krızek et al., 2007). The work of Kalousis et al.(2007) also presents an extensive comparative evalu-ation of feature selection stability over a number ofhigh-dimensional datasets. However, most of these re-cent works only focus on the stability of single featureselection techniques.

In this work, we investigate whether the use of ensem-ble feature selection techniques can be used to yieldmore robust feature selection techniques, and whethercombining multiple methods has any effect on the clas-sification performance.

2. Methods

2.1. Quantification of robustness

Depending on the outcome of a feature selection tech-nique, the result can be either a set of weights, a rank-ing, or a particular feature subset. In order to assessrobustness, a subsampling scheme is used that gener-ates k subsamples containing 90% of the original data.The robustness of a technique is then measured bythe average over all pairwise similarity comparisonsbetween the different feature selectors:

Stot =2

∑ki=1

∑kj=i+1 S(fi, fj)

k(k − 1)

where fi represents the outcome of the feature selec-tion method applied to subsample i (1 ≤ i ≤ k), andS(fi, fj) represents a similarity measure between fi andfj .Here, we focus on similarities between rankings - usingthe Spearman rank correlation coefficient - and sub-sets, using the Jaccard index (Kalousis et al., 2007) or

45

Table 1. Robustness of the different feature selectors across the different datasets. Spearman correlation coefficient (Sp),Jacard index (JC) and consistency index (CI) on a subset of 1% best features.

Dataset SU Relief SVM RFE RF

Colon

Single Ensemble Single Ensemble Single Ensemble Single EnsembleSp 0.61 0.76 0.62 0.85 0.7 0.81 0.91 0.99JC 0.3 0.55 0.45 0.56 0.44 0.5 0.01 0.64CI 0.45 0.7 0.61 0.71 0.6 0.65 0.01 0.77

LeukemiaSp 0.68 0.76 0.58 0.79 0.73 0.79 0.97 0.99JC 0.54 0.6 0.44 0.55 0.49 0.57 0.36 0.8CI 0.7 0.74 0.6 0.71 0.64 0.72 0.53 0.89

LymphomaSp 0.59 0.74 0.49 0.76 0.77 0.81 0.96 0.99JC 0.37 0.55 0.42 0.56 0.43 0.46 0.22 0.73CI 0.53 0.7 0.58 0.71 0.6 0.63 0.35 0.84

AverageSp 0.63 0.75 0.56 0.8 0.73 0.80 0.95 0.99JC 0.40 0.57 0.44 0.56 0.45 0.51 0.2 0.72CI 0.56 0.71 0.6 0.71 0.61 0.67 0.3 0.83

the consistency index (Kuncheva, 2007).

2.2. Ensemble feature selection

In order to improve the robustness of feature selec-tion, a similar idea as in ensemble learning can beused, where multiple classifiers are combined in orderto improve performance. In this work, we constructan ensemble of feature selectors by bootstrapping thedata, and creating a consensus feature selector thataggregates the results of the single feature selectors byrank summation.

3. Results

Table 1 shows the robustness of a representative sam-ple of feature selectors, including two filter based ap-proaches (Symmetrical Uncertainty (SU) and Relief)and two embedded approaches (recursive feature elimi-nation using a SVM (SVM RFE) and Random Forests(RF)). It can be observed that constructing an ensem-ble version of each feature selector significantly im-proves robustness.However, considering only robustness of a feature se-lection technique is not an appropriate strategy to findgood feature rankings or subsets, and also model per-formance should be taken into account to decide whichfeatures to select. Therefore, feature selection needs tobe combined with a classification model in order to getan estimate of the performance of the feature selector-classifier combination.Our results show that in most cases, classification per-formance using ensemble feature selection is compa-rable to the performance using conventional featureselection techniques, or performs only slightly worse(data not shown). It turns out that the best trade-off between robustness and classification performancedepends on the dataset at hand, giving rise to a new

model selection strategy, incorporating both classifica-tion performance as well as robustness in the evalua-tion strategy by taking e.g. the harmonic mean of ro-bustness and classification performance as a combinedmeasure.

Acknowledgments

Y.S. is funded by a post-doctoral grant from the Re-search Foundation Flanders (FWO-Vlaanderen). T.A.is funded by a grant from the Institute for the Promo-tion of Innovation through Science and Technology inFlanders (IWT-Vlaanderen).

References

Dunne, K., Cunningham, P., & Azuaje, F. (2002).Solutions to instability problems with sequentialwrapper-based approaches to feature selection (Tech-nical Report TCD-2002-28). Dept. of Computer Sci-ence, Trinity College, Dublin, Ireland.

Kalousis, A., Prados, J., & Hilario, M. (2007). Stabil-ity of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst., 12, 95–116.

Krızek, P., Kittler, J., & Hlavac, V. (2007). Improvingstability of feature selection methods. Proceedingsof the 12th International Conference on ComputerAnalysis of Images and Patterns (pp. 929–936).

Kuncheva, L. (2007). A stability index for feature se-lection. Proceedings of the 25th International Multi-Conference on Artificial Intelligence and Applica-tions (pp. 309–395).

Saeys, Y., Inza, I., & Larranaga, P. (2007). A re-view of feature selection techniques in bioinformat-ics. Bioinformatics, 23, 2507–2517.

46

Feature Selection Techniques for Database Cleansing:Knowledge-driven vs. Greedy Search

Marieke van Erp [email protected] van den Bosch [email protected] Lendvai [email protected] Hunt [email protected]

ILK, Tilburg University, P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands

Abstract

A comparative study is presented in whicha wrapper-based feature selection algorithmand a knowledge-driven feature selectionapproach are compared to each other ona database cleaning task. Results showthat both perform better than a classi-fier that is trained on the full feature set.The knowledge-driven approach provides aconsiderable speed up of the experimentsover the wrapper-based feature selection ap-proach.

1. Introduction

In order for machine learning algorithms to achieve op-timal performance on a particular task, it is importantto present the machine learner with optimal trainingdata, e.g., not containing any attributes that may dis-tract the algorithm. Today’s data sets often containlarge numbers of attributes, hence it is often not pos-sible for human experts to remove bad attributes fromthe data before it is presented to the machine learner,or there may not even be a human expert to do this.Automatic feature selection promises to take over thistask from humans, but may come at a cost. Some al-gorithms, such as C4.5 (Quinlan, 1993) order featuresduring classification, but more often feature selectiontakes place before the actual classification. Amongthese techniques is the wrapper approach (Kohavi &John, 1997) in which the algorithm that will be used inthe classification task is run on different feature sub-sets to evaluate the ‘goodness’ of each set for the finalclassification task. The order of the feature subsets un-der scrutiny can be organised via various algorithmsthat browse the feature set search space in a smartway to avoid having to test every possible subset offeatures.

Another approach is to mimic the human expert whochooses features based on the world knowledge he/shehas by selecting features via the introduction of knowl-edge in the classification experiment, for instance viaan ontology. Some work has been done to steer featureselection via a domain ontology for microarray gene ex-pression data (Qi & Tang, 2006). This work applies asimilar approach to the natural history domain.

The natural history domain harbours vast amounts ofdata, of which some is digitised (mostly manually).In this work a flat database from the Dutch NationalMuseum for Natural History (Naturalis) is used. Itcontains 16,870 records organised into 39 columns thatdescribe findings and characteristics of reptile and am-phibian specimens, as well as how they are preservedin the collection and when they entered the collection.The data was entered manually by researchers at themuseum and contains typos, spelling variations andincorrect values.

In earlier work, a machine learner was already appliedto this data to detect and correct errors (Sporlederet al., 2006). Error detection and correction is treatedas a classification task, in which it is assumed thatthe majority of the database fields is correct, and canbe used as training data to predict other databasefields. In that work all features were used, which isregarded here as the baseline because it is known thatthe performance of the k nearest neighbour algorithmcan deteriorate when it is trained on irrelevant fea-tures (Wettschereck et al., 1997).

2. Domain Knowledge

To express domain knowledge, an ontology was createdfor the reptiles and amphibians via the CIDOC-CRMscheme (international standard ISO 21127:2006). Byadhering to the CIDOC-CRM structure it is easier tointegrate the ontology with information sources from

47

Table 1. Number of times each approach achieves highestaccuracy over the other approaches

all feats hill climb ontologyall feats – 14 (2) 28 (3)hill climb 23 (2) – 29 (1)ontology 8 (3) 9 (1) –

within the same institution, or even from other insti-tutions (Crofts et al., 2007).

3. Experimental setup

For the ontology-driven feature selection (ontology),3 experiments were performed per database column,where the number of experiments represents the maxi-mum distance between the to be predicted concept (i.e.,database column) and the concepts in the ontology (i.e,features). The experiments were carried out in an in-cremental fashion: first all features within distance 1to the focus column in the ontology were selected, thenall the features within distance 2, etc.

The ontology-driven feature selection was compared toa greedy heuristic search, namely a bi-directional hillclimber (hill climb). The search carried out by thismethod starts out with an empty feature set. Duringeach iteration it adds or deletes the one feature that, incombination with the other features already selected,improves the classifier the most (Caruana & Freitag,1994). This heuristic feature selection algorithm doesa very thorough job and can lead to very good results,but is very expensive.

All experiments were carried out in a leave-one-outcross validation setting, with the implementation offeature-weighted k-NN as implemented in the TiMBLsoftware package (Daelemans et al., 2007). Emptydatabase fields were not included in the values thatwere to be predicted because in some cases a field issupposed to be empty, whereas in some cases it is not,which is out of the scope of this work.

4. Results

In Table 1 the number of times every approach ‘wins’from the other approaches is given, where the rowsrepresent the score of the approach against the oth-ers in each column. The numbers in parentheses aredraws. ‘All feats’ is the baseline with the completefeature set.

5. Conclusions

For most database columns, there is an increase in ac-curacy when feature selection is applied. As expected

the heuristic feature selection approach performs best,but often the accuracy of the ontology-driven approachwas very close. The added bonus of the ontology-driven approach is that it is very fast, whereas the hillclimbing search took significantly more time to comeup with a well performing feature set. The results vary,because there is much variety between the differentdatabase columns. Follow-up research will include in-vestigating how these differences affect each approach,and whether it is possible to combine the knowledge-driven approach with the heuristic approach.

Acknowledgements

This research is funded by NWO, the NetherlandsOrganisation for Scientific Research as part of theCATCH programme.

References

Caruana, R., & Freitag, D. (1994). Greedy attributeselection. Proceedings of the Eleventh InternationalConference on Machine Learning (pp. 28–36).

Crofts, N., Doerr, M., Gill, T., Stead, S., & Stiff, M.(2007). Definition of the cidoc conceptual referencemodel. The version 4.2.2 of the reference document.

Daelemans, W., Zavrel, J., Van der Sloot, K., & Vanden Bosch, A. (2007). Timbl: Tilburg memory basedlearner, version 6.1, reference guide. (Technical Re-port 07-07). ILK/Tilburg University.

Kohavi, R., & John, G. H. (1997). Wrappers for fea-ture subset selection. Artificial Intelligence, 97, 273–324. Special issue on ‘Relevance’.

Qi, J., & Tang, J. (2006). Gene Ontology DrivenFeature Selection from Microarray Gene ExpressionData. Proceedings of CIBCB’06. 2006 (pp. 1–7).

Quinlan, J. (1993). C4.5: Programs for machine learn-ing. Morgan Kaufmann.

Sporleder, C., van Erp, M., Porcelijn, T., & van denBosch, A. (2006). ‘spotting the odd-one-out’: Data-driven error detection and correction in textualdatabases. Proceedings of the EACL 2006 Workshopon Adaptive Text Extraction and Mining (ATEM).Trento, Italy.

Wettschereck, D., Aha, D. W., & Mohri, T. (1997).A review and comparative evaluation of feature-weighting methods for a class of lazy learning algo-rithms. Artificial Intelligence Review, special issueon Lazy Learning, 11, 273–314.

48

Deriving p-values for Tree-based Variable Importance Measures

Van Anh Huynh-Thu [email protected] Wehenkel [email protected] Geurts [email protected]

Department of Electrical Engineering and Computer ScienceGIGA-Research, Bioinformatics and Modeling Unit, B34, University of Liege, B-4000 Liege, Belgium

1. Motivation

In many applications and specially in bioinformatics,an important task is the selection or ranking of somevariables or features according to their relevance topredict some target variable, e.g. in order to iden-tify genes or markers involved in a disease (see Saeyset al. (2007) for a recent review). Feature selectionhas usually two goals: improve the model accuracy byreducing overfitting and provide a better understand-ing of the problem under study. In this research, wefocus on the latter goal.

Univariate approaches in the form of statistical testsare widely used in this domain. These methods pro-vide for each variable a so called p-value that repre-sents the probability of getting a value of some statis-tic as unusual as the observed one under the (null)hypothesis that there is no correlation between thevariable and the target. These methods are statisti-cally well founded and provide results that are intu-itive and easily interpretable by practitioners. How-ever, they make sometimes strong parametric hypothe-ses and they potentially miss important features thatare only relevant in interaction with some other fea-tures.

On the other hand, machine learning brings severalfeature selection and ranking approaches that can beapplied to a large variety of prediction problems andare able to model feature dependencies. Among them,variable importance measures derived from tree-basedmodels have been used in many application domains(see Saeys et al. (2007) for some references in bioin-formatics). These methods make no strong parametrichypotheses about the problem, can handle mixturesof categorical and numerical variables, and are oftencomputationally very efficient. However, with respectto univariate statistical tests, this kind of importancemeasure is not so intuitive for practitioners. One con-sequence is that it is not easy to indicate an impor-tance threshold in order to distinguish truly relevantfrom irrelevant variables. Furthermore, despite their

popularity, there has not been much empirical stud-ies of these measures in controlled experiments (Stroblet al. (2007) is an exception but it however focusesonly on the detection of a univariate effect).

In this context, our goal in this research is two-fold: (i)improve interpretability by proposing a p-value to beassociated to each variable in the ranking and (ii) pro-vide a systematic study of these measures on artificialdatasets in order to assess their strengths and limita-tions. Below, we briefly describe tree-based variableimportance measures and how we propose to exploitrandom permutations to evaluate them. We then re-port some preliminary results on artificial data.

2. Tree-based importance measures

Tree-based algorithms, single trees or ensembles oftrees, allow to easily compute an attribute importancemeasure for a given classification problem. Among theimportance measures proposed in the literature, weuse the information measure from (Wehenkel, 1998).Namely, at each test node n, we compute the total re-duction of the class entropy due to the split, which isdefined by:

I(n) = #SHC(S)−#StHC(St)−#SfHC(Sf ), (1)

where S denotes the set of samples that reach noden, St (resp. Sf ) denotes its subset for which the testis true (resp. false), HC(·) is the Shannon entropy ofthe class frequencies in a subset, and # denotes thecardinality of a set of samples. The overall impor-tance of an attribute is then computed by summingthe I values of all tree nodes (of a single tree, or ofan ensemble of trees) where this attribute is used tosplit. The importances are usually normalized for thedifferent variables so that they sum up to 100%.

3. Evaluation by random permutations

The false discovery rate, FDR, has been introduced tocontrol multiple testing issues in the context of uni-variate tests. It is defined as the proportion of irrele-

49

vant variables among all variables that are predictedas relevant by the model (Storey & Tibshirani, 2003).This concept could be applied directly to the variableimportance ranking provided by a tree-based method.For an importance threshold vimp, the FDR is esti-mated as the proportion of variables that receive animportance greater than vimp under the null hypothe-sis that all variables are irrelevant. The null hypothesiscan be simulated by randomly permutating the classlabels in the learning sample.

One problem of this approach in a multivariate con-text is that it assumes that the relevance measures areindependent of each other, and it has been shown thatthis assumption potentially leads to an overestimationof the FDR (see Listgarten and Heckerman (2007) fora discussion of this issue in the context of Bayesiannetworks). In order to overcome this problem, we pro-pose an alternative measure based on the computa-tion of a conditional p-value, which we define, for avariable with an importance vimp, as the probabilitythat it receives an importance equal or greater thanvimp, under the hypothesis that this variable and allits successors in the ranking are irrelevant. This quan-tity can be estimated by random permutations as well,by permutating the values of those variables that havean importance equal or smaller than vimp instead ofpermutating the class labels, so as to keep the originalrelationship between the first variables and the targetclass unchanged.

Like the FDR, this conditional p-value has an intuitiveinterpretation. It should better reflect the relevanceof a variable as it takes into account the dependencebetween the importances.

4. Preliminary results

To give some intuition about the proposed measures,we report here a small experiment on some artificialdata. The dataset is composed of 100 objects and 20variables. The first three are really relevant (and of de-creasing importance), while the others are pure Gaus-sian noise. The problem is such that the third vari-able is only relevant in combination with the first two.Table 1 shows variable importances obtained with en-sembles of extremely randomized trees (Geurts et al.,2006) and one standard statistical test (the Mann-Whitney U test) for comparison. FDRs and condi-tional p-values are estimated from 1000 random per-mutations.

We observe that the true ranking is well retrievedby the tree-based method while the univariate Mann-Whitney test fails at finding the third feature (as ex-

Table 1. Variable rankingsET Mann-Whitney

Var. Imp. FDR cond. p Var. p FDRfeat1 17.7 0 0 feat1 4.5e-09 0feat2 15.6 0 0 feat2 1.2e-07 0feat3 6.7 0.36 0.06 feat5 4.1e-02 0.27feat8 5.4 1 0.23 feat9 5.0e-02 0.25feat9 4.8 1 0.60 feat7 6.2e-02 0.26feat19 4.2 1 0.93 feat18 6.8e-02 0.23feat5 4.1 1 0.94 feat19 7.2e-02 0.21feat14 3.8 1 0.98 feat10 1.5e-01 0.39feat6 3.8 1 0.98 feat3 3.1e-01 0.71

... ...

pected). The FDRs and cond. p-values usefully com-plement the ranking and the cond. p-value is indeedless conservative than the FDR, giving more chance tothe third variable to be considered relevant.

We are currently investigating how these measures be-have when parameters of the problems (number of ir-relevant and relevant features, learning sample size,etc.) and of the algorithms (splitting criterion, impor-tance measure, etc.) are varied. In the light of thesenew findings, the same approach will be applied to realpractical bioinformatics problems.

Acknowledgments

V.A. Huynh-Thu is recipient of a F.R.I.A. fellowship. P.

Geurts is a Research Associate of the F.R.S.-FNRS. This

work presents research results of the Belgian Network BIO-

MAGNET (Bioinformatics and Modeling: from Genomes

to Networks), funded by the Interuniversity Attraction

Poles Programme, initiated by the Belgian State, Science

Policy Office.

References

Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremelyrandomized trees. Machine Learning, 63, 3–42.

Listgarten, J., & Heckerman, D. (2007). Determining thenumber of non-spurious arcs in a learned DAG model:Investigation of a bayesian and a frequentist approach.Proceedings of UAI. UAI Press.

Saeys, Y., Inza, I., & Larranaga, P. (2007). A review offeature selection techniques in bioinformatics. Bioinfor-matics, 23, 2507–2517.

Storey, J. D., & Tibshirani, R. (2003). Statistical signifi-cance for genomewide studies. Proc Natl Acad Sci U SA, 100, 9440–9445.

Strobl, C., Boulesteix, A.-L., Zeileis, A., & Horthorn, T.(2007). Bias in random forest variable importance mea-sures: Illustrations, sources and a solution. BMC Bioin-formatics, 8.

Wehenkel, L. (1998). Automatic learning techniques inpower systems. Boston: Kluwer Academic.

50

Linear Regression using Costly Features

Robby Goetschalckx [email protected]

KULeuven - Department of Computer Science, Celestijnenlaan 200A, 3001 Heverlee, Belgium

Scott Sanner [email protected]

National ICT Australia, Tower A, 7 London Circuit, Canberra City ACT 2601, Australia

Kurt Driessens [email protected]

KULeuven - Department of Computer Science, Celestijnenlaan 200A, 3001 Heverlee, Belgium

1. Introduction

In this paper we consider the problem of linear regres-sion where some features might only be observable ata certain cost. We assume that this cost can be ex-pressed in the same units and thus be compared tothe approximation error cost. The learning task be-comes a search for the features that contain enoughinformation to warrant their cost. Costs of featurescould reflect necessary time for complicated computa-tions, the price of a costly experiment setup, the pricean expert asks for advice, . . .

Sparsity in linear regression has been widely studied,cf. chapter 3 of (Hastie et al., 2001). In these meth-ods, striving for sparsity in the parameters of the linearapproximation leads to higher interpretability and lessover-fitting. Our setting, however, is significantly dif-ferent as sparsity is gained by only including featureswith a cost lower than their value.

Including costs of features for classification and regres-sion has also been discussed in (Turney, 2000; Domin-gos, 1999; Goetschalckx & Driessens, 2007). In theseapproaches, binary tests are used in a decision treeif the expected information gain is worth more thanthe cost. The difference with this paper is that herewe are able to deal with numerical features instead ofonly binary ones.

2. Problem Statement

The problem we try to solve can be stated as following.

Given:

• a set X of examples,P (x) is the distribution fromwhich examples x are drawn

• a target function Y : X → R which we want to

approximate

• a set F of features: fi : X → R

• a cost of features: c : R×X → R

Find: a subset F of F and weights wi such thatthe accumulated cost, consisting of the approxima-tion error and the feature cost, is minimized: findarg minF, ~w

∑x∈X P (x)

[|Y (x)− y(x)|+ ∑

f∈F c(f, x)]

where y(x) = w0 +∑fi∈F wifi.

In other words, we try to find a linear combination ofa subset of the features, where the absolute error ofthe approximation is compared to the cost of the setof features.

3. Cost Sensitive Forward StagewiseRegression

Many sparse linear regression methods are based onleast-angle regression methods such as lasso and for-ward stepwise regression (Efron et al., 2002). In thesemethods the cost function includes a term

∑fi∈F |wi|,

to encourage lower weights and sparsity. For our set-ting, we simply change this to

∑fi∈F c(fi, x).

We adapt an existing method, forward-stepwise regres-sion, to solve the problem in an efficient way. Webriefly describe this below and refer the reader to thedetailed discussion in (Efron et al., 2002) for more in-formation on this and related methods. The modifiedalgorithm will be called C-FSR.

We start by gathering a batch of examples, togetherwith all their feature values and their target values.We take the average target value and assign its valueto w0 (this normalizes the residuals). All features arenormalized to have 0 mean and standard deviation

51

equal to 1, and assigned the feature vectors ~fi. Weassign the normalized target values to the vector ofresiduals ~r.

Cost-sensitive Forward-Stagewise Regression(C-FSR)repeat:

1. Find the feature

fi = arg maxf∈F

|~fi · ~r| − I[fi /∈ F ] · ci > 0

2. Increase its weight wi by ε.sign(~fi · ~r)3. Recompute residuals

4. F ← F ∪ {fi}until no such feature can be found.

In step 1, the cost is only included in the score if thefeature is not yet introduced in the linear function.

It should be noted that the C-FSR is a greedy selec-tion approach. Because of this, the approximation ob-tained might not always be globally optimal. A localoptimum is guaranteed, however and the increase inprediction accuracy of using this greedy feature set isguaranteed to equal or exceed the cost of its compu-tation.

4. Experiments

The setting we used for experiments is a simple do-main, where there are seven different examples withdifferent values. We provided seven indicator featuresfi for 1 ≤ i ≤ 7, one for every example. f1 and f4

have a cost c, the others are cost-free.

We varied the value of c over a range of 0 to 0.5. Withincreasing c the agent can not distinguish between x1

and x4 without paying a cost. C-FSR showed a clear-cut phase transition near c = 0.185 (the theoreticallycorrect threshold value): with lower costs, the costfor the features is always payed, with higher costs theagent prefers not to use the costly features. This canbe clearly seen in figure 1.

Here the cost the agent pays is compared with theprediction error the agent is aware of of making (thenorm of the final residual vector). For very low valuesof c, the agent actually computes both of the featureswhile only one is needed. As the costs are indeed verylow, this does not pose a real problem. For low valuesof c, the agent keeps paying for one of the features.For values higher than the threshold, the error theagent is aware of making is lower than the cost of extra

0

0.05

0.1

0.15

0.2

0.25

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

RM

SE

pred

ictio

n co

st

c

RMSE versus Cost

RMSEcost

c2 c

Figure 1. Error the agent is aware of making versus theamount spent on costly features

information and the agent will not pay for the features.For larger domains similar tests have shown that thealgorithm scales well, with running time linear in thenumber of features.

5. Conclusion

We introduced a novel sparse linear regression method,C-FSR, in the case where features have different costs.Empirically we have shown that C-FSR behaves near-perfectly for a test domain.

Acknowledgments

The research presented in this paper was funded bythe Fund for Scientific Research (FWO Vlaanderen),and by National ICT Australia.

References

Domingos, P. (1999). Metacost: A general method formaking classifiers cost-sensitive. Proceedings of the5th International Conference on Knowledge Discov-ery and Data Mining (pp. 155–164).

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R.(2002). Least angle regression (Technical Report).Statistics Department, Stanford University.

Goetschalckx, R., & Driessens, K. (2007). Cost sensi-tive reinforcement learning. Proceedings of the work-shop on AI Planning and Learning (pp. 1–5).

Hastie, T., Tibshirani, R., & Friedman, J. (2001). Theelements of statistical learning. Springer-Verlag.

Turney, P. (2000). Cost-sensitive learning bibliogra-phy. Institute for Information Technology, NationalResearch Council, Ottawa, Canada.

52

Automatic Regression Modeling with Active Learning

Dirk Gorissen [email protected]

Tom Dhaene [email protected]

Eric Laermans [email protected]

Ghent University - IBBT, Department of Information Technology (INTEC), Gaston Crommenlaan 8, Bus 201,9050 Ghent, Belgium

Abstract

Many complex, real world phenomena are dif-�cult to study directly using controlled ex-periments. Instead, the use of computer sim-ulations has become commonplace as a fea-sible alternative. However, due to the com-putational cost of these high �delity simu-lations, the use of neural networks, kernelmethods, and other surrogate modeling tech-niques have become indispensable. Surrogatemodels are compact and cheap to evaluate,and have proven very useful for tasks such asoptimization, design space exploration, visu-alization, prototyping, and sensitivity analy-sis. Consequently, there is great interest intechniques that facilitate the construction ofsuch regression models, while minimizing thecomputational cost and maximizing modelaccuracy. We present a fully automated ma-chine learning toolkit for regression model-ing and active learning to tackle these issues.We place a strong focus on adaptivity, self-tuning and robustness in order to maximizee�ciency and make our algorithms and toolseasily accessible to other scientists in compu-tational science and engineering.

1. Introduction

For many problems from science and engineering itis impractical to perform experiments on the physicalworld directly (e.g., airfoil design, earthquake propa-gation). Instead, complex, physics-based simulationcodes are used to run experiments on computer hard-ware. While allowing scientists more �exibility tostudy phenomena under controlled conditions, com-puter experiments require a substantial investment ofcomputation time (one simulation may take many min-utes, hours, days or even weeks) (Wang & Shan, 2007).

As a result, the use of various approximation meth-ods that mimic the behavior of the simulation modelas closely as possible has become standard practice.This work concentrates on the use of data-driven,global approximations using compact surrogate mod-els (also known as metamodels, or response surfacemodels (RSM)). Popular metamodel types include:neural networks, Kriging models, and Support VectorMachines (SVM).

Global surrogate models provide a fast and e�cientway for the engineer to explore the relationship be-tween parameters (design space exploration), studythe in�uence of various boundary conditions on di�er-ent optimization runs, or enable the simulation of largescale systems where this would normally be too cum-bersome. For the last case a classic example is the full-wave simulation of an electronic circuit board. Elec-tromagnetic modeling of the whole board in one run isalmost intractable. Instead the board is modeled as acollection of small, compact, accurate replacement sur-rogate models that represent the di�erent functionalcomponents (capacitors, resistors, ...) on the board.

2. Motivation

However, in order to come to an acceptable approx-imation, numerous problems and design choices needto be overcome: what data collection strategy to use(active learning), what model type is most applica-ble (model selection), how should model parametersbe tuned (hyperparameter optimization), how to op-timize the accuracy vs. computational cost trade-o�,etc. Particularly important is the data collection strat-egy. Since data is computationally expensive to ob-tain, it is impossible to use traditional, one-shot, space�lling experimental designs. Data points must be se-lected iteratively, there where the information gain willbe the greatest. An intelligent sampling function isneeded that minimizes the number of sample pointsselected in each iteration, yet maximizes the informa-

53

Figure 1. Automatic Adaptive Surrogate Modeling

tion gain of each iteration step. This is the process ofactive learning (Sugiyama & Ogawa, 2002), but it isalso known as adaptive sampling or sequential design.

Together this makes that there are an overwhelmingnumber of options available to the designer: di�erentmodel types, di�erent experimental designs, di�erentmodel selection criteria, etc. However, in practice itturns out that the designer rarely tries out more thanone subset of options. All too often, surrogate modelconstruction is done in a one-shot manner. Iterativeand adaptive methods, on the other hand, have the po-tential of producing a much more accurate surrogateat a considerably lower cost (less data points) (Busbyet al., 2007). We present a state-of-the-art machinelearning platform that provides an automatic, �exibleand rigorous means to tackle such problems and thatcan easily be integrated in the engineering design pro-cess: the SUrrogate MOdeling (SUMO) Toolbox.

3. SUMO Toolbox

The SUMO Toolbox (Gorissen et al., 2006) is an adap-tive tool that integrates di�erent modeling approachesand implements a fully automated, adaptive globalsurrogate model construction algorithm. Given a sim-ulation engine, the toolbox automatically generates asurrogate model within the prede�ned accuracy andtime limits set by the user (see �gure 1). However,at the same time keeping in mind that there is nosuch thing as a `one-size-�ts-all', di�erent problems

need to be modeled di�erently. Therefore the tool-box was designed to be modular and extensible butnot be too cumbersome to use or con�gure. Di�erentplugins are supported: model types (neural networks,SVMs, splines, ...), hyperparameter optimization al-gorithms (Pattern Search, Genetic Algorithm (GA),Particle Swarm Optimization (PSO), ...), active learn-ing (density based, error based, gradient based, ...),and sample evaluation methods (local, on a clusteror grid). The behavior of each component is con�g-urable through a central XML con�guration �le andcomponents can easily be added, removed or replacedby custom, problem-speci�c, implementations.

The di�erence with existing machine learning toolkitssuch as Rapidminer (formerly Yale), Spider, Shogun,Weka, and Plearn is that they are heavily biased to-wards classi�cation and data mining (vs. regression),they assume data is freely available and cheap (no ac-tive learning), and they lack advanced algorithms forthe automatic selection of the model type and modelcomplexity.

Our approach has been successfully applied to a verywide range of �elds ranging from combustion mod-eling in chemistry and metallurgy, semi-conductormodeling (electromagnetism), aerodynamic modeling(aerospace), to structural mechanics modeling in thecar industry. Its success is primarily due to its �exi-bility, self tuning implementation, and its ease of inte-gration into the larger computational science and en-gineering pipeline.

References

Busby, D., Farmer, C. L., & Iske, A. (2007). Hier-archical nonlinear approximation for experimentaldesign and statistical data �tting. SIAM Journalon Scienti�c Computing, 29, 49�69.

Gorissen, D., Hendrickx, W., Crombecq, K., &Dhaene, T. (2006). Integrating gridcomputing andmetamodelling. Proceedings of 6th IEEE/ACM In-ternational Symposium on Cluster Computing andthe Grid (CCGrid 2006) (pp. 185�192). Singapore.

Sugiyama, M., & Ogawa, H. (2002). Release from ac-tive learning/model selection dilemma: optimizingsample points and models at the same time. NeuralNetworks, 2002. IJCNN '02. Proceedings of the 2002International Joint Conference on (pp. 2917�2922).

Wang, G. G., & Shan, S. (2007). Review of meta-modeling techniques in support of engineering de-sign optimization. Journal of Mechanical Design,129, 370�380.

54

Active Learning for Primary Drug Screening

Kurt De Grave, Jan Ramon and Luc De Raedt {Kurt.DeGrave,Jan.Ramon,Luc.DeRaedt}@cs.kuleuven.be

Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium

Abstract

We study the task of approximating the kbest instances with regard to a function us-ing a limited number of evaluations. Wealso apply an active learning algorithm basedon Gaussian processes to the problem, andevaluate it on a challenging set of structure-activity relationship prediction tasks.

1. Introduction

High Throughput Screening (HTS) is a step in thedrug discovery process, in which chemical compoundsare screened against a biological assay. The goal ofthis step is to find a few lead compounds within theentire compound library that exhibit a very high ac-tivity in the assay. In this type of application, onlypartial information can be obtained by testing specificinstances for their performance. Such tests correspondto experiments and can be quite expensive. The chal-lenge then is to identify the best performing instancesusing as few experiments as possible.

2. Problem statement

Our work is especially motivated by the structure-activity relationship domain, where HTS approachesoften assume the availability of a large but fixed li-brary of chemical compounds. Hence, we assume thelearner must select the next example from a finite pool.

We formally specify the problem as follows:Given:

• a pool P of instances,

• an unknown function f that maps instances x ∈ Pon their target values f(x),

• an oracle that can be queried for the target valueof any example x ∈ P,

• the maximal number Nmax of queries,

• the number k of best examples searched for.

Find:

• the top k instances in P, that is, the k instancesin P that have the highest values for f .

From a machine learning perspective, the key chal-lenge is to determine the policy for determining thenext query to be asked on the basis of the alreadyknown examples. This policy will have to keep theright balance between exploring the whole pool of ex-amples and exploiting those regions in the pool thatlook most promising.

3. Model and selection strategies

We will use a Gaussian process (GP) model for learn-ing (Gibbs, 1997). Detailed explanations can be foundin several textbooks, e.g. (Bishop, 2006). The GPmodel allows us to calculate the probability distribu-tion of the target value t∗ of a new example x∗ giventhe tested examples XN and their measured target val-ues TN :

t∗|XN , TN , x∗ ∼ N (t∗, var(t∗)) (1)

Different active learning strategies exist. In line withthe customary goal of inducing a model with maximalaccuracy on future examples, most approaches involvea strategy aiming to greedily improve the quality of themodel in regions of the example space where its qual-ity is lowest. One can select new examples for whichthe predictions of the model are least certain or mostambiguous. Depending on the learning algorithm, thistranslates to near decision boundary selection, ensem-ble entropy reduction, version space shrinking, andothers. In our model, it translates to maximum vari-ance on the predicted value or arg max(var(t∗)).

(Warmuth et al., 2003) found that in a highly skeweddistribution, recall increases quickly when one selectsexamples for testing that are most likely to belong tothe minority class. For our optimization problem we

55

will test the equivalent method of selecting the exam-ple that the current model predicts to have the besttarget value, or arg max(t∗). We will refer to this asthe maximum predicted strategy.

Another strategy is to always choose the example forwhich the optimistic guess is maximal. The idea isnot to test the example in the database where the pre-dicted value t∗ is maximal, but the example wheret∗ + koptimism · var(t∗) is maximal.

An alternative strategy is to select the example xN+1

that has the highest probability of improving the cur-rent solution, as described by (Lizotte et al., 2007).Let the current step be N , the aggregate value of theset of k best examples be ‖TN‖best−k

and the tar-get value of the k-th best example be t#(k,N). Wecan evaluate this probability computing the cumula-tive Gaussian

P (t∗ > t#(k,N)) =∫ ∞

t=t#(k,N)

(t− t∗)P (t∗ = t)dt. (2)

We call this the maximum gain probability strategy.

4. Experimental evaluation

We evaluated the algorithm on the US National Can-cer Institute (NCI) 60 anticancer drug screen (NCI60)dataset (Shoemaker, 2006). A pool of 2,000 com-pounds was randomly selected from each assay.

We used a linear kernel using for each compound 1024Open Babel FP2 fingerprints as features. The algo-rithm was bootstrapped with measurements of ten ran-dom compounds. Each experiment was repeated 20times for every assay and the results were averaged.

Budget 10% 15% 20% 25%Max predicted (0σ) 0.251 0.684 0.040 0.021Optimistic (0.5σ) 0.521 Best 0.251 0.111Optimistic (1σ) 0.182 0.469 Best BestOptimistic (2σ) 0.618 0.958 0.179 0.298Max variance (∞σ) ε ε ε εMax gain prob Best 0.982 0.189 0.052Random selection ε ε ε ε

Table 1. Best strategy (attaining highest ‖TNmax‖best−10)

and Wilcoxon signed-rank test p-value for the null hypoth-esis that the difference between the top-10 values of thisstrategy and those of the best strategy is on average 0. εindicates that p < 10−10. Budget shown as % of pool size.

From the results presented in Figure 1 and Table 1 onecan see that all strategies, except maximum-variance,clearly perform much better than random example se-lection. The 1σ optimistic strategy performs best overthe widest range of budgets.

101 102 1030.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1NCI60 k=10

Max prediction0.5σ optimistic1σ optimistic2σ optimisticMax varianceMax gain probRandom selection

Figure 1. The value of ‖TN‖best−10as a function of the

fraction of compounds tested. The vertical axis is scaledto place the aggregate target value of the overall k bestcompounds at one and the worst k compounds at zero.

5. Conclusions

To summarize: we introduced the best-k optimizationproblem in a machine learning context, we proposed anapproach based on Gaussian processes to tackle it, andwe applied it successfully to a challenging structureactivity relationship prediction task.

Acknowledgements

Kurt De Grave is supported by GOA 2003/08 ”In-ductive Knowledge Bases”. Jan Ramon is a post-doctoral fellow of the Fund for Scientific Research(FWO) of Flanders. This research used HPC resources(http://ludit.kuleuven.be/hpc).

References

Bishop, C. M. (2006). Pattern recognition and machinelearning. Springer.

Gibbs, M. (1997). Bayesian gaussian processes for re-gression and classification. Doctoral dissertation.

Lizotte, D., Wang, T., Bowling, M. and Schuurmans,D. (2007). Automatic gait optimization with Gaus-sian process regression. Proceedings of IJCAI-07,944–949.

Shoemaker, R. (2006). The NCI60 human tumour cellline anticancer drug screen. Nat. Rev. Cancer, 6,813–823.

Warmuth, M. K. et al. (2003). Active learning withsupport vector machines in the drug discovery pro-cess. J. Chem. Inf. Comput. Sci., 43, 667–673.

56

When can we simplify a one-versus-one multi-class classifier to asingle ranking?

Willem Waegeman [email protected]

Department of Electrical Energy, Systems and Automation, Ghent University, Technologiepark 913, B-9052Ghent, Belgium

Bernard De Baets [email protected]

Department of Applied Mathematics, Biometrics and Process Control, Ghent University, Coupure links 653,B-9000 Ghent, Belgium

Luc Boullart [email protected]

Department of Electrical Energy, Systems and Automation, Ghent University, Technologiepark 913, B-9052Ghent, Belgium

AbstractWe try to answer the question posed inthe title from a theoretical point of view.To this end, sufficient conditions are de-rived for which a one-versus-one ensemble be-comes ranking representable, i.e. conditionsfor which the ensemble can be reduced toa ranking or ordinal regression model suchthat a similar performance on training datais measured. As performance measure, weuse the area under the ROC curve (AUC)and its reformulation in terms of graphs. Bymeans of a graph-theoretic analysis of theproblem, we are able to formulate necessaryand sufficient conditions for ranking repre-sentability. For the three class case, this re-sults in a new type of transitivity for pairwiseAUCs that can be verified by solving an in-teger quadratic program.

1. Introduction

Many machine learning algorithms for multi-class clas-sification aggregate several binary classifiers to com-pose a decision rule. In the popular one-versus-oneensemble (Furnkranz, 2002), a classifier is trained oneach pair of categories, but do we really need sucha complex model for every multi-class classificationtask? One might agree that different multi-class classi-fication schemes have a different degree of complexity,but no consensus has been reached on which one toprefer. In this work we go one step further and inves-tigate whether a one-versus-one multi-class model can

be simplified to a ranking model. We start from the as-sumption that the optimal complexity of a multi-classmodel is problem-specific. Reducing a one-versus-one ensemble to a ranking model, can be seen as aquite drastic application of the bias-variance trade-off: a one-versus-one classification scheme is a complexmodel, resulting in a low bias and a high variance ofthe performance, while an ordinal regression model isa much simpler model, manifesting a high bias but alow variance. So, we do not claim that a one-versus-one scheme can always be reduced to a ranking model.We rather look for necessary and sufficient conditionsfor such a reduction.

2. Strict ranking representability

Let X denote the object space and Y the unorderedset of r labels. We use the notation Y = {y1, ..., yr}to denote the respective categories. Furthermore, weformally define a one-versus-one model as a set F ofr(r − 1)/2 ranking functions fkl : X → R with 1 ≤k < l ≤ r. Thus, we consider one-versus-one schemesfor which each binary classifier produces a continuousoutput resulting in a probability estimate or a rankingof the data for each pair of categories. A dataset ofsize n will be denoted D = {(x1, y1), ..., (xn, yn)}.Given a one-versus-one classification model, repre-sented by a set F of r(r− 1)/2 pairwise ranking func-tions fkl, when can we reduce this model to a singleranking f : X → R that gives a better performanceon unknown test data? Or, equivalently, when can wesimplify the one-versus-one model to a ranking modelwithout decreasing the error on training data? Hav-

57

ing in mind the bias-variance trade-off, it would be ap-propriate to prefer the single ranking model over theone-versus-one scheme if the training error does notincrease. In that case, the former model is complexenough to fit the data well in spite of having a lowervariance over different training samples. In its moststrict form, we can define ranking representability of aone-versus-one classification scheme as follows.

Definition 2.1. Let D ⊂ X × Y. We call a setF of pairwise ranking functions fkl strictly rankingrepresentable on D if there exists a ranking functionf : X → R such that for all yk, yl ∈ Y and any(xi, yk), (xj , yl) ∈ D

fkl(xi) ≤ fkl(xj)⇔ f(xi) ≤ f(xj) . (1)

We can define a unique directed graph for a set of pair-wise rankings F , and strict ranking representability ofF can be easily checked with a simple algorithm thatverifies whether the corresponding graph is a DAG.

3. AUC ranking representability

It goes without saying that strict ranking repre-sentability has a very limited applicability to reduceone-versus-one multi-class schemes, since the condi-tion is too strong to be satisfied in practice. Whenfitting r(r−1)/2 functions to the data in a multi-classsetting, it is unrealistic to think that all these func-tions will impose a consistent ranking, i.e. a rankingsatisfying Eq. (1). Yet, is it really necessary to requirestrict ranking representability in order to exchange aone-versus-one model for a single ranking model? Theanswer is no, since we are interested in a good perfor-mance on independent test data. Therefore, demand-ing that a single ranking gives exactly the same resulton training data as a one-versus-one scheme might be atoo strong condition. An obvious relaxation could ex-ist in requiring that a single ranking model yields thesame performance on training data instead of requir-ing the same results. This makes a subtle differencesince it is now allowed that both models make errorson different data objects, as long as the total error ofboth models is similar. As claimed above, the singleranking model should attain better results on indepen-dent test data when the bias-variance trade-off is takeninto consideration.

The performance measure that we will consider is thepairwise AUC, which can be evaluated on a one-versus-one model as well as on a single ranking model. We willrespectively use the notations Akl(F , D) and Akl(f,D)for the AUC obtained for categories yk and yl. AUCranking representability is defined as follows.

Definition 3.1. Let D ⊂ X × Y. We call a set Fof pairwise ranking functions fkl AUC ranking rep-resentable on D if there exists a ranking functionf : X → R such that

Akl(F , D) = Akl(f,D) ∀k, l : 1 ≤ k < l ≤ r . (2)

A graph-theoretic reformulation of AUC ranking rep-resentability can be established by defining a setGAUC(F , D) of graphs such that F is AUC rankingrepresentable if and only if GAUC(F , D) contains atleast one acyclic graph. However, unlike strict rankingrepresentability, it is far form trivial to verify whethera set F of pairwise rankings fkl is AUC ranking rep-resentable, since examining all graphs in GAUC(F , D)will be computationally intractable for large trainingsamples. In the talk we will present a way to tackle theproblem by using additional graph concepts and theframework of cycle transitivity (De Baets et al., 2006).Using this framework, we are able to define necessaryconditions for AUC ranking representability, since thepairwise AUCs of an AUC ranking representable one-versus-one scheme are reciprocal relations coincidingwith dice models (De Schuymer et al., 2003). Theseconditions can be easily verified in practice by analyz-ing the pairwise AUCs.

In another way, sufficient conditions for AUC rankingrepresentability can also be translated into the frame-work of cycle transitivity. To this end, a new type ofcycle transitivity is introduced, leading to a verifiablesufficient condition for the three class case. In thisway, AUC ranking representability can be checked bysolving an integer quadratic program.

Acknowledgment

Willem Waegeman is supported by a grant of the “In-stitute for the Promotion of Innovation through Sci-ence and Technology in Flanders (IWT-Vlaanderen)”.

References

De Baets, B., De Meyer, H., De Schuymer, B., & Jenei,S. (2006). Cyclic evaluation of transitivity of recip-rocal relations. Social Choice and Welfare, 26, 217–238.

De Schuymer, B., De Meyer, H., De Baets, B., &Jenei, S. (2003). On the cycle-transitivity of thedice model. Theory and Decision, 54, 164–185.

Furnkranz, J. (2002). Round robin classification. Jour-nal of Machine Learning Research, 2, 723–747.

58

Monotone Relabeling of Partially Non-Monotone Data:Restoring Regular or Stochastic Monotonicity

Michael Rademaker [email protected] De Baets [email protected] of Applied Mathematics, Biometrics and Process Control, Ghent University, Coupure links 653,9000 Gent, Belgium

Hans De Meyer [email protected] of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281 S9, 9000 Gent,Belgium

Abstract

We examine the problem of non-monotonicity in multi-criteria data, andformulate different optimal relabeling al-gorithms for different domain constraints.More exactly, we examine the case wherelabels are ordinal or lacking a distancefunction, and the case where such a distancefunction is applicable. Furthermore, we dis-cuss the problem of stochastic monotonicity,and give optimal algorithms for relabelingfor this type of domain knowledge. Ofcentral importance is the transitivity of thenon-monotonicity relation, which permitsformulation of each of these relabelingproblems as an independent set problemin a comparability graph. Network flowalgorithms can then be applied in order toyield optimal solutions in all but one of theproblems discussed.

1. Introduction

We consider the problem of noise in the form of non-monotonicity. Three options present themselves givena data set subject to such noise: keep the data set as itis, identify the noisy instances and remove them fromthe data set, or identify the noisy instances and rela-bel them. It is this last option we will discuss here.We will examine regular and stochastic monotonicity,and discuss the ways in which these problems can betranslated to independent set problems in comparabil-ity graphs. Of key importance is the transitivity of thenon-monotonicity relation, permitting the formulationof the problem as one solvable by network flow algo-

rithms and formulation of L1 loss optimal relabelingalgorithms.

2. Multi-criteria data, monotonicityand independent sets

In multi-criteria data, instances can be orderedw.r.t. the scores on the different criteria. One instanceis said to be strictly better than another instance ifit received a score that is at least as good on eachof the criteria, with the preference being strict for atleast one criterion. For two such instances, one canexpect the better instance to receive a label (comingfrom the collection of labels L) that is at least as goodas the worse instance. This background knowledge isknown as the monotonicity requirement: an increasein criterion scores cannot result in a decrease in label.

Sometimes, assignment of a single label to each featurevector is too strict. In such cases, a distribution overthe different labels is supplied for each feature vector,and regular monotonicity does not apply. Rather, wehave stochastic monotonicity: the better feature vectorshould have a distribution that contains more higherlabels than the worse feature vector. This is most eas-ily seen on the basis of the cumulative distributions:the better feature vector has a cumulative distribu-tion that takes higher values at higher labels, whilethe worse feature vector should take higher values forlower labels (see Section 3).

Of central importance is the fact that both regularand stochastic non-monotonicity are defined as tran-sitive relations: if an instance or feature vector x isnon-monotone w.r.t. an instance or feature vector y, itwill also be non-monotone w.r.t. all instances or fea-ture vectors which are better than y according to their

59

(a) (b)

Figure 1. Graph representation of a partially non-monotone data set S1(numbers represent class labels)

features, and worse according to their label(s).

The independent set concept is relevant to the discus-sion of non-monotonicity in defining an optimal clean-up (Rademaker et al., 2006). This problem from graphtheory deals with a graph G, comprised of a set ofvertices V and a set of edges E. Figure 1(a) can beseen as a graph-representation of a data set S with in-stances S. The set of instances corresponds to the setof vertices of our graph, and the edges we show denotetwo instances to be non-monotone (Enm) w.r.t. eachother. Though finding a maximum independent setis an NP-complete problem in the general case, it issolvable in our case: the transitivity inherent to thenon-monotonicity relation means a graph such as theone in Figure 1(a) is actually a comparability graphby definition. The condensed representation of such agraph is shown in Figure 1(b). Through the use of anetwork flow algorithm, it is possible to determine themaximum independent set in O(|V |3) time (Möhring,1985). We will show how the problems of restoringregular and stochastic monotonicity through the re-labeling of instances can be solved by these methods.Network flow representations are possible and straight-forward, but have been omitted here for lack of space.

3. Solving the relabeling problem

Suppose we have the label function d : S → L thatreturns the label an instance received in the data set,and the label distance function D : L × L → R thatquantifies how far apart two labels are. We now trans-late the minimal L1 relabeling problem to a weightedmaximum independent set problem. To this end, weadd a number of instances to the data set: we copyeach instance |L| − 1 times, and assign each of thesecopies a new label so that we end up with in total |L|copies, one for each label. The weight each of theseinstances receives is related to the label distance be-tween the old and the new label. The original copy

Figure 2. Cumulative frequency label distributions: onestep better, original distribution and one step worse (la-bels denoted by letters)

of the instance, let us call it aold, receives a weightof A = D (min (L) ,max (L)). Other instances receivea weight depending on their label, i.e. for a relabeledinstance a′ this weight is A − D (d (a′) , d (aold)). Ifwe determine a maximum weighted independent set inthis data set, we have a relabeling that minimizes theL1 loss.

An analogous technique can be used when stochasticmonotonicity is demanded. Each feature vector hasan original collection of labels. By stepwise relabelingall instances that received the worst label to the worst-but-one label, we can transitively shift the distributionto the better labels by relabeling instances. An anal-ogous operation is possible to worsen the distributionby relabeling the best instances to the best-but-onelabel. As evidenced by Figure 2 which contains anoriginal distribution and the next-best and next-worstrelabeled version, such a collection of partially rela-beled distributions can be ordered from best to worst,with the original distribution obviously being betterthan all those that have been constructed to be worse,and worse than those that have been constructed tobe better. Weights can be assigned on the basis of thenumber of instances that still carry their original la-bel (as a 0/1 loss equivalent), or on the basis of thenumber of times an instance needed to be relabeled inorder to end up with the distribution in question (asan L1 loss equivalent). Illustrative examples will beprovided, and extensions to simultaneously minimizethe 0/1 loss and the L1 loss will be formulated.

ReferencesMöhring, R. (1985). Algorithmic aspects of compa-

rability graphs and interval graphs. Proceedings ofthe NATO Advanced Study Institute on Graphs andOrder (pp. 41–101). Banff, BC, Canada.

Rademaker, M., De Baets, B., & De Meyer, H. (2006).On the role of maximal independent sets in clean-ing data sets for supervised ranking. Proceedings ofthe 2006 IEEE International Conference on FuzzySystems (pp. 7810–7815). Vancouver, BC, Canada.

60

Learning in Kernelized Output Spaces with Tree-Based Methods


University of Liege, Department of EE and CS & GIGA research Institut Montefiore, Sart Tilman B28, 4000,Liege, Belgium

Florence d’Alche-Buc [email protected]

University of Evry, IBISC 523 Place des Terrasses de l’Agora, 91000, Evry, France

AbstractThe availability of structured data sets fromXML documents to biological structures,such as sequences or graphs, has promptedfor the development of new learning algo-rithms able to handle complex structuredoutput spaces. Motivated by the success ofkernel methods for handling complex inputspaces, one approach to handle complex out-put problems is to embed the outputs into akernelized space and develop algorithms thatcan work in such a space by exploiting thekernel trick. In this abstract, we present ourrecent work with tree-based methods in thisdomain.

1. Learning in kernelized output spaces

The general problem of supervised learning may beformulated as follows: from a learning sample LS ={(xi, yi)|i = 1, . . . , NLS} with xi ∈ X and yi ∈ Y, finda function f : X → Y that minimizes the expecta-tion of some loss function over the joint distribution ofinput/output pairs:

Ex,y{ℓ(f(x), y)}. (1)

Let us suppose now that we have a kernelized outputspace, ie. an output space Y endowed with a kernelk : Y × Y → IR, and let us denote by φ : Y → H thefeature map defined by k. In this paper, we considerproblems where the loss function is defined as follows:

ℓ(y1, y2) = ||φ(y1)− φ(y2)||2= k(y1, y1) + k(y2, y2)− 2k(y1, y2), (2)

which depends only on the output kernel.

The resolution of this problem usually involves twosteps (see Figure 1):

φ y( )

x yf

f φ φφ−1

Figure 1. Learning with kernelized outputs

1. Learn an intermediate function fφ : X → H thatmaps each input vector into the Hilbert space andminimizes the following loss:

Ex,y{||fφ(x) − φ(y)||2}. (3)

2. From a prediction fφ(x) in H, get a predictionf(x) in the original output space Y by solving:

f(x) = arg miny∈Y

||φ(y)− fφ(x)||2. (4)

For this solution to be practically feasible, we needto be able to write the algorithm from pairwise dot-products only. Solutions for the learning stage are usu-ally obtained by using the kernel trick at the outputof some standard supervised learning method that canhandle a vectorial output. For example, Cortes et al.(2005) propose a kernelization of the output of ker-nel ridge regression. The methods that we presentbelow are obtained by kernelizing the output of (mul-tiple output) regression trees and gradient boosting.The optimization problem (4) is the so called preim-age problem, whose solution is kernel specific.

Like other kernel methods, this formulation of the su-pervised learning problem has the advantage of decou-pling the design of the algorithm from the design ofthe kernel. Given the important literature that existson kernels, it allows one to develop generic algorithmsthat can potentially handle a large range of problemsin terms of loss functions and output spaces.

61

2. Output kernel trees

In (Geurts et al., 2006), we propose a kernelization ofthe output of regression trees that we called OK3 foroutput kernel trees. The idea of standard regressiontrees is to recursively split the learning sample with bi-nary tests based on the input variables, trying at eachsplit to reduce as much as possible the (empirical) vari-ance of the output in the left and right subsamples oflearning cases corresponding to that split. Our ker-nelization of this method is based on the exchange ofthis variance for the average square distance from thecenter of mass in H that can be computed from kernelvalues only:

1N

N∑i=1

||φ(yi)− 1N

N∑i=1

φ(yi)||2 =

1N

N∑i=1

k(yi, yi)− 1N2

N∑i,j=1

k(yi, yj)

Predictions at tree leaves are then written as linearcombinations of learning sample outputs, which makesexpression (4) again computable from kernel valuesonly.

The use of ensemble methods is usually necessaryto make tree-based methods competitive with othermethods in terms of accuracy. The extension of en-semble methods based on randomization, such as bag-ging or random forests, to kernelized output spaces isstraightforward as these methods only rely on scorecomputations to grow the trees and combine predic-tions by simply averaging them. The extension ofboosting methods is however generally not trivial asthese methods manipulate the outputs. In (Geurtset al., 2007b), we show however that a particular typeof boosting algorithm, gradient boosting with squareloss, can be also modified to handle a kernelized outputspace. Just like in standard classification and regres-sion settings, the use of ensemble methods usually im-proves very much with respect to single output kerneltrees.

3. Supervised graph inference

One application of these methods is supervised graphinference. In this problem, we assume that we have aset of vertices {vi, i = 1, . . . , m} that are related be-tween each other by a graph. Each vertex is further-more described by an input feature vector x(vi) ∈ Xand the goal is to find a function g(x(v), x(v′)) :X×X → {0, 1} that predicts as well as possible the ex-istence of a connection between two (potentially new)vertices v and v′ described by their input vectors.

Our solution is based on a kernel embedding of thegraph (e.g., by a diffusion kernel) and the applicationof output kernel trees in the resulting kernelized out-put space. This application gives a model fφ(.) andedge predictions are then obtained by thresholding thedot-product between the predictions of this model fortwo (new) input vectors:

g(x, x′) = 1(〈fφ(x), fφ(x′)〉 > kth).

Since fφ is expressed as a linear combination of out-puts from the learning sample, this latter expressioncan again be computed from kernel values only.

In (Geurts et al., 2007a), we have applied this method-ology to the supervised inference of two biological net-works between Yeast genes (protein-protein interac-tion network and enzyme network) from various in-formation about these genes (microarray expression,localization, and phylogenetic information). Our ap-proach yields competitive results with respect to ex-isting methods.

Acknowledgments

PG is a research associate of the FRS-FNRS (Bel-gium). This work has been done while he was apostdoc at IBISC (Evry, France) with support of theCNRS (France). FAB’s research has been funded byGenopole (France). This work presents research re-sults of the Belgian Network BIOMAGNET (Bioinfor-matics and Modeling: from Genomes to Networks),funded by the Interuniversity Attraction Poles Pro-gramme, initiated by the Belgian State, Science PolicyOffice.

References

Cortes, C., Mehryar, M., & Weston, J. (2005). A gen-eral regression technique for learning transductions.Proceedings of ICML 2005 (pp. 153–160).

Geurts, P., Touleimat, N., Dutreix, M., & d’Alch Buc,F. (2007a). Inferring biological networks with out-put kernel trees. BMC Bioinformatics, 8 Suppl 2,S4.

Geurts, P., Wehenkel, L., & d’Alch Buc, F. (2006).Kernelizing the output of tree-based methods. Pro-ceedings of the 23rd International Conference onMachine Learning (ICML 2006) (pp. 345–352).

Geurts, P., Wehenkel, L., & d’Alch Buc, F. (2007b).Gradient boosting for kernelized output spaces. Pro-ceedings of ICML 2007.

62

Selective Inductive Transfer

Beau Piccart [email protected] Struyf [email protected] Blockeel [email protected]

Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium

Abstract

Multi-target models, which predict multi-ple target variables simultaneously, may pre-dict some of the targets more accurately,and other targets less accurately than asingle-target model. This raises the questionwhether it is possible to find, for a given maintarget, a subset of the other targets that,when combined with the main target in amulti-target model, results in the most ac-curate model for the main target. We pro-pose Selective Inductive Transfer, an algo-rithm that automatically finds such a subset.

1. Introduction

We consider simultaneous prediction of multiple vari-ables, which is known as multi-target or multi-objective prediction. In this setting, the input is as-sociated with a vector of target variables, and all ofthem need to be predicted as accurately as possible.Multi-target prediction is encountered, e.g., in ecologi-cal modelling, where the domain expert is interested in(simultaneously) predicting the frequencies of differentorganisms in agricultural soil or river water.

It has been shown that multi-target models can bemore accurate than predicting each target individuallywith a separate single-target model (Caruana, 1997).This is a consequence of the fact that when the tar-gets are related (e.g., if they represent frequently co-occurring organisms in the ecological modelling appli-cations mentioned above), they can carry informationabout each other; the single-target approach is unableto exploit that information, while multi-target modelsnaturally exploit it. This is a form of inductive trans-fer: the information a target carries about the othertargets is transferred to those other targets.

Multi-target models do, however, not always lead tomore accurate prediction. For a given target variable,the variable’s single-target model may be more accu-

rate than the multi-target model. That is, inductivetransfer from other variables can be beneficial, but itmay also be detrimental to accuracy. Therefore, thesubset of targets that, when combined with a giventarget (the main target) in a multi-target model, re-sults in the most accurate model for the main target,may be non-trivial, i.e., different from the empty setand from the set of all targets. We call this set thesupport set for the main target.

2. Selective Inductive Transfer

We propose Selective Inductive Transfer (SIT), agreedy algorithm that approximates the support setfor a given main target, and that works as follows.

1. Initialize the support set to the empty set.

2. Consider the multi-target model with as targetsthe main target and the current support set. Usethis model to predict only the main target andestimate the resulting accuracy. (We use 10-foldcross validation to estimate accuracy.)

3. Add each candidate support target in turn to thecurrent support set and estimate the main target’saccuracy for each resulting multi-target model.

4. Select the candidate support target (if any) thatyielded the largest increase in accuracy over theaccuracy of the current support set and add itpermanently to the support set. If no candidateimproves accuracy, then return the current sup-port set.

5. Go to step 2.

SIT has a number of advantages over related methods.First, other algorithms that exploit transfer selectively,such as the Task Clustering algorithm of Thrun andO’Sullivan (1996), often incorrectly assume transfer tobe symmetric. This results in suboptimal models. SITdoes not assume transfer to be symmetric; if a is a

63

support target for main target b, then b is not nec-essarily a support target for a. A second shortcom-ing of other methods, such as the η-MTL algorithm ofSilver and Mercer (1996), is that they rely on heuris-tics, such as the linear correlation of the targets, toapproximate transfer. Such heuristics may poorly ap-proximate transfer. SIT measures transfer empirically(using cross validation). Therefore, it does not sufferfrom this problem. A third advantage of SIT is thatit is a general method that can be combined with anymulti-target learner.

3. Experimental Evaluation

The aim of our experiments is to test to which extentSIT, for a given main target, succeeds in finding a goodset of support targets. To this end, we compare SIT totwo common baseline models: a single-target model forthe main target (ST), and a multi-target model thatincludes all targets (MT).

SIT has been implemented in the decision tree induc-tion system Clus, which also implements single- andmulti-target regression trees (Blockeel et al., 1998),and is available as open source software from http://www.cs.kuleuven.be/~dtai/clus/.

We compare for each target variable of 6 ecologicalmodelling multi-target regression datasets, the pre-dictive performance of a traditional single-target re-gression tree (STRT), a tree constructed by SIT withall other targets as candidate support targets, and amulti-target tree including all targets (MTRT).

Table 1 shows for each method, the cross validatedPearson correlation, averaged over all targets. (Thismeasure is common in ecological modelling; we alsouse it as accuracy estimate in the internal cross vali-dation performed by SIT.) SIT performs significantlybetter than STRT for 5 out of 6 datasets and neverperforms significantly worse. It performs comparableto MTRT or significantly better than MTRT in 5 outof 6 datasets. For one dataset (Soil 2 ), SIT still signif-icantly outperforms STRT, yet it is significantly worsethan MTRT. Here, SIT selects too few of the targets(it selects 4.2 out of the 38 targets on average in thedifferent folds). The most likely cause is the large num-ber of targets. SIT implements a greedy hill-climbingsearch that stops too early and ends up in a local op-timum for this dataset.

4. Conclusions & Further Work

We proposed Selective Inductive Transfer (SIT), analgorithm that searches for the set of support targets

Table 1. 10-fold cross validated Pearson correlation aver-aged over all targets and five runs. •,◦ denote a statisticallysignificant improvement or degradation of SIT or MTRTover STRT. Significance is determined by the corrected re-sampled t-test (Nadeau & Bengio, 2003) with significancelevel 0.01. �,� denote a statistically significant improve-ment or degradation of SIT over MTRT.

Dataset STRT SIT MTRTSigmea 0.63±0.40 0.64±0.40 0.64±0.40Soil 1 0.60±0.18 0.63±0.13• 0.64±0.13•Soil 2 0.34±0.40 0.41±0.35•� 0.48±0.29•Soil 3 0.19±0.23 0.24±0.23• 0.26±0.22•Water 1 0.26±0.17 0.29±0.15•� 0.27±0.15Water 2 0.37±0.27 0.41±0.25•� 0.39±0.23

that, when predicted together with the main targetin a multi-target model, maximally improves the pre-dictive performance with regard to the main target.Experiments show that, in all but one dataset, SITfinds a good set of support targets.

In further work, we plan to compare SIT to othermethods that exploit transfer selectively and investi-gate alternative search strategies to the greedy hill-climbing approach implemented by SIT.

Acknowledgments: Beau Piccart is supported byproject G.0255.08 “Efficient microprocessor design us-ing machine learning” funded by the Research Foun-dation - Flanders (FWO-Vlaanderen). Jan Struyf andHendrik Blockeel are post-doctoral fellows of the Re-search Foundation - Flanders (FWO-Vlaanderen).

References

Blockeel, H., De Raedt, L., & Ramon, J. (1998). Top-down induction of clustering trees. 15th Int’l Conf.on Machine Learning (pp. 55–63).

Caruana, R. (1997). Multitask learning. Mach. Learn.,28, 41–75.

Nadeau, C., & Bengio, Y. (2003). Inference for thegeneralization error. Mach. Learn., 52, 239–281.

Silver, D., & Mercer, R. (1996). The parallel transfer oftask knowledge using dynamic learning rates basedon a measure of relatedness. Connect. Sci., 8, 277–294.

Thrun, S., & O’Sullivan, J. (1996). Discovering struc-ture in multiple learning tasks: The TC algorithm.13th Int’l Conf. on Machine Learning (pp. 489–497).

64

Vision as Inference in a Hierarchical Markov Network

Justus Piater [email protected] Scalzo [email protected] Detry [email protected]

University of Lige, INTELSIG Group, Grande Traverse 10; 4000 Lige – Sart Tilman, Belgium

Abstract

We present a snapshot of our current work onlearning visual object models in the form ofMarkov networks, point out interesting rela-tions to biological vision, and illustrate theirperformance on diverse tasks such as objectrecognition and 3D pose estimation.

1. Introduction

Cortical visual processing involves both bottom-uppropagation of perceptual stimuli and modulation bytop-down signals. Lee and Mumford (2003) suggestedthat the visual processing stream from the LGN viaV1, V2 and V4 to IT might perform Bayesian infer-ence within an undirected Markov chain. A corticallayer xi (say, V1) computes its activity P (xi | xk 6=i) =P (xi | xi−1, xi+1) = P (xi−1 | xi)P (xi | xi+1)/Zi , thatis, a posterior probability distribution given bottom-up input from xi−1 (LGN), under top-down priorsxi+1 (V2). The parameters of the Markov networkare specified via pairwise compatibility potentials ψ,where P (xi | xi−1, xi+1) = ψ(xi−1, xi)ψ(xi, xi+1)/Zi.These potentials must be learned from experience withthe world; Lee and Mumford (2003) do not commenton how this might be done. A crucial aspect of thismodel is that ambiguities at low levels should persistand propagate upwards until they can be resolved byintegrating larger-scale evidence or top-down expecta-tions. As a biologically plausible implementation ofinference with arbitrary, possibly multi-modal proba-bility densities, Lee and Mumford (2003) suggest beliefpropagation using particle representations. Moreover,they provide a wealth of neurophysiological and psy-chophysical evidence for such a computational model.

2. Principle

We are currently developing representations and meth-ods for visual inference that constitute, at least at thevague level of detail given above, a working computer

implementation of central aspects of Lee and Mum-ford’s model. Without making explicit reference tospecific cortical layers, our approach is based on aMarkov network such as the didactic example shownin the figure, with vertices arranged in layers corre-sponding to those of Lee and Mumford’s. Each vertexis a random variable representing the spatial proba-bility density of the presence of a feature. At level 0,a primitive feature x0,j is the spatial probability den-sity of a given type of locally observable feature. It isinferred from local image appearance yj via its obser-vation potential φ(x0,j , yj). At higher levels, a com-pound feature (recursively) represents the presence ofboth of its children, and the compatibility potentialsψ represent pairwise spatial relationships. For exam-ple, in the figure, feature x3,1 represents the spatialprobability density of features x2,1 and x2,3 occurringin the relative configuration encoded by ψ(x2,1, x3,1)and ψ(x2,3, x3,1).

We construct such networks, including their topol-ogy and compatibility potentials, using unsupervisedlearning. The input layer y, fixed at the outset, issuccessively exposed to visual stimuli. The systemrecords the occurrences of known features (primitiveor compound), as well as the spatial relations betweenthem. When reoccurring constellations of featuresxi,a and xi,b are detected, the observed non-uniformspatial co-occurrence probability densities are turnedinto the compatibility potentials ψ(xi,a, xi+1,c) andψ(xi,b, xi+1,c) with respect to a newly-instantiated fea-ture xi+1,c, located between them.

65

3. Results

In practice, we arrive at networks of on the order of 10layers and on the order of between 10 and 100 verticesper layer. What they represent depends on the train-ing data. For example, we have trained networks thatrepresent individual objects, from a fixed viewpoint orfrom a variety of viewpoints. If a network is trained onimages of several distinct objects, higher-level verticeswill specialize to become view-tuned cells of specificobjects. If objects share common parts, these are likelyto be represented by the same lower-level subgraphs.

Networks learned in this way can be instantiated on agiven input stimulus by computing the observations y,optionally instantiating higher-level vertices accordingto prior expectations, and performing nonparametricbelief propagation using particle representations (sim-ilar to Sudderth et al., 2003) throughout the networkuntil convergence. Thanks to bidirectional propaga-tion, the network converges to a globally coherent in-terpretation of the scene, where each vertex xi,j con-tains its best possible interpretation of its children,under the priors provided by the parents. We haveused this procedure successfully for object detection,recognition, and pose estimation, in 2D and in 3D, aswell as for inference of occluded object parts.

Acknowledgments

This work was supported by the Belgian NationalFund for Scientific Research (FNRS) and by the EUCognitive Systems project IST-FP6-IP-027657 PACO-PLUS.

References

Lee, T. S., & Mumford, D. (2003). HierarchicalBayesian inference in the visual cortex. Journal ofthe Optical Society of America, 20, 1434–1448.

Sudderth, E. B., Ihler, A. T., Freeman, W. T., & Will-sky, A. S. (2003). Nonparametric Belief Propaga-tion. Computer Vision and Pattern Recognition (pp.605–612).

66

Semi-supervised Classification in Graphs using Bounded RandomWalks

Semi-supervised learning, large graphs, betweenness measure, passage times

Jerome Callut [email protected] Fancoisse [email protected] Saerens [email protected]

UCL Machine Learning Group (MLG)Louvain School of Management, IAG,Universite catholique de Louvain, B-1348 Louvain-la-Neuve, Belgium

Pierre Dupont [email protected]

UCL Machine Learning Group (MLG)Department of Computing Science and Engineering, INGI,Universite catholique de Louvain, B-1348 Louvain-la-Neuve, Belgium

Abstract

This paper describes a novel technique, calledD-walks, to tackle semi-supervised classifi-cation problems in large graphs. We in-troduce here a betweenness measure basedon passage times during random walks ofbounded lengths in the input graph. Theclass of unlabeled nodes is predicted bymaximizing the betweenness with labelednodes. This approach can deal with di-rected or undirected graphs with a lineartime complexity with respect to the num-ber of edges and the maximum walk lengthconsidered. Preliminary experiments on theCORA database show that D-walks outper-forms NetKit (Macskassy & Provost, 2007)as well as Zhou et al. algorithm (Zhou et al.,2005), both in classification rate and comput-ing time.

1. Introduction

This paper is concerned with semi-supervised classifi-cation of nodes in a graph. Given an input graph withsome nodes being labeled, the problem is to predict themissing node labels. This problem has numerous ap-plications such as classification of individuals in socialnetworks, linked documents categorization or proteinfunction prediction, to name a few.

Several approaches have been proposed to tackle semi-supervised classification problems in graphs. Kernelmethods (Zhou et al., 2005; Tsuda & Noble, 2004)embed the nodes of the input graph into an Euclidean

feature space where a classifier, such as a SVM, canbe estimated. Despite of their good predictive perfor-mance, these techniques cannot easily scale up to largeproblems due to their high time complexity. NetKit isan alternative relational learning approach (Macskassy& Provost, 2007). It has a lower computational com-plexity but is less simple conceptually and may requireto fine tune several of its components.

The approach proposed in this paper, called D-walks,relies on random walks performed on the input graphseen as a Markov chain. More precisely, a betweennessmeasure, based on passage times during random walksof bounded length, is derived for each class (or labelcategory). Unlabeled nodes are assigned to the cate-gory for which the betweenness is the highest. The D-walks approach has the following properties: (i) it hasa linear time complexity with respect to the number ofedges and the maximum walk length considered; sucha low complexity allows to deal with very large graphs,(ii) it can handle directed or undirected graphs, (iii)it can deal with multi-class problems and (iv) it has aunique hyper-parameter that can be tuned efficiently.

2. Discriminative random walks

We are given an input graph G containing a set ofnodes N and edges E . The (possibly weighted) adja-cency matrix is denoted A. The graph G is assumedpartially labeled. The nodes in the labeled set L ⊂ Nare assigned to a category from a discrete set Y. Theunlabeled set is defined as U = N \ L.

Random walks in a graph can be modeled by a

67

discrete-time Markov chain (MC) describing the se-quence of nodes visited during the walk. Each stateof the Markov chain corresponds to a distinct nodeof the graph. The MC transition probability matrixis simply given by P = D−1A, with D the diagonalmatrix of node degrees. We consider discriminativerandom walks (D-walks, for short) in order to definea betweenness measure used for classifying unlabelednodes.

Definition 1 (D-walk) Given a MC defined on thestate set N , a class y ∈ Y and a discrete length l >1, a D-walk is a sequence of state q0, . . . , ql such thatyq0 = yql

= y and yqt 6= y for all 0 < t < l.

The notation Dyl refers to the set of all D-walks of

length l, starting and ending in a node of class y. Wealso consider Dy

≤Lreferring to all D-walks up to a givenlength L. The betweenness function BL(q, y) measureshow much a node q ∈ U is located “between” nodesof class y ∈ Y. The betweenness BL(q, y) is formallydefined as the expected number of times the node q isreached during Dy

≤L-walks.

Definition 2 (D-walk betweenness) Given an un-labeled node q ∈ U and a class y ∈ Y, the D-walkbetweenness function U × Y → R+ is defined as fol-lows: BL(q, y) , E [pt(q) | Dy

≤L], where pt(q) is thepassage times function N → R+ counting the numberof times a node q has been visited.

This betweenness measure is related to the one pro-posed by Newman in (Newman, 2005). Our measureis however relative to a specific class y rather than tothe whole graph. It also considers random walks up toa given length instead of unbounded walks. Boundingthe walk length has two major benefits: (i) better clas-sification results are generally obtained with respect tounbounded walks (ii) the betweenness measure can becomputed very efficiently (in Θ(|E|L)) using forwardand backward recurrences, similar to those used in theBaum-Welch algorithm for HMM parameter estima-tion. Finally, an unlabeled node q ∈ U is assigned tothe class with the highest betweenness.

3. Experiments

We report here preliminary experiments performed onthe Cora dataset (Macskassy & Provost, 2007) con-taining 3582 nodes classified under 7 categories. Asthis graph is fully labeled, node labels were randomlyremoved and used as test set. More precisely, we haveconsidered 9 different proportions of labeled nodes inthe graph: {0.1, 0.2, . . . , 0.9} and for each labelingrate, 10 random deletions were performed. Compara-

tive performances obtained with NetKit (Macskassy& Provost, 2007) and with the approach of Zhouet al. (Zhou et al., 2005) are also provided. Thehyper-parameters of each approach have been tunedusing ten-fold cross-validation. Figure 1 shows thecorrect classification rate on test data obtained byeach approach for increasing labeling rates. The D-walk approach clearly outperforms its competitors onthese data. The D-walks approach is also the fastestmethod. It requires typically 1.5 seconds of CPU1 forevery graph classification including the auto-tuning ofits hyper-parameter L. NetKit takes about 4.5 sec-onds per graph classification and our implementationof Zhou et al. approach typically takes several min-utes. Large graphs (several millions of edges) werealso successfully classified in a few minutes with D-walks while neither NetKit nor Zhou et al. methodscould be applied on such large graphs.

0.2 0.4 0.6 0.8

0.65

0.70

0.75

0.80

0.85

0.90

Classification rate on the CORA dataset

Labeling rate

Cla

ssifi

catio

n R

ate

D−WalkNetKitZhou et al.

Figure 1. Classification rate of D-walk and two competingmethods on the Cora dataset. Error bars report standarddeviations over 10 independent runs.

References

Macskassy, S. A., & Provost, F. (2007). Classification innetworked data: A toolkit and a univariate case study.J. Mach. Learn. Res., 8, 935–983.

Newman, M. (2005). A measure of betweenness centralitybased on random walks. Social networks, 27, 39–54.

Tsuda, K., & Noble, W. S. (2004). Learning kernels frombiological networks by maximizing entropy. Bioinfor-matics, 20, 326–333.

Zhou, D., Huang, J., & Scholkopf, B. (2005). Learning fromlabeled and unlabeled data on a directed graph. ICML’05: Proceedings of the 22nd international conference onMachine learning (pp. 1036–1043). New York, NY, USA:ACM.

1Intel Core 2 Duo 2.2Ghz with 2Gb of virtual memory.

68

The Sum-Over-Paths Covariance: A novel covariance measurebetween nodes of a graph

Graph mining, kernel on a graph, correlation measure, semi-supervised classification.

Amin Mantrach [email protected] de Recherches Interdisciplinaires et de Développements en Intelligence Artificielle (IRIDIA-CoDE)Université Libre de Bruxelles, B-1050 Brussels, Belgium

Marco Saerens [email protected] Yen [email protected] Machine Learning Group (MLG), Louvain School of Management,Université catholique de Louvain, B-1348 Louvain-la-Neuve, Belgium

AbstractThis work introduces a link-based covariancemeasure between the nodes of a weighted, di-rected, graph where a cost is associated to eacharc. To this end, a probability distribution onall possible paths through the network is definedby minimizing the sum of the expected costs be-tween all pairs of nodes while fixing the totalrelative entropy spread in the network. Thisresults in a probability distribution on the setof paths such that long paths occur with a lowprobability while short paths occur with a highprobability. The covariance measure is then com-puted according to this probability distribution:two nodes will be highly correlated if they of-ten co-occur together on the same – preferablyshort – paths. The resulting covariance matrixbetween nodes (say n in total) is a Gram ma-trix and therefore defines a valid kernel matrixon the graph; it is obtained by inverting a n× nmatrix. The proposed model could be used forvarious graph mining tasks such as computingbetweenness centrality, semi-supervised classifi-cation, visualization, etc, as shown in the exper-imental section.

1. The sum-over-paths covariancemeasure

Basic notations and definitions. Consider aweighted directed graph or network, G, with a set of nnodes V (or vertices) and a set of arcs E (or edges). Thegraph is supposed to be strongly connected. To each arclinking node k and node k′, a number ckk′ > 0 is associ-ated, representing the immediate cost of following thisarc. The cost matrix C is the matrix containing the im-mediate costs ckk′ . In a first step, a random walk onthis graph will be defined. The choice to follow an arc willbe made according to transition probabilities representingthe probability of jumping from a node k to another node

k′ belonging to the set S(k) of neighboring nodes (succes-sors S) that can be reached from node k. The transitionsprobabilities defined on each node k will be denoted aspkk′ = P(k′|k) with k′ ∈ S(k). Furthermore, P will bethe matrix containing the transition probabilities pkk′ aselements. If there is no arc between k and k′, we simplyconsider that ckk′ takes a large value, denoted by ∞; inthis case, the corresponding transition probability will beset to zero, pkk′ = 0. The natural random walk on thegraph will be defined by pref

kk′ = c−1kk′/

Pk′ c−1kk′ and the cor-

responding transitions-probabilities matrix Pref. In otherwords, in this natural random walk, the random walkerchooses to follow a link with a probability proportional tothe inverse of the immediate cost, therefore favoring linkshaving a low cost. These transition probabilities will beused as reference probabilities later; hence the superscriptref.

Definition of the probability distribution on theset of paths. Let us first consider two nodes, an initialnode i and a destination node j. We define the (possibly in-finite) set of paths (including cycles) connecting these twonodes asRij = {℘rij}. Thus, ℘rij is path number rij , withpath index rij ranging from 1 to ∞. Let us further definethe set of all paths R =

Sij Rij and a probability distri-

bution on this set R representing the probability P(℘rij )of following the path numbered rij . The main idea will beto use the probability distribution P(℘rij ) minimizing theexpected cost-to-go among all the probability distributionshaving a fixed relative entropy with respect to the naturalrandom walk on the graph.

Let us also denote as Erij the total cost associated to thepath rij , referred to as the energy associated to that path.We assume that the total cost associated to a path is ad-ditive, i.e. E(℘rij ) =

Ptf

t=1ckt−1kt where k0 = i is theinitial state and ktf = j is the destination state; tf is thetime (number of steps) needed to reach node j. Here, weassume that ℘rij is a valid path from the initial state tothe destination state, that is, every ckt−1kt 6=∞ along thatpath.

69

We now have to find the path probabilities mini-mizing the sum of the expected energy for reach-ing node j when starting from i. In other works,we are seeking path probabilities, P(℘rij ), minimiz-ing

Pni,j=1

P∞rij=1P(℘rij )E(℘rij ) subject to the con-

straint −P∞rij=1P(℘rij ) ln(P(℘rij )/Pref(℘rij )) = J0 where

Pref(℘rij ) represents the probability of following the path℘rij when walking according to the natural random walk,i.e. when using transition probabilities pref

kk′ . Here, J0 isprovided a priori by the user, according to the desired de-gree of randomness he is willing to concede. By definingthe Lagrange function

L =nX

i,j=1

∞Xrij=1

P(℘rij )E(℘rij )

+ λ

"nX

i,j=1

∞Xrij=1

P(℘rij ) lnP(℘rij )

Pref(℘rij )+ J0

#

+ µ

"nX

i,j=1

∞Xrij=1

P(℘rij )− 1

#, (1)

we obtain the following probability distribution

P(℘rij ) =Pref(℘rij ) exp [−θE(℘rij )]

nXi,j=1

∞Xrij=1

Pref(℘rij ) exp [−θE(℘rij )]

(2)

where θ = 1/λ. Thus, as expected, short paths (havingsmall E(℘rij )) are favoured in that they have a large prob-ability of being followed. When θ → ∞, only shortestpaths are considered in R while when θ → 0 all pathscorresponding to the natural random walk are taken intoaccount.

Definition of the covariance measure. We nowshow that the sum-over-paths covariance measure can becomputed from a key quantity, defined as

Z =nX

i,j=1

∞Xrij=1

Pref(℘rij ) exp [−θE(℘rij )] , (3)

which corresponds to the partition function in statisticalphysics [2]. It can be shown that the partition functioncan easily be computed from the cost matrix by invertinga matrix [4].

Indeed, the expected number of times the link k → k′ andthe link l → l′ are traversed together along a path can becomputed by taking the second-order derivative

η(k, k′; l, l′) =1

θ2∂2(lnZ)

∂cll′∂ckk′

=nX

i,j=1

∞Xrij=1

P(℘rij ) δ(rij ; k, k′)δ(rij ; l, l′)

−"

nXi,j=1

∞Xrij=1

P(℘rij ) δ(rij ; k, k′)

#2

(4)

where δ(rij ; k, k′) indicates the number of times the linkk → k′ is present in path number rij , and thus the number

of times the link is traversed. This last quantity clearlycorresponds to the covariance between link k → k′ andlink l→ l′.

Now, the covariance measure between node k′ and nodel′ is simply defined as

cov(k′, l′) =

nXk,l=1

η(k, k′; l, l′) (5)

which corresponds to the main quantity of interest. Thisquantity can thus easily be computed from the partitionfunction (the calculus are similar to the one apprering in[4]).

2. Preliminary experimentsNotice that the experiments investigating the betweennessand the visualization capabilities are not reported in thispaper because of the lack of space.

Semi-supervised classification. Preliminary exper-iments on semi-supervised classification have been per-fomed on the IMDb dataset (described in [3]). We comparethe proposed sum-over-paths kernel to the regularizationkernel proposed by [5], the commute-time kernel [1] andthe diffusion map kernel [1] in a semi-supervised classifica-tion task of unlabeled nodes. An alignment procedure isused in order to classify the unlabeled nodes, as describedin [5], for each of the four studied kernels. The hyper-parameters of each algorithm have been tunned by usinga 5-fold cross-validation and the best value has been re-tained for the estimation of the classification rate on thetest set. In order to reduce random effects on our results,the labeled nodes have been sampled randomly 10 times,and averaged results are finally reported on these ten runs.

The results (ommitted because of the lack of space) showthat the sum-over-paths kernel provides competitive resultsin comparison with the other standard kernels on a graph.

References[1] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. A

novel way of computing similarities between nodes of agraph, with application to collaborative recommenda-tion. Proceedings of the 2005 IEEE/WIC/ACM Inter-national Joint Conference on Web Intelligence, pages550–556, 2005.

[2] E. T. Jaynes. Information theory and statistical me-chanics. Physical Review, 106:620–630, 1957.

[3] S. A. Macskassy and F. Provost. Classification innetworked data: A toolkit and a univariate casestudy. Journal of Machine Learning Research, 8:935–983, 2007.

[4] M. Saerens, Y. Achbany, F. Fouss, and L. Yen. Ran-domized shortest-path problems: Two seemingly unre-lated problems. Manuscript submitted for publication,2008.

[5] D. Zhou, J. Huang, and B. Scholkopf. Learning from la-beled and unlabeled data on a directed graph. Proceed-ings of the 22nd International Conference on MachineLearning, pages 1041–1048, 2005.

70

Classification on incomplete data using sparse representations:Imputation is optional

Jort Gemmeke [email protected]

Dept. of Linguistics, Radboud University,P.O. Box 9103, NL-6500 HD, Nijmegen, The Netherlands

AbstractWe present a non-parametric technique ca-pable of performing classification directly onincomplete data, optionally performing im-putation. The technique works by sparselyrepresenting the available data in a basis ofexample data. Experiments on a spoken digitclassification task show significant improve-ment over a baseline missing-data classifier.

1. Introduction

Classification on incomplete data is a challenging taskbecause parametric techniques require that the dimen-sionality of the data doesn’t change between train-ing and classification while non-parametric techniqueswhich can handle incomplete data such as k-nearest-neighbors often deliver suboptimal classification accu-racies. In practice, the missing data is often estimatedprior to classification through imputation. Most impu-tation methods estimate the missing coefficients basedon local information and/or do not fully exploit thestructure of the underlying signal. Based on work inCompressed Sensing (Donoho, 2006; Candes, 2006) wepresent a non-parametric method which can not onlyperform classification directly on the available databut optionally imputes the missing data. The methodis based on the premise that a signal can be sparselyrepresented in a basis of example signals (Yang et al.,2007) and that this sparse representation can be ex-actly recovered even if only a small part of the data isavailable (Zhang, 2006). We show the effectiveness ofthis approach on a spoken digit classification task.

2. Method

2.1. Sparse representation

We consider observation vector y of unknown class anddimensionality K to be a linear combination of featurevectors di,n, where the first index (1 ≤ i ≤ I) denotes

one of I classes and the second index (1 ≤ n ≤ Ni) aspecific exemplar vector of class i with Ni the numberof examples in that class. We write:

y =I∑

i=1

Ni∑n=1

αi,ndi,n

with weights αi,n ∈ R. The set of exemplars span aK ×N dimensional basis (N = N1 +N2 + . . .+NI):

A = (d1,1 . . . d1,N1 . . . dI,1 . . . dI,NI) (1)

Thus, we can express y as:

y = Ax (2)

with x an N -dimensional vector that ide-ally will be sparsely represented as x =[0 . . . 0 αi,1αi,2 . . . αi,Ni

0 . . . 0]T (i.e., most coeffi-cients not associated with class i are zero).

Taking into account that the observation vector y isincomplete we denote its available coefficients by ya

and the missing coefficients ym. Now we can solvethe system of linear equations of Eq. 2 using only theavailable coefficients ya and the basis Aa, formed byonly retaining the rows of A indicated by the availablecoefficients. Research in the field of compressed sens-ing (Donoho, 2006; Candes, 2006) has shown that if xis sparse, x can be recovered exactly by solving:

min||x||1 subject to ||ya −Aax||2 ≤ ε (3)

with a small constant ε such that the error e satisfies||e||2 < ε and ||.||1 the l1norm.

2.2. Sparse classification (SC)

Following (Yang et al., 2007), we perform classifica-tion by comparing the support of ya in parts of Aa

associated with different classes i. In other words, wecompare how well the various parts of x associatedwith different classes i can reproduce ya. The repro-duction error is called the residual. The residual of

71

class i is calculated by setting the coefficients of x notassociated with i to zero while keeping the coefficientsassociated with i unchanged. Thus the residual is:

ri(yr) = ||ya −Aaδi(x)||2 (4)

with δi(x), the vector selecting only the columns of Athat correspond to class i. The class c that is assignedto an observed vector y is the one that gives rise tothe smallest residual:

c =argmini

ri(yr). (5)

2.3. Sparse imputation (SI)

Alternatively, one can use the sparse representationx to impute the missing coefficients. Without loss ofgenerality we reorder y and A as in (Zhang, 2006) sothat we can write:

y =[ya

yi

]=[

ya

Amx

](6)

with Am pertaining to the rows of A indicated by themissing coefficients in y and yi an estimate for themissing coefficients ym. This yields a new observationvector y after which ordering can be restored.

3. Experiments

We apply the described method to missing data spokendigit classification task (AURORA-2). In noisy speech,with digits represented by fixed length observation vec-tors, coefficients are considered missing if their values(representing speech energy in a time-windowed spec-trographic representation) are dominated by speechenergy rather than noise energy. We explore the ef-fectiveness of our approach by selecting the missingcoefficients using knowledge of the true speech andnoise signals. Using a setup described in detail in(Gemmeke & Cranen, 2008) we compare the sparseclassification technique with a baseline, state-of-the-art, HMM-classifier which performs imputation on aframe-by-frame basis (Van hamme, 2006). Addition-ally we compare classification accuracies obtained bycombining sparse imputation and the baseline classi-fier.

While not strictly linear as a function of signal-to-noiseratio (SNR), the percentage missing coefficients rangesfrom 0% (clean speech) to 80− 95% (at SNR −5 dB).In Table 1 it is shown that the sparse classificationand sparse imputation methods significantly outper-form the baseline, frame-based classifier.

Table 1. AURORA-2 single digit classification accuracy.

SNRmethod clean 15 10 5 0 -5Baseline 99.3 99.1 98.7 96.6 88.4 61.0SC 98.4 98.4 98.0 97.5 95.8 91.0SI 99.3 99.0 98.5 97.7 96.5 91.3

4. Discussion and conclusions

Results show that both sparse methods give consider-able improvement over the baseline, suggesting that acorrect sparse representation can be found even whenthe majority of the data is missing, provided the re-dundancy in the structure of the data is exploited byuse of example whole-digit observations vectors. Theslightly better results using sparse imputation ratherthan sparse classifications seem to suggest that thesparse classification method does not generalize to ob-served digits as well as the HMM-based (parametric)approach.

Acknowledgments

The research of Jort Gemmeke was carried out inthe MIDAS project, granted under the Dutch-FlemishSTEVIN program.

References

Candes, E. J. (2006). Compressive sampling. Proceed-ings of the International Congress of Mathemati-cians.

Donoho, D. L. (2006). For most large underdeterminedsystems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution.Communications on Pure and Applied Mathematics,59, 907–934.

Gemmeke, J., & Cranen, B. (2008). Noise robust digitrecognition using sparse representations. acceptedfor ISCA ITWR 2008.

Van hamme, H. (2006). Handling time-derivative fea-tures in a missing data framework for robust au-tomatic speech recognition. Proceedings of IEEEICASSP.

Yang, A. Y., Wright, J., Ma, Y., & Sastry, S. S. (2007).Feature selection in face recognition: A sparse rep-resentation perspective. submitted to IEEE Trans-actions Pattern Analysis and Machine Intelligence.

Zhang, Y. (2006). When is missing data recoverable?Technical Report.

72

Multi-Agent State Space Aggregation using Generalized LearningAutomata

Yann-Michael De Hauwere [email protected] Vrancx [email protected] Nowe [email protected]

Computational Modeling Lab, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussel, Belgium

Abstract

A key problem in multi-agent reinforcementlearning remains dealing with the large statespaces typically associated with realistic dis-tributed agent systems. As the state spacegrows, agent policies become more and morecomplex and learning slows. One possible so-lution for an agent to continue learning inthese large-scale systems, is to learn a pol-icy which generalizes over states, rather thantrying to map each individual state to anaction. In this paper we present a multi-agent learning approach capable of aggre-gating states, using associative reinforcementlearners called generalized learning automata(GLA).

1. Introduction

Reinforcement learning (RL) has already been shownto be a powerful tool for solving single agent MarkovDecision Processes (MDPs). Basic RL techniques arenot suited for problems with very large state spaces,however, as they mostly rely on a tabular represen-tation for policies and enumerating all possible state-action pairs is not feasible (the so called curse of di-mensionality). Because of these issues, several exten-sions have been proposed to reduce the complexity oflearning. Methods for representing the agent’s policysuch as neural networks, decision trees and other re-gression techniques are already widely used. To ourunderstanding, relatively little work has been done onextending RL for large state spaces to MAS, so far.

2. Generalized Learning Automata

A Generalized Learning Automaton (GLA) is an asso-ciative reinforcement learning unit. The purpose of aGLA is to learn a mapping from given inputs or con-

texts to actions. At each time step the GLA receives aninput which describes the current system state. Basedon this input and its own internal state the unit thenselects an action. This action serves as input to the en-vironment, which in turn produces a response for theGLA. Based on this response the GLA then updatesits internal state.

Formally a GLA can be represented by a tuple(X,A, β, u, g, T ), where X is the set of possible in-puts to the GLA and A = {a1, . . . , ar} is the set ofoutputs or actions the GLA can produce. β ∈ [0, 1]denotes the feedback the automaton receives for anaction. The real vector u represents the internal stateof the unit. It is used in conjunction with a probabilitygenerating function g to determine the action probabil-ities, given an input x ∈ X. T is a learning algorithmwhich updates u, based on the current value of u, thegiven input, the selected action and response β. In thispaper we use a modified version of the REINFORCE(WILLIAMS, 1992) update scheme. In (Thathachar& Sastry, 2004) it is shown, that this update mech-anism, converges to local maxima of f(u) = E[β|u],showing that the automata find a local maximum overthe mappings that can be represented by the internalstate in combination with the function g.

We propose to use the GLA described above in Multi-agent Reinforcement learning problems. In such a sys-tem each agent internally uses a set of GLA to learnthe different regions in the state space where differentactions are optimal. We use the following set-up forthe GLA. With every action ai ∈ A the automatoncan perform, it associates a vector ui. This results inan internal state vector u = [u1

τ . . .urτ ] (where τ de-

notes the transpose). With this state vector we usethe Boltzmann distribution as probability generatingfunction:

73

g(x, ai, u) =e

xτ uiT∑

j exτ uj

T

(1)

Of course since this function is fixed in advance and theenvironment in general is not known, we have no guar-antee that the GLA can represent the optimal map-ping. For instance when using the function given inEquation 1 with a 2-action GLA, the internal statevector represents a hyperplane. This plane separatescontext vectors which give a higher probability to ac-tion 1 from those which action 2. If the sets of contextvectors where different actions are optimal, are notlinearly separable the GLA cannot learn an optimalmapping.

GLA 1 GLA m

GLA 1 GLA m

Environment

a(t)...

Agent 1

Agent n

x(t)

...

...

Figure 1. Learning set-up. Each agent receives factoredstate representation as input. GLA decide action to beperformed.

To allow a learner to better represent the desired map-ping from context vectors to actions, we can utilize sys-tems composed of multiple GLA units. For instancethe output of multiple 2-action GLAs can be combinedto allow learners to build a piecewise linear approxi-mation of regions in the space of context vectors. Ingeneral, we can use systems which are composed offeedforward structured networks of GLA. In these net-works, automata on one level use actions of the au-tomata on the previous level as inputs . If the feedfor-ward condition is satisfied, meaning that the input ofa LA does not depend on its own output, convergenceto local optima can still be established (Phansalkar &Thathachar, 1995).

Figure 1 shows the general agent learning set-up. Eachtime step t a vector x(t) giving a factored representa-tion of the current system state is generated. Thisvector is given to each individual agent as input. Theagents internally use a set of GLA to map an actionto the current state. The joint action a(t)of all agents

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Agent 1 loc

Age

nt 2

loc

GLA Regions

REGION II REGION I

Figure 2. Typical results for approximations for parabolalearnt by agents.

serves as input to the environment, which respondswith a feedback β(t) that agents use to update theGLA.

3. Experiments

In a first simple experiment 2 agents move around ran-domly in a 2 dimensional state space. Each time stepboth agents receive the current state (x, y)-values asinput. The agents have to learn which joint actionsare optimal in different regions of the state space. Inthis case there are 2 regions determined by a parabola.In region I, given by the inside of the parabola action(a1, a1) is optimal with a reward of 0.9. When thejoint location of the agents falls outside the parabola,however, action (a2, a2) is optimal with reward 0.5. Inboth cases all other joint actions have a pay-off of 0.1.

Both agents use a system consisting of 2 GLA, con-nected by an AND operation. Both GLA have 2 ac-tions: 0 and 1. If the automata both choose 1 theagents performs its first action a1 else it performs ac-tion a2. Figure 2 shows typical results for approxima-tions that the agents learn for the parabola.

References

Phansalkar, V., & Thathachar, M. (1995). Local andglobal optimization algorithms for generalized learn-ing automata. Neural Computation, 7, 950–973.

Thathachar, M., & Sastry, P. (2004). Networks ofLearning Automata: Techniques for Online Stochas-tic Optimization. Kluwer Academic Pub.

WILLIAMS, R. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforce-ment Learning. Reinforcement Learning, 8, 229–256.

74

Efficiently learning timed models from observations

Sicco Verwer, Mathijs de Weerdt, and Cees Witteveen [email protected]

Delft University of Technology, P.O. Box 5031, 2600 GA, Delft, the Netherlands

1. Learning timed models efficiently

This paper describes an efficient algorithm for learn-ing a timed model from observations. The algorithm isbased on the state merging method for learning a de-terministic finite state automaton (DFA). This methodand its problem have been the subject of many studieswithin the grammatical inference field, see e.g. (de laHiguera, 2005). Consequently, it enjoys a sound the-oretical basis which can be used to prove propertiesof our algorithm. For example, while it has longbeen known that learning DFAs is NP-complete, ithas been shown that DFAs can be learned in thelimit from polynomial time and data (efficiently in thelimit) using a state merging method.

A DFA is a language model. A language is a set offinite sequences of symbols σ = s1s2 . . . sn knownas strings. Adding time to a string can be done bygiving every symbol a time value t ∈ N1, result-ing in a sequence of symbol-time value pairs τ =(s1, t1)(s2, t2) . . . (sn, tn). Every time value in such atimed string represents the time that has elapsed sinceoccurrence of the previous symbol. A set consisting oftimed strings is called a timed language. There are twomain reasons why we want to learn a timed modelinstead of an untimed model. First, in many appli-cations the data that is obtained by observing a sys-tem contains timestamps. For example, this is thecase when time series data is obtained from sensors.Second, we believe that learning a timed model fromsuch data is more efficient than learning an untimedmodel from this data. The reason is that, while it ispossible to construct for any timed model an equiva-lent untimed model by sampling the time values, thisuntimed model is of size exponential in the size of thetimed model. Thus, an efficient algorithm that learnsa timed system using an untimed model is by defini-tion an inefficient algorithm since it is exponential intime and data in the size of the timed model. In con-trast, we show it is possible to learn certain types of

1It is more common to use R for time values. In practice,there always is a finite precision of time and hence this onlymakes the timed models unnecessarily complex.

timed models efficiently. Naturally, we want to focuson learning models that are efficiently learnable.

We assume familiarity with the theory of languagesand automata. The timed models we consider areknown as timed automata (TA) (Alur & Dill, 1994). Inthese models, time is represented using a finite num-ber of clocks. A clock can be thought of as a stopwatchthat measures the time since it was last reset. When atransition of a TA executes (fires) it can optionally re-set the value of a clock. This clock then measures thetime since the last execution of this transition. Thevalue of clocks are used to constrain the execution oftransitions in a TA. A transition in a TA can contain aboolean constraint, known as a clock guard, that speci-fies when this transition is allowed to be executed de-pending on the values of clocks. If the clock valuessatisfy the constraint, the transition can be executed.Otherwise, it cannot be executed. Thus a transition isexecuted only if the TA is currently in its source (orparent) state, the current symbol is equal to the tran-sition label, and the current clock values satisfy thetransitions clock guard. In this way, the execution ofa TA depends not just on the symbols, but also on thetime values occurring in a timed string.

We focus on learning algorithms for a simple timedmodel known as a deterministic real-time automa-ton (DRTA) (Dima, 2001) based on the state merg-ing method. A DRTA is a TA that contains a singleclock that is reset by every transition. Hence, the ex-ecution of a DRTA depends on the symbols, and onthe time between two consecutive events of a timedstring. The reason for restricting ourselves to thesesimple TAs is that the before-mentioned learnabilityresults for DFAs can quite easily be adapted to thecase of DRTAs. In other words, we can show thatDRTAs are efficiently learnable using a state merg-ing method. Due to the expressive power of clocks,we can also show this does not hold for determinis-tic TAs (DTAs) in general: any algorithm that tries tolearn a DTA A will sometimes require an input set ofsize exponential in the size of A. Other methods weknow of for learning timed models try to learn a sub-

75

class of DTAs known as event recording automata,see e.g. (Grinchtein et al., 2005). These automata aremore powerful than DRTAs. However, due to this ex-tra power an algorithm that learns these models is in-efficient in the amount of data it requires.

We are currently finishing these (in)efficient learnabil-ity proofs. Our aim is to discover the most powerfultimed model that is efficiently learnable.

2. Learning from observations

We constructed an algorithm for learning DRTAsfrom labeled data (Verwer et al., 2007). A high-levelview of this algorithm is given in Algorithm 1. The al-gorithm starts with a prefix tree A that is constructedfrom the input set S. This is a DRTA that is such that:all the clock guards are equal to true (it disregardstime values), there exists exactly one execution onepath to any state (it is a tree), and it is such that the ex-ecution ofA on any timed string from S ends in a stateinA. Starting from this prefix tree, our algorithm per-forms the most consistent merge or split as long asconsistent merges or splits are possible. A merge oftwo states a and b replaces a and b with a new statec that contains all the input and output transitions ofboth a and b. Afterwards, it is possible that A con-tains some non-deterministic transitions. These areremoved by continuously merging the target states ofthese transitions until none are left.

A transition splitting process is used to learn the clockconstraints of the DRTA. A split of a transition d, attime t removes d from A and adds a transition e thatis satisfied by all the clock values that satisfy d up totime t, and a transition f that is satisfied by all theclock values that satisfy d starting at t. The timedstrings from S that have an execution path over e andf are subsets of the timed strings from S that had anexecution path over d. Therefore, the prefix tree is re-calculated starting from the new transitions e and f .

Algorithm 1 State merging and splitting DRTAsRequire: A set of timed strings S.Ensure: The result is a small consistent DRTA A.

Construct a timed prefix A tree from S.while States can be merged or split consistentlydo

Evaluate all possible merges and splits.Perform the most consistent merge or split.

end whileReturn the constructed DRTA A.

There is one important part of Algorithm 1 that weleft undefined: What does it mean to be consistent?

In the original setting of our algorithm the answer tothis question was simple: The algorithm got a labeleddata set as input and consistency was defined usingthese labels. However, for many applications this set-ting is unrealistic: Usually, only positive data is avail-able. We adapted our algorithm to this setting by us-ing statistics as a consistency measure for our models.

This seems a very natural thing to do and it hasbeen done for the problem of learning probabilisticDFAs (Kermorvant & Dupont, 2002). The main prob-lem to overcome is to come up with a good statisticfor the (dis)similarity of two states based on the pre-fix trees of their suffixes. In our approach, two statesin such a tree are treated as being similar if the dis-tributions with which they generate (next) symbolsare not significantly different. In addition, since wedeal with a timed model, we require that the distribu-tions generating the time values of these symbols arenot significantly different. There properties are testedfor every state in the prefix tree using Chi-square andKolmogorov-Smirnov hypothesis tests. The result ismany p-values, which we combine into a single testusing a standard method for multiple hypothesis test-ing. The method that was used to learn DFAs does notmake use of a multiple hypothesis testing method: itsimply rejects if any of the tests fails.

Initial tests of our algorithm on artificial data showthat the uses of timed models and multiple hypothe-sis testing are promising.

ReferencesAlur, R., & Dill, D. L. (1994). A theory of timed au-

tomata. Theoretical Computer Science, 126, 183–235.

de la Higuera, C. (1997). Characteristic sets for poly-nomial grammatical inference. Machine Learning,27, 125–138.

de la Higuera, C. (2005). A bibliographical studyof grammatical inference. Pattern Recognition, 38,1332–1348.

Dima, C. (2001). Real-time automata. Journal of Au-tomata, Languages and Combinatorics, 6, 2–23.

Grinchtein, O., Jonsson, B., & Leucker, M. (2005). In-ference of timed transition systems. Electronic Notesin Theoretical Computer Science.

Kermorvant, C., & Dupont, P. (2002). Stochastic gram-matical inference with multinomial tests. ICGI ’02(pp. 149–160). Springer-Verlag.

Verwer, S., de Weerdt, M., & Witteveen, C. (2007). Analgorithm for learning real-time automata. Bene-learn’07 (pp. 128–135).

76

ProSOM: Core promoter identification in the human genome.

Thomas Abeel [email protected] Saeys [email protected] Van de Peer [email protected]

Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium,Department of Molecular Genetics, Ghent University, Technologiepark 927, 9052 Gent, Belgium

Abstract

More and more genomes are being sequenced,and to keep up with the pace of sequencingprojects, automated annotation techniquesare required. One of the most challengingproblems in genome annotation is the iden-tification of the core promoter. Better corepromoter prediction can improve genome an-notation and can be used to guide experimen-tal work.

Comparing the average structural profileof transcribed, promoter and intergenic se-quences demonstrates that the core promoterhas unique features that cannot be foundin other sequences. We show that unsuper-vised clustering by using self-organizing mapscan clearly distinguish between the structuralprofiles of promoter sequences and other ge-nomic sequences. An implementation of thispromoter prediction program, called Pro-SOM, is available and has been comparedwith the state-of-the-art.

1. Introduction

Currently, the genomic sequence of over 50 eukaryoticorganisms is available. So it is important to automatethe identification of genes and regulatory sequences.

The core promoter is the region immediately upstreamof the TSS, where the transcription initiation complexassembles.

Core promoters have distinct features that can be usedto distinguish them from other sequences. One suchproperty models the local base-stacking energy. Highvalues denote regions that destack or melt easily. Tworegions that seem to melt easily are located around -30from the TSS and on the TSS, and are embedded in alarge-scale region that is significantly more stable. We

used this large-scale feature in earlier work to predictpromoter regions in a wide range of species.

We present a novel promoter prediction technique,called ProSOM, that uses an unsupervised self-organizing map (SOM) to distinguish core promoterregions from the rest of the genome.

2. Material and methods

2.1. Data

We used the human genome assembly (hg17, May2004). The cap analysis gene expression (CAGE)dataset was retrieved from the Fantom3 project. Itcontains 123,400 unique TSSs for human. The En-sembl gene annotation has been retrieved using theBioMart tool for Ensembl release 37. Sequences andannotation were retrieved from the ENCODE project.For the training of the SOM we retrieved promoter,transcribed and intergenic sequences from DBTSS andEnsembl.

2.2. Structural profiles

The nucleotide sequence is converted into a sequenceof numbers (i.e., a numerical profile). This is doneby replacing each dinucleotide with its energy value,which is obtained from experimentally validated con-version tables. We have used the conversion tables forbase-stacking energy from Florquin et al. 2005.

2.3. Clustering and promoter prediction

The clustering technique we used is the self-organizingmap (SOM), a special type of artificial neural networkthat can be used both for clustering and class predic-tion. A SOM consists of a rectangular grid of clus-ters, each of which has a weighted connection to everyinput node. In our case, the input nodes representthe different values of a structural profile associatedto a potential promoter region. The SOM provides a

77

Promoter sequences

-200 -100 0

-0,15

-0,10

-0,05

Transcribed sequences

0 100 200

-0,15

-0,10

-0,05

Intergenic sequences

0 100 200

-0,15

-0,10

-0,05

Figure 1. The structural profile of promoter (left), tran-scribed (center) and intergenic (right) human sequences.The profiles are the averages over all sequences in the re-spective training sets. We used the base-stacking energy asphysical property. The left panel shows the region [-200,50]around the TSS, while for the other two panels there is noreference point and the location are numbered from 0 to250.

mapping from a higher-dimensional feature space (thestructural profile) to a lower-dimensional cluster space.

2.4. Validation

The validation of our technique was done on two se-quence sets; first on the entire human genome assem-bly (hg17, May 2004) and secondly on the ENCODEregions. For both sets we retrieved a set of experimen-tally characterized TSSs and a gene annotation.

An aggregate measure for the performance of a clas-sifier that is often used in the machine learning fieldis the F-measure. This is the harmonic mean of therecall (sensitivity) and the precision (specificity).

We have proposed a more objective way to assessthe performance of a PPP based on the genome-widescreening for TSSs. This technique is based on theCAGE datasets that have been described earlier. Thedataset contains locations where transcription starts.A TP is a known site that has a prediction within 50bp of a true TSS, a FN is a TSS without a predictionand a FP is a prediction that has no associated TSSin the reference set within 50 bp.

3. Results

Figure 1 shows the average structural profile of base-stacking energy of the three datasets we used for train-ing the SOM. The promoter sequences show a verystriking profile with overall lower values than the othertwo graphs. It has two clear peaks at position -30(TATA binding protein) and position 0 (TSS).

We used the trained SOM to predict promoter regions.To each cluster we attached a probability that a givensequence assigned to that cluster is a promoter. If thestructural profile of a sequence maps to a cluster thathas a probability equal to or above the threshold, we

Table 1. Evaluation of promoter prediction programs usingthe CAGE dataset with a maximum allowed distance of 50bp.

program recall prec. FProSOM 0.17 0.30 0.22Eponine 0.14 0.35 0.20EP3 0.11 0.27 0.16ARTS 0.11 0.27 0.15FirstEF 0.13 0.15 0.14

predict it as a promoter region.

To validate our predictions we use the dataset ofCAGE-tags from and a set of genes from Ensembl.To compare with the state-of-the-art, we used a max-imum allowed distance from the TSS of 50 bp. Table1 shows the performance of ProSOM versus a numberof other PPPs.

We also analyzed the ENCODE regions of the humangenome in more detail. The ENCODE project tries toannotate one percent of the human genome in greatdetail. ProSOM gets an F-measure of 0.28 on thisvalidation set.

4. Discussion and conclusion

Self-organizing maps provide an intuitive way to clus-ter DNA sequences. They are unique among unsuper-vised clustering techniques in their ability to distin-guish core promoters from other sequences. We pack-aged this technique as a full-fledged promoter predic-tion tool, called ProSOM, that performs as well as thebest existing software packages.

Acknowledgments

T.A. is funded by a grant from the Institute for thePromotion of Innovation through Science and Tech-nology in Flanders (IWT-Vlaanderen). Y.S. is fundedby a post-doctoral grant from the Research Founda-tion Flanders (FWO-Vlaanderen).

References

Abeel, T., Saeys, Y., Bonnet, E., Rouze, P., & Van dePeer, Y. (2008a). Generic eukaryotic core pro-moter prediction using structural features of DNA.Genome Res, 18, 310–323.

Abeel, T., Saeys, Y., Rouze, P., & Van de Peer, Y.(2008b). ProSOM: Core promoter prediction basedon unsupervised clustering of DNA physical profiles.Bioinformatics, (in press), –.

78

Benchmarking machine learning techniques for the extractionof protein-protein interactions from text

Sofie Van Landeghem [email protected] Saeys [email protected] Van de Peer [email protected]

Department of Plant Systems Biology, VIB, 9052 Gent, BelgiumDepartment of Molecular Genetics, University of Ghent, 9052 Gent, Belgium

Bernard De Baets [email protected]

Department of Applied Mathematics, Biometrics and Process Control, University of Ghent, 9000 Gent, Belgium

Abstract

Accurately extracting information from textis a challenging discipline because of the com-plexity of natural language. We have studiedstate-of-the art systems that extract biolog-ical relations from research articles. It hasbecome clear that this field is still strugglingwith a heterogeneous collection of data sets,data formats and evaluation methods. Whilerecent developments look promising, there isstill plenty of room for improvement.

1. Introduction

In the field of life sciences it is vital to automaticallylink experimental results to data already published inonline literature resources. Fully automated systemsthat extract biological knowledge from text have thusbecome a necessity. We have studied the feasibility ofapplying machine learning approaches for the extrac-tion of protein-protein interactions (PPIs). During ourcomparative study, it became clear that there is a greatneed for the standardization of evaluation procedures.

2. Corpora

Over the past few years, different methods have beenproposed to extract biological relations from text. Thedevelopment of standard benchmarking data sets is astep forward towards meaningful comparison betweenthese systems. Such corpora include LLL, AImed andBioInfer, which have all been published in differentdataformats. Only recently, software has been intro-duced to convert these and two smaller data sets intoa common dataformat (Pyysalo et al., 2008), whichfacilitates comparison between different methods.

Figure 1. Dependency parse for ‘The results show thatmyogenin heterodimerizes with E12 and E47 in vivo.’

3. PPI extraction

Sentences selected from biomedical text usually con-tain complex structures with multiple subordinateclauses. Interacting proteins often occur in a sen-tence with some distance between them. Therefore,pattern-based approaches and algorithms using wordorder suffer from low recall. On the other hand, tech-niques solely based on co-occurrence of named entitiesexhibit low precision. To better capture the semanticsof a sentence, recent systems make use of informationderived from dependency trees (see Fig 1).

By extracting properties from dependency trees, ex-plicit features can be obtained for each pair of pro-teins. These feature vectors are used by classifierssuch as decision trees, BayesNet and SVM to identifysentences which express a protein-protein interaction.Useful features relate to lexical and syntactic informa-tion about the children and ancestors of the proteinsin the tree, the presence of common interaction wordsand depth of the named entities in the tree.

79

4. Ideas to improve benchmarking

4.1. Common set of benchmark data

A comparative study between different PPI extractionsystems is a non trivial task as different studies oftenbenchmark on different data sets. The RelEx systemof Fundel et al. (2006) has been reimplemented withthe goal of evaluating it on different corpora (Pyysaloet al., 2008). An F score of 0.77 was obtained whenbenchmarking on LLL, and a score between 0.41 and0.44 when evaluated on AImed and Bioinfer. We ob-tain similar results when applying the walk kernel ofKim et al. (2008) to the AImed data set, which resultsin an F score of 0.44. In contrast, the original paperreports a score of 0.77 for the evaluation on LLL. Thisshows that for the same extraction method, perfor-mance can differ up to 36% depending on the choiceof the corpus. It is therefore meaningful to evaluatenew algorithms on a collection of different data sets.

4.2. Instance extraction

When benchmarking on the same corpus, different pre-processing steps can yield different instances. Homod-imers, which are self-interacting proteins, are some-times simply discarded. A similar issue is raised byannotations which are nested. The ability of the pre-processing techniques to deal with such annotationsinfluences the final number of instances in the data setand ultimately the performance of the system.

Most corpora do not deal with the construction of neg-ative training data. It has become common practiceto adapt the closed world assumption, stating that nointeraction exists between two entities when there isno annotated evidence. Even though AImed providesan explicit set of abstracts with no annotated interac-tions, these are not always used, resulting in differentnumbers of negative instances in the training set.

Ideally, abstracts for the testing phase should be com-pletely hidden during training. Saetre et al. (2008)pointed out that some evaluations suffer from an arti-ficial boost of performance by using features from thesame sentence in both training and testing steps ofthe machine learning algorithm. This boost of perfor-mance has been estimated between 10 and 20%.

4.3. Counting true positives

The definition of true positives varies between differ-ent evaluation approaches. Most approaches considerevery protein pair as an individual instance and evalu-ate whether an interaction is stated between these twoparticular entities. Some however state that an inter-

action between two proteins may be expressed in thesame corpus by more than one instance. To extract atrue interaction, retrieving one such instance suffices.The latter evaluation technique exhibits higher recall.Even though this technique may be useful for the eval-uation of complete information retrieval systems, wefeel the first is more representative for the subtask ofextracting interactions between named entities fromindividual sentences.

4.4. Directed interactions

Finally, the definition of PPI extraction task is notunambiguously defined across corpora. The LLL dataset and Bioinfer both consider the role of the differentproteins in their interaction and discriminate betweeneffectors and effectees. In AImed however, protein-protein interactions are considered to be symmetrical.This has led to the common practice of treating LLLannotations as symmetrical as well, resulting in artifi-cially higher precision rates.

5. Conclusions

The comparison of different PPI extraction methodsis hindered by the lack of standard evaluation proce-dures. We have pointed out the main problems forsuch a comparative study and indicated some practi-cal guidelines for setting up a meaningful evaluation.

Acknowledgments

SVL would like to thank the Special Research Fund(BOF) for funding her research. YS would like tothank the Research Foundation Flanders (FWO) forfunding his research.

References

Fundel, K., Kuffner, R., & Zimmer, R. (2006). Relex–relation extraction using dependency parse trees.Bioinformatics, 23, 365–371.

Kim, S., Yoon, J., & Yang, J. (2008). Kernel ap-proaches for genic interaction extraction. Bioinfor-matics, 24, 118–126.

Pyysalo, S., Airola, A., Heimonen, J., Bjrne, J., Gin-ter, F., & Salakoski, T. (2008). Comparative analy-sis of five protein-protein interaction corpora. BMCBioinformatics, 9.

Saetre, R., Sagae, K., & Tsujii, J. (2008). Syntacticfeatures for protein-protein interaction extraction.Proceedings of the Second International Symposiumon Languages in Biology and Medicine (LBM2007).

80

Predicting sub-Golgi localization of glycosyltransferases

Aalt D J van Dijk [email protected]

Applied Bioinformatics, PRI, Wageningen UR, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands

Dirk Bosch [email protected]

Metabolic Regulation, PRI, Wageningen UR, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands

Cajo J F ter Braak [email protected]

Biometris, PRI, Wageningen UR, Bornsesteeg 47, 6708 PD Wageningen , The Netherlands

Sander van der Krol [email protected]

Metabolic Regulation, PRI, Wageningen UR, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands

Roeland C H J van Ham [email protected]

Applied Bioinformatics, PRI, Wageningen UR, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands

Abstract

Proteins fulfill their biological role in spe-cific cellular sub-compartments, and predic-tion of protein localization is an importanttopic within bioinformatics. Here we studylocalization of glycosyltransferases which canreside in any of the cis-, medial-, or trans-Golgi compartments or in the Trans GolgiNetwork (TGN) compartment. This sub-Golgi localization is important for the orderof reactions performed in glycosylation path-ways, but it is currently poorly understood.We use a dataset of proteins with experi-mentally determined sub-Golgi localizationsto develop a predictor, making use of a ded-icated protein structure-based kernel in anSVM.

1. Experimental dataset

Golgi-localized glycosyltransferases contain one trans-membrane helix, which is important for their correctsub-Golgi localization. A dataset of 59 proteins withknown sub-Golgi localization was obtained. Thesewere clustered (using the minimum variance methodimplemented in the R function hclust) based on theirsequence identities in order to remove redundant se-quences, which would otherwise result in unjustifiedhigh performance of the resulting predictor. Basedon the sharp rise of maximum inter-cluster similaritywhen using more clusters, 31 clusters were selected. Of

these, 18 had only one entry and 13 had multiple en-tries with consistent localization, indicating that theavailable data is consistent. Additional training se-quences were obtained based on sequence similarity(via ENSEMBL or BLAST), resulting in 107 sequenceswith cis-Golgi localization, 117 with medial-Golgi lo-calization, 86 with trans-Golgi localization and 89 withTGN localization.

2. Kernel and SVM

We tested different kernels, both based on the lin-ear sequence (string kernels) and on the modeled 3D-structure of the transmembrane domain (structure ker-nel). For both kernels, we applied a grouping ofamino acids where amino acids were clustered follow-ing Shen (2007) into the following 7 groups: AGV,ILFP, YMTS, HNQW, RK, DE, and C. For the stringkernel, we took as a starting point the conjoint triadstring kernel (Shen, 2007). Triads were redefined toaccommodate a fixed spacing of either 0 (the originaltriad definition) or 1, 2 or 3 (non-sequential triads),since such spacing determines alignment of residues tospecific sides of the transmembrane helix. The struc-ture kernel was designed based on observed residuecontacts in 3D models of the helix. These mod-els were obtained via structure calculations in CNS(Brunger,1998). 1 Side-chain side-chain contacts were

1Dihedral angle and hydrogen bond restraints were de-fined, and the anneal.inp CNS-script was used to calculateten structures for each helix. The lowest energy structurewas used to obtain the kernel-features.

81

counted using a distance cutoff of 3.5 Aand each tripletof amino acids within this distance cutoff was countedas one occurrence of a triad.

As SVM implementation SVMlight (Joachims, 1999)was applied. For each type of triad vi a normalizedcount was defined as di=(fi-min)/max, where fi

is the raw count and min (max) is the minimum(maximum) over all fi. Since the number of train-ing examples was relatively small compared to thedimension of the feature space, a linear kernel wasexpected to be powerful enough. Leave-one-out crossvalidation was applied to optimize the parameterC, for which a grid [1,2,3,4,5,6,7,8,9,10,15,20,25,30]was used. For the sake of completeness, the radialbasis function (RBF) kernel was also tested, wherethe additional γ parameter was optimized on a grid[500,200,100,50,10,5,1,0.1,0.01,0.001,0.0005,0.0001].To obtain an unbiased performance estimation, nestedcross-validation was used as described previously(Varma, 2006). The leave-one-out cross-validationwas performed cluster-wise, meaning that all se-quences in one cluster were removed simultaneously.

3. Results

To obtain a multiclass classification, three separatepredictors were built: one for cis vs. the other threelocalizations, one for cis or medial vs. trans or TGN,and one for TGN vs. the other three localizations.This particular ordering was chosen because it coin-cides with the biologically relevant order cis-medial-trans-TGN. The cis/medial vs. trans/TGN predic-tor was used to test the performance of the variousstring kernels. Table 1 shows the prediction accuraciesfor the various kernels, and indicates that the string-kernels that take structural features of the transmem-brane domain into account perform better than ker-nels that do not take this into account (note that aspacing of 2 or 3 reflects the proximities of residues in3D-space whereas a spacing of 1 does not). The 3D-structure based kernel has the best prediction perfor-mance. Randomly assigning class-labels to each set ofclustered sequences and retraining the SVM-predictorresulted in much lower performance (47% accuracy),showing that the performance obtained by the SVM-predictor is non-trivial.

The structure kernel was subsequently used for theother predictors, whose performance was comparableto that of the cis/medial vs. trans/TGN predictor.For each of these three predictors we also tested anRBF instead of linear kernel, which gave comparableresults (data not shown). The predictors were com-bined by using combinatorial logic, e.g. if for a given

Table 1. Classification accuracies for cis/medial vstrans/TGN prediction

Kernel cis/medial trans/TGN all

String: spacing 1 64.0 43.0 55.2String: spacing 0 73.4 58.5 67.2String: spacing 3 64.2 76.6 69.4String: spacing 2 64.3 84.3 73.0Structure based 78.5 72.6 76.1

Table 2. Confusion table for combined predictor

Predicted

Experimental cis medial trans TGN

cis 6 3 1 2medial 0 5 0 1trans 0 0 3 0TGN 1 2 1 5

sequence the cis/medial vs. trans/TGN predictor re-turns cis/medial and the cis vs. the rest predictorreturns not cis then the prediction would be medial.Table 2 shows the confusion table for the resultingpredictor. The cross-validated prediction accuracy is61%.

Application to a variety of glycosyltransferases demon-strates the power of our approach. For example, weobtain consistent predictions when comparing human-mouse orthologs, whereas applying a simple sequencesimilarity based predictor results in much less consis-tent predictions. In addition, comparison with a largeset of glycan structures, which are the products of theenzymatic actions of the glycosyltransferases, demon-strates a significant correlation between sub-Golgi lo-calization and the predicted ordering of different stepsin glycan biosynthesis.

References

Brunger, A.T., et al. (1998) Crystallography & NMRsystem: A new software suite for macromolecularstructure determination, Acta Crystallographica Sec-tion D-Biological Crystallography, 54, 905-921.Joachims, T. (1999) Making large-Scale SVM LearningPractical. In, Advances in Kernel Methods - SupportVector Learning. MIT-Press.Shen, J.W., et al. (2007) Predicting protein-protein in-teractions based only on sequences information, Pro-ceedings of the National Academy of Sciences of theUnited States of America, 104, 4337-4341.Varma, S. and Simon, R. (2006) Bias in error estima-tion when using cross-validation for model selection,Bmc Bioinformatics, 7, -.

82

Prediction of genetic risk of complex diseases by supervised learning

Vincent Botta [email protected] Geurts [email protected] Wehenkel [email protected]

Department of Electrical Engineering and Computer ScienceGIGA-Research, University of Liege, B4000 Belgium

Sarah Hansoul [email protected]

Animal GenomicsGIGA-Research, University of Liege, B4000 Belgium

1. Whole genome association studies

The majority of important medical disorders (f.i. sus-ceptibility to cancer, cardiovascular diseases, diabetes,Crohn’s disease) are said to be complex. This meansthat these diseases are influenced by multiple, often in-teracting environmental and genetic risk factors. Thefact that individuals differ in terms of exposure toenvironmental as well as genetic factors explains theobserved inter-individual variation in disease outcome(i.e. phenotype). The proportion of the phenotypicvariance that is due to genetic factors (heritability)typically ranges from less than 10 to over 60 % forthe traits of interest. The identification of genes in-fluencing susceptibility to complex traits reveals noveltargets for drug development, and allows for the imple-mentation of strategies towards personalized medicine.

Recent advances in marker genotyping technology al-low for the genotyping of hundreds of thousands ofSingle Nucleotide Polymorphisms (SNPs) per individ-ual at less than 0.1 eurocents per genotype, the iden-tification of genomic regions (i.e. loci) that influencesusceptibility to a given disease can now be obtainedby means of so-called “whole genome association stud-ies” (WGAS).

2. Supervised learning for WGAS

The basic idea behind a GWAS is to genotype a col-lection of affected (cases) and unaffected (controls) in-dividuals for a very large number of SNPs spread overthe entire genome. Genomic regions showing statisti-cal differences among cases and controls are then de-tected using this dense collection of SNPs. From a ma-chine learning point of view, analysis of this dataset isa binary classification problem, with a very large num-ber of raw symbolic variables, each one correspondingto a different SNP and having only three possible val-

ues (homozygous wild, heterozygous and homozygousmutant). On top of this very high p/n ratio, theseproblems are also generally highly noisy, and the rawinput variables are strongly correlated (which is ex-plained by the so-called linkage desiquilibrium).

In this research we study two different representationsof the input data for the application of supervisedlearning, namely the raw genotype data on the onehand, and on the other hand the groups of stronglycorrelated SNPs (i.e. the haplotype blocks) , repre-senting the observed combinations of about 10 to 100phased genotypes between the recombination hotspotsof the different chromosomes. We report an empiricalstudy based on several simulated datasets where oneor two independent or interacting causal mutations ona single chromosome are studied. We provide com-parative results of different ensembles of randomizeddecisions trees adapted to handle the particular natureof these two types of input variables. These methodsare assessed in terms of their predictive power as wellas their ability to help identifying the genomic regionscontaining causal mutations.

3. Methods

3.1. Dataset generation

We used the program gs (Li & Chen, 2008) to gen-erate samples based on HapMap data (Consortium,2003) so as to keep the linkage desiquilibrium patternssimilar to those in actual human populations and fo-cus on chromosome 5. The raw input variables wereobtained by taking SNPs spaced by 10 kilobases fromthe HapMap pool to reproduce classical GWAS condi-tions, and the causal disease loci were removed fromthe input variables.

Using genotype penetrance tables, 5 different diseasemodels were tested: two for the one locus experiments,

83

Figure 1. Haplotype block statistics. Top: block modali-ties; Middle: block length; Bottom: scatter plot

and the three most common disease models with in-teractions (Li & Reich, 2000) for the two locus case.We considered different noise and penetrance values.

In the first (raw) data representation, the differentdatabases are composed of 14604 symbolic variableswith 3 possible values. The second representation isa variant of the first where we group correlated vari-ables into blocks (haplotype blocks chosen accordingto HapMap hotspots). This dramatically reduces thenumber of variables but it also increases their modal-ities (up to few hundred possible combinations whenthe sample comes from a broad population). In total,this yielded 1957 haplotype blocks. Figure 1 shows thehistograms of block lengths and number of modalities.

3.2. Supervised learning

We evaluated Random Forests and Extra-Trees (seeGeurts et al., 2006 for a precise description of thesealgorithms and related notions). These methods werecustomized in an ad hoc way to handle the datasetsfor the haplotype block variant. Various values oftheir two main meta-parameters (number of tested at-tributes and number of trees) were screened while thetrees were completely developped.

Learning was repeated 10 times on balanced learningsets (containing between 100 and 1000 controls and asmany cases). All models were evaluated on the sameindependent and balanced test set of size 5000.

The predictive power was assessed using the mean areaunder the ROC curves and compared to best possi-ble theoretical AUCs which were deduced from the se-lected disease model.

We ranked SNPs and haplotype blocks using variableimportances based on information theory (see We-henkel, 1998), and provide the mean rank of the SNPsadjacent to the causal mutations, or of the block(s)containing these mutation(s).

4. Preliminary results

Preliminary results show good perspectives. In partic-ular, the different methods obtain rather good AUCsas compared with the theoretical upper bound derivedfrom the disease models. The different methods arealso able to predict and to localize the disease loci,rather well. We also observed that most often thedirect application of supervised learning to the rawgenotype data provides slightly superior results bothin terms of risk prediction and loci identification thanthe application of these methods to haplotype blocks.This essentially suggests that further work should fo-cus on a better determination of the haplotype blockstructure from the datasets themselves (rather than byextrapolating these structures from other cohorts, asit was the case in these first investigations).

Acknowledgments

This paper presents research results of the BelgianNetwork BIOMAGNET (Bioinformatics and Model-ing: from Genomes to Networks), funded by the In-teruniversity Attraction Poles Programme, initiatedby the Belgian State, Science Policy Office. The sci-entific responsibility rests with its authors. VincentBotta is recipient of a F.R.I.A. fellowship. Sarah Han-soul is a postdoctoral research fellow of the F.R.S.-FNRS and Pierre Geurts is a Research Associates ofthe F.R.S.-FNRS.

References

Consortium, T. I. H. (2003). The international hapmapproject. Nature, 426, 789–796.

Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremelyrandomized trees.

Li, J., & Chen, Y. (2008). Generating samples for associa-tion studies based on hapmap data. BMC Bioinformat-ics, 9, 44.

Li, W., & Reich, J. (2000). A complete enumeration andclassification of two-locus disease models. Hum Hered,50, 334–349.

Wehenkel, L. (1998). Automatic learning techniques inpower systems.

84

Component analysis for genome-wide association studies

Gilles Meyer [email protected] Sepulchre [email protected]

Department of Electrical Engineering and Computer Science,GIGA Bioinformatics Platform,University of Liege, Belgium.

1. Introduction

This work illustrates the application of componentanalysis such as principal component analysis (PCA)and independent component analysis (ICA) to analyzeSNP databases. The problems of association mappingand population stratification are both addressed withthese methods.

2. Component analysis

In the general framework of component analysis, thedata matrix X ∈ Rm×n is approximated by the prod-uct of two lower-rank matrices A ∈ Rm×k and S ∈Rk×n with k ≤ m :

X ≈ AS (1)

where S contains the reduced data and A’s columnsare the directions spanning the subspace of the reduceddata.

If the directions are constrained to be mutually or-thogonal and computed to retain as much variance aspossible from the original data, the factorization (1)correspond to a principal component analysis of X.

Another possibility is to identify the directions thatmake the rows of S as statistically independent as pos-sible. This objective is pursued in independent com-ponent analysis (Hyvarinen et al., 2001).

3. SNP databases

The human DNA sequence is about 3 billion base pairsof nucleotides (A-T-C-G) arranged into 23 chromo-somes. Each individual has one pair of each chromo-some, one is inherited from the maternal side and theother one from the paternal side.

In the world’s population, there is about 10 millionsites or loci that vary between individuals. Such locireferred as single nucleotide polymorphisms (SNPs)

are natural candidates for the research of causal dif-ferences responsible for diseases or other phenotypesof interest.

At one particular SNP locus, there are usually twoalleles (specific nucleotides) observed across the popu-lation. Thus, a SNP database can be represented as amatrix whose elements can take 3 discrete values : 2if the arbitrary reference allele is carried on each chro-mosome, 1 if the two different alleles are observed and0 if the non-reference allele is present on each chro-mosome. To date, an order of magnitude for thesedatabases is the measurement of 106 SNPs for 103 in-dividuals. These numbers are rapidly increasing withthe development of cheaper technologies.

4. Association mapping

The analysis of SNP databases aim at finding loci bio-logically related to a measured phenotype, for examplea particular disease.

The potential of ICA to perform such analysis has beenillustrated in (Dawy et al., 2005) on simulated data.

In this work, the method is applied to a real databaseconcerned about the identification of loci involved inCrohn disease.

5. Population stratification

A problem encountered in association mapping is thepresence of individuals coming from different popula-tions. This is a source of bias into the observed allelefrequencies leading to false discoveries in the mappingprocess.

In (Price et al., 2006), PCA is used to correct for thisstratification effect by computing a component linkedto the population structure and then by removing itfrom the data. The issue will be discussed in the con-text of a Crohn large database.

85

Acknowledgments

This work was supported by the Belgian NationalFund for Scientific Research (FNRS) through a Re-search Fellowship at the University of Liege. This pa-per presents research results of the Belgian NetworkBioMAGNet (Bioinformatics and Modelling: fromGenomes to Networks), funded by the InteruniversityAttraction Poles Programme, initiated by the BelgianState, Science Policy Office. The scientific responsibil-ity rests with its author(s).

References

Dawy, Z., Sarkis, M., Hagenauer, J., & Mueller, J. C.(2005). A novel gene mapping algorithm based onindependent component analysis. IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP) (pp. 381–384). Philadelphia.

Hyvarinen, A., J.Karhunen, & Oja, E. (2001). Inde-pendent component analysis. Wiley-Interscience.

Price, A., Patterson, N., Plenge, R., Weinblatt, M.,Shadick, N., & Reich, D. (2006). Principal compo-nents analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38, 904–909.

86

Morphological Feature Extraction ofIntracranial Pressure Signals via Nonlinear Regression

Fabien Scalzo [email protected] Xu [email protected] Bergsneider [email protected] Hu [email protected]

Division of Neurosurgery, Geffen School of Medicine, University of California, Los Angeles, CA, USA

Abstract

The management of many neurological dis-orders such as traumatic brain injuries relieson the continuous measurement of intracra-nial pressure (ICP). Following recent studies,the automatic analysis of ICP pulse seemspromising for forecasting intracranial andcerebrovascular pathophysiological changes.MOCAIP algorithm has recently been devel-oped to automatically extract ICP morpho-logical features (in terms of sub-peak posi-tions) in real time. This paper extends MO-CAIP by using a regression model insteadof Gaussian priors during the peak designa-tion to improve the accuracy of the process.Experimental evaluations conducted on realclinical data indicate that the use of a regres-sion model significantly increases the peakdesignation accuracy.

1. Introduction

The management of many neurological disorders suchas traumatic brain injuries relies on the continuousmeasurement of intracranial pressure (ICP). Followingrecent studies (Hu et al., 2008), variations of the ICPsignal are linked to the development of intracranial hy-pertension and cerebral vasospasm, acute changes inthe cerebral blood carbon dioxide (CO2) levels, andchanges in the craniospinal compliance. Therefore,the automatic and continuous analysis of ICP featuresappears to be promising for a better monitoring, un-derstanding and forecasting of intracranial and cere-brovascular pathophysiological changes.

Processing ICP signals to extract features in a continu-ous and reliable way is, however, very challenging andbeyond most of state-of-the-art ICP analysis methods.

2. Previous Work

MOCAIP algorithm (Hu et al., 2008) (MorphologicalClustering and Analysis of ICP Pulse) has recentlybeen developed to extract morphological changes ofICP pulse in real time. The algorithm relies on thefact that the ICP waveform is triphasic (i.e. threesub-peaks in each ICP pulse). The MOCAIP algo-rithm offers several interesting properties: it is ableto enhance ICP signal quality, to recognize legitimateICP pulses and to detect the three sub-components inan ICP pulse. This last step is done by consideringa set of peak candidates (extracted at curve inflex-ion points), and by identifying the three peaks amongthem. During this assignation, MOCAIP makes use ofGaussian priors to set the position of each peak suchthat the configuration maximizes the probability toobserve the peaks given the prior distributions. How-ever, this can be problematic because the position ofthe peaks within the pulse presents a large variationthat is translated into large variance priors and weak-ens the effectiveness of the peak designation step.

3. Approach

This work introduces an extension of the MOCAIPalgorithm to improve the accuracy of the peak desig-nation. The innovative idea is to consider the locationof the peaks (p1, p2, p3) as a function f(x) of the pulsesignal (discretized as a vector x) (Fig. 1). To this end,a regression model is exploited during the peak desig-nation, instead of the Gaussian priors, to extract themost likely location of each peak,

y = f(x) (1)⇔ (p1, p2, p3) = ax+ b (2)

Given a set of annotated training pulses, an efficientSpectral Regression (SR) analysis (Cai et al., 2007) isused to estimate the linear function f(x). The SpectralRegression analysis is a recent method which combines

87

0 50 100 150 200 250 300 3500

0.5

1

1.5

2

2.5

3

3.5

4

0 350

a

b

c

Pulse x

Figure 1. A regression model f(x) is used to predict thepositions a, b and c, of the three peaks. The pulse is dis-cretized and normalized into a vector x.

spectral graph analysis and ordinary regression. Themain idea of Spectral Regression is to use eigenvectorsof the affinity (i.e. item-item similarity) matrix to re-veal a low-dimensional structure of high-dimensionaldata. In our framework, we use a RBF kernel toproject the data into a higher dimensional space andthus capture the nonlinear relation between the pulsex and the position of the peaks y = (p1, p2, p3).

Once the peak positions have been predicted by themodel, a nearest-neighbor matching algorithm is usedto assign the candidates to the label of the closest pre-diction.

4. Experiments

The effectiveness of the proposed extension is evalu-ated by measuring the accuracy of the algorithm todesignate the three ICP peaks on real clinical data.To do so, we assume that the ICP pulses have beenpreviously extracted using MOCAIP and that a set ofcandidate peaks has been detected.

The dataset used during our experiments contains13611 ICP pulses that were extracted from the ICPsignals of 66 patients. It is a particularly challengingdataset because among the pulses, 1717 have miss-ing P1, 265 have missing P2 and 34 have missingP3. The average accuracy (in terms of True Posi-tive (TP) and False Positive (FP) rates) is recordedusing a five-fold cross-validation procedure. The re-sults obtained by the proposed method are reportedin Table 1 and compared to MOCAIP. Our exten-

sion achieves a very high true positive rate for cor-rectly designating the first two peaks. The signif-icant improvement in terms of True Positive andFalse Positive rate is confirmed by the combined ac-curacy (TP+FN)/(TP+FP+TN+FN); MOCAIP ob-tains 90%, 88%, and 87% for each peak and the resultsof the proposed extension are 97%, 98% and 88%.

Figure 2 illustrates detection results on four differentpulses. We can observe that the detection is successfuldespite the large variability in shape of the IntracranialPressure Signals (ICP).

P1(TP, FP) P2 (TP, FP) P3 (TP, FP)MOCAIP 91%, 18% 88%, 39% 86%, 53%this work 99%, 13% 98%, 11% 88%, 56%

Table 1. Peak Identification results in terms of True Posi-tive (TP) and False Positive (FP) rates are reported for theMOCAIP algorithm and the proposed Spectral Regression(SR) extension.

0 50 100 150 200 250 3004.5

5

5.5

6

6.5

7

7.5

8

8.5

9

9.5

0 50 100 150 200 250 300 350 400−8

−7

−6

−5

−4

−3

−2

−1

0 50 100 150 200 250 3002

3

4

5

6

7

8

9

0 50 100 150 200 250 300−12

−11

−10

−9

−8

−7

−6

−5

Figure 2. Peak detection on four ICP pulses. The groundtruth is marked as a green dot and the output of our frame-work is depicted as a cross.

Acknowledgments

The present work is supported in part by NINDS grantsR21-NS055998, R21-NS055045, R21-NS059797 and R01-NS054881.

References

Cai, D., He, X., & Han, J. (2007). Spectral Regression forEfficient Regularized Subspace Learning. IEEE Interna-tional Conference on Computer Vision (ICCV’07).

Hu, X., Xu, P., Scalzo, F., Miller, C., Vespa, P., &Bergsneider, M. (2008). Morphological Clustering andAnalysis of Continuous Intracranial Pressure. Submittedto IEEE Transactions on Biomedical Engineering.

88

An Extended nmf Algorithm for Word Sense Discrimination

Tim Van de Cruys [email protected]

Humanities Computing, University of Groningen, Oude Kijk in ’t Jatstraat 26, 9712 EK Groningen

Abstract

Many words used in natural language are am-biguous: they have various senses. Tradi-tional algorithms dealing with semantic sim-ilarity cannot cope with this ambiguity. Wepresent an extension of a dimensionality re-duction algorithm called non-negative ma-trix factorization that combines both‘bag of words’ data and syntactic data, in or-der to find semantic dimensions according towhich both words and syntactic relations canbe classified. The use of three way data al-lows one to determine which dimension(s) areresponsible for a certain sense of a word, andadapt the corresponding feature vector ac-cordingly, ‘subtracting’ one sense to discoveranother one. The intuition in this is thatthe syntactic features of the syntax-based ap-proach can be disambiguated by the topicaldimensions found by the bag of words ap-proach.

1. Introduction

Most work on semantic similarity relies on the distri-butional hypothesis (Harris, 1985). This hypothesisstates that words that occur in similar contexts tendto be similar. With regard to the context used, twobasic approaches exist. One approach makes use of‘bag of words’ co-occurrence data; in this approach, acertain window around a word is used for gatheringco-occurrence information. Bag of words methods areparticularly good at finding topical similarity. One ofthe dominant methods using this method is latentsemantic analysis (lsa, (Landauer et al., 1998)).

The second approach uses a more fine grained dis-tributional model, focusing on the syntactic relationsthat words appear with. Typically, a large text cor-pus is parsed, and dependency triples are extracted.1

1e.g. dependency relations that qualify apple might be‘object of eat ’ and ‘adjective red ’. This gives us depen-dency triples like < apple, obj, eat >.

Syntax-based methods are good at finding a tighter,synonym-like similarity. Note that the former ap-proach does not need any kind of linguistic annota-tion, whereas for the latter, some form of syntacticannotation is needed.

In this research, a framework is explored that triesto combine best of both approaches. The intuitionin this is that the syntactic features of the syntax-based approach can be disambiguated by the topicaldimensions found by the window-based approach.

2. Methodology

2.1. Extending Non-negative MatrixFactorization

Non-negative matrix factorization (nmf) (Lee & Se-ung, 2000) is a group of algorithms in which a non-negative matrix Vn×m is factorized into two other ma-trices, Wn×r and Hr×m, subject to the constraint thatW, H ≥ 0.

Typically, r is chosen much smaller than n, m so thatboth instances and features are expressed in terms of afew components. Practically, the factorization is car-ried out through the iterative application of updaterules.

Since we are interested in the classification of nounsaccording to both ‘bag-of-words’ context and syntac-tic context, we first construct three matrices that cap-ture the co-occurrence frequency information for eachmode. The first matrix contains co-occurrence fre-quencies of nouns cross-classified by dependency re-lations, the second matrix contains co-occurrence fre-quencies of nouns cross-classified by words that appearin the noun’s context window, and the third matrixcontains co-occurrence frequencies of dependency re-lations cross-classified by co-occurring context words.2

We then apply nmf to the three matrices, but we inter-2All co-occurrence information is extracted from the

Twente Nieuws Corpus (Ordelman, 2002). The corpus hasbeen parsed with the Dutch dependency parser Alpino (vanNoord, 2006).

89

leave the separate factorizations: the results of the for-mer factorization are used to initialize the factorizationof the next matrix. This implies that we need to ini-tialize only three matrices at random; the other threeare initialized by calculations of the previous step. Theprocess is represented graphically in figure 1.

Figure 1. A graphical representation of the extended nmf

When the factorization is finished, the three modes(nouns, dependency relations and context words) areclassified according to latent semantic dimensions.

2.2. Sense Subtraction

Next, we want to use the factorization that has beencreated in the former step for word sense discrimina-tion. The intuition is that we ‘switch off’ one dimen-sion of an ambiguous word, to reveal possible othersenses of the word. From matrix H, we know the im-portance of each syntactic relation given a dimension.With this knowledge, we can ‘subtract’ the syntacticrelations that are responsible for a certain dimensionfrom the original noun vector.

The last step is to determine which dimension(s) areresponsible for a certain sense of the word. In order todo so, we embed our method in a clustering approach.First, a specific word is assigned to its predominantsense (i.e. the most similar cluster). Next, the dom-inant semantic dimension(s) for this cluster are sub-tracted from the word vector, and the resulting vectoris fed to the clustering algorithm again, to see if otherword senses emerge.

3. Results

3.1. Example

Example (1) shows the top-10 similar words for theambiguous proper name Barcelona, which may eitherrefer to the Spanish city or to the Spanish football

club. In (a), the results for the original vector aregiven; the two senses of Barcelona are clearly mixedup, showing cities as well as football clubs among themost similar nouns. In (b), where the ‘football’ dimen-sion has been subtracted, only cities show up. In (c),where the ‘cities’ dimension has been subtracted, onlyfootball clubs remain.

(1) a. Barcelona, Arsenal, Inter, Juventus, Vitesse,Milaan ‘Milan’, Madrid, Parijs ‘Paris’, Wenen‘Vienna’, Munchen ‘Munich’

b. Barcelona, Milaan ‘Milan’, Munchen ‘Munich’,Wenen ‘Vienna’, Madrid, Parijs ‘Paris’, Bonn,Praag ‘Prague’, Berlijn ‘Berlin’, Londen ‘Lon-don’

c. Barcelona, Arsenal, Inter, Juventus, Vitesse,Parma, Anderlecht, PSV, Feyenoord, Ajax

3.2. Evaluation

Our method has been embedded in an automatic clus-tering framework and evaluated against Dutch Eu-rowordNet (Vossen et al., 1999). Compared to Pan-tel and Lin (2002) – considered state of the art inword sense discrimination – our method consistentlyscores higher with regard to precision, but lower withregard to recall (e.g. with wordnet similarity thresh-old θ = 0.50, p = 69% and r = 56% for our method –p = 38% and r = 60% for Pantel and Lin’s).

References

Harris, Z. (1985). Distributional structure. In J. J. Katz(Ed.), The philosophy of linguistics, 26–47. Oxford Uni-versity Press.

Landauer, T., Foltz, P., & Laham, D. (1998). An Introduc-tion to Latent Semantic Analysis. Discourse Processes,25, 295–284.

Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. NIPS (pp. 556–562).

Ordelman, R. (2002). Twente Nieuws Corpus (TwNC).Parlevink Language Techonology Group. University ofTwente.

Pantel, P., & Lin, D. (2002). Discovering word senses fromtext. Proceedings of the eighth ACM SIGKDD interna-tional conference on Knowledge discovery and data min-ing (pp. 613–619). New York, NY, USA: ACM Press.

van Noord, G. (2006). At Last Parsing Is Now Opera-tional. TALN06. Verbum Ex Machina. Actes de la 13econference sur le traitement automatique des langues na-turelles (pp. 20–42). Leuven.

Vossen, P., et al. (1999). Eurowordnet, building a mul-tilingual database with wordnets for several europeanlanguages.

90

Automatic Vandalism Detection in Wikipedia:Towards a Machine Learning Approach

Koen Smets [email protected] Goethals [email protected] Verdonk [email protected]

Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium

Abstract

Since the end of 2006 several autonomousbots are, or have been, running on Wikipediato keep the encyclopedia free from vandal-ism and other damaging edits. These ex-pert systems, however, are far from optimaland should be improved to relieve the humaneditors from the burden of manually revert-ing such edits. We investigate the possibil-ity of using machine learning techniques tobuild an autonomous system capable to dis-tinguish vandalism from legitimate edits. Wehighlight the results of a small but importantstep in this direction by applying commonlyknown machine learning algorithms using astraightforward feature representation. Thisstudy demonstrates that elementary features,which are also used by the current approachesto fight vandalism, are not sufficient to buildsuch a system. They will need to be accompa-nied by additional information which, amongother things, incorporates the semantics of arevision.

1. Experiments

We will discuss the setting for our machine learningexperiment conducted on simplewiki, the Simple En-glish version of Wikipedia. We first consider the label-ing of the data and its representation. Thereafter wediscuss the results of two learning algorithms put totest: a Naive Bayes classifier on bags of words (BOW)(McCallum 1996) and a combined classifier built usingprobabilistic sequence modeling (Bratko et al. 2006),also referred to in the literature as statistical compres-sion models.

1.1. Labeling of Revisions

As a proof of concept and because of space and timeconstraints, we run the preliminary machine learn-ing experiments on Simple English Wikipedia, a user-contributed online encyclopedia intended for peoplewhose first language is not English.

The data is labeled by inspecting comments that sig-nal a revert action, i.e. an action which restores a pageto a previous version. This approach closely resemblesthe identification of the set of revisions denoted byPriedhorsky et al. (2007) as Damaged-Loose, a su-perset of the revisions explicitly marked as vandalism(Damaged-Strict).

While labeling based on commented revert actions is agood first order approximation, mislabeling cannot beexcluded. If we regard vandalism as the positive classthroughout this abstract, then there will be both falsepositives and false negatives. The former arises whenreverts are misused for other purposes than fightingvandalism like undoing changes without proper refer-ences or prior discussion. The latter occurs when van-dalism is corrected but not marked as reverted in thecomment, or when vandalism remains undetected fora long time. Estimating the number of mislabelingsis very hard and manual labeling is out of question,considering the vast amount of data.

1.2. Revision Representation

In this case study we use the simplest possible datarepresentation. As for ClueBot (Carter 2007) andVoABot II, the two active vandal fighting bots onWikipedia nowadays, we extract raw data from thecurrent revision and from the history of previous ed-its. In particular, for each revision we use its text, thetext of the previous revision, the user groups (anony-mous, bureaucrat, administrator . . . ) and the revisioncomment. We also experimented with including thelengths of the revisions as extra features. The effect

91

on overall performance is however minimal and thuswe discard them in this analysis. Hence the focus lieshere more on the content of an edit.

As the modified revision and the one preceding it dif-fer slightly, it makes sense to summarize an edit. LikeClueBot, we calculate the difference using the standarddiff tool. Processing the output gives us three typesof text: lines that were inserted, deleted or changed.As the changed lines only differ in some words or char-acters from each other, we again compare these usingwdiff. Basically, this is the same as what users seewhen they compare revisions visually.

1.3. Analysis and Discussion

Table 1 indicates the performance of a simplified ver-sion of ClueBot. It is lower than the performance ofthe original ClueBot which relies on a user whitelistfor trusted users and only reverts edits done by anony-mous or new users (Carter 2007). Table 2 shows theresults of the machine learning experiments on a 40%test set.

1.3.1. BOW + Naive Bayes

The precision of the Naive Bayes classifier only tak-ing into account the revision diff features as bags ofwords, both with or without user group informationand revision comments, is almost the same as in Ta-ble 1. A significant increase can be noticed in termsof recall and F1, especially when including user groupinformation and comment.

1.3.2. Probabilistic Sequence Modeling

Interesting to note is that the recall is much higher,but that the precision drops unexpectedly. We lack aplausible explanation for this strange behaviour, butthe effect can be diminished by setting the thresholdparameter to a score higher than zero. This is shownin Figure 1, where we plot the precision/recall curvesfor varying thresholds for the probabilistic sequencemodels and for the Naive Bayes models, both withand without user groups and comments. The marksshow the results when the log ratio threshold is equalto 0. The tendency is that, despite the worse behav-ior shown in Table 2, the overall accuracy measuredin term of precision and recall is better for the com-pression based models than for the bag of words modelusing Naive Bayes.

Acknowledgements

Koen Smets is supported by a Ph. D. fellowship of theResearch Foundation - Flanders (FWO).

Table 1. Performance of ClueBot (without user whitelist)on Simple English Wikipedia.

ACC PRE REC F1

ClueBot 0.9270 0.6114 0.1472 0.2372

Table 2. Results for Naive Bayes and Probabilistic Se-quence Modeling.

(a) revision diff

ACC PRE REC F1

NB 0.9303 0.6166 0.2503 0.3561PSM 0.8554 0.3117 0.7201 0.4351

(b) revision diff + commment + user groups

ACC PRE REC F1

NB 0.9314 0.5882 0.3694 0.4359PSM 0.8436 0.3209 0.9171 0.4755

Figure 1. Precision/Recall curves.

References

Bratko, A., Cormack, G. V., Filipic, B., Lynam, T. R.,& Zupan, B. (2006). Spam Filtering using StatisticalData Compression Models. JMLR, 6, 2673–2698.

Carter, J. (2007). ClueBot and Vandalism onWikipedia. Unpublished. Available at http://24.40.131.153/ClueBot.pdf.

McCallum, A. K. (1996). Bow: a Toolkit for StatisticalLanguage Modeling, Text Retrieval, Classificationand Clustering. Available at http://www.cs.cmu.edu/~mccallum/bow.

Priedhorsky, R., Chen, J., Lam, S. T. K., Panciera, K.,Terveen, L., & Riedl, J. (2007). Creating, Destroy-ing, and Restoring Value in Wikipedia. Proceedingsof the ACM SIGGROUP conference.

92

Supervised learning of short-term strategies for generation planning

Bertrand Cornelusse [email protected] Wehenkel [email protected]

Department of Electrical Engineering and Computer Science, University of Liege, 4000 Liege, Belgium

Gerald Vignal [email protected]

OSIRIS department, EDF R&D, 1 avenue du general de Gaulle, F-92141 Clamart Cedex, France

1. Short term electricity generationplanning

Short term electricity generation aims at decidingwhich generation units will be in operation for a cer-tain time period and how much they will produce.Typical generation pools contain a mix of generationunits: classical thermal plants, nuclear plants, hydro-electric generators, wind turbines, ... Wind turbinescan be considered as non controllable in a time horizonof one day and can be seen as negative loads. On theother hand, the operation of thermal plants and hydro-electric generators has to be planned. Although thisproblem can be formulated as a multi-stage stochas-tic programming problem, its complexity is by far toolarge to allow for an exact solution by available meth-ods. In current practice, the generation plans are gen-erally optimized deterministically, based on a demandforecast and generation units availability assumptions.The generation plans are typically computed the daybefore they are executed and adjusted in-real-time bygood practice rules so as to cope with the differencesbetween real-time conditions and forecasts.

We propose a simulation based approach which usesdeterministic optimization methods to compute opti-mal plannings for a set of possible scenarios and ex-tracts by machine learning recourse strategies fromthese simulations for the next day.

2. Problem description

In the context of short term generation planning, thegeneration pattern must satisfy some coupling con-straints (CC) linking all the generation units. First, asthe electricity cannot be stored in sufficient quantities,generation must always be close to demand. Secondly,for some more technical reasons, some levels of an-cillary services are required. The generation pool istypically divided in 2 categories: thermal units andhydro-electric generation valleys, themselves contain-

ing reservoirs, turbines and pumps. The operation ofthe thermal generation units is restricted by some dy-namical constraints (DC), because the thermal unitsmust stay in operation for a minimum duration, andbecause the levels of hydro-reservoirs must stay be-tween acceptable limits. One has thus to decide whento start and when to stop thermal units and to fix theirset point if they operate, and for the hydro-valleys onehas to decide when to use water to generate electric-ity, when to store some water by pumping and whento spill water out of dams.

The objective of the generating company is to min-imize generation costs including fuel costs, thermalunits start-up costs, and opportunity costs for the uti-lization of water in hydropower plants.

2.1. Mathematical model

For appropriate choices of cost functions and for anappropriate formulation of dynamical constraints, thisproblem can be modeled as a Mixed Integer LinearProgram (MILP):

minpi,si,p′,s′

∑i∈I

Ci(pi) +T∑

t=1

CP (p′t) +T∑

t=1

CS(s′t)(1)

s.t. p′t = DPt −∑i∈I

pi,t, ∀t ∈ {1, ..., T} (2)

s′t = DSt −∑i∈I

si,t, ∀t ∈ {1, ..., T} (3)

pi ∈ Di, si ∈ Si, ∀i ∈ I. (4)

The optimization is performed over T time periods.The letter i indexes the set of generating units I. pi

(respectively si) is the production (ancillary serviceslevel) of unit i all along the planning period (pi =(pi,1, ..., pi,T )). The constraints (4) indicate that pi

and si must stay inside sets encoding the DC. Ci(·)accounts for all the costs linked to generation units.To model the CC, we define some slack variables p′

93

(2) and s′ (3). p′ is the mismatch between generationand demand. s′ represents the mismatch between therequired level of ancillary services and the actual levelthat is provided. These slack variables penalize theobjective though the functions Cp(·) and Cs(·).To solve this problem efficiently, we use a state of theart branch and cut algorithm (ILOG, 2007).

3. Learning of recourses strategies

As mentioned in Section 1, a unique planning is sub-mitted one day before its execution, but it is admit-ted to take recourses at predefined time steps duringthe day, say every two hours. Because complete re-computation of the plannings is not achievable duringthe day, we propose to use the time that is availableoff-line to make some simulations and to infer somerecourse strategies applicable in real-time.

Consider a set D of demand patterns, and a set Oof generation units which could become unavailablenext day. Let π? be the optimal planning associatedto a reference demand scenario D? ∈ D. Supposethat we want to compute an optimal recourse strat-egy σ∗tr

(ξtr) for a single a priori fixed recourse time

tr ∈ {1, 2, . . . , T}, i.e. we want to know the modifica-tions to bring to all the units from time tr+1 to T oncethe real behavior ξtr of the system between time 1 andtr is known. Let S be the set of scenarios made of onedemand of D and of a unit outage of O imposed at atime in {1, ..., tr}. First, we compute the plannings πs

for each scenario s ∈ S by imposing the planning π?

for times 1 to tr and using the optimization problemformulation of Section 2.1 to adjust the planning fortime tr + 1 to T . The difference πs−π? illustrates theimpact of the demand variation and the unit outage ofscenario s on the reference planning. We exploit thesesimulations to formulate a supervised learning problemin order to derive a function σ∗tr

(ξtr ) that will serve asa recourse strategy. The learning set is illustrated inTable 1.

Inputs output

• state of the system at tr,

• observed demand derivationfrom forecasting until tr,

• unit failure before tr,

• prediction time t > tr,

• ON/OFF sta-tus of unit i att,

• and/or powerof unit i at t.

Table 1. An item of the Learning set.

To simplify the learning phase, we propose to learn

separately the adjustements for the different units andtime steps t ∈ {tr + 1, . . . , T}. This approach has theadvantage to lead to a (relatively) simple supervisedlearning formulation, but it points out two issues:

1. Since we apply a time decomposition, the unitsDC may not be satisfied; an additional phase willbe needed to impose them.

2. Since the strategies are learned unit by unit, theglobal view of the system is weakened and globalconstraints are not satisfied; they will also haveto be enforced a posteriori.

Although we focus here on the construction of a re-course strategy for a single priod tr, we can derivestrategies for multiple periods during the day by ap-plying this procedure iteratively for an increasing se-quence of tr ∈ {1, 2, ..., T}.

4. Validation of the learned strategies

We want to assess the optimality of the learned strate-gies and to compare them to the ones that are cur-rently used. The latter option is difficult to achievewhile this research is carried out through simulationson a reduced generation pool. On the other hand, wecan evaluate our strategies on scenarios obtained byMonte-Carlo simulations and compare their cost to

• the reference planning perfectly adjusted to theuncertainties by re-optimization from time tr + 1to T (lower bound),

• the reference planning not adjusted (upperbound).

Acknowledgments

Bertrand Cornelusse is funded by the FRIA (BelgianFund for Research in Industry and Agriculture). Thispaper presents research results of the Belgian Net-work DYSCO, funded by the Interuniversity Attrac-tion Poles Programme, initiated by the Belgian State,Science Policy Office. The scientific responsibility restswith its authors.

References

Carpentier, P., Cohen, G., Culioli, J., & Renaud, A.(1996). Stochastic optimization of unit commit-ment: a new decomposition framework. Power Sys-tems, IEEE Transactions on, 11, 1067–1073.

ILOG (2007). ILOG CPLEX 11.0 user’s manual.

94

Performance Evaluation of Machine Learning Techniques for theLocalization of Users in Wireless Sensor Networks

Jean-Michel Dricot [email protected] Van der Haegen [email protected] Le Borgne [email protected] Bontempi [email protected] Machine Learning Group – Université Libre de Bruxelles – 1050 Bruxelles, Belgium

Abstract

In this paper, we introduce a novel andmodular framework that is used to evalu-ate the performance of several user local-ization techniques in a wireless sensorsenvironment. Three different stages areconsidered: (i) the signal acquisition andthe corresponding distance model, (ii)the terminal positioning, and (iii) the fil-tering of the estimates over time. More-over, we investigated how an accelerome-ter could be included in the filters modelin order to further refine the accuracy.

Different implementations of the above-stated modules have been implementedand combined and the correspondingperformance is investigated.

1. Motivation

Our research is specific in that it does not aim atimplementing a single localization technique butrather focuses on the development of a testbedsuitable for evaluating the performance of dif-ferent positioning approaches. The hardwarewe used includes a mesh of pre-deployed sen-sors (TMote Invent with CC2420 Chipcon radiosand accelerometer) whose position is known. Alightweight computer worn by the user embeds thelocalization software and collects the real-worlddata.

The localization process is achieved trough theconsecutive steps, as presented on Figure 1. For

Radio Signal Collection

PositionEstimation

Filtering

Motion vs. StaticClassification

Localization

Signal to Distance Model

FixedSensors

Mesh

Mobile Sensor

&User

Terminal

Figure 1. The modular architecture of the localizationframework. The three modules on the left of the schemahave multiple implementations.

each step, we provide multiple implementations.First, since the radio signal in an indoor envi-ronment presents large variations over time anddifferent beaconing protocols are evaluated toaverage and de-noise the estimate of the sig-nal strength. Second, we consider several vari-ants of the multilateration technique (Savvideset al., 2003). These are: (i) the simple, (ii)the subsampled, (iii) the nearest-neighbour, and(iv) the weighted nearest-neighbour multilatera-tion. Third, we investigate how the data issued bythe accelerometer can be used to detect the usermobility vs. a natural body movement (which isconsidered as noise). The corresponding instan-taneous position estimate and the data of the ac-celerometer are merged into a recursive filteringmodule. Three different Kalman filters (Brown &Hwang, 1996) are provided, each of them havinga different underlying model.

95

2. Results

The experiments were conducted in the basementof an indoor building made of large corridors andmetal structures. In a first round, the data issuedfrom the accelerometer were processed. It hasbeen noted that the variance of the acceleration isa good estimator to detect the user’s behaviour,i.e., whether it is static or in motion. On Figure 2,the variance of the acceleration in the x-axis andthe y-axis is reported and the corresponding be-haviour is annotated. A classification techniqueknows as Support Vector Machine (Shawe-Taylor& Cristianini, 2000) was used with an overall ac-curacy of up to 90%.

oo

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

oo

o

o

o

o

o

o

o

0 10 20 30 40 50

020

40

60

80

100

moveVarXY[,1]

moveVarXY[,2]

+++ +

+

++ ++

+

++++ ++++++ +

+

+++++ +++ +++++++

+

++

+

++

+

oStatic

Mobile

!2x

!2 y

!2x [m2s!4]

!2 y

[m2s!

4]

Figure 2. Classification of the terminal mobility. It is char-acterized by the variance of the acceleration along the twoaxes of the mobile sensor.

RMSE of the distance estimation [m]

Occ

uren

ces

Localization error Multilateration

mean error= 536.160685132175 var error= 243126.569117915

errorsm

Fre

qu

en

cy

0 500 1000 1500 2000 2500

05

10

15

0 5 10 15 20 25

Figure 3. Distribution of the RMSE of the position as eval-uated by the the multilateration algorithm.

The Figure 5 presents the underlying model forvarious filter we implemented. We will now focus

0 5 10 15 20 25RMSE of the distance estimation [m]

Occ

uren

ces

Localization error Multilateration + Kalman

mean error= 340.5329790047 var error= 25431.7746455835

errorsk

Fre

quency

0 500 1000 1500 2000 2500

05

10

15

20

25

Figure 4. Distribution of the RMSE of the position afterfiltering and fusion with the information of the accelerom-eter.

on the joint use of the accelerometer to furtherimprove a Kalman filter.

The Figure 3 reports the Root Mean Square Error(RMSE) of the position estimated by the multi-lateration technique without additional filtering.One can observe that its average positioning er-ror is µ = 5.4 m and that this technique presentsa significant variance (σ = 4.9 m). In a secondexperiment, shown on Figure 4, the samples werecollected at the accelerometer and used to furtherrefine the filter model. In that case, the averageRMSE of the estimation falls to 3.4 m while, atthe same time, the variance of the error is dividedby 3, i.e., σ = 1.6 m. Our results suggest that thefusion of multiple sensors has a significant, posi-tive impact on a non-supervised user localizationand tracking.

References

Brown, R., & Hwang, P. (1996). Introductionto random signals and applied kalman filtering.Wiley Press.

Savvides, A., Park, H., & Srivastava, M. (2003).The n-hop multilateration primitive for node lo-calization problems. Journal of Mobile Network-ing Applications.

Shawe-Taylor, J., & Cristianini, N. (2000). Sup-port vector machines. Cambridge UniversityPress.

96

Real-life space (hidden)

Observer space(observed)

time=t-1 time=t

Motion positionspeed

Multilateration

positionspeed

Observedposition

Random Acceleration

Radio Signal Noise

Accelerometer

Classification

Figure 5. Underlying model for the design of the customizable Kalman flter. The variables are noted as ovals, the transitionfunctions as squares, and the noise processes as a cloud. Note that the samples collected by the accelerometer are onlyused in selected implementations of the filter. It is therefore noted as dashed.

97

Bayes-Relational Learning of Opponent Modelsfrom Incomplete Information in No-Limit Poker

Marc PonsenMICC Universiteit Maastricht, Netherlands

Jan RamonKurt DriessensDeclarative Languages and Artificial Intelligence Group, Katholieke Universiteit Leuven (KUL), Belgium

Tom CroonenborghsBiosciences and Technology Department, KH Kempen University College, Belgium

Karl TuylsTechnische Universiteit Eindhoven, Netherlands

Abstract

We propose an opponent modeling approachfor No-Limit Texas Hold’em poker thatstarts from a (learned) prior, i.e., generalexpectations about opponent behavior andlearns a relational regression tree-functionthat adapts these priors to specific oppo-nents. An important asset is that this ap-proach can learn from incomplete informa-tion (i.e. without knowing all players’ handsin training games).

1. Introduction

For many board and card games, computers have atleast matched humans in playing skill. An exceptionis the game of poker, offering new research challenges.The complexity of the game is threefold, namely pokeris (1) an imperfect information game, with (2) stochas-tic outcomes in (3) an adversarial multi-agent environ-ment. One promising approach used for AI poker play-ers applies an adaptive imperfect information game-tree search algorithm to decide which actions to takebased on expected value (EV) estimates (Billings et al.,2006). This technique (and related simulation algo-rithms) require two estimations of opponent informa-tion to accurately compute the EV, namely a predic-tion of the opponent’s outcome of the game and pre-diction of opponent actions. Therefore learning an op-ponent model is imperative and this model should in-clude the possibility of using relational features for thegame-state and -history.

In this paper we consider a relational Bayesian ap-proach that uses a general prior (for outcomes and ac-tions) and learns a relational regression tree to adaptthat prior to individual players. Using a prior willboth allow us to make reasonable predictions from thestart and adapt to individual opponents more quicklyas long as the choice of prior is reasonable.

2. Learning an Opponent Model

We learn an opponent model for players in the gameof No-Limit Texas Hold’em poker. To make the modeluseful for an AI player, we must be able to learn thismodel from a limited amount of experience and (if pos-sible) adapt the model quickly to changes in the oppo-nent’s strategy. An added, and important, difficultyin poker is that we must be able to learn this modelgiven a large amount of hidden information. We pro-pose to start the opponent model with a prior distri-bution over possible action choices and outcomes. Wewill allow the model to adapt to different opponents bycorrecting that prior according to observed experience.

Consider a player p performing the i-th action ai ina game. The player will take into account his handcards, the board Bi at time point i and the game his-tory Hi at time point i. The board Bi specifies boththe identity of each card on the table (i.e., the com-munity cards that apply to all players) and when theyappeared, and Hi is the betting history of all playersin the game. The player can fold, call or bet. For sim-plicity, we consider check and call to be in the sameclass, as well as bet and raise and we don not consider

99

the difference between small and large calls or bets atthis point.1

We limit the possible outcomes of a game rp for aplayer p to: 1) p folds before the end of the game(rp = lose), 2) p wins without showing his cards (rp =win) and 3) p shows his cards (rp = cards(X,Y )).This set of outcome values also allows us to learn fromexamples where we did not see the opponent’s cards,registering these cases as win or lose, without requir-ing the identities of the cards held by the player. Thelearning task now is to predict the outcome for an op-ponent P (rp|Bi,Hi) and the opponent action (given aguess about his hand cards) P (ai|Bi,Hi−1, rp)

2.1. Learning the Corrective Function

We propose a two-step learning approach. First, welearn functions predicting outcomes and actions forpoker players in general. These functions are then usedas a prior, and we learn a corrective function to modelthe behavior and statistics of a particular player. Thekey motivations for this are first that learning the dif-ference between two distributions is an elegant way tolearn a multi-class classifier (e.g. predicting distribu-tions over 2+(52*53/2) possible outcomes) by gener-alizing over many one-against-all learning tasks, andsecond that even with only a few training examplesfrom a particular player already accurate predictionsare possible.

In the following description, the term example refer-ences a tuple (i, p, ai, rp,Hi−1, Bi) of the action ai per-formed at step i by a player p, together with the out-come rp of the game, the board Bi and the bettinghistory Hi−1.

Consider the mixture Dp+∗ of two distributions: thedistribution D∗ of arbitrarily drawn examples from allplayers and the distribution Dp of arbitrarily drawnexamples from a particular player p. Then, considerthe learning problem of, given a randomly drawn ex-ample x from Dp+∗, predicting whether x originatedfrom D∗ or from Dp. For a given learning setting (ei-ther predicting actions from outcomes or predictingoutcomes from actions), it is easy to generate exam-ples from D∗ and Dp, labeling them with ∗ or p, andlearning the function P (Dp|x), giving for each examplex the probability the example is labeled with p. Wedo so by using the relational probability tree learnerTilde (Blockeel & De Raedt, 1998). From this learned’differentiating’ model, we can compute the probabil-

1We will consider the difference between small and largecalls or bets as features in the learned corrective function.

ity P (x|Dp), for every example x by using Bayes’ rule:

P (x|Dr) = P (Dr|x) · P (x)/P (Dr) (1)

Since we have chosen to generate as many examplesfor D∗ as for Dp in the mixture,

P (Dp) = P (D∗) = 1/2 (2)P (x) = P (D∗)P (x|D∗) + P (Dp)P (x|Dp) (3)

and substituting (2) and (3) into (1) gives:

P (x|Dp) =(

P (Dp|x) ·(

12P (x|Dp) +

12P (x|D∗)

))/12

= P (Dp|x)P (x|Dp) + P (Dp|x)P (x|D∗).

From this, we easily get:

P (x|Dp) =P (x|D∗) · P (Dp|x)

1 − P (Dp|x)(4)

Here, P (x|D∗) is the learned prior and P (Dp|x) is thelearned differentiating function.

Having now explained how to learn a player-specificprediction function given a prior, the question remainsas how to learn the prior. We learn the prior by (again)learning a differentiating function between a uniformdistribution and the distribution formed by all exam-ples collected from all players. Even though the uni-form distribution is not accurate, this is not really aproblem as sufficient training examples are available.

3. Experiments and Results

We observed cash games (max 9 players per game)played in an online poker room and extracted exam-ples for players who played more than 2300 games.We randomly selected 20% of the games for the testset, while the remaining games were used to learn anopponent model. We learned one decision tree for allexamlpes in the preflop phase and another for all re-maning examples from the other phases, i.e, the post-flop phases. The language bias used by Tilde (i.e., allpossible tests for learning the decision tree) includestests to describe the game history Hi at time i (e.g.game phase, number of remaining players, pot odds,previously executed actions etc.), board history Bi attime i, as well as tests that check for certain types ofopponents that are still active in the game. For exam-ple, we may find tests such as ”there is an opponentstill to act in this round who is aggressive after theflop, and this player raised earlier in this game”.

To evaluate our learning strategy, we report the log-likelihoods of the learned distributions and comparethem with reasonable priors. A model with a higher

100

likelihood directly allows an algorithm to sample andestimate the actions and outcome more accurately.

Figure 1 and 2 plot log-likelihoods averaged over 8players for different training set sizes. The priors areclearly better than uninformed priors (i.e. not usinglearning). After having observed 200 games, in generalthe likelihood improves with the size of the trainingset.

Predicting the Outcome of a Player

-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

100 200 400 800 1600 3200

Preflop-Learned Preflop-Prior

Flop-Learned Flop-Prior

Turn-Learned Turn-Prior

River-Learned River-Prior

Figure 1. Experiment predicting the outcome, given boardand game history. The x -axis represents the number ofgames used for the training set, and the y-axis the averagedlog-likelihood scores on examples in the test set.

Predicting the Action of a Player

-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

100 200 400 800 1600 3200

Preflop-Learned Preflop-Prior

Flop-Learned Flop-Prior

Turn-Learned Turn-Prior

River-Learned River-Prior

Figure 2. Experiment predicting the action, given out-come, board and game history. The axis are similar tothose in Figure 1.

4. Conclusions

We presented a Bayes-relational opponent modelingsystem that predicts both actions and outcomes forhuman players in the game of No-Limit Texas Hold’empoker. Both these sources of opponent information arecrucial for simulation and game-tree search algorithms,such as the adaptive tree search method by (Billingset al., 2006). The Bayes-relational opponent modeling

approach starts from prior expectations about oppo-nent behavior and learns a relational regression tree-function that adapts these priors to specific opponents.Our experiments show that our model adapts to spe-cific player strategies relatively quickly.

Acknowledgments

Marc Ponsen is sponsored by the Interactive Collabo-rative Information Systems (ICIS) project, supportedby the Dutch Ministry of Economic Affairs, grant nr:BSIK03024. Jan Ramon and Kurt Driessens are post-doctoral fellow of the Research Foundation - Flanders(FWO).

References

Billings, D., Davidson, A., Schauenberg, T., Burch,N., Bowling, M., Holte, R. C., Schaeffer, J., &Szafron, D. (2006). Game-tree search with adapta-tion in stochastic imperfect-information games. The4th Computers and Games International Conference(CG 2004), Revised Papers (pp. 21–34). Ramat-Gan, Israel: Springer.

Blockeel, H., & De Raedt, L. (1998). Top-down induc-tion of first order logical decision trees. ArtificialIntelligence, 101, 285–297.

101

System modeling with Reservoir Computing

Francis Wyffels [email protected] Schrauwen [email protected] Stroobandt [email protected]

Electronics and Information Systems Department, Ghent University, Sint-Pietersnieuwstraat 41, 9000 Gent,Belgium

Abstract

Reservoir Computing is a novel techniquewhich can be applied to a wide range of ap-plications. In this work we demonstrate thatReservoir Computing can be used for blackbox nonlinear system modeling.We will useReservoir Computing to model the outputflow of a heating tank with variable dead-time.

1. Introduction

Many control engineering techniques, in particularModel Predictive Control strategies, are based on pro-cess models. These models are obtained from physi-cal principles or data-driven models (Camacho et al.,2007). Most of the data-driven models are black boxmodels based on Analog Neural Networks (Camachoet al., 2007) which cannot cope with problems thathave a strong temporal aspect. Therefore some re-search has focused on the use of Recurrent NeuralNetworks which have memory due to the loops insidethe network. But unfortunately Recurrent Neural Net-works are hard to train.

Reservoir Computing is a recently developed techniquefor very fast training of Recurrent Neural Networkswhich has been successfully used in many applications(Jaeger, 2001) such as speech recognition (Skowronski& Harris, 2007; Verstraeten et al., 2007), robot con-trol (Antonelo et al., 2007) and time-series generation(Jaeger, 2001). To accomplish this, Reservoir Comput-ing uses an untrained dynamic system (the reservoir),where the desired function is implemented by a lin-ear, memory-less mapping from the full instantaneousstate of the dynamical system to the desired outputwhich can be trained by using linear regression tech-niques such as ridge regression (Wyffels et al., 2008a).A schematic overview is given in Figure 1.

In this work we will use Reservoir Computing to model

K input nodes Reservoir with N state nodes L output nodes

- dotted lines: trained interconnections

- solid lines: random but fixed interconnections

Figure 1. Schematic overview of the Reservoir Computingtechnique.

the behavior of a nonlinear dynamical system withvariable dead-time.

2. Experimental setup

The task at hand is the modeling of a heating tankwith a variable cold water inlet and a hot water outlet.The heating element of the tank has a constant powerthus, the outlet temperature is controlled by varyingthe cold water flow. Because the temperature of theoutlet flow is measured after flowing through a longsmall pipe, the system has a variable dead-time whichadds an extra difficulty in predicting the model. A fulldescription of the plant can be found in (Cristea et al.,2005).

In contrast to most modeling techniques, we don’tmake any assumptions about the plant neither we splitup the plant in different parts. We model both, thetank and the outlet pipe, by using only one reservoirconsisting of 400 randomly connected band-pass neu-rons (see (Wyffels et al., 2008b) for an introduction).The spectral radius was tuned to give the reservoir anear-stable behavior. In order to give the reservoirmore nonlinear properties each neuron adds an auxil-

103

iary input with a constant bias. Because of the variabledead-time, we needed to increase the fading memoryof the reservoir by adding feedback from the outputto all the neurons. The readout function was trainedusing 10,000 samples of random input-output exam-ples extracted by simulation. Next, the reservoir wasleft predicting 3,500 samples based on its input, 1,000samples were discarded for warming up to eliminatetransient effects.

200 400 600 800 1000 1200 1400 1600 1800 200024

26

28

30

32

34

36

38

40

42

44

Time (samples)

Out

let t

empe

ratu

re (

°C)

Figure 2. Validation of the model: real outlet tempera-ture (gray solid line), predicted outlet temperature (dashedline).

3. Results

In Figure 2, a comparison of the desired outlet tem-perature and the predicted outlet temperature is given.Using the previously described reservoir configuration,the outlet temperature was predicted in connection toa variable input flow which was not seen by the reser-voir during training. One can see that the reservoir isable to give accurate predictions of the outlet temper-ature.

4. Conclusions

In previous work, many techniques for system model-ing are proposed. But most of them need a lot of ex-perience in the application domain or are difficult totrain. In this work we showed that by using ReservoirComputing one is able to model a nonlinear systemwith variable dead-time based on input-output record-ings of the plant. No knowledge in the applicationdomain was needed. For future work we wish to inves-tigate the use of Reservoir Computing as a controller

in control engineering tasks.

Acknowledgments

This work was partially funded by FWO Flandersproject G.0317.05 and the Photonics@be Interuniver-sity Attraction Poles program (IAP 6/10), initiated bythe Belgian State, Prime Minister’s Services, SciencePolicy Office.

References

Antonelo, E. A., Schrauwen, B., & Campenhout, J. V.(2007). Generative modeling of autonomous robotsand their environments using reservoir computing.Neural Processing Letters, 26, 233–249.

Camacho, E., Rubio, F., Berenguel, M., & Valenzuela,L. (2007). A survey on control schemes for dis-tributed solar collector fields. part i: Modeling andbasic control approaches. Solar Energy, 81, 1240–1251.

Cristea, S., de Prada, C., & De Keyser, R. (2005). Pre-dictive control of a process with variable dead-time.CD-Proceedings of the 16th IFAC World Congress.

Jaeger, H. (2001). The “echo state” approach toanalysing and training recurrent neural networks(Technical Report GMD Report 148). German Na-tional Research Center for Information Technology.

Skowronski, M. D., & Harris, J. G. (2007). 2007 Spe-cial Issue: Automatic speech recognition using apredictive echo state network classifier. Neural Net-works, 20, 414–423.

Verstraeten, D., Schrauwen, B., D’Haene, M., &Stroobandt, D. (2007). An experimental unificationof reservoir computing methods. Neural Networks,20, 391–403.

Wyffels, F., Schrauwen, B., & Stroobandt, D. (2008a).Regularization methods for reservoir computing.Proceedings of the International Conference on Ana-log Neural Networks (ICANN). (submitted).

Wyffels, F., Schrauwen, B., Verstraeten, D., &Stroobandt, D. (2008b). Band-pass reservoir com-puting. Proceedings of the International Joint Con-ference on Neural Networks.

104

Index

Abeel, 45, 77Ammar, 31Auvray, 29

Bebis, 33Bergsneider, 87Blockeel, 63Bontempi, 95Bosch, 81Botta, 83Boullart, 57Bruynooghe, 15

Callut, 67Cornelusse, 93Croonenborghs, 15, 99

d’Alche, 61De Baets, 57, 59, 79De Grave, 55De Hauwere, 73De Meyer, 59De Raedt, 23, 25, 27, 55De Vleeschouwer, 35de Weerdt, 75Defourny, 17, 31Detry, 65Dhaene, 53Dricot, 95Driessens, 15, 51, 99Drugman, 43Dumont, 41Dupont, 67

Ernst, 17, 19

Furnkranz, 12Florent Gemmeke, 71Fonteneau, 19Francoisse, 67

Geurts, 37, 41, 49, 61, 83Goethals, 91Goetschalckx, 51Gorissen, 53Gutmann, 25

Hansoul, 83Hu, 87Hunt, 47Huynh-Thu, 49

Kersting, 25

Kimmig, 25

Laermans, 53Landwehr, 27Le Borgne, 95Lemeire, 21Lendvai, 47Leray, 31Loss, 33

Mantrach, 69Maree, 37, 41Meessen, 35Meyer, 85Murphy, 11

Nicolescu, 33Nowe, 73

Piater, 65Piccart, 63Ponsen, 99

Ratsch, 12Rademaker, 59Ramon, 55, 99

Saerens, 67, 69Saeys, 45, 77, 79Sanner, 51Scalzo, 33, 65, 87Schrauwen, 103Sepulchre, 85Simon, 35Smets, 91Stehouwer, 39Stroobandt, 103Struyf, 63

ter Braak, 81Thon, 27Triggs, 11Tuyls, 99

Van de Cruys, 89Van de Peer, 45, 77, 79van den Bosch, 47Van der Haegen, 95van der Krol, 81van Dijk, 81van Erp, 47van Ham, 81Van Landeghem, 79

105

Verdonk, 91Verwer, 75Vignal, 93Vrancx, 73

Waegeman, 57Wehenkel, 17, 19, 29, 31, 37, 41, 49, 61, 83, 93Witteveen, 75wyffels, 103

Xu, 87

Yen, 69

106

the annual belgian-dutch machine learning … annual belgian-dutch machine learning conference 19-20...

Documents