machine learning:

28
Machine Learning: Theory, Applications, Experiences WiML 2007 October 17, 2007 Royal Plaza Hotel Orlando, Florida www.wimlworkshop.org

Upload: butest

Post on 10-May-2015

643 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning:

Machine Learning:Theory, Applications, Experiences

WiML 2007

October 17, 2007Royal Plaza HotelOrlando, Floridawww.wimlworkshop.org

Page 2: Machine Learning:

ELEVATOR

GUEST LAUNDRYL

Page 3: Machine Learning:

Schedule 09:00 Registration, poster set-up, and continental breakfast

09:30 Welcome

09:45 Invited Talk: Machine Learning in Space Kiri L. Wagstaff, N.A.S.A.

10:15 A general agnostic active learning algorithm Claire Monteleoni, UC San Diego

10:35 Bayesian Nonparametric Regression with Local Models Jo-Anne Ting, University of Southern California

10:55 Coffee Break

11:15 Invited Talk: Applying machine learning to a real-world problem: real-time ranking of electric components Marta Arias, Columbia University

11:45 Generating Summary Keywords for Emails Using Topics. Hanna Wallach, University of Cambridge 12:05 Continuous-State POMDPs with Hybrid Dynamics Emma Brunskill, MIT

12:25 Spotlights

12:45 Lunch

14:20 Invited Talk: Randomized Approaches to Preserving Privacy Nina Mishra, University of Virginia

14:50 Clustering Social Networks Isabelle Stanton, University of Virginia

15:10 Coffee Break

15:30 Invited Talk: Applications of Machine Learning to Image Retrieval Sally Goldman, Washington University

16:00 Improvement in Performance of Learning Using Scaling Soumi Ray, University of Maryland Baltimore County

16:20 Poster Session

17:10 Panel/ Open Discussion 17:40 Concluding Remarks

Page 4: Machine Learning:

Invited Talks

Machine Learning in SpaceKiri L. Wagstaff, N.A.S.A.

Remote space environments simultaneously present significant challenges to the machine learning community and enormous opportunities for advancement. In this talk, I present recent work on three key issues associated with machine learning in space: on-board data classification and regression, on-board prioritization of analysis results, and reliable computing in high-radiation environments. Support vector machines are currently being used on-board the EO-1 Earth orbiter, and they are poised for adoption by the Mars Odyssey orbiter as well. We have developed techniques for learning scientist preferences for which subset of images is most critical for transmission, so that we can make the most use of limited bandwidth. Finally, we have developed fault-tolerant SVMs that can detect and recover from radiation-induced errors while performing on-board data analysis.

About the speaker:Kiri L. Wagstaff is a senior researcher at the Jet Propulsion Laboratory in Pasadena, CA. She is a member of the Machine Learning and Instrument Autonomy group, and her focus is on developing new machine learning methods that can be used for data analysis on-board spacecraft. She has applied these techniques to data being collected by the EO-1 Earth-orbiting spacecraft, Mars Odyssey, and Mars Pathfinder. She has also worked on crop yield prediction from orbital remote sensing observations, the fault protection system for the MESSENGER mission to Mercury, and automatic code generation for the Electra radio used by the Mars Reconnaissance Orbiter and the

Mars Science Laboratory. She is very interested in issues such as robustness (developing fault-tolerant machine learning methods for high-radiation environments) and infusion (how can machine learning be used to advance science?). She holds a Ph.D. in Computer Science from Cornell University and is currently working on an M.S. in Geology from the University of Southern California.

Page 5: Machine Learning:

Applying machine learning to a real-world problem: real-time ranking of electric componentsMarta Arias, Columbia University

In this talk, I will describe our experience with applying machine learning techniques to a concrete real-world problem: the generation of rankings of electric components according to their susceptibility to failure. The system's goal is to aid operators in the replacement strategy of most at-risk components and in handling emergency situations. In particular, I will address the challenge of dealing with the concept drift inherent in the electrical system and will describe our solution based on a simple weighted-majority voting scheme.

About the speaker:Marta Arias received her bachelor's degree in Computer Science from the Polytechnic University of Catalunya (Barcelona, Spain) in 1998. After that she worked for a year at Incyta S.A. (Barcelona, Spain), a company specializing in software products for Natural Language Processing applications. She then enrolled in the graduate student program at Tufts University, recieving her PhD in Computer Science in 2004. That same year she joined the Center for Computational Learning Systems of Columbia University as an Associate Research Scientist. Dr. Arias'

research interest include the theory and application of machine learning.

Page 6: Machine Learning:

Randomized Approaches to Preserving PrivacyNina Mishra, University of Virginia, Microsoft Research

The Internet is arguably one of the most important inventions of the last century. It has altered the very nature of our lives -- the way we communicate, work, shop, vote, recreate, etc. The impact has been phenomenal for the machine learning community since both old and newly created information repositories, such as medical records and web click streams, are readily available and waiting to be mined. However, opposite these capabilities and advances is the basic right to privacy: On the one hand, in order to best serve and protect its citizens, the government should ideally have access to every available bit of societal information. On the other hand, privacy is a fundamental right and human need, which theoretically is served best when the government knows nothing about the personal lives of its citizens. This raises the natural question of whether it is even possible to simultaneously realize both of these diametrically opposed goals, namely, information transparency and individual privacy. Surprisingly, the answer is yes and I will describe solutions where individuals randomly perturb and publish their data so as to preserve their own privacy and yet large-scale information can still be learned. Joint work with Mark Sandler.

About the speaker:Nina Mishra is an Associate Professor in the Computer Science Department at the University of Virginia. Her research interests are in data mining and machine learning algorithms as well as privacy. She previously held joint appointments as a Senior Research Scientist at HP Labs, and as an Acting Faculty member at Stanford University. She was Program Chair of the International Conference on Machine Learning in 2003 and has served on numerous data mining and machine learning program committees. She also serves on the editorial Boards of Machine Learning, IEEE Transactions on Knowledge and Data Engineering, IEEE Intelligent Systems and the

Journal of Privacy and Confidentiality. She is currently on leave in Search Labs at Microsoft Research. She received a PhD in Computer Science from UIUC.

Page 7: Machine Learning:

Applications of Machine Learning to Image RetrievalSally Goldman, Washington University

Classic Content-Based Image Retrieval (CBIR) takes a single non-annotated query image, and retrieves similar images from an image repository. Such a search must rely upon a holistic (or global) view of the image. Yet often the desired content of an image is not holistic, but is localized. Specifically, we define Localized Content-Based Image Retrieval as a CBIR task where the user is only interested in a portion of the image, and the rest of the image is irrelevant. We discuss our localized CBIR system, Accio!, that uses labeled images in conjunction with a multiple-instance learning algorithm to first identify the desired object and re-weight the features, and then to rank images in the database using a similarity measure that is based upon individual regions within the image. We will discuss both the image representation and multiple-instance learning algorithm that we have used in the localized CBIR systems that we have developed. We also look briefly at ways in which multiple-instance learning can be applied to knowledge-based image segmentation.

About the speaker:Dr. Sally Goldman is the Edwin H. Murty Professor of Engineering at Washington University in St. Louis and the Associate Chair of the Department of Computer Science and Engineering. She received a Bachelor of Science in Computer Science from Brown University in December 1984. Under the guidance of Dr. Ronald Rivest at the Massachusetts Institute of Technology, Dr. Goldman completed her Master of Science in Electrical Engineering and Computer Science in May 1987 and her Ph.D. in July 1990. Dr. Goldman's research is in the area of algorithm design and analysis and machine learning with a recent focus on applications to the area of content-based image retrieval. Dr. Goldman has received many teaching awards and honors including the Emerson

Electric Company Excellence in Teaching Award in 1999, and the Governor's Award for Excellence in Teaching in 2001. Dr. Goldman and her husband, Dr. Ken Goldman, have just completed a book titled, A Practical Guide to Data Structures and Algorithms using Java.

Page 8: Machine Learning:

Talks

A General Agnostic Active Learning AlgorithmClaire Monteleoni, UC San Diego

We present a simple, agnostic active learning algorithm that works for any hypothesis class of bounded VC dimension, and any data distribution. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends a scheme due to Cohn, Atlas, and Ladner to the agnostic setting (i.e. arbitrary noise), by (1) reformulating it using a reduction to supervised learning and (2) showing how to apply generalization bounds even for the non-i.i.d. samples that result from selective sampling. We provide a general characterization of the label complexity of our algorithm. This quantity is never more than the usual PAC sample complexity of supervised learning, and is exponentially smaller for some hypothesis classes and distributions. We also demonstrate improvements experimentally.

This is joint work with Sanjoy Dasgupta and Daniel Hsu. Currently in submission, but for a full version, please see UCSD tech report:http://www.cse.ucsd.edu/Dienst/UI/2.0/Describe/ncstrl.ucsd_cse/CS2007-0898

Bayesian Nonparametric Regression with Local ModelsJo-Anne Ting, University of Southern California

We propose a Bayesian nonparametric regression algorithm with locally linear models for high-dimensional, data-rich scenarios where real- time, incremental learning is necessary. Nonlinear function approximation with high-dimensional input data is a nontrivial problem. An application example is a high-dimensional movement system like a humanoid robot, where real-time learning of internal models for compliant control may be needed. Fortunately, many real-worlddata sets tend to have locally low dimensional distributions, despite having high dimensional embedding (e.g., Tenenbaum et al. 2000, Roweis & Saul, 2000). A successful algorithm, thus, must avoid numerical problems arising potentially from redundancy in the input data, eliminate irrelevant input dimensions, and be computationally efficient to allow for incremental, online learning.

Several methods have been proposed for nonlinear function approximation, such as Gaussian process regression (Williams & Rasmussen, 1996), support vector regression (Smola & Schölkopf, 1998) and variational Bayesian mixture models (Ghahramani & Beal, 2000). However, these global methods tend to be unsuitable for fast, incremental function approximation. Atkeson, Moore & Schaal (1997) have shown that in such scenarios, learning with spatially localizedmodels is more appropriate, particularly in the framework of locally weighted learning.

Page 9: Machine Learning:

In recent years, Vijayakumar & Schaal (2000) have introduced a learning algorithm designed to fulfill the fast, incremental requirements of locally weighted learning, specifically targeting high-dimensional input domains through the use of local projections. This algorithm, called Locally Weighted Projection Regression (LWPR),performs competitively in its generalization performance with state-of-the-art batch regression methods. It has been applied successfully to sensorimotor learning on a humanoid robot for the purpose of executing fast, accurate movements in a feedforward controller.

The major issue with LWPR is that it requires gradient descent (with leave-one-out cross-validation) to optimize the local distance metrics in each local regression model. Since gradient descent search is sensitive to the initial values, we propose a novel Bayesian treatment of locally weighted regression with locally linear models that eliminates the need for any manual tuning of meta parameters, cross-validation approaches or sampling. Combined with variational approximation methods to allow for fast, tractable inference, this Bayesian algorithm learns the optimal distance metric value for each local regression model. It is able to automatically determine thesize of the neighborhood data (i.e., the ``bandwidth’’) that should contribute to each local model. A Bayesian approach offers error bounds on the distance metrics and incorporates this uncertainty in the predictive distributions. By being able to automatically detect relevant input dimensions, our algorithm is able to handle high- dimensional data sets with a large number of redundant and/or irrelevant input dimensions and a large number of data samples. We demonstrate competitive performance of our Bayesian locally weighted regression algorithm with Gaussian Process regression and LWPR on standard benchmark sets. We also explore extensions of this locally linear Bayesian algorithm to a real-time setting, to offer a parameter-free alternative for incremental learning in high-dimensional spaces.

Generating Summary Keywords for Emails Using Topics.Hanna Wallach, University of Cambridge

Email summary keywords, used to concisely represent the gist of an email, can help users manage and prioritize large numbers of messages. Previous work on email keyword selection has focused on a two-stage supervised learning system that selects nouns from individual emails using pre-defined linguistic rules [1]. In this work we present an unsupervised learning framework for selecting email summary keywords. A good summary keyword for an email message is not best characterized as a word that is unique to that message, but a word that relates the message to other topically similar messages. We therefore use latent representations of the underlying topics in a user's mailbox to find words that describe each message in the context of existing topics rather than selecting keywords based on a single message in isolation. We present and compare several methods for selecting email summary keywords, based on two well-

Page 10: Machine Learning:

known models for inferring latent topics: latent semantic analysis (LSA) and latent Dirichlet allocation (LDA).

Summary keywords for an email message are generated by selecting thewords that are most topically similar to the words in the email. We use two approaches for selecting these words, one based on query-document similarity, and the other based on word association. Each approach may be used in conjunction with either LSA or LDA. We evaluate keyword quality by generating summaries for emails from twelve users in the Enron corpus and comparing each method's performance with a TF-IDF baseline. The quality of keywords are assessed using two proxy tasks, in which the summaries are used in place of whole messages: recipient prediction and foldering. In the recipient prediction task, the keywords for each email are used to predict the intended recipients of the current message. In the foldering task, each user's email messages are sorted into folders using the selected keywords as features. Our topic-based methods out-perform TF-IDF on both tasks, demonstrating that topic-based methods yield better summary keywords. By selecting keywords based on user-specific topics, we find summaries that represent each message in the context of the entire mailbox, not just that of a single message. Furthermore, combining the summary for an email with the email's subject improves foldering and recipient prediction results over those obtained using either summaries or subjects alone.

References:[1] S. Muresan, E. Tzoukermann, and J. Klavans (2001). Combininglinguistic and machine learning techniques for emailsummarization. CONLL.

Continuous-State POMDPs with Hybrid Dynamics Emma Brunskill, MIT

Partially observable Markov decision processes (POMDPs) provide a rich framework for describing many important planning problems that arise in situations with hidden state and stochastic actions. Most previous work has focused on solving POMDPs with discrete state, action and observation spaces. However, in a number of applications, such as navigation or robotic grasping, the world is most naturally represented using continuous states. Though any continuous domain can be described using a sufficiently fine grid, the number of discrete states grows exponentially with the dimensionality of the underlying state space. Existing discrete state POMDP algorithms can only scale up to the order of a few thousand states, beyond which they become computationally infeasible. Therefore, approaches for dealing efficiently with continuous-state POMDPs are of great interest.

Previous work (such as [1]) on planning for continuous-state POMDPs has typically modeled the world dynamics using a single linear Gaussian model to describe the effects of an action. Unfortunately, this model is not powerful

Page 11: Machine Learning:

enough to represent the multi-modal state-dependent dynamics that arise in a number of problems of interest. For example, in legged locomotion the different "modes" of walking and running are described best by significantly different dynamics. We instead employ a hybrid dynamics model for continuous-state POMDPs that can represent stochastic state-dependent distributions over a number of different linear dynamic models. We developed a new point-based approximation algorithm for solving these hybrid-dynamics POMDP planning problems that builds on Porta et al.'s continuous-state point-based approach[1]. One nice attribute of our algorithm is that by representing the value function and belief states using a weighted sum of Gaussians, the belief state updates and value function backups can be computed in closed form. An additional contribution of our work is a new procedure for constructing a better approximation of the alpha functions composing the value function. We conducted experiments on a set of small problems to illustrate how the representational power of the hybrid dynamics model allows us to address problems not previously solvable by existing continuous-state approaches. In addition, we examined the toy problem of a simulated robot searching blindly (no observations) for a power supply in a long hallway. This problem requires a variable level of representational granularity in order to perform well. Here our hybrid continuous-state planner outperforms a discrete state POMDP planner, demonstrating the potential of continuous-state approaches.[1] J. Porta, M. Spaan, N. Vlassis, and P. Poupart. Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research, 7:2329-2367, 2006

Clustering Social NetworksIsabelle Stanton, University of Virginia

Social networks have gained popularity recently with the advent of sites such as MySpace, Friendster, Facebook, etc. The number of users participating in these networks is large, e.g., a hundred million in MySpace, and growing. These networks are a rich source of data as users populate their sites with personal information. Of particular interest in this paper is the graph structure induced by the friendship links.

A fundamental problem related to these networks is the discovery of clusters or communities. Intuitively, a cluster is a collection of individuals with dense friendship patterns internally and sparse friendships externally. There are many reasons to seek tightly-knit communities in networks, for instance, target marketing schemes can be designed based on clusters and terrorist cells can be uncovered.

Existing clustering criteria are limited in that clusters typically do not overlap, all vertices are clustered and/or external sparsity is ignored. We introduce a new criterion that overcomes these limitations by combining internal density with external sparsity in a natural way. Our criterion does not require a strict

Page 12: Machine Learning:

partitioning of the data which is particularly important in social networks, where one user may be a member of many communities.

This work focuses on the combinatorial properties of the new criterion. In particular, we bound the amount that clusters can overlap, as well as find a loose bound for the number of clusters in a graph. From these properties we have developed deterministic and randomized algorithms for provably finding the clusters, provided there is a sufficiently large gap between internal density and external sparsity. Finally, we perform experiments on real social networks illustrate the effectiveness of the algorithm.

Improvement in Performance of Learning Using ScalingSoumi Ray, University of Maryland Baltimore County

Reinforcement learning often requires many training iterations to get an optimal policy. We are interested in trying to speed up learning in a domain using scaling, which works as follows: partial learning is performed to learn a sub-optimal action value function, Q, in the domain using standard Q-learning for few iterations. The Q-values of Q are then multiplied by a constant factor to scale the Q-values. Then learning continues using the scaled Q-values of the new Q-table as the initial values. Surprising, in many situations this scaling significantly reduces the number of iterations required to learn compared to learning without scaling.

We can summarize our method of scaling in the following steps:1. Partial learning is done in the domain.2. The Q-values of the partially learned domain are scaled, using a scaling factor decided manually.3. Finally learning in the domain is carried out using the new scaled Q-values.

This method can reduce the number of steps required to learn in the domain compared to learning without scaling. Two important aspects of scaling are the scaling factor and the time of scaling. If the scaling factor and the time of scalingare chosen correctly then we can get great improvements in the performance oflearning in a domain. We have used 10×10 grid world domains with the starting position at the top left corner and the goal at the bottom right corner to run our experiments.

A Theory of Similarity Functions for ClusteringMaria-Florina Balcan, Carnagie Mellon University

Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as ground-truth and then design algorithms to (approximately) optimize various graph-based objective functions. However, in most applications, this similarity information is merely based on some heuristic: the true goal is to cluster the

Page 13: Machine Learning:

points correctly rather than to optimize any specific graph property. In this work, we initiate a theoretical study of the design of similarity functions for clustering from this perspective. In particular, motivated by recent work in learning theory that asks "what natural properties of a similarity function are sufficient to be able to learn well?" we ask "what natural properties of a similarity function are sufficient to be able to em cluster well?"

We develop a notion of the clustering complexity of a given property (analogous to notions of capacity in learning theory), that characterizes its information-theoretic usefulness for clustering. We then analyze this complexity for several natural game-theoretic and learning-theoretic properties, as well as design efficient algorithms that are able to take advantage of them. We consider two natural clustering objectives: (a) list clustering: analogous to the notion of list-decoding, the algorithm can produce a small list of clusterings (which a user can select from) and (b) hierarchical clustering: the desired clustering is some pruning of this tree (which a user could navigate). Our algorithms for hierarchical clustering combine recent learning-theoretic approaches with linkage-style methods.

This is joint work with Avrim Blum and Santosh Vempala.

Page 14: Machine Learning:

Spotlights

Advancing Associative Classifiers - Challenges and SolutionsLuiza Antonie, University of Alberta

In the past years, associative classifiers, classifiers that use association rules, have started to attract attention. An important advantage that these classification systems bring is that, using association rule mining, they are able to examine several features at a time, while other state-of-the-art methods, like decision trees or naive Bayesian classifiers, consider that each feature is independent of one another. However, in real-life applications, the independence assumption is not necessary true, and it was shown that correlations and co-occurrence of features can be very important. In addition, the associative classifiers can handle a large number of features, while other classification systems do not work well for high dimensional data. The associative classification systems proved to perform as well as, or even better, than other techniques in the literature. The associative classifiers are models that can be read, understood, modified by humans and thus can be manually enriched with domain knowledge.

We have proposed the integration of new types of association rules and new methods to reduce the number of rules in the model. In our research work we studied the behaviour of associative classifiers when negative association rules, maximal and closed itemsets are employed. These types of association rules have not been used in associative classifiers before, thus bringing new challenges and opportunities to our work. Given that one advantage of the classifiers based on association rules is their readability, another direction that we investigated is reducing the number of association rules used in the classification model. Pruning of rules not only improves readability, but it may minimize overfitting of the model as well. Another challenge is the use of rules in the classification stage. We proposed a new technique where the system automatically learns how to use the rules.

Many applications can benefit from a good classification model. Given the readability of the associative classifiers, they are especially fit to applications were the model may assist domain experts in their decisions. Medical field is a good example were such applications may appear. Let us consider an example were a physician has to examine a patient. There is a considerable amount of information associated with the patient (e.g. personal data, medical tests, etc.). A classification system can assist the physician in this process. The system can predict if the patient is likely to have a certain disease or present incompatibility with some treatments. Considering the output of the classification model, the physician can make a better decision on the treatment to be applied to this patient. Given the transparency of our model, a health practitioner can understand how the classification model reached its decision.

Page 15: Machine Learning:

Real-life applications are usually characterized by unbalanced datasets. Classes of interest may be under-represented, thus making harder the discovery of knowledge associated with them. We evaluated the performance of our system under these difficult conditions. We studied the performance of our classification model on real-life applications (mammography classification, text categorization, preterm birth prediction) where the classes of interest are typically under-represented.

This is joint work with my supervisors, Osmar R. Zaiane and Robert C.Holte.

Learning to Predict Prices in a Supply Chain Management GameShuo Chen, UC Berkeley

Economic decisions can benefit greatly from accurate predictions of market prices, but making such predictions is a difficult problem and an area of active research. In this paper, we present and compare several techniques for predicting market prices that we have employed in the Trading Agent Competition Supply Chain Management (TAC SCM) Prediction Challenge. These strategies include simple heuristics and various machine learning approaches, such as simple perceptrons and support vector regression. We show that the heuristic methods are very good, especially for predicting current prices, but that the machine learning techniques may be more appropriate for future price predictions.

Sonar Terrain Mapping with BDI Agents Shivali Gupta, University of Maryland, Baltimore County

Mapping a constantly changing environment is a challenge that necessitates a team of agents working together. These agents must continually explore the terrain and assemble the map in a distributed fashion. In a real-world instance of this problem agents have limited sensor and communication ranges, such as surveillance problem, further compounding the problem.

Our solution is to create multiple “Explorer" agents and a centralized “Base station" agent using the BDI architecture. The BDI architecture provides a framework for agents that have their individual beliefs, desires and intentions (goals). The environment is ripe with uncertainty given its continually changing nature which makes BDI architecture well suited to this problem. Mobile Explorer agents have limited range of communication and partial observability of the environment. The Base station agent is stable and it maintains the global map of the environment from the information of the Explorer agents. Explorer agents use the Base station’s global map (its beliefs about the world) to decide which area to explore next, and after exploration they send their updated map to the Base station agent. The Base station agent merges its copy with the information received from the explorer agent. The Explorer agents must stay within

Page 16: Machine Learning:

communication range of each other to maintain a complete communication network between all agents and the base station.

The system models the environment as a grid of cells and the Base station assigns each cell a “Curiosity level", based on how long it has been since that region was explored. Higher curiosity level implies that the cell has not been explored recently. Therefore, the curiosity level drives exploration toward regions of uncertainty. Explorer agents calculate a force vector,

Sonar Terrain Mapping with BDI Agents

Shivali Gupta

Abstract

Mapping a constantly changing environment is a challenge that necessitates a team of agents workingtogether. These agents must continually explore the terrain and assemble the map in a distributed fashion.In a real-world instance of this problem agents have limited sensor and communication ranges, such assurveillance problem, further compounding the problem.

Our solution is to create multiple “Explorer" agents and a centralized “Base station" agent usingthe BDI architecture. The BDI architecture provides a framework for agents that have their individualbeliefs, desires and intentions (goals). The environment is ripe with uncertainty given its continuallychanging nature which makes BDI architecture well suited to this problem. Mobile Explorer agentshave limited range of communication and partial observability of the environment. The Base stationagent is stable and it maintains the global map of the environment from the information of the Exploreragents. Explorer agents use the Base station’s global map (its beliefs about the world) to decide whicharea to explore next, and after exploration they send their updated map to the Base station agent. TheBase station agent merges its copy with the information received from the explorer agent. The Exploreragents must stay within communication range of each other to maintain a complete communicationnetwork between all agents and the base station.

The system models the environment as a grid of cells and the Base station assigns each cell a “Cu-riosity level", based on how long it has been since that region was explored. Higher curiosity levelimplies that the cell has not been explored recently. Therefore, the curiosity level drives explorationtoward regions of uncertainty. Explorer agents calculate a force vector,

force_vector =!

for_every_cell

distance_based_penalty ! curiosity_value ! unit_vector (1)

where distance_based_penalty is the inverse of the manhattan distance of cells from agents, to find thedirection to explore. This calculation ensures that not all the agents move in one direction at the sametime. One of the major advantages of this distributed approach is that a failure of an agent does not affectthe system in general. If an Explorer agent fails, then the other agents can still continue to explore theenvironment.

The results show that more agents prevent the average curiosity level from rising at a fast space andthe average eventually stabilizes after a limited number of Explorer agents explore the map. Anotherresult shows that distance penalty based on the manhattan distance provides a better solution because itallows Explorer agents to explore the local area around them, as well as the outer edges of the map incomparison to a penalty based on euclidian distance which localizes the search procedure. In our futurework, we are interested in adding a learning mechanism to the algorithm which would enable Exploreragents to predict the changing behavior of the environment and how to explore it optimally. Learningwould also enable Explorer agents to avoid obstacles in their environment.

1

where distance_based_penalty is the inverse of the manhattan distance of cells from agents, to find the direction to explore. This calculation ensures that not all the agents move in one direction at the same time. One of the major advantages of this distributed approach is that a failure of an agent does not affect the system in general. If an Explorer agent fails, then the other agents can still continue to explore the environment.

The results show that more agents prevent the average curiosity level from rising at a fast space and the average eventually stabilizes after a limited number of Explorer agents explore the map. Another result shows that distance penalty based on the manhattan distance provides a better solution because it allows Explorer agents to explore the local area around them, as well as the outer edges of the map in comparison to a penalty based on euclidian distance which localizes the search procedure. In our future work, we are interested in adding a learning mechanism to the algorithm which would enable Explorer agents to predict the changing behavior of the environment and how to explore it optimally. Learning would also enable Explorer agents to avoid obstacles in their environment.

Online Learning for OffRoad RobotsRaia Hadsell, NYU

We present a learning-based solution to the problem of long-range obstacle detection in autonomous robots. The system uses sparse traversability information from a stereo module to train a classifier online. The trained classifier can then predict the traversability of the entire scene. This learning strategy is called self-supervised, near-to-far learning, and, if it is done in an online manner, it allows the robot to adapt to changing environments and still accurately predict the traversability of distant areas.

A distance-normalized image pyramid makes it possible to efficiently train on each frame seen by the robot, using large windows that contain contextual information as well as shape,color, and texture. Traversability labels are initially obtained for each target using a stereo module, then propagated to other views of the same target using temporal and spatial concurrences, thus training the

Page 17: Machine Learning:

classifier to be view-invariant. A ring buffer simulates short-term memory and ensures that the discriminative learning is balanced and consistent. This long-range obstacle detection system sees obstacles and paths at 30-40 meters, far beyond the maximum stereo range of 12 meters, and adapts very quickly to new environments.

Experiments were run on the LAGR (Learning Applied to Ground Robots) robot platform. Both the robot and the reference ``baseline'' software were built by Carnegie Mellon University and the National Robotics Engineering Center. In this program, in which all participants are constrained to use the given hardware, the goal is to drive from a given start to a predefined (GPS) goal position through unknown, offroad terrain using only passive vision. Both qualitative and quantitative results are given by comparing the field performance of the robot with and without learning-based, long-range vision enabled.

Page 18: Machine Learning:

PostersUntitledMair Allen-Williams, University of Southampton

Two particular challenges faced by agents within dynamic, uncertain multi-agent systems are learning and acting in uncertain environments, and coordination with other agents about whom they may have little or no knowledge. Although uncertainty and coordination have each been tackled as separate problems, existing formal models for an integrated approach make a number of simplifying assumptions, and often have few guarantees. In this report we explore the extension of a Bayesian learning model into partially observable multi-agent domains. In order to implement such a model practically we make use of a number of approximation techniques. In addition to traditional methods such as repair sampling and state clustering, we apply graphical inference methods within the learning step to propagate information through partially observable nodes. We demonstrate the scalability of this approach with an ambulance rescue problem inspired by the Robocup Rescue system.

Supervised Learning by Training on Aggregate OutputsJanara Christensen, Carleton College

Supervised learning is a classic data mining problem where one wishes to be be able to predict an output value associated with a particular input vector. We present a new twist on this classic problem where, instead of having the training set contain an individual output value for each input vector, the output values in the training set are only given in aggregate over a number of input vectors. This new problem arose from a particular need in learning on mass spectrometry data, but could easily apply to situations when data has been aggregated in order to maintain privacy. We provide a formal description of this new problem for both classification and regression. We then examine how k-nearest neighbor, neural networks, and support vector machines can be adapted for this problem.

Disparate Data Fusion for Protein Phosphorylation PredictionGenetha Gray, Sandia National Labs

New challenges in knowledge extraction include interpreting and classifying data sets while simultaneously considering related information to confirm results or identify false positives. We discuss a data fusion algorithmic framework targeted at this problem. It includes separate base classifiers for each data type and a fusion method for combining the individual classifiers. The fusion method is an extension of current ensemble classification techniques and has the advantage of allowing data to remain in heterogeneous databases. In this poster, we focus on the applicability of such a framework to the protein phosphorylation prediction problem and show some numerical results.

Page 19: Machine Learning:

Real Boosting a la Carte with an Application to Boosting Oblique Decision TreeClaudia Henry, Université des Antilles et de la Guyane

In the past ten years, boosting has become a major field of machine learning and classification. We bring contributions to its theory and algorithms. We first unify a well-known top-down decision tree induction algorithm due to Kearns and Mansour, and discrete AdaBoost, as two versions of a same higher-level boosting algorithm. It may be used as the basic building block to devise simple provable boosting algorithms for complex classifiers. We provide one example: the first boosting algorithm for Oblique Decision Trees, an algorithm which turns out to be simpler, faster and significantly more accurate than previous approaches.

Multimodal Integration for Multiparty Dialogue Understanding: A Machine Learning FrameworkPei-Yun Sabrina Hsueh, University of Edinburgh

Recent advances in recording and storage technologies have led to huge archives of multimedia conversational speech recordings in widely ranging areas, such as clinical use, online sharing service, and meeting analysis. While it is straightforward to replay such recordings, finding information from the often lengthy archives has become more difficult. It is therefore essential to provide sufficient aids to guide the users through the recordings and to point out the most important events that need their attentions. In particular, my research concerns how to infer human communicative intention from low level audio and video signals. In particular, I focus on identifying multimodal integration patterns (e.g., people tend to speak more firmly and address to the whole group more often when they are making decisions) in human conversations, using approaches ranging from statistical analysis, empirical study, to machine learning.

Past research has shown that ehe identified multimodal integration patterns are useful for recognizing local speaker intention in recorded speech such as speech disfluency (e.g., false start). My research attempts to recover speaker intention that serve a more global communicative goal, such as ìinitiate-discussionî and ìreach-decision." A learning framework that can identify characteristic features of different semantic classes has been developed. This framework has been proven to be useful for automatic topic segmentation (and labeling) and automatic decision detection. The ultimate goal of this research is to enhance the current browsing and search utilities of multimedia archives.

Page 20: Machine Learning:

A POMDP for Automatic Software CustomizationBowen Hui, University of Toronto

Providing personalized software for individuals has the potential to increase work productivity and user satisfaction. In order to accommodate a wide variety of user needs, skills, and preferences, today's software is typically packed with functionality suitable for everyone. As a result, the interface is complicated, functionalities are unexplored, and hence, unused, and users are dissatisfied with the product. Many attempts in the user adaptive systems literature have explored ways to customize software according to the inferred user needs.

Recent probabilistic approaches model the uncertainty in the application domain and typically optimize single objective functions, i.e., helping the user complete a task faster or interact with the interface easier, but not both. A few exceptions exist that provide a principled treatment to modeling the uncertainty and the tradeoffs that are needed to satisfy multiple objectives. Nevertheless, existing work have done little to address three important issues:* the interaction principles that govern the nature of the problem's objective functions* the hidden user variables that explain observed preferences and behaviour* the value of information available in the repeated, sequential nature of the interaction between the user and the system

We are interested in designing a software agent that assists the user by adapting the interface and suggesting task completion help. In particular, the sequential nature of the human-computer interaction (HCI) naturally lends itself as a partially observable Markov decision process (POMDP). We propose to develop a customization POMDP that learns the type of user it is dealing with and adapts its behaviour in order to maximize expected rewards formulated by the interaction principles for that specific user. Overall, modeling the automatic customization problem as a POMDP enables the system to take optimal actions with respect to the value of information gain of an exploratory action and the immediate rewards obtained by exploitation. This approach provides a decision-theoretic treatment to balancing the opportunities to learn about the user versus exploiting what the system already knows about the user.

This work pools together techniques and insights from artificial intelligence and machine learning to construct and solve the POMDP. Specifically, we adopt methods from the Bayesian user modeling literature to construct a generic user model, the activity recognition literature to build a goal model of user activities, the HCI literature to formulate the reward model specifying user objectives, the preference elicitation literature to learn the user's utility function for adaptive systems, and the machine learning literature to populate model parameters with incomplete data and to do approximate inference. In addition to the development of the novel user model and reward model, a major contribution here is

Page 21: Machine Learning:

demonstrating that the customization POMDP is able to model real world applications tractably and is able to adapt to different types of users quickly.

Using Probabilistic Graphical Models in Bio-Surveillance ResearchMasoumeh Izadi, McGill University

Artificial intelligence methods can support and assist optimal use of clinical and administrative knowledge in diverse perspectives from diagnostic assistance, and detection of epidemics, to improved efficiency of health care delivery processes. Probabilistic graphical models have been successfully used for many medical problems. We describe a decision support system in public health bio-surveillance research. A long line of research has shown that current outbreak detection methods are ineffective; they raise both false alarms and miss attacks. Our approach tries to bring us closer to an effective detection system that detects real attacks and only those. I show how Partially Observable Markov Decision Processes (POMDPs) can be applied on outbreak detection methods for improving alarm function in the case of anthrax. Our results show that this method significantly outperforms existing solutions, in terms of both sensitivity and timeliness.

Incorporating a New Relational Feature in Online Handwritten Character RecognitionSara Izadi, Concordia University

Artificial neural networks have shown good capabilities in performing classification tasks. However, classifier models used for learning in pattern classification are challenged when the differences between the patterns of the training set are small. Therefore, the choice of effective features ismandatory for reaching a good performance. Statistical and geometrical features alone are not suitable for recognition of hand printed characters due to variations in writing styles, that may result in deformations of character shapes. We address this problem by using a relational context feature combined with a localdescriptor for training a neural network-based recognition system in a user-independent online character recognition application. Our feature extraction approach provides a rich representation of the global shape characteristics, in a considerably compact form. This new relational feature generally provides a higher distinctiveness and robustness to character deformations, thus potentiallyincreasing the recognition rate in a user-independent system. While enhancing the recognition accuracy, the feature extraction is computationally simple. We show that the ability to discriminate in handwriting characters is increased by adopting this mechanism which provides input to the feed forward neural network architecture. Our experiments on Arabic character recognition show comparable results with the state-of-the- art methods for online recognition of thesecharacters.

Page 22: Machine Learning:

Description Length and the Multiple Motif ProblemAnna Ritz, Brown University

Protein interactions drive many biological functions in the cell. A source protein can interact with several proteins; the specificity of this interaction is partly determined by the sequence around the binding site. In the 20-letter alphabet of protein sequences (denoting the 20 amino acids), a motif is a pattern that describes these binding preferences for a given protein. The motif-finding problem is to extract a motif from a set of sequences that interact with a given protein. The problem is solved by identifying statistically enriched patterns in this foreground set compared to a background set of non-interacting sequences. Finding such patterns is well-studied in Computational Biology.

Recent advances in technology require us to rethink the approach to the motif-finding problem. Mass spectrometry, for example, allows high-throughput measurements of multiple proteins interacting simultaneously. This creates a foreground set that is a mixture of motifs. The Multiple Motif problem is described as follows: find a collection of motifs, called a motif model, that best describes the foreground. The motif model is empty if the background distributions describe the foreground better than any set of patterns. A few algorithms to find multiple motifs exist, but they use either overly simplistic or overly descriptive motif representations. Overly simplistic motifs provide limited information about the structure of the data, while overly descriptive motifs use many parameters that require unrealistically large datasets. We use a representation between these extremes: some positions in a motif are exact, while others are restricted to a few letters.

When comparing motif models, we want to know which model describes the foreground the best. We use description length as a metric. Our goal is to learn the motif model that produces the most compact representation of the foreground by minimizing description length. Using minimum description length in this context circumvents some of the limitations of other representations. Each motif in the model must contribute to describing the foreground as concisely as possible, avoiding both redundancy and overfitting. Description length also gives a criterion for merging multiple exact motifs into a single, inexact motif, a task that is often ambiguous in other algorithms.

We describe the use of minimum description length to filter the results of known algorithms and to discover novel motifs in synthetic and real datasets.This is joint work with Benjamin Raphael and Gregory Shakhnarovich at Brown University.

Page 23: Machine Learning:

Machine Translation with Self Organized MapsAparna Subramanian, University of Maryland, Baltimore County

I am investigating the idea of using Self Organizing Maps for the purposes of Machine Translation. Human translators seem to translate based on their knowledge of what words/phrases of one language best represent the translation of the word/phrase in another. While choosing these word/phrase equivalents, they rely on similarity in the underlying concept to which the two words/phrases in different languages correspond to. This gives a good reason for a machine translation system to do something similar, i.e. translating at a conceptual level. Conceptual relativism of languages indicates a good source to parameterize concepts for the purpose of translation. Self Organizing Maps (SOM) can be used to formalize such concept categories and improve them by learning over time. Contextual information can also be captured in SOMs and be used for translation. Major challenges in practical application of SOMs to problems such as translation which require large vectors of concepts to be stored and processed are speed and space. This can be resolved in at least the following two ways – SOMs stored and processed as a hierarchy of concepts and SOMs maintained as different modules each catering to a group of similar concepts. I plan to further investigate the feasibility of these methods.

One approach for translation therefore is to average over the contextual relevance of the given piece, e.g. sentence, over the whole conversation or text in the source language under consideration. This can be done using a SOM for contexts which learns with every input sentence in the text. The mapping of the input sentence in the SOM can then be used as input to the Word Category Map of the source language. The output/s of this exercise can be the input to the target language Word Category Map. The words/phrases that are the outcome of this step can be organized into a sentence using the context SOM for the target language and can be aided by the knowledge of the grammar for the target language.

The investigation is in its initial stages, though the idea appears promising because this kind of translation system has the capacity to evolve through learning and takes care of pragmatics of the input. The approach also seems viable since there have been attempts in the past to use SOM for Natural Language Processing in general. The present work will be significant as attempts of using Self Organizing Maps for Machine Translation do not appear to have been explored, though it has been indicated as possibility in previous works.

Policy Recognition for Multi-Player Tactical ScenariosGita Sukthankar, University of Central Florida

This research addresses the problem of recognizing policies given logs of battle scenarios from multi-player games. The ability to identify individual and team policies from observations is important for a wide range of applications including

Page 24: Machine Learning:

automated commentary generation, game coaching, and opponent modeling. We define a policy as a preference model over possible actions based on the game state, and a team policy as a collection of individual policies along with an assignment of players to policies. Given a sequence of input observations, O, (including observable game state and player actions), a set of player policies, P, and team policies, T, the goal is to identify the individual policies p that were employed during the scenario.

A team policy is an allocation of players to tactical roles and is typically arranged prior to the scenario as a locker-room agreement. However, circumstances during the battle (such as the elimination of a teammate or unexpected enemy reinforcements) can frequently force players to take actions that were a priori lower in their individual preference model. In particular, one difference between policy recognition in a tactical battle and typical plan recognition is that agents rarely have the luxury of performing a pre-planned series of actions in the face of enemy threat. This means that methods that rely on temporal structure, such as Dynamic Bayesian Networks (DBNs) and Hidden Markov Models are not necessarily be well-suited to this task. An additional challenge is that, over the course of a single scenario, one only observes a small fraction of the possible game states, which makes policy learning difficult.

This research explores a model-based system for combining evidence from observed events using the Dempster-Shafer theory of evidential reasoning. The primary benefit of this approach is that the model generalizes easily to different initial starting states (scenario goals, agent capabilities, number and composition of the team). Unlike traditional probability theory where evidence is associated with mutually-exclusive outcomes, the Dempster-Shafer theory quantifies belief over sets of events. We computed the average accuracy over the set of battles for each of the three rules of combination. We evaluate our Dempster-Shafer based approach on logs of real and simulated games played using Open Gaming Foundation d20, the rule system used by many popular tabletop games, including Dungeons and Dragons.

Advice-based Transfer in Reinforcement LearningLisa Torrey, University of Wisconsin

This report is an overview of our work on transfer in reinforcement learning using advice-taking mechanisms. The goal in transfer learning is to speed up learning in a target task by transferring knowledge from a related, previously learned source task. Our methods are designed to do so robustly, so that positive transfer will speed up learning but negative transfer will not slow it down. They are also designed to allow human teachers to provide simple guidance that increases the benefit of transfered knowledge. These methods allow us to push the boundaries of current work in this area and perform transfer between complex and dissimilar tasks in the challenging RoboCup simulated soccer domain.

Page 25: Machine Learning:

Determining a Relationship Between Two Distinct Atmospheric Data Sets of Different GranularitiesEmma Turetsky, Carleton College

Regression analysis is a classic data mining problem with many real-world applications. We present several methods of using data mining and statistical analysis to find a relationship between two different data sets; atmospheric particles (and their elemental constituents) and elemental carbon (EC). Specifically, we wish to determine which elements in the atmosphere cause elemental carbon, something that is common in industrial zones and large cities and can normally be found in exhaust fumes and areas where there is visible carbon. In order to do this, we used machine learning regression algorithms including SVM regression and Lasso regression as well as regular linear regression. Weíve created several models that correlate specific elements with the amount of elemental carbon in the atmosphere.

Inferring causal relationships between genes from steady state observations and topological ordering informationXin Zhang, Arizona State University

The development of high-throughput genomic technologies, such as cDNA microarray and oligonucleotide chips, empowers researchers to reveal gene interactions. Mathematical modeling and in-silico simulation can be used to analyze gene interactions unambiguously, and to predict the network dynamic behavior in a systematic way. Various network inference models have been developed to identify gene regulatory networks using gene expression data, but none of them are about inferring causal relationships between genes, which is a very important issue in system biology. Among the developed methods, the Inductive Causation (IC) algorithm has been proven to be effective for inferring causal relationships among variables. However, simulation study in the context of gene regulatory network shows that the IC algorithm, which uses only one single data source, results in low precision and recall rates. To improve the performance, we propose a joint learning scheme that integrates multiple data sources. We present a modified IC (mIC) algorithm, that combines steady state data with partial prior knowledge of gene topological ordering information, for jointly learning causal relationships among genes.

We perform three sets of experiments on synthetic datasets for learning causal relationships between genes using the IC and the mIC algorithms. Each experiment contains 100 randomly generated Boolean networks (DAGs), each of which contains 10 genes connected by proper functions, with the gene topological ordering information. The distribution of the network is generated based on the probability distribution of the root genes and the proper functions. The Monte Carlo sampling method is used to generate 200 samples in a dataset for each network based on the probability distribution. We compare the simulation results from the mIC algorithm with the ones from the IC algorithm.

Page 26: Machine Learning:

From the simulation based evaluation we conclude that (i) IC algorithm does not work well for learning gene regulatory networks from steady state data alone, (ii) a better way for learning the gene causal relationship from steady state data is to use additional knowledge such as gene topological ordering, (iii) the precision and recall rates for mIC algorithm is significantly improved compared with IC algorithm with statistical confidence of 95%. For randomly generated networks, the mIC algorithms work well for jointly learning the causal regulatory network by combining steady state data and gene topological ordering knowledge, with precision rate of greater than 60%, and recall rate greater than 50%.

We further apply the mIC algorithm to gene expression profiles used in the study of melanoma. 31 malignant melanoma samples were quantized to the ternary format such that the expression level of each gene is assigned to ñ1 (downregulated), 0 (unchanged) or 1 (up-regulated). The 10 genes involved in this study are chosen from 587 genes from the melanoma dataset. The result showed that some of the important causal relationships associated with WNT5A gene have been identified using the mIC algorithm, and those causal connections have been verified from the literatures.

Page 27: Machine Learning:

Workshop Organization

Organizers: Hila Becker, Columbia University Bethany Leffler, Rutgers University

Faculty Advisor: Lise Getoor, University of Maryland, College Park

Reviewers: Hila Becker Finale Doshi Seyda Ertekin Katherine Heller Bethany Leffler Özgür Şimşek Jenn Wortmann

Page 28: Machine Learning:

Thanks to our sponsors:

Committee on the Status of

Women in Computing Research

C R A

PRINCETONUNIVERSITY