active feature-value acquisition - prem · pdf [email protected] ... and provost:...

21
MANAGEMENT SCIENCE Vol. 55, No. 4, April 2009, pp. 664–684 issn 0025-1909 eissn 1526-5501 09 5504 0664 inf orms ® doi 10.1287/mnsc.1080.0952 © 2009 INFORMS Active Feature-Value Acquisition Maytal Saar-Tsechansky McCombs School of Business, University of Texas at Austin, Austin, Texas 78712, [email protected] Prem Melville IBM T.J. Watson Research Center, Yorktown Heights, New York 10598, [email protected] Foster Provost Stern School of Business, New York University, New York, New York 10012, [email protected] M ost induction algorithms for building predictive models take as input training data in the form of feature vectors. Acquiring the values of features may be costly, and simply acquiring all values may be wasteful or prohibitively expensive. Active feature-value acquisition (AFA) selects features incrementally in an attempt to improve the predictive model most cost-effectively. This paper presents a framework for AFA based on estimating information value. Although straightforward in principle, estimations and approximations must be made to apply the framework in practice. We present an acquisition policy, sampled expected utility (SEU), that employs particular estimations to enable effective ranking of potential acquisitions in settings where relatively little information is available about the underlying domain. We then present experimental results showing that, compared with the policy of using representative sampling for feature acquisition, SEU reduces the cost of producing a model of a desired accuracy and exhibits consistent performance across domains. We also extend the framework to a more general modeling setting in which feature values as well as class labels are missing and are costly to acquire. Key words : information acquistion; predictive modeling; active learning; active feature acquisition; data mining; machine learning; business intelligence; imputation; utility-based data mining History : Received July 18, 2006; accepted April 25, 2008, by Ramayya Krishnan, information systems. Published online in Articles in Advance January 28, 2009. 1. Introduction the shift from relying on existing information col- lected for other purposes to using information col- lected specifically for research purposes is analogous to primitive man’s shifting from food collecting to agriculture (Siegel and Fouraker 1960, p. 72) Predictive models play a key role in numerous busi- ness intelligence tasks. Models are induced from his- torical data to predict customer behavior or to detect adversarial acts such as fraud. A critical factor affect- ing the knowledge captured by such a model is the quality of the information from which the model is induced—the “training data.” In the context of pre- dictive modeling, the quality of information pertains to the training sample’s composition, the accuracy of the values, and the number of unknown values. For many predictive modeling tasks, potentially pertinent information is not immediately available but can be acquired at a cost. Traditionally, informa- tion acquisition and inductive modeling are addressed independently; data are collected irrespective of the modeling objectives. However, information acquisi- tion and predictive modeling in fact are mutually dependent: newly acquired information affects the model induced from the data, and the knowledge captured by the model can help determine what new information would be most useful to acquire (Simon and Lea 1974). We would like to take advan- tage of this relationship and develop feature value acquisition policies for predictive model induction— procedures for evaluating and selecting feature-value acquisitions that will be used for model induction. Mookerjee and Mannino (1997) address a similar problem, where costly feature values of a test instance for which inference is requested are unknown and are acquired sequentially given an existing knowledge base. Here we study a complementary problem in which values must be acquired for induction. The availability of generic, effective, and com- putationally feasible information-acquisition policies for model induction can affect business practices by 664

Upload: trinhque

Post on 22-Mar-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

MANAGEMENT SCIENCEVol. 55, No. 4, April 2009, pp. 664–684issn 0025-1909 �eissn 1526-5501 �09 �5504 �0664

informs ®

doi 10.1287/mnsc.1080.0952©2009 INFORMS

Active Feature-Value Acquisition

Maytal Saar-TsechanskyMcCombs School of Business, University of Texas at Austin, Austin, Texas 78712,

[email protected]

Prem MelvilleIBM T.J. Watson Research Center, Yorktown Heights, New York 10598,

[email protected]

Foster ProvostStern School of Business, New York University, New York, New York 10012,

[email protected]

Most induction algorithms for building predictive models take as input training data in the form of featurevectors. Acquiring the values of features may be costly, and simply acquiring all values may be wasteful

or prohibitively expensive. Active feature-value acquisition (AFA) selects features incrementally in an attemptto improve the predictive model most cost-effectively. This paper presents a framework for AFA based onestimating information value. Although straightforward in principle, estimations and approximations must bemade to apply the framework in practice. We present an acquisition policy, sampled expected utility (SEU), thatemploys particular estimations to enable effective ranking of potential acquisitions in settings where relativelylittle information is available about the underlying domain. We then present experimental results showing that,compared with the policy of using representative sampling for feature acquisition, SEU reduces the cost ofproducing a model of a desired accuracy and exhibits consistent performance across domains. We also extendthe framework to a more general modeling setting in which feature values as well as class labels are missingand are costly to acquire.

Key words : information acquistion; predictive modeling; active learning; active feature acquisition;data mining; machine learning; business intelligence; imputation; utility-based data mining

History : Received July 18, 2006; accepted April 25, 2008, by Ramayya Krishnan, information systems.Published online in Articles in Advance January 28, 2009.

1. Introduction� � � the shift from relying on existing information col-lected for other purposes to using information col-lected specifically for research purposes is analogousto primitive man’s shifting from food collecting toagriculture � � � � (Siegel and Fouraker 1960, p. 72)

Predictive models play a key role in numerous busi-ness intelligence tasks. Models are induced from his-torical data to predict customer behavior or to detectadversarial acts such as fraud. A critical factor affect-ing the knowledge captured by such a model is thequality of the information from which the model isinduced—the “training data.” In the context of pre-dictive modeling, the quality of information pertainsto the training sample’s composition, the accuracy ofthe values, and the number of unknown values.For many predictive modeling tasks, potentially

pertinent information is not immediately availablebut can be acquired at a cost. Traditionally, informa-tion acquisition and inductive modeling are addressed

independently; data are collected irrespective of themodeling objectives. However, information acquisi-tion and predictive modeling in fact are mutuallydependent: newly acquired information affects themodel induced from the data, and the knowledgecaptured by the model can help determine whatnew information would be most useful to acquire(Simon and Lea 1974). We would like to take advan-tage of this relationship and develop feature valueacquisition policies for predictive model induction—procedures for evaluating and selecting feature-valueacquisitions that will be used for model induction.Mookerjee and Mannino (1997) address a similarproblem, where costly feature values of a test instancefor which inference is requested are unknown and areacquired sequentially given an existing knowledgebase. Here we study a complementary problem inwhich values must be acquired for induction.

The availability of generic, effective, and com-putationally feasible information-acquisition policiesfor model induction can affect business practices by

664

Page 2: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 665

transforming existing information-acquisition prac-tices and by changing the manner by which firmsinteract with consumers. As an example, considerthe generation of personalized recommendations tocustomers. Often, a recommender system’s underly-ing predictive model employs customers’ ratings ofprior purchases as predictors of a customer’s prefer-ence for a product she has not yet purchased. Theavailability of many ratings from a large number ofcustomers is critical for successful induction of anaccurate model of consumer preferences. However,without costly incentives, most consumers rarely pro-vide this valuable feedback. To improve the model’spredictive accuracy, it is infeasible to acquire feed-back from all consumers about all products, even thosethey have already purchased. A better acquisition pol-icy would determine which ratings from which cus-tomers would be most cost-effective to acquire viacostly incentives, in order to obtain the desired mod-eling objective for the lowest cost (Huang 2007). Sim-ilar scenarios emerge in other modeling tasks wheremissing feature values can be acquired at a cost. Theseinclude modeling of medical treatment effectivenessand diagnosis from medical databases, where patientinformation, such as details on prior hospitalizationsand prior medical tests, is notoriously incomplete.Intelligent information-acquisition policies can alsodramatically change already established information-acquisition models: firms acquire bundles of psycho-graphic, consumption, and lifestyle data periodicallyfrom third-party suppliers, such as Axiom, to supportbusiness intelligence modeling for tasks such as riskscoring, customer retention, and personalized market-ing. As with other information goods, a firm shouldconsider how to bundle and price information on con-sumers. Effective acquisition policies will enable firmsto identify and acquire different types of informa-tion for different consumers at potentially differentprices, to enhance cost-effective modeling. These capa-bilities may also enable small firms to reap the benefitsfrom business intelligence modeling, allowing themto enrich their potentially limited data by selectivelyacquiring useful information.Given training data with missing feature values,

an arbitrary classification-model induction algorithm,a set of prospective feature-value acquisitions, and thecost of acquiring each specific feature value (the costof features may vary feature to feature and instanceto instance), the general active feature-value acquisi-tion (AFA) problem is to acquire feature values so asto obtain a desired performance level for minimumcost. In this paper we consider performance to besome function of the model’s generalization accuracy.1

1 The model’s expected accuracy over the population underconsideration.

However, because we do not know a priori the popu-lation under consideration, the generalization perfor-mance cannot be known exactly, and we must estimateit from a sample. Thus, AFA policies cannot be prov-ably optimal, so we employ a heuristic measure of theperformance from feature-value acquisition.Even given a heuristic measure of performance, in

principle, identifying the feature-value set that yieldsthe desired performance objective at a minimum costrequires considering all possible sets of prospectiveacquisitions. Unfortunately, it is not feasible to com-pute this for most interesting problems. Moreover,given a finite sample, both statistical learning theoryand practical experience tell us that more searchthrough possible models often leads to worse perfor-mance, because of problems of multiple comparisons(Vapnik 1998, Jensen and Cohen 2000). We will revisitthis issue in §5. We therefore revise the objective ofAFA for this paper as follows. Given a performancemeasure, we aim to identify the individual featurevalue to acquire next, in order to achieve the great-est improvement in the performance measure perunit cost, which implies a greedy, myopic acquisitionpolicy.The primary contributions of this paper are a

general framework for addressing AFA and a spe-cific method for solving AFA problems based onappropriate heuristics. To our knowledge, no priorwork2 addresses general AFA. We propose an acquisi-tion policy that produces acquisition schedules itera-tively based on estimates of the expected utility fromdifferent potential acquisitions. In principle this isstraightforward, but the AFA setting renders utilityestimation particularly challenging: estimations oftenmust be made based on little available informationand ought to be sensitive so as to capture the ben-efit to induction of individual feature values. Fur-ther, because the space of possible acquisitions canbe immense, estimating the value of each potentialacquisition may be computationally infeasible. Wedevelop and study empirically the impact of measuresfor capturing the value to induction of single featurevalues in the presence of scarce data and propose dif-ferent mechanisms to reduce the complexity of theestimation.This expected-utility approach has several impor-

tant advantages. The framework is general and canbe applied to derive an acquisition schedule for anyinduction technique. This is important because due tothe inherent bias of different modeling techniques andprofessional regulations in some industries, no sin-gle technique is applied across all problems. Another

2 With the exception of a short paper on our preliminary studies(Melville et al. 2005).

Page 3: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition666 Management Science 55(4), pp. 664–684, © 2009 INFORMS

important advantage of the expected-utility approachis that it can be applied to improve any utility func-tion derived from the model’s predictive performance,such as estimated generalization accuracy, expectedprofit in a particular setting, or the expected cost ofmodel error. Finally, this approach can utilize informa-tion about the varying cost of information to derivean acquisition schedule, not assuming that the cost ofacquiring an unknown value is fixed for all featuresand/or all instances. Experimental results demon-strate that the resulting method provides significantlybetter models for a given cost than those obtainedwith other acquisition policies. Because the methodutilizes acquisition cost information, it is particularlyadvantageous in challenging tasks for which there issignificant variance across potential acquisitions withrespect to their informativeness and their cost.Finally, another contribution of this paper is an

extension of the policy to a more general acquisitionproblem. For some modeling tasks, class labels (i.e.,dependent variables’ values) are missing as well asfeature values, and either or both may be acquiredat cost. We show that because our framework esti-mates the value to induction of different acquisi-tions, it allows the dependent variable to be treatedas yet another feature. Thus, the AFA frameworkand method can be extended directly to address thisnew problem. In practice, the method interleaves theacquisition of class labels and feature values, basedon the marginal expected value from each acquisition;we show it to be superior both to uniform acquisi-tions and to policies that consider the acquisition ofonly feature values or only class labels.

2. Active Feature-Value AcquisitionAssume a classifier induction problem where eachinstance is represented with n independent variablesplus a discrete, dependent “class variable.” The avail-able data set of m instances can be represented by the“incomplete” matrix F , where Fi�j corresponds to thevalue of the jth feature of the ith instance, which maybe missing. Missing elements in the matrix F repre-sent missing feature values that can be acquired ata cost. In general, the cost of different feature valuesmay vary, depending on the nature of the particularfeature or of the instance for which the information ismissing.

Algorithm 1. General Active Feature-Value Acquisi-tion Framework

Given: F : initial (incomplete) instance-feature matrix;Y = �yi� i = 1� � � � �m�: class labels for all instances;T : training set= �F �Y �; L: classifier induction algo-rithm; �: size of query batch; C: cost matrix for allinstance-feature pairs;

1. Initialize set of possible queries Q to�qi� j : i = 1� � � � �m� j = 1� � � � �n� such that Fi� j

is missing}2. Repeat until stopping criterion is met3. Induce a classifier, M = L�T 4. ∀qi� j ∈ Q compute score�qi� j � ci� j � L�T 5. Select the subset, S, of � feature value with

the highest scores6. ∀ qi� j ∈ S7. Acquire values for Fi� j

8. Remove S from Q9. End Repeat10. Return M = L�T

We present an iterative, sequential acquisitionframework, where at each acquisition phase, alterna-tive acquisitions are evaluated to acquire the valueof Fi� j at the cost Ci� j that provides the largestimprovement per unit cost in the performance objec-tive. The iterative framework for AFA is presented inAlgorithm 1. The framework is independent of theclassification modeling technique; it is given a learner,L, which includes a model induction algorithm and amissing value treatment to allow for induction fromthe incomplete matrix F .3 At each phase a “score”is estimated and assigned to each potential acquisi-tion, reflecting the estimated added value per unitcost of the acquisition. The acquisition with the high-est score is selected, and the corresponding featurevalue is acquired; a particular approach for assign-ing scores to potential acquisitions will be describedin detail in §2. Once a value is acquired, the train-ing data and the information acquisition cost areappropriately updated, and this process is repeateduntil some stopping criterion is met, e.g., a desir-able model accuracy has been obtained. Often, manyvalues must be acquired to obtain a desired perfor-mance level. To reduce the computational burden orbased on domain constraints, at each iteration an AFApolicy may acquire a “batch” of � ≥ 1 values. Asbefore, computing the set that will result in the great-est improvement in the heuristic measure is compu-tationally complex. In this paper, we select the valueswith the highest individual scores; the sensitivity tothis choice is examined in §3.3.We now present a method for AFA based on com-

puting the value of the information that may beacquired. The central component of the computationalso presents the main difficulties with its implementa-tion: the computation of the value of information priorto acquisition, when only partial knowledge about the

3 Induction algorithms either include an internal mechanism forincorporating instances with missing feature values (Quinlan 1993)or require that missing values be imputed first. Henceforth, weassume that the induction algorithm includes or is coupled withsome treatment for instances with missing values.

Page 4: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 667

acquired information is available. We discuss three dif-ficulties with the computation and present approxima-tion techniques to address these difficulties. Togetherthey comprise the proposed AFA method: sampledexpected utility (SEU).We estimate the value of a potential acquisition by

its expected marginal contribution to predictive per-formance. Because the true value of the missing fea-ture is unknown prior to its acquisition, it is necessaryto estimate the potential impact of an acquisition fordifferent possible acquisition outcomes. The acquisi-tion with the highest information value will be theone that results in the maximum utility in expectation,given a model, a model induction algorithm, and aparticular utility function. For the last, the objectivemay be to maximize the model’s generalization accu-racy, to maximize future profit, to minimize the costsincurred due to incorrect predictions, etc. A utilityscore captures the expected improvement from eachpotential acquisition. Assuming feature j has K dis-tinct possible values v1� � � � � vK , the expected utility ofthe acquisition qi�j , or “query” for short, is given by

E�qi� j =K∑

k=1

��Fi� j = vkP�Fi� j = vk� (1)

where P�Fi�j = vk is the probability that Fi�j has thevalue vk, and ��Fi�j = vk is the utility of knowing (viaacquisition) that the feature value Fi�j is vk. The utility��· is the marginal improvement in performance perunit of acquisition cost:

��Fi� j = vk = ��F � Fi� j = vk

Ci� j

� (2)

where ��F � Fi� j = vk is the change in value to induc-tion from augmenting F with Fi� j = vk, and Ci� j isthe cost of acquiring Fi� j . This expected utility pol-icy therefore corresponds to selecting the query thatwill result in the estimated largest increase in per-formance per unit cost in expectation. If all featurecosts are equal, this corresponds to selecting the querythat would result in the classifier with the high-est expected performance. Otherwise, expected utilityallows several low-yield, high-margin acquisitions tobe selected instead of one higher-yield acquisitionwith less expected improvement per unit cost.In principle, this approach would allow the estima-

tion of the value of each possible acquisition and thenthe selection of acquisitions by ranking them by theirinformation-value estimates. However, there are sig-nificant hurdles to its practical implementation. Weintroduce the challenges next, and then address eachin turn in the following three subsections.Challenge 1. Estimating contribution to induc-

tion. As outlined in Equation (2), for each query

(∀qij ∈ Q) computing expected utility requires the esti-mation of the value, ��·, to induction from the acqui-sition. Here we assume classification accuracy to bethe performance metric of interest; as discussed, theframework applies to other goals such as minimizingmisclassification cost or maximizing profit. As we willsee, to estimate the expected improvement in classifi-cation performance, it is necessary to detect expectedchanges in the modeling technique’s average classprobability estimation that are conducive to improvedclassification accuracy. It turns out that the obviousmeasure, classification accuracy itself, is not sensitiveenough to such changes.Challenge 2. Estimating value distributions. For

estimating the expected contribution of differentacquisitions, a prerequisite is to estimate the condi-tional distribution P�Fi� j = vk for each missing valueof Fi� j , as needed in Equation (1). We must iden-tify an estimation mechanism appropriate for theAFA setting: many feature values may be missing,thereby rendering some modeling mechanisms moreeffective than others. For example, some mechanismsrequire that missing predictors or their distributionsbe estimated to produce a prediction. This adds yetanother layer of estimation, which may underminethe model’s prediction, sometimes significantly (Saar-Tsechansky and Provost 2007).Challenge 3. Reducing the consideration set. Even

if all unknown values were estimated accurately,selecting the best from all potential acquisitions wouldrequire estimating the utility of, in the worst case,mn queries. This would be very expensive com-putationally and is infeasible for most interestingproblems.

2.1. Estimating an Acquisition’s Contributionto Performance

Let us consider the measure to be used for esti-mating the value to induction from an acquisition,��F � Fi� j = vk. Let us assume for this discussionthat acquiring new information aims to improve themodel’s classification accuracy, for a binary classifica-tion problem (thus, the decision threshold for maxi-mum a posteriori classification is 0.5). Assuming thatthe model’s contribution to induction is computed byaveraging over a set of hold-out examples, this sug-gests a simple criterion for identifying an effectiveutility measure—prefer acquisitions that improve esti-mated accuracy. Let A�f denote the value to induc-tion estimated for a single hold-out instance, andwhere the classifier’s estimated probability that thecorresponding instance belongs to the true class is f .

Criterion 1. For a given hold-out instance, A�f1 �A�f2 if f1 > and f2 < , where fi refers to the model’sestimated probability that the given instance belongs to thetrue class, is the decision boundary, and a � b denotesthat a is better than b.

Page 5: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition668 Management Science 55(4), pp. 664–684, © 2009 INFORMS

An obvious measure for this contribution is themodel’s classification accuracy itself (i.e., the esti-mated generalization accuracy) over the augmentedsample F . However, as we will show, classificationaccuracy does not capture fine-grained changes in themodels; therefore, we would like a more sensitivemeasure for evaluating the benefits from prospectiveacquisitions.To understand why, it is necessary to examine the

dynamics of the modeling setting. Specifically, thetraining data—and therefore the models induced—continuously change as new information is acquired.Rather than examine the classification performanceof a particular model induced from one version ofthe data, it is useful to examine how new acqui-sitions affect the distribution of estimations inducedfrom different likely variations of the training data.Friedman’s analysis of the relationship between train-ing data and classification error (Friedman 1997)examines how changes in an induction technique’saverage estimation of the probability of the true classaffect the likelihood of classification error. For thesake of discussion, assume binary classification andlet f �y � x and f̂ �y � x denote the actual probabilityand the model’s estimated probability that an instancebelongs to class y, respectively, where x is the inputvector of observable attributes. Following Friedman’sanalysis, the probability that the predicted class y isestimated (erroneously) not to be the most likely classy can be approximated with a standard normal dis-tribution by

P�y = y = �̃

[sign�f − 1/2

Ef̂ − 1/2√varf̂

]� (3)

where �̃ is the upper tail area of the standard nor-mal distribution, var f̂ denotes the estimation vari-ance resulting from variations in the training sample,and Ef̂ denotes the mean of the probability estimationf̂ �y � x generated by models induced from differentvariations of the training sample. Henceforth, we referto Ef̂ and to var f̂ as the average probability estima-tion and estimation variance, respectively. When theaverage probability estimation leads to incorrect classprediction, and given a certain estimation variance,the likelihood of incorrect classification decreases asthe average probability estimation of the true classincreases. Equation (3) also reveals that the likelihoodof correct classification can improve when the averageprobability estimation already leads to a correct pre-diction. As shown in Equation (3), given a certain esti-mation variance, var f̂ , if the true class probability f

and the average probability estimation Ef̂ lead to thesame (correct) classification, then the further from thedecision boundary the average probability estimation

Ef̂ is, the higher the likelihood is of a reduction inclassification error. This is because it is less probablefor the estimation variance to cause an erroneous clas-sification. This analysis suggests two criteria for aneffective measure A of the utility from an acquisition.First, A should favor more extreme (correct) estimatesof class membership probability.

Criterion 1′. For a given hold-out instance, A�f1 �

A�f2 iff f1 > f2, where fi refers to the model’s estimatedprobability that the given instance belongs to the true class.

This is a specialization of Criterion 1; Criterion 1 willalways hold if Criterion 1′ does (but not vice versa).Second, for a correctly classified example, for a

fixed-size change in the estimate of class member-ship probability, A should favor changes to estimatesnearer to the decision boundary (note that in theanalysis above this boundary is assumed to be 0.5):

Criterion 2. For a given hold-out instance, A�f1 −A�f1 + � � A�f2 − A�f2 + ��∀� 0 < � ≤ min�1 −f1�1− f2� and < f1 < f2.

Based on these two criteria, we can assess theadequacy of different possibilities for the utility mea-sure A, including intuitive alternatives such as esti-mated accuracy (error rate) or the estimate f̂ of theprobability of class membership itself. Specifically,classification accuracy is not an adequate AFA utilitymeasure because it does not satisfy either Criterion 1′

or Criterion 2 completely. This is illustrated for binaryclassification by the diamond line in Figure 1, whichshows classification error (1-accuracy) as a functionof the model’s estimated probability of the true class,assuming maximum a posteriori classification.Let us consider an alternative AFA utility measure,

Log Gain (LG) (also known as cross-entropy). For amodel induced from a training set F , let f̂

F�y � x be

the probability estimated by the model that instance xbelongs to class y, and let ��A be an indicator func-tion such that � = 1 if A is the correct class and � =Figure 1 Log Gain and Classification Error vs. the Probability of the

True Class

0

1

2

3

4

5

6

7

0 0.2 0.4 0.6 0.8 1.0

Estimated probability of the true class

1-accuracy (classification error)Log Gain

Page 6: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 669

0 otherwise. Let LG�x = ∑y −��y log2 f̂F �y � x; Log

Gain is “better” as its value decreases. Consider anevaluation data set of t instances; let the value ofinduction from an acquisition resulting in a trainingset F be given by the sum of Log Gains over theset instances: ��F =∑t

e=1LG�xe. Hence, for each valueVk that feature Fi� j can take, we would induce a modelfrom the augmented data set and compute this sum ofLog Gains. As illustrated in Figure 1, Log Gain satis-fies both criteria. It captures important changes in themodel’s estimation following an acquisition, whichwill allow the AFA policy to focus on acquisitions thatdecrease the likelihood of classification error: usingLog Gain will result in higher scores for acquisitionsthat increase the average probability estimation of thetrue class when, on average, the model’s class predic-tion is incorrect. It will also promote acquisitions thatlead to more extreme probability estimations whenthe model’s class prediction is accurate—reducing therisk of erroneous classification as new information isadded to the training sample. In principle, other mea-sures that promote these two objectives should benefitthe AFA policy as well.

2.2. Estimating Feature-Value DistributionLet us now address the second term in Equation (1).We need to estimate the conditional probability distri-bution of a missing feature value given the known val-ues. For each feature j , the probability P�Fi� j = vk inEquation (1) will be inferred from a model Mi�j , basedon the other information available about instance i(what is known about the other features and theclass).Unfortunately, because of the AFA setting, the

instances to which the feature-distribution-estimationmodel Mi�j would be applied may have many miss-ing values. Considering an arbitrary predictive model,predictors whose values are required for inference(rather than for induction) may not be available. Theloss in predictive accuracy stemming from the needto estimate missing predictors’ values or their dis-tributions can be avoided if the model incorporatesonly predictors whose values are known for thisinstance (Saar-Tsechansky and Provost 2007). How-ever, for AFA this would entail considering a tremen-dous number of combinations, as at any point in anacquisition schedule different instances may includewidely different sets of known feature values.For the main results of this paper, to estimate the

feature-value distribution model Mi�j we employ onlyone predictor: the class variable. This is because forour setting, the class is guaranteed to be known atinference time. In principle, one can employ any setof known predictors to estimate a missing feature-value distribution. In §3 we validate empirically thebenefits of relying on known predictors exclusively.

Conditioning on the class variable outperforms a sim-pler model that does not condition the missing fea-ture distribution at all, because the latter does nottake into account any instance-specific informationin the estimation. A straightforward application ofmore complex modeling that captures the interac-tions among predictors but requires the estimation ofunknown predictors for inference does not improveAFA performance.

2.3. Reducing the Consideration SetEstimating the expectation E�· for each query, qi� j ,requires training one classifier for each possible valueof Fi� j . Therefore, exhaustively evaluating all possi-ble queries is infeasible for most interesting prob-lems. One way to make this exploration tractable isby applying a fast method to identify a subset of allthe possible queries that will subsequently be consid-ered for acquisition. In particular, let the explorationparameter � (1 ≤ � ≤ mn/�) control the size of thesample to be considered for acquisition. (Recall that� is the batch size, m the number of examples, andn the number of features.) To acquire a batch of �queries, first a subsample of �� queries is selectedfrom the available pool of prospective acquisitions;then the expected utility of each query in this sub-sample is evaluated using Equation (1). The value of� can be set depending on the amount of time theuser is willing to spend on this process and the effec-tiveness of the selection scheme. We consider two fastmethods to identify a subset of queries.The first approach, uniform sampling (US), identifies

a representative subset of missing feature values viaa uniform random sample of queries. However, whenthe consideration set is drawn uniformly at random,particularly informative acquisitions may be left outof the consideration set. An alternative approach is tolimit the consideration set to a subset of queries thatare more likely to be informative for model induc-tion than a query drawn at random. In particular,we propose selecting the consideration set of queriesfrom particularly informative instances. This invokesthe subproblem of what then constitutes an informa-tive instance for model induction. We conjecture thatacquired feature values are more likely to have animpact on classification accuracy when the acquiredvalues belong to a misclassified example and, as such,embed predictive patterns that are not consistent withthe current model. Next, correctly classified instancesare more informative if their class prediction is uncer-tain. The use of uncertainty for active data acquisitionoriginated in work on optimum experimental design(Federov 1972) and has been extensively applied inthe active learning literature (Cohn et al. 1994, Saar-Tsechansky and Provost 2004). For a probabilisticmodel, a lack of discriminative patterns results in

Page 7: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition670 Management Science 55(4), pp. 664–684, © 2009 INFORMS

uncertain predictions, for which the model assignssimilar likelihoods for class membership of differentclasses. Formally, for an instance x, let Py�x be theestimated probability that x belongs to class y, as pre-dicted by the model. Then the uncertainty score isgiven by Py1

�x−Py2�x, where Py1

�x and Py2�x are the

first-highest and second-highest predicted class prob-ability estimates respectively. Motivated by this rea-soning, error sampling (ES) ranks informative instanceshigher if they are misclassified by the current model.Next, ES ranks instances in increasing order of theuncertainty score.We call the approaches in which uniform sampling

and error sampling are used to reduce the set of miss-ing values considered for acquisition, sampled expectedutility-US (SEU-US) and sampled expected utility-ES(SEU-ES), respectively.

3. Experimental EvaluationWe now present a comprehensive set of experimentsthat demonstrate the efficacy of AFA and the benefitsof the measures described in §2.

3.1. Objectives and MethodologyWe begin with the key empirical question of whetherfeature values can be acquired cost-effectively withAFA, compared with a default policy in which a rep-resentative set of feature-value acquisitions is drawnuniformly at random. Next, we present extensiveempirical results that carefully examine the benefitsof the measures we propose for AFA, compared withalternatives. These evaluations provide empirical sup-port to the arguments in §2 regarding measures thatare likely to be particularly effective for AFA. We thenexamine the upper-bound performance that could beobtained with an omniscient “oracle” and how closeSEU is to this performance. In addition, we examinewhich of the two measures that SEU employs is clos-est to an omniscient measure for the correspondingquantity. These results suggest how improvements ineach of SEU’s estimations can contribute to SEU’soverall performance so as to approach the perfor-mance of the oracle. Finally, we perform sensitivityanalyses, exploring how different settings affect SEU’sperformance. These include SEU performance withdifferent feature-value cost structures and parametersthat determine the size of the initial sample providedto SEU, the number of examples it considers for acqui-sition, and the number of examples acquired in eachacquisition phase.

3.1.1. Primary Results. To address the first ques-tion, we compare, as a function of acquisition cost,the classification performance obtained by the policiesSEU-ES, SEU-US, and a policy (uniform) that selectsacquisitions uniformly at random. This study also

aims to examine the merits of the two approaches,US and ES, which we propose for reducing the set ofprospective acquisitions considered by SEU.We then demonstrate empirically the properties of

the utility measure proposed for AFA in §2. The abil-ity of SEU to rank potential acquisitions accuratelywill be affected by the accuracy of its estimates ofthe quantities in Equation (1), namely, the value toinduction from a prospective acquisition of a valueFi� j = vk, (��F � Fi� j = vk) and the estimated distribu-tion of values for each unknown feature (P�Fi� j = vk).Specifically, in §2.1 we showed analytically that LogGain effectively captures important changes in theestimated class probabilities following an acquisitionthat affect the likelihood of erroneous classification.Thus, Log Gain improves SEU’s ability to identifyacquisitions that are particularly likely to reduce clas-sification error. In contrast, using classification accu-racy as the utility measure does not capture changesin probability estimation, except when the estimatedclass membership also changes. To demonstrate thiseffect on SEU performance, we compare SEU with amodified policy that employs classification accuracyon the training set to estimate the value of prospectiveacquisitions. We refer to this policy as SEU-accuracy.We also examine our choice for estimating the prob-ability distribution over the values that a prospec-tive acquisition may produce. As discussed in §2.2,prior research has concluded that employing onlypredictors whose values are known during inferenceimproves prediction significantly. Based on these find-ings, in this study we conditioned the estimation onthe (known) class label. To validate the merits of thisapproach, we compare it with two alternative meth-ods that reflect two extremes with respect to relianceon predictors and their availability at inference time.The first is a simple approach that computes theunconditional frequency of feature values, based simplyon their frequency in the training data. We refer to thisapproach as SEU-frequency. In the second approach,we use tree induction, employing all other featuresand the class label as predictors to estimate the prob-ability distribution of a prospective acquisition. Thisapproach, SEU-DT, aims to capitalize on the interac-tions between predictors for inference. However, asdiscussed in §2.2, inference is likely to suffer if predic-tors whose values are unknown must be estimated atinference time. The conditional distribution approachwe employ in SEU lies between these two extremes—rather than relying on unknown values or estimatinga simple unconditional distribution, SEU conditionsthe estimation on the known label.

3.1.2. Oracle Policies. To derive an upper boundfor SEU’s performance, we employ an omniscient pol-icy (the oracle) that knows the true values of miss-ing features to determine the feature-value acquisition

Page 8: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 671

that will lead to the greatest improvement in gen-eralization performance. In addition, we assume theoracle has access to the held-out test data to com-pute the actual improvement in Log Gain followingan acquisition. As with SEU, to render the evaluationfeasible, rather than evaluate all possible acquisitions,the oracle selects the best acquisition among a sam-ple of �� prospective acquisitions. Both policies selectprospective acquisitions from the same set of prospec-tive acquisitions.We also present experiments that decompose the

advantages conferred by the oracle over the imper-fect estimations performed by SEU. Specifically, wedecompose the relative advantages into the oracle’sknowledge of the true model’s performance mea-sured over the held-out test data compared withSEU’s estimation over the training set, and the ora-cle’s knowledge of the true values of prospectiveacquisitions compared with SEU’s estimation of theexpected benefits from prospective acquisitions overall possible values that a missing feature may have.Recall that in SEU we estimate the distribution of amissing value to compute the benefits from its acqui-sition in expectation. If the actual values of prospec-tive acquisitions were known, one could compute thebenefit to model induction from acquiring the cor-responding value directly rather than in expectation.To evaluate the upper-bound performance that canbe obtained by SEU if the actual values of missingfeatures were known, we constructed a new policy,the feature oracle, that has access to the true valuesof prospective acquisitions for the purpose of eval-uating acquisitions. Both SEU and the feature oracleestimate the benefit to induction over the training setand evaluate the same set of prospective acquisitionsin each acquisition phase. To evaluate the benefitsfrom assessing the model’s accuracy directly over thetest data, we compare SEU’s performance to that ofthe performance oracle—a policy in which the improve-ment in Log Gain is computed on the test data. Wefix all other components of the polices so that the per-formance oracle and SEU employ the same measureto estimate feature-value distributions and evaluatethe same consideration set of prospective acquisitionsat each acquisition phase. We compare SEU to theperformance oracle and feature oracle to estimate howimprovements in each of SEU’s estimations can con-tribute to SEU’s overall performance so as to approx-imate the upper-bound performance.

3.1.3. Sensitivity Analyses. Finally, we explorethe performance of SEU under different settings.The first of these evaluations considers the robust-ness of SEU’s performance under different feature-value acquisition costs. SEU also incorporates severalparameters, such as the size of the sample of featurevalues, whose contributions to learning are evaluated

by SEU, the number of feature values acquired in eachacquisition phase, and the size of the initial sampleprovided to SEU for evaluating the contributions ofprospective acquisitions. We explore in turn how eachof these design parameters affects SEU’s performance(Langley 2000, Hevner et al. 2004).

3.1.4. Experimental Setup. The empirical evalua-tions are performed over a set of data sets from avariety of domains. Four data sets4—Expedia, eToys,Priceline, and QVC—contain information about Webusers and their visits to large retail websites. Thetarget (dependent) variable indicates whether a usermade a purchase during a visit. The predictorsdescribe customers’ surfing behaviors at the site aswell as at other sites over time. We induce modelsto estimate whether a purchase will occur duringa given session and employ the acquisition poli-cies to estimate which unknown feature values aremost cost-effective to acquire. These data sets containboth continuous and categorical features; therefore,for simplicity, when estimating value distributions weconverted all the continuous features to categoricalfeatures using the discretization method of Fayyadand Irani (1993). The remaining data sets are availablefrom the UC Irvine repository (Blake and Merz 1998)and pertain to a variety of domains.The performance of each acquisition policy is eval-

uated over 10 independent runs of 10-fold cross-validation as follows. For each cross-validation run,the 10-fold partition was selected at random. In eachfold of the cross-validation, all policies were providedwith the same subset of initial feature values, drawnuniformly at random from the training portion. Allthe remaining feature values in the training data con-stitute the initial pool of potential acquisitions. Ateach acquisition phase, each policy acquires the val-ues of a set of queries from the pool of prospectiveacquisitions; then a new model is induced and itsclassification accuracy is measured on the test data.This process is repeated until a desired number offeature values has been acquired. To reduce compu-tation costs in the experiments, we acquire queriesin fixed-size batches at each iteration. For problemsfor which learning requires more training informa-tion, we acquired a larger number of feature valuesat each phase. For each data set, we selected the ini-tial random sample size to be such that the inducedmodel performed at least better than assigning allinstances to the majority class. We later explore thepolicy’s performance for smaller numbers of initialfeature values and for different batch sizes. The testdata set contains complete instances to allow us toestimate the true generalization accuracy of the con-structed model. We set the exploration parameter �

4 From the related study by Zheng and Padmanabhan (2006).

Page 9: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition672 Management Science 55(4), pp. 664–684, © 2009 INFORMS

Table 1 Error Reductions of SEU Variants Compared to a Uniform Random Acquisition Policy

Considerationset alternatives Alternative for Alternatives for

� Initial sample A�F � F i� j = vk � P �F i� j = vk �

Data set (batch size) (no. of instances) SEU-US SEU-ES SEU-accuracy SEU-frequency SEU-DT

Audiology 100 147 14�61∗ 19�72∗ 7�81∗‡ 8�38∗‡ 11�31∗‡

Car 50 1�033 10�93∗ 11�11∗ 4�25∗‡ 8�14∗‡ 9�42∗

eToys 100 125 19�11∗ 49�18∗ 10�17∗‡ 17�99∗‡ 16�14∗‡

Expedia 100 350 10�04∗ 16�61∗ 6�38∗‡ 5�92∗‡ 11�03∗

Lymph 20 38 7�20∗ 3�05∗ 5�56∗‡ 6�30∗ 6�70∗

Priceline 100 75 10�69∗ 1�24† 4�46∗‡ 9�82∗ 9�17∗‡

QVC 100 225 3�76∗ 14�82∗ −0�20‡ 2�83∗ 1�30∗‡

Vote 10 59 18�30∗ 8�33∗ 6�23∗‡ 12�83∗‡ 15�32∗‡

Average 11�83 15�50 5�58 9�02 10�04

∗Policy is better than uniform, p < 0�05.†p < 0�06.‡SEU-US is better than the alternative policy, p < 0�05.

to 10. For model induction we used J48 classification-tree induction, which is the Weka (Witten and Frank1999) implementation of C4.5 (Quinlan 1993). Integralto this induction algorithm is a missing value treat-ment, enabling induction from the incomplete dataset. In addition, Laplace smoothing was used with J48to improve class probability estimates.We compare the performance of any two policies,

A and B, by computing the percentage reduction inclassification error rate obtained by A over B at eachacquisition phase and report the average reductionover all acquisition phases. We refer to this average asthe average percentage error reduction (Saar-Tsechanskyand Provost 2004). The reduction in errors obtainedwith policy A over the errors of policy B is consideredto be significant if the errors produced by policy A arefewer than the corresponding errors (i.e., at the sameacquisition phase) produced by policy B, accordingto a paired t-test �p < 0�05 across all the acquisitionphases. The learning curves that we present and theaverage percentage error reduction we report reflectaverage performance of each policy over the 10 runsof 10-fold cross-validation.

3.2. Results

3.2.1. Primary Results. Table 1 presents the aver-age error reductions obtained by different SEU poli-cies with respect to the uniform sampling policy,which acquires a representative set of feature valuesdrawn uniformly at random. The number of acquisi-tions acquired at each acquisition phase, �, and thesize of the initial sample are also presented in Table 1.In this and subsequent tables each significant value�p < 0�05 is marked with an asterisk �∗.Let us first examine whether SEU effectively de-

creases classification error compared with a uniformsampling policy. In Table 1, the fourth and fifth

columns present the average error reductions obtainedby SEU-US and SEU-ES with respect to uniform querysampling. Figure 2 presents the performance of thethree policies on four data sets that exhibit the differ-ent patterns of performance we observe. For all datasets SEU builds more accurate models than uniformquery sampling. The differences in performance onall data sets, except for SEU-ES on Priceline, are sta-tistically significant.5 These results demonstrate thatthe expected utility framework and the specific meth-ods we employ to estimate the expected improve-ment in performance are indeed effective for AFA:SEU selects queries that on average are more informa-tive for induction than queries selected uniformly atrandom.To underscore the advantage of using SEU, one can

observe the cost benefit of using SEU to build a modelexhibiting a desired performance level, comparedwith using a uniform acquisition policy. For exam-ple, for the eToys data set, uniform query samplinghad to acquire approximately 1,800 feature values toobtain an accuracy of 94%. SEU-ES had to acquirefewer than 400 feature values to achieve the sameaccuracy. When data-acquisition costs are consider-able, this could translate to substantial savings in thecost of building accurate models.The results also indicate that the method employed

to select the queries to be considered for acquisitioncan have a significant impact on the outcome. BothSEU-US and SEU-ES acquire useful feature values thatsignificantly improve the model’s performance. Forsome data sets, such as eToys and Audiology, SEU-ES selects significantly more informative acquisitionsthan SEU-US, suggesting that ES identifies a subset of

5 Note that for SEU-ES on Priceline, the improvement at least issignificant at the 0�06 level.

Page 10: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 673

Figure 2 Four Characteristic Patterns of the Improvement of Classification Accuracy as a Function of the Number of Feature Values Acquired,Assuming Uniform Feature Costs

74

76

78

80

82

84

86

6,200 6,300 6,400 6,500 6,600 6,700

Acc

urac

yA

ccur

acy

Acc

urac

yA

ccur

acy

SEU-USSEU-ES

Uniform

90

91

92

93

94

95

96

97

98

5,000 5,500 6,000 6,500 7,000

60

62

64

66

68

70

72

74

76

78

80

3,000 3,500 4,000 4,500 5,00069

70

71

72

73

74

75

76

77

78

79

10,000 10,200 10,400 10,600 10,800 11,000 11,200 11,400 11,600

Number of feature values acquired

Number of feature values acquired Number of feature values acquired

Number of feature values acquired

Car eToys

Priceline Audiology

acquisitions to be considered for acquisitions superiorto those drawn on average via uniform query sam-pling. In other data sets—e.g., Priceline—SEU-US ispreferable. Recall that ES selects entire instances so allthe missing feature values for these instances simul-taneously become candidates for acquisition. Fleshingout a smaller set of examples may be more or lesspreferable in different domains. Also, if examples areincorrectly classified simply because they are outliers,ES will stumble because it will prefer all the unknownfeatures for these examples. Moreover, if only a fewfeatures are relevant, selecting all the features will onlydilute the candidate set.The average error reduction obtained with SEU-US

over all acquisition phases ranges between 3.76% and19.11%. SEU-ES often results in even more substantialsavings, but its performance is more varied than thatof SEU-US. The average error reduction obtained bySEU-ES ranges between 1.24% and 49.18%. Becauseit forms the consideration set based on the entireinstance, ES may sometimes fail to select instances

with a highly informative feature value if the entireinstance seems less informative than another instance.In sum, both SEU policies provide considerable

advantage over uniform query sampling. SEU-ESusually is the better of the two and can sometimesprovide very substantial savings. SEU-US is moreconsistent and thus would be a more conservativechoice. In the next section we expand on these results,focusing on the more conservative SEU-US policy. Forthe remainder of this paper, unless specified other-wise, SEU will refer to SEU-US.We now compare SEU’s performance to its perfor-

mance using alternative utility measures, providingempirical support for the desired properties outlinedin §2. First we examine empirically whether LogGain is a better measure of prospective utility thanis classification accuracy. The sixth column of Table 1shows the average error reduction obtained by SEU-accuracy over uniform sampling. We mark with anasterisk �∗ the data sets where SEU-accuracy is sig-nificantly better than uniform acquisition; the data

Page 11: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition674 Management Science 55(4), pp. 664–684, © 2009 INFORMS

Figure 3 Classification Accuracy vs. Log Gain for Estimating theValue of an Acquisition

74

76

78

80

82

84

86

6,200 6,300 6,400 6,500 6,600 6,700

Acc

urac

y

SEUSEU-accuracyUniform

Number of feature values acquired

(a) Car

(b) Error reductions

Oracle-Log Gain vs.Data set Oracle-Accuracy

Audiology 2.10∗

Car 8.34∗

eToys 0.70Expedia 3.45∗

Lymph 6.83∗

Priceline 2.45∗

QVC 2.94∗

Vote 6.52∗

∗p < 0�05.

Acc

urac

y

74

76

78

80

82

84

86

88

Performance Oracle-Log GainPerformance Oracle-AccuracySEUUniform

6,200 6,300 6,400 6,500 6,600 6,700

Number of feature-values acquired

(c) Car

Note. The three comparisons are described in text.

sets for which SEU is significantly better than SEU-accuracy (all of them) are marked with a double-dagger (‡). Figure 3(a) shows the performance of eachpolicy as well as of uniform sampling for the Car dataset. The improvements obtained by SEU (with LogGain) can be substantially higher than those obtainedby SEU-accuracy, up to more than 12% average errorreduction. These results demonstrate that by capturing

changes in the probability estimation, Log Gain indeedis able to select significantly more informative featurevalues to acquire, leading to better models on average.A possible reason for the inferior performance

obtained by SEU-accuracy may be the difficulty ofprecisely estimating classification accuracy using onlythe training data. In practice only the training data areavailable to the SEU policy, but it is useful to estab-lish whether Log Gain is more informative even whenan oracle computes the value of prospective acquisi-tions directly on the test data or whether classificationaccuracy is preferable when it is estimated with suf-ficient precision, in spite of its step-like form. Toaddress this question, we compared SEU’s perfor-mance with versions of the policy where Log Gainand classification accuracy are measured on the held-out test data rather than on the training data. We referto these policies as Performance Oracle-Log Gain andPerformance Oracle-Accuracy, respectively. Figure 3(b)presents the error reduction obtained with Perfor-mance Oracle-Log Gain compared with PerformanceOracle-Accuracy. For the Car data set, Figure 3(c)presents the performance obtained by the oracles,SEU, and the uniform acquisition policy. The resultsconfirm the advantage from detecting changes in themodel’s probability estimation via Log Gain over clas-sification accuracy—Log Gain is able to identify moreinformative acquisitions even when the impact of anacquisition is evaluated with absolute precision overthe test data.Now, let us turn to the value distribution estimation

employed by SEU. Table 1 presents average reduc-tions in error when using conditional distributions(SEU, fourth column) compared with using uncondi-tional frequency estimation (SEU-frequency, seventhcolumn) and tree induction (SEU-DT, eighth column).For the Audiology data set, Figure 4 shows the perfor-mance of SEU with each estimation approach. These

Figure 4 Three Different Methods for Estimating the Feature-ValueDistribution Compared with Uniform Acquisition

69

70

71

72

73

74

75

76

77

10,000 10,200 10,400 10,600 10,800 11,000 11,200 11,400 11,600

Acc

urac

y

SEUSEU-DTSEU-frequencyUniform

Number of feature values acquired

Audiology

Page 12: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 675

Figure 5 Error Reduction of the Omniscient Oracle Compared with SEU

(a) Average error reductions

SEU error reductionData set Oracle vs. SEU as % of oracle’s

Audiology 9�47∗ 49.21Car 13�11∗ 53.17eToys 3�11∗ 40.51Expedia 4�09∗ 42.89Lymph 14�90∗ 36.38Priceline 0�63 56.54QVC 2�42∗ 47.31Vote 0�66 62.43

∗p < 0�05.

74

76

78

80

82

84

86

88

90

6,200 6,300 6,400 6,500 6,600 6,700

Acc

urac

y

OracleSEUUniform

Number of feature values acquired

(b) Car

Note. SEU achieves about half the total error reduction over uniformacquisition.

results suggest that unconditional distributions pro-vide poorer estimates of the feature-value distributioncompared with the conditional distributions; the aver-age improvement over the uniform policy over alldata sets is 9.02%, compared with 11.83% obtained bySEU. For individual data sets, SEU’s relative advan-tage with respect to SEU-frequency reached up to6.23%. We also find that SEU is often better or com-parable to SEU-DT, which can capture more com-plex patterns but relies on predictors that may beunknown at inference time. Errors in estimating miss-ing predictors or their distributions contribute to pre-diction error cumulatively, and furthermore, becausethe induction technique implicitly assumes all predic-tors will be available during inference, it is less likelyto capture alternative predictive patterns involvingfeature values that will be available during inferencethan relying exclusively on known predictors (Saar-Tsechansky and Provost 2007).

3.2.2. Oracle Policies. Let us now examine theupper-bound performance that can be obtained withSEU. Figure 5(a) shows the average error reduc-tion obtained by the oracle, compared with SEU.Figure 5(b) shows the performance obtained by the

Figure 6 Oracle with Complete Knowledge of Feature Values (Only)Compared with SEU

(a) Average error reductions

Data set Feature oracle vs. SEU

Audiology −1�93Car 3�87∗

eToys −2�60Expedia 4�00∗

Lymph 2�19∗

Priceline −2�24QVC −2�08Vote 0�64

∗p < 0�05.

74

76

78

80

82

84

86

6,200 6,300 6,400 6,500 6,600 6,700

Acc

urac

y

Number of feature values acquired

SEUFeature oracle

Uniform

(b) Car

oracle, SEU, and the uniform policies for the Cardata set. The oracle obtains between 0.63% and 14.9%average error reduction, and its benefit is statisticallysignificant in most cases. We also measured the errorreduction obtained by each, the oracle and SEU, com-pared with uniform sampling to determine what pro-portion of the improvement obtained by the oracle isobtained by SEU. In Figure 5(a) we denote this mea-sure as “SEU error reduction as % of oracle’s.” SEUconsistently achieves about half the “optimal” errorreduction.We can decompose the advantages conferred by

each of the oracle’s perfect measures of the quantitiesin Equation (1) over the corresponding imperfect esti-mations performed by SEU. Figure 6(a) presents theaverage error reduction obtained with the feature ora-cle compared with the SEU policy. For the Car dataset, Figure 6(b) shows the performance of the featureoracle, SEU, and uniform sampling. As shown, acqui-sitions made by the SEU policy often result in modelsthat perform comparably to those induced with acqui-sitions made by the feature oracle. The feature oracleperforms statistically significantly better in only threedata sets; in these cases the feature oracle’s acquisi-tions lead to models that are, on average, between2.19% and 4% more accurate than those produced bySEU. Thus, for the purpose of choosing acquisitions

Page 13: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition676 Management Science 55(4), pp. 664–684, © 2009 INFORMS

Figure 7 Oracle with Complete Knowledge of Test-Set Performance(Only) Compared with SEU

(a) Average error reductions

Performance oracleData set vs. SEU

Audiology 5�63∗

Car 2�74∗

eToys 6�29∗

Expedia 3�56∗

Lymph 2�17∗

Priceline 2�64∗

QVC −0�18Vote 5�05∗

∗p < 0�05.

68

69

70

71

72

73

74

75

76

77

78

79

10,000 10,200 10,400 10,600 10,800 11,000 11,200 11,400 11,600

Acc

urac

y

Number of feature values acquired

Performance oracleSEUUniform

(b) Audiology

based on computed expected utility, SEU estimatesthe distribution of missing values fairly well; how-ever, in principle, there is some room for improve-ments so as to reach the feature oracle’s performance.As we show in §3, models that are expected to per-form well in this setting are those that can producean estimation of the distribution without the need toimpute or estimate the distribution missing predic-tors. Approaches that are more computationally inten-sive (and generally more accurate) than the one weemploy may improve the performance further (Saar-Tsechansky and Provost 2007).To evaluate the benefits from assessing the model’s

accuracy directly over the test data, we compareSEU’s performance to that of the performance oracle.Figure 7(a) presents the average percent error reduc-tion obtained with the performance oracle comparedwith SEU. With the exception of a single data set, theperformance oracle’s ability to rank potential acquisi-tions by their impact on the test data results in modelsthat are statistically significantly more accurate thanthose induced with SEU. The performance oracle pro-duced models that are 2.17%–6.29% more accuratethan those obtained with SEU.

Our decomposition of the oracle’s performancesuggests that its access to the held-out test dataconfers the most significant advantage to its per-formance over SEU. Unfortunately, in practice, thisinformation cannot be available to an acquisition pol-icy. SEU’s estimation of the feature-value distribu-tion already results in a comparable performance tothat of the feature oracle for most data sets. How-ever, comparing the decomposed results to the resultsof the (complete) oracle, we see that there seemsto be an interaction that leads to the overall errorreduction being larger than the sum of the individualreductions.

3.3. Sensitivity AnalysesWe now explore SEU’s performance under differentexperimental settings. In the first set of experimentswe evaluate SEU’s performance when attributes varylargely both in the information they provide aboutthe class and in their costs. To make the problem set-ting challenging, we constructed synthetic data in thefollowing way. For the Lymph data set, which has18 features, we added an equal number of binary fea-tures with randomly selected values to provide noinformation on the class. In addition, for each featurewe associate a cost drawn uniformly at random from1 to 100. We evaluated the policies’ performances forfive different assignments of feature costs.Because uniform sampling does not take feature

costs into account, we also compare SEU with abaseline strategy that does. This approach, cheapestfirst, selects feature values for acquisition in order ofincreasing cost. The results for all randomly assignedcost structures show that for the same cost, SEUconsistently builds more accurate models than theuniform policy. Figure 8 presents results for two rep-resentative cost structures. SEU’s superiority is moresubstantial than that observed with uniform costs forthe original data sets. This is because SEU’s abilityto capture the value of acquisitions per unit cost ismore critical when there are features of varying infor-mation value and cost. In contrast, the performancefor cheapest first is quite varied for different costassignments. When there are highly informative fea-tures that are inexpensive, cheapest first of courseperforms quite well (e.g., Figure 8(a)), because itsunderlying assumption holds. In such cases, SEUwould not perform as well because it imperfectly esti-mates the expected improvement from each acqui-sition. In contrast, when many inexpensive featuresare also uninformative (probably a more realistic sce-nario), cheapest first performs worse than the uni-form policy (Figure 8(b)). SEU, however, estimates thetrade-off between cost and expected improvement inperformance, and although the estimation clearly isimperfect, it consistently selects better queries thanrandom acquisitions for all cost structures.

Page 14: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 677

Figure 8 Comparing Acquisition Policies Under DifferentAcquisition-Cost Structures

60

61

62

63

64

65

66

67

68

69

62,000 63,000 64,000 65,000 66,000 67,000 68,000 69,000 70,000 71,000

Acc

urac

yA

ccur

acy

60

61

62

63

64

65

66

67

68

69

68,000 70,000 72,000 74,000 76,000

Cheapest first

78,000

Total cost of feature values acquired

UniformSEU

Total cost of feature values acquired

(a) Acquisition-cost structure 1

(b) Acquisition-cost structure 2

We now examine in turn how each of SEU’s param-eters affects its performance. SEU requires some train-ing data to estimate the expected contribution toinduction of prospective acquisitions. In the experi-ments so far, we evaluated the performance of SEUonce the model induced from the initial sample per-forms comparably to a majority classifier. Here, wealso explore how SEU performs when it is providedwith a smaller initial sample until it exhausts the poolof potential acquisitions. As before, we initialized thetraining set to a random set of feature values, andthe number of values is equal to 10 times the num-ber of features for the corresponding data set. A rep-resentative pattern that we observed is presented inFigure 9(a) for the eToys data set. As shown, becauseof the small amount of training data, both policiesrequire additional data to produce predictive patternsand improve performance. However, the acquisitionsmade by SEU are substantially more informative;thus, SEU acquires fewer features to achieve a givenlevel of performance. Thus, even when only a small

Figure 9 Changes to SEU’s Parameters Have the Expected Effectson Performance

89

90

91

92

93

94

95

96

97

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000

Acc

urac

y

SEUUniform

90

91

92

93

94

95

96

97

5,000 5,500 6,000 6,500 7,000

Acc

urac

yA

ccur

acy

60

62

64

66

68

70

72

74

76

700 750 800 850 900 950 1,000 1,050 1,100

Number of feature values acquired

Number of feature values acquired

Number of feature values acquired

SEU (α = 50)SEU (α = 20)SEU (α = 10)Uniform

SEU (β = 1)SEU (β = 20)SEU (β = 40)Uniform

(a) Exhaustive acquisition of values

(b) SEU with different α values

(c) SEU with different β values

number of values is available initially, it is prefer-able to employ SEU. In addition, as is typical withinformation-acquisition policies, once both policiesexhaust the pool of acquisitions, the performances ofthe models they produce converge.

Page 15: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition678 Management Science 55(4), pp. 664–684, © 2009 INFORMS

To alleviate computational costs, SEU evaluatesonly a subset of prospective acquisitions at eachacquisition phase. The size of this consideration set,as determined by the parameter �, is likely to affectSEU’s performance. For the eToys data set, Figure 9(b)presents SEU performance for different considerationset sizes. The results are quite intuitive—the largerthe consideration set size, the more likely it is thatmore informative feature values are identified andacquired, improving the model’s performance. Asshown, even for a small � value of 10, SEU acquiresmore informative acquisitions on average than thoseselected by uniform sampling. If there are no compu-tational constraints, however, a larger considerationset clearly is preferable.The SEU policy evaluates the expected contribution

of each individual feature-value acquisition. How-ever, in practice and in the experiments conductedin this study, more than a single feature value maybe acquired simultaneously in each acquisition phase.Although we find that SEU is effective in this setting,it is important to explore how SEU with batchacquisitions performs compared to when a singlevalue is acquired at each phase. For the Lymph dataset, Figure 9(c) shows SEU performance when a singlevalue is acquired at each phase, compared to whenmultiple feature values are acquired simultaneously.Here as well the results are quite intuitive. BecauseSEU estimates the contribution of single values, it per-forms best when it acquires one value at a time;similarly, when batch acquisitions are performed, thesmaller the batch size is, the better SEU performs.When there are no significant computational con-straints, the acquisition of an individual value at eachiteration is preferable.

4. Active Information AcquisitionNow we demonstrate further the value of theexpected utility framework by exploring its exten-sion to a more general task. In principle, othersorts of information than feature values also maybe acquired, at a cost, to benefit induction. We referto the task of simultaneously evaluating the acqui-sition of any information pertinent to induction asactive information acquisition (AIA). We consider thecase of AIA for which both unknown feature val-ues and class labels can be acquired. The motivationfor this problem combines the motivations for tra-ditional active learning (Cohn et al. 1994) and foractive feature-value acquisition. Consider, for exam-ple, data used to model customers’ responses to anoffer. Different feature values may be missing fordifferent customers, and the responses to offers forparticular customers can be acquired at a cost. Thelatter costs may stem from contacting a customer or

from the opportunity cost arising from presenting asuboptimal offer to a potential buyer. Given theseacquisition costs, it would be beneficial for an AIApolicy to suggest what would be the most effectiveacquisitions.The expected utility framework extends directly to

handle this problem. The advantage of our approachis that it evaluates all acquisition types by the samemeasure—i.e., the marginal expected contribution tothe predictive performance per unit cost. In Algo-rithm 1, missing classes can be included as potentialqueries in step 2. As a technical point, in this settingwe cannot use class-conditional distributions to esti-mate the feature-value distributions of missing fea-tures (or classes), because we do not have class labelsfor all instances. Instead, we will use an instance ofthe base learner (tree induction in this case) to esti-mate the value distribution of the feature under con-sideration, as done in SEU-DT in §3. We refer to thispolicy as SEU-AIA.

4.1. Determining the Consideration Set for AIATo make the expected utility approach tractable, heretoo we reduce the set of candidate queries. However,in the AIA setting, drawing the consideration set uni-formly at random would be biased against missingclass labels—there typically are fewer class labels thanfeature values, and a uniform sample would tend toreflect this. Such a bias could be detrimental to induc-tion because a class label tends to be much more infor-mative than a single feature value. To reduce the setof queries evaluated by AIA, we employ a computa-tionally inexpensive heuristic that aims to capture therelative information value of a prospective acquisitionper unit cost, before it is explicitly computed by AIA.Specifically, we compute a weight for each prospec-tive acquisition that is proportional to an estimationof the information conveyed by this value for modelinduction (described next), normalized by the value’scost.More specifically, for candidate-set reduction, we

evaluate the contribution to induction of all featurevalues of a given a feature Fi by a cost-normalizedvariant of the information gain IG�Fi�L (Quinlan1993) of the feature Fi for class variable L. The infor-mation gain is given by IG�Fi�L = H�L − H�L � Fi =H�L −∑

j p�Fi = vjH�L � Fi = vj, where H�Z denotesShannon’s information entropy of variable Z, and fea-ture Fi can have one of j values v1� � � � � vj . The infor-mation gain captures the reduction in the entropyof the class variable once the value of feature Fi isknown. Thus, features that carry more informationfor determining the class will have higher informa-tion gain and also are more valuable to induction. Thescore assigned to a feature value is the correspondingfeature’s information gain normalized by the fea-ture value’s cost. Thus, feature values with high

Page 16: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 679

information gain and lower costs will be assignedhigher weights. These weights then guide the sam-pling of the consideration set.In supervised learning, instances whose class labels

are missing are not used for induction; thus, whenlabels are acquired, the values of the known featurevalues of the respective instance also become avail-able for induction. To capture the value to inductionwhen a class label is acquired, we compute the sumof information gains for all the known features of therespective instance. Formally, let Mk denote the setof all known feature values for instance k; then theweight assigned to the value from acquiring the labelof instance k is give by

∑Fi∈Mk

IG�Fi�L�The consideration set is composed of the prospec-

tive acquisitions with the highest weights. This selec-tion scheme is tractable because it does not requireintensive computations for each missing value andestimates the same information gain for all queries ofa given feature. Once the consideration set is deter-mined, SEU is applied to estimate the expected valueof each individual query in the set.

4.2. SEU-AIA PerformanceTo assess the performance of SEU-AIA for this generalinformation-acquisition task, we remove class labelsand feature values from the training data uniformlyat random. We compare the performance of SEU-AIA to acquiring missing values uniformly at ran-dom. For this set of experiments we assume that allfeatures and class labels have the same cost; next,we will present experiments with nonuniform costs.The purpose of this comparison is to verify that SEU-AIA effectively estimates the expected contribution ofmissing values of both types, so as to rank them accu-rately, and to produce better predictive models for agiven cost.Figure 10(a) presents a summary of results com-

paring SEU-AIA with uniform sampling. On all datasets, acquiring information using SEU-AIA resultsin significantly better models than using uniformacquisition. Figures 10(b) and 10(c) show the resultsfor Audiology and Expedia, respectively, demon-strating the substantial impact. SEU-AIA consistentlyacquires informative values for modeling, whichresults in models superior to those obtained by uni-form acquisition. SEU-AIA evaluates and comparesthe different types of information effectively andprovides a significant lift in predictive performance.To our knowledge, this is the first demonstrationof an effective policy for this general information-acquisition problem.To gain further insight into SEU-AIA’s choice of

acquisitions, we experimentally compare the perfor-mance of SEU-AIA with that of an active learning (AL)policy, which employs SEU-AIA but only considers

Figure 10 Active Information Acquisition vs. Uniform RandomSampling When Both Features and Class LabelsMust Be Acquired at a Cost

(a) Average error reductions

Data set SEU-AIA vs. uniform

Audiology 21�29∗

Car 0�52∗

eToys 39�71∗

Expedia 15�97∗

Lymph 16�42∗

Priceline 28�54∗

QVC 23�47∗

Vote 20�05∗

∗p < 0�05.

66

68

70

72

74

76

78

10,000 10,200 10,400 10,600 10,800 11,000 11,200 11,400 11,600

Acc

urac

yA

ccur

acy

Number of feature values/class labels acquired

Number of feature values/class labels acquired

85.0

85.5

86.0

86.5

87.0

87.5

88.0

88.5

89.0

89.5

14,000 14,500 15,000 15,500 16,000

SEU-AIAUniform

(b) Audiology

(c) Expedia

class labels for acquisitions, and to SEU, which onlyevaluates feature acquisitions. We perform these eval-uations under different cost structures in which eitherclass labels or feature values are significantly moreexpensive to acquire. These comparisons allow us toassess whether SEU effectively manages acquisitionsof class labels and of feature values so as to producemodels that are comparable or superior to those pro-duced with either AL or AFA alone.

Page 17: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition680 Management Science 55(4), pp. 664–684, © 2009 INFORMS

Figure 11 Different Costs Structures Favor Different Basic AcquisitionStrategies

66

68

70

72

74

76

78

10,000 10,200 10,400 10,600 10,800 11,000 11,200 11,400 11,600

Acc

urac

yA

ccur

acy

66

67

68

69

70

71

72

73

74

10,800 10,900 11,000 11,100 11,200 11,300

66

68

70

72

74

76

78

10,000 10,200 10,400 10,600 10,800 11,000 11,200 11,400 11,600

Acc

urac

y

Total cost of information acquired

Number of feature values/class labels acquired

SEU-AIAALSEUUniform

Total cost of information acquired

(a) Feature cost = label cost

(c) Feature cost = 6 × label cost

(b) Label cost = 6 × feature cost

Note. SEU-AIA shows robust performance, adjusting acquisitions based oncosts.

Figure 11 shows classification accuracy as a func-tion of acquisition cost for SEU-AIA, AL, SEU, anduniform acquisition for three different cost structures.Figure 11(a) shows results for a cost structure in which

class labels and feature values are equally expensive,whereas Figures 11(b) and 11(c) present results forcost structures in which feature values or class labelsare significantly more expensive, respectively. Notethat as there are fewer class labels than feature values,the curves describing AL performance may appeartruncated, particularly if label costs are significantlylower than feature costs or vice versa.Our results suggest that although policies that con-

sider the acquisition of only one type of informa-tion can perform well for cost structures in whichthe values they acquire are informative and inexpen-sive, their performance is inconsistent and they per-form poorly with other cost structures. By contrast,SEU-AIA effectively handles the trade-off between theinformativeness of different types of values and thecost of acquiring them, providing consistent, goodperformance across all cost structures. For exam-ple, for the cost structure shown in Figure 11(a), noone type of information is consistently more cost-effective than the other, and it therefore is beneficialto alternate between acquisitions of class labels andof feature values. As shown in Figure 11(a), SEU-AIA performs better than each of the other policies,which consider only the acquisition of class labelsor of feature values. This confirms that acquisitionsof both feature values and class labels is more cost-effective. For the cost structures in which feature val-ues are significantly more expensive than class labelsor vise versa, either AL or SEU, respectively, performbest. This is because the extreme cost structure leadsto one type of acquisitions being consistently morecost-effective than the other. Although SEU-AIA’s esti-mation of the value of prospective acquisitions isimperfect, its performance in these settings approxi-mates that of the better of AL or SEU, demonstrat-ing that SEU-AIA accurately estimates the usefulnessof acquisitions that are more cost-effective, regard-less of their type. By contrast, the policies that con-sider the acquisition of only one type of information,namely AL and SEU, perform poorly in one of thesettings. For example, in the cost settings shown inFigure 11(c), AL performs very well. However, ALperforms poorly for the cost structure in Figure 11(b)because it does not consider the highly cost-effectivefeature values that can be acquired. Because it is notknown a priori how policies that acquire only featurevalues or only class labels will perform under differ-ent cost structures and given the varied performanceof these policies, SEU-AIA is attractive, allowing oneconsistently to identify different types of cost-effectiveacquisitions.

5. Limitations and Future WorkDespite the effectiveness of the expected utility frame-work and the SEU policy, there are limitations thatprovide avenues for future work.

Page 18: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 681

• We discussed in §1 why any policy for AFAmust be heuristic, as generalization accuracy itselfcannot be computed exactly. However, even for aheuristic measure of performance, the question ofwhether more complex optimization over all possibleset acquisitions will improve AFA remains an openand interesting question. We note that an optimizationprocedure to identify the best of all possible acquisi-tion sets can be thought of as a multiple-comparisonprocedure (Jensen and Cohen 2000), using a finiteset of training data to estimate generalization perfor-mance of alternative models. Such settings may leadto pathologies such as overfitting and oversearch—indeed, there has been growing evidence that searchfor optimal “combinations” of factors for predictivemodeling often is inferior to greedy search.• In some settings it may be possible to acquire

different sets of feature values for a single price. Forexample, access to different information sources—such as archived patient records of different careproviders—may each be costly, but once a recordis accessed, a set of values can be acquired for asingle price. Thus, the problem becomes determin-ing which set of values would be most cost-effectiveto acquire next. In principle, our framework can beapplied directly. However, estimating the value distri-bution of all possible assignments for a given set andcomputing the value to induction from each assign-ment would be very inefficient in practice. One possi-bility is to assume statistical independence of valuesin a given set, alleviating the joint value-distributionestimation. Subsequently, one may employ MonteCarlo estimation to draw value assignments for eachprospective set to approximate the expected contribu-tion to model induction.• For estimating the value distribution of a pros-

pective acquisition it is recommended to use onlyfeature values that will be known at inference time(Saar-Tsechansky and Provost 2007). In principle,one can model this distribution using any set ofknown predictors. As we note earlier, this approachcan be inefficient with most modeling techniques,because it may require inducing a different modelto capture the interactions among each unique set ofknown predictors. One alternative is to employ thenaïve Bayes assumption of conditional independence,which allows inference without the need to estimatethe values or the distributions of missing predictors.In addition, it is trivial to marginalize any missingvariables by not including them during inference.The models induced may be improved by using a

missing value treatment that takes into considerationthe potential bias in the missingness pattern createdby a selective (nonuniform) acquisition policy. In theempirical evaluations in this paper we employ thetreatment integral to C4.5. It would be informative for

future work to explore the performance with othermissing value treatments and modeling techniques.• The expected utility framework allows one to

incorporate performance objectives other than accu-racy, such as the (economic) benefit from model use.• If (once) enough values are available, fea-

ture selection may be applied to complement AFA.Prospective value acquisitions pertaining to featuresthat are excluded by feature selection may not be con-sidered for acquisition. Otherwise, AFA itself wouldhave to learn that these values provide no utility.

6. Related WorkThe problem of sequential information acquisition hasbeen addressed in a variety of settings, with varioustypes of information being acquired to satisfy a vari-ety of objectives. To the best of our knowledge, thepolicies we present are the first approaches designedfor the problem of incrementally acquiring featurevalues for inducing a general classifier when costs arespecified for individual entries of F .An early and highly influential stream of research

pertains to the classic multiarmed bandit prob-lem introduced by Robbins (1952). McCardle (1985)applies similar reasoning. The objective in the multi-armed bandit problem and the technology adoptiondecision problem is the estimation of single parameteror a set of independent parameters, e.g., the probabil-ity of success. Other researchers also have addressedinformation acquisition for decision making. Mooreand Whinston (1986, 1987) develop a theoreticaldecision-making model for a decision maker who canacquire information at a cost to reduce the uncer-tainty associated with a given decision. Feature-valueacquisition for case-retrieval and inference algorithmswas treated by Mookerjee and Mannino (1997) andMannino and Mookerjee (1999), who study sequen-tial feature-value acquisition for test instances. Theyaddress a setting where all feature values of a testinstance are missing but can be acquired sequentiallyduring inference to minimize the overall acquisi-tion costs. Mookerjee and Mannino also demonstratethat joining concept formation and retrieval strat-egy results in significantly lower acquisition cost dur-ing inference, compared to when the two phases areaddressed independently. Differently from this priorwork, we develop policies for acquiring informationto improve predictive model induction.

The notion of information acquisition designedfor predictive model induction has been addressedby several prior lines of work. Three decades ago,authors identified the significance of the interdepen-dence between induction as a search in the space of allpossible concepts/models and the selection of trainingdata used to direct the search. Simon and Lea (1974)

Page 19: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition682 Management Science 55(4), pp. 664–684, © 2009 INFORMS

describe conceptually how induction involves simul-taneous search of two spaces—the results of search-ing the model space can affect how training datawill be sampled. Techniques from optimal experimen-tal design (Federov 1972) and from active learning(Cohn et al. 1994, Freund et al. 1997) assume classlabels are unknown. Thus, in active learning completeinstances are acquired to enhance learning. A prob-lem related to active learning was addressed by Zhengand Padmanabhan (2002, 2006), where instances withincomplete feature values are not used for induction,and—similar to active learning—a policy is proposedto identify useful instances for which to acquire com-plete information. Melville et al. (2004) address thesame problem but assume that incomplete instancescan be used for induction, using some missing-valuetreatment. The approach we develop here for feature-value acquisition is inspired by the method proposedby Roy and McCallum (2001) for active learning. Royand McCallum examine the expected improvementfrom acquiring class labels for naïve Bayes classifierinduction. Our approach for AIA, presented in §4, wegeneralize the AFA problem to include the class vari-able as another feature to acquire, thus subsumingboth AFA and traditional active learning.Also closely related, Lizotte et al. (2003) study bud-

geted learning, where an amount to be spent towardfeature-value acquisition is specified a priori. Thereare two main differences between AFA and the bud-geted learning problem and policies proposed byLizotte et al. First, the fundamental goals are differ-ent. On one hand, AFA seeks acquisitions that willgive the best model for any intermediate investmentin information acquisition; thus, the order of acquisi-tions is critical. Budgeted learning, on the other hand,seeks the best model under the assumption that theentire budget will be spent on acquisitions; thus, theorder of acquisitions is largely unimportant (beyondthe critical question of acquire or not). The AFA set-ting is appealing because it allows reaching a perfor-mance goal on a fraction of the budget. The policiesof Lizotte et al. (2003) also assume the induction ofa naïve Bayes classifier with its conditional indepen-dence assumption and equal costs for acquiring thevalues of a feature regardless of the instance. Theseassumptions combine to enable queries of the form,“Acquire the value of feature j for any instance inclass k.” In the AFA setting, the applicability to a gen-eral classifier (not just naïve Bayes) as well as thepotentially variable cost for entries in the matrix Frequires the ability to assign different values to theacquisition of a given feature for different instances.We experiment with and discuss the implications

of different cost structures in §3.3. The main impli-cation of the conditional independence assumptionemployed by Lizotte et al. (2003) is that it substantially

limits the number of queries to be considered by thepolicy. In principle, the formulation of the policiesproposed by Lizotte et al. (2003) can be extendedto consider unique benefits from each feature-valueacquisition (as in Veeramachaneni et al. 2006); how-ever, the computational complexity of that implemen-tation may be prohibitive, particularly that of singlefeature lookahead with a large lookahead. In §§2.3and 4 we address means to reduce the considerationset of queries for the policies we propose. Williamset al. (2005) also address a similar problem; how-ever, differently from the work above they constructan acquisition policy designed specifically for a logis-tic regression model, assume that the modeling per-formance and acquisition costs must be given by thesame units, and assume that the training data alsodefine the complete set of instances for which predic-tion will be required.Some work on cost-sensitive learning (Turney 2000),

such as CS-ID3 (Tan and Schlimmer 1990), alsoattempts to minimize the cost of acquiring featuresduring training; however, that work only consid-ers acquiring information for the current traininginstance. The LAC∗ algorithm (Greiner et al. 2002)acquires a random sample of complete instances inrepeated exploration phases that are intermixed withexploitation phases, using the current classifier to clas-sify instances economically.Work on feature (variable/model) selection (John

et al. 1994) assumes known feature values and selectsa subset of features to use for model induction. Dur-ing feature selection, the known feature values areused to estimate the relative contribution from includ-ing the feature values for all instances. In principle,feature selection procedures could be complementaryto an AFA policy, as discussed in §5.

7. ConclusionsWe presented a general approach to active feature-value acquisition that acquires feature values basedon the estimated expected improvement in modelperformance per unit cost. We also proposed a spe-cific measure for the utility of a prospective acqui-sition that captures the benefit from the acquisition,given the dynamics of modeling with incrementallychanging training data, and a method for estimat-ing the distributions of possible query results in thepresence of missing values. We showed how thiscomputationally intensive policy can be made fasterand remain effective by constraining the search toa sample of potential feature-value acquisitions. Theresulting technique, SEU, consistently yields bettermodels per unit acquisition cost, compared with uni-form query sampling. The technique’s advantage isparticularly apparent when feature values have vary-ing information values and incur different costs.

Page 20: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value AcquisitionManagement Science 55(4), pp. 664–684, © 2009 INFORMS 683

We also studied SEU’s component measures com-pared with alternatives and omniscient oracles, whichknow exactly the quantities being estimated. Thesestudies reveal that SEU’s measure of prospectivefeature-value distributions is largely effective and thatthe greatest improvement in performance by the ora-cles is obtained when they know the exact impact ofdifferent values that may be acquired. A sensitivityanalysis of the method’s parameters produced intu-itive results—SEU performs better when the consid-eration set of prospective acquisitions is larger andwhen the number of values acquired at each phase issmaller.Finally, we showed how the SEU framework for

feature-value acquisition can be effectively appliedto address the more general information-acquisitionproblem in which missing (training) class labels andfeature values both may be acquired at a cost. SEUis able to alternate between acquisitions of classlabels and of feature values based on their relativecontributions to learning per unit cost. In this generalsetting SEU produces better models for a given costthan a uniform policy or than policies that acquireonly feature values or only class labels.As we discussed in the introduction, the availability

of a generic, effective, and computationally efficientpolicy for information acquisition offers opportu-nities to change the manner by which firms—which rely on consumer feedback to generate valu-able business intelligence—interact with consumersto enhance data-driven intelligence cost-effectively.It also presents opportunities to transform businesspractices where information is acquired regularly inbundles: intelligent acquisition policies will allowfirms to selectively identify only the most cost-effective values to acquire from potentially differentsources to improve induction. Furthermore, such poli-cies are likely to render information acquisition fromthird-party providers feasible for small businesseswith limited budgets.The 1960 Siegel and Fournaker quote with which

we started the paper is as relevant as ever. The mean-ing of “research purposes” has changed dramaticallyin half a century, especially with the million-foldimprovement in computing power per unit cost. Inthis paper we have shown new ways to bring thistremendous computing power to bear to evaluate thevalue of information to improve decision making.

AcknowledgmentsThe authors thank the anonymous reviewers for their sub-stantive comments. The authors are especially grateful tothe associate editor for excellent suggestions, and theythank Duy Vu for his generous help in enhancing the code.They also thank Saurabh Baji, Hasrat Godil, Prateek Gupta,and Priyank Mishra for their help in code development.

The third author gratefully acknowledges the support of anNEC Faculty Fellowship.

ReferencesBlake, C. L., C. J. Merz. 1998. UCI repository of machine learn-

ing databases. http://www.ics.uci.edu/mlearn/MLRepository.html.

Cohn, D., L. Atlas, R. Ladner. 1994. Improving generalization withactive learning. Machine Learn. 15(2) 201–221.

Fayyad, U. M., K. B. Irani. 1993. Multi-interval discretizationof continuous-valued attributes for classification learning.R. Bajcsy, ed. Proc. 13th Internat. Joint Conf. Artificial Intelligence,Morgan Kaufmann, San Francisco, 1022–1027.

Federov, V. 1972. Theory of Optimal Experiments. Academic Press,New York.

Freund, Y., H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sam-pling using the query by committee algorithm. Machine Learn.28(2/3) 133–168.

Friedman, J. H. 1997. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining Knowledge Discovery 1(1) 55–77.

Greiner, R., A. Grove, D. Roth. 2002. Learning cost-sensitive activeclassifiers. Artificial Intelligence 139(2) 137–174.

Hevner, A., S. March, J. Park, S. Ram. 2004. Design science in infor-mation systems research. MIS Quart. 28(1) 75–105.

Huang, Z. 2007. Selectively acquiring ratings for product recom-mendation. Proc. 9th Internat. Conf. Electronic Commerce �ICEC’07, ACM, New York, 379–388.

Jensen, D. D., P. R. Cohen. 2000. Multiple comparisons in inductionalgorithms. Machine Learn. 38(3) 309–338.

John, G. H., R. Kohavi, K. Pfleger. 1994. Irrelevant features andthe subset selection problem. W. W. Cohen, H. Hirsh, eds.Proc. 11th Internat. Conf. Machine Learn. �ICML-94, MorganKaufmann, San Francisco, 121–129.

Langley, P. 2000. Crafting papers on machine learning. Proc. 17thInternat. Conf. Machine Learn. �ICML-2000, Morgan Kaufmann,San Francisco, 1207–1216.

Lizotte, D., O. Madani, R. Greiner. 2003. Budgeted learningof Naive-Bayes classifiers. C. Meek, U. Kjaerulff, eds. Proc.19th Conf. Uncertainty Artificial Intelligence �UAI-2003, MorganKaufmann, San Francisco.

Mannino, M. V., V. S. Mookerjee. 1999. Optimizing expert sys-tems: Heuristics for efficiently generating low cost informationacquisition strategies. Informs J. Comput. 11(3) 278–291.

McCardle, K. 1985. Information acquisition and the adoption ofnew technology. Management Sci. 31(11) 1372–1389.

Melville, P., M. Saar-Tsechansky, F. Provost, R. Mooney. 2004. Activefeature-value acquisition for classifier induction. Proc. 4th IEEEInternat. Conf. Data Mining �ICDM-04, IEEE Computer Society,Washington, D.C.

Melville, P., M. Saar-Tsechansky, F. Provost, R. Mooney. 2005. Anexpected utility approach to active feature-value acquisition.Proc. 5th IEEE Internat. Conf. Data Mining �ICDM-05, IEEEComputer Society, Washington, D.C., 745–748.

Mookerjee, V. S., M. V. Mannino. 1997. Sequential decision modelsfor expert system optimization. IEEE Trans. Knowledge and DataEngrg. 9(5) 675–687.

Moore, J. C., A. B. Whinston. 1986. A model of decision-makingwith sequential information-acquisition (part 1). Decision Sup-port Systems 2(4) 285–307.

Moore, J. C., A. B. Whinston. 1987. A model of decision-makingwith sequential information-acquisition (part 2). Decision Sup-port Systems 3(1) 47–72.

Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. MorganKaufman, San Mateo, CA.

Page 21: Active Feature-Value Acquisition - Prem · PDF filefprovost@stern.nyu.edu ... and Provost: Active Feature-Value Acquisition Management Science 55(4), pp ... business intelligence modeling

Saar-Tsechansky, Melville, and Provost: Active Feature-Value Acquisition684 Management Science 55(4), pp. 664–684, © 2009 INFORMS

Robbins, H. 1952. Some aspects of the sequential design of experi-ments. Bull. Amer. Math. Soc. 55 527–535.

Roy, N., A. McCallum. 2001. Toward optimal active learningthrough sampling estimation of error reduction. C. E. Brodley,A. P. Danyluk, eds. Proc. 18th Internat. Conf. Machine Learn.�ICML-2001, Morgan Kaufmann, San Francisco, 441–448.

Saar-Tsechansky, M., F. Provost. 2004. Active sampling forclass probability estimation and ranking. Machine Learn. 54153–178.

Saar-Tsechansky, M., F. Provost. 2007. Handling missing valueswhen applying classification models. J. Machine Learn. Res. 81623–1657.

Siegel, S., L. E. Fouraker. 1960. Bargaining and Group Decision-Making Experiments in Bilateral Monopoly. McGraw-Hill,New York.

Simon, H. A., G. Lea. 1974. Problem solving and rule induction:A unified view. L. W. Gregg, ed. Knowledge and Cognition,Chap. 5. Erlbaum, Potomac, MD.

Tan, M., J. C. Schlimmer. 1990. Two case studies in cost-sensitiveconcept acquisition. T. Dietterich, W. Swartout, eds. Proc. 8thNational Conf. Artificial Intelligence �AAAI-90, Association forthe Advancement of Artificial Intelligence, Menlo Park, CA,854–860.

Turney, P. D. 2000. Types of cost in inductive concept learning.T. Dietterich, D. Margineantu, F. Provost, P. Turney, eds. Proc.Workshop on Cost-Sensitive Learn. 17th Internat. Conf. MachineLearn., Stanford, CA, 15–21.

Vapnik, V. N. 1998. Statistical Learning Theory. John Wiley & Sons,New York.

Veeramachaneni, S., E. Olivetti, P. Avesani. 2006. Active samplingfor detecting irrelevant features. W. W. Cohen, A. Moore, eds.Proc. 23rd Internat. Conf. Machine Learn. �ICML-2006, Pittsburgh,961–968.

Williams, D., X. Liao, L. Carin. 2005. Active data acquisition withincomplete data. Technical report, Department of Electrical andComputer Engineering, Duke University, Durham, NC.

Witten, I. H., E. Frank. 1999. Data Mining: Practical Machine Learn-ing Tools and Techniques with Java Implementations. MorganKaufmann, San Francisco.

Zheng, Z., B. Padmanabhan. 2002. On active learning for dataacquisition. N. Zhong, P. S. Yu, eds. Proc. IEEE Internat. Conf.Data Mining �ICDM-2002, IEEE Computer Society, Washington,D.C., 562–569.

Zheng, Z., B. Padmanabhan. 2006. Selectively acquiring customerinformation: A new data acquisition problem and an activelearning based solution. Management Sci. 52(5) 697–712.