learning to validate the predictions of black box machine ... · black box machine learning models...

Learning to Validate the Predictions ofBlack Box Machine Learning Models on Unseen Data

Sergey Redyuk* Sebastian Schelter† Tammo Rukat**Volker Markl* Felix Biessmann‡

{sergey.redyuk,volker.markl}@tu-berlin.de [email protected] {tammorukat,felix.biessmann}@gmail.com*Technische Universität Berlin †New York University **University of Oxford ‡Beuth University Berlin

ABSTRACTWhen end users apply a machine learning (ML) model onnew unlabeled data, it is difficult for them to decide whetherthey can trust its predictions. Errors or shifts in the targetdata can lead to hard-to-detect drops in the predictive qualityof the model. We therefore propose an approach to assistnon-ML experts working with pretrained ML models. Ourapproach estimates the change in prediction performance ofa model on unseen target data. It does not require explicitdistributional assumptions on the dataset shift between thetraining and target data. Instead, a domain expert can declar-atively specify typical cases of dataset shift that she expectsto observe in real-world data. Based on this information, welearn a performance predictor for pretrained black boxmodels,which can be combined with the model, and automaticallywarns end users in case of unexpected performance drops.We demonstrate the effectiveness of our approach on twomodels – logistic regression and a neural network, appliedto several real-world datasets.

ACM Reference Format:Sergey Redyuk, Sebastian Schelter, Felix Biessmann, Volker Markl.2019. Learning to Validate the Predictions of Black BoxMachine Learn-ing Models on Unseen Data. In Workshop on Human-In-the-LoopData Analytics (HILDA’19), July 5, 2019, Amsterdam, Netherlands.ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3328519.3329126

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected]’19, July 5, 2019, Amsterdam, Netherlands© 2019 Copyright held by the owner/author(s). Publication rights licensedto the Association for Computing Machinery.ACM ISBN 978-1-4503-6791-2/19/07. . . $15.00https://doi.org/10.1145/3328519.3329126

1 INTRODUCTIONNew challenges emerge with the advent of machine learn-ing (ML) as a central component in modern informationinfrastructures [4]. Problems in this space originate, for ex-ample, from developers having to handle unexpected shiftsin the data that is fed into systems powered by machinelearning. Consider the following examplary scenario: A con-sultant with ML expertise creates a ML model for predict-ing the sales numbers of competitor products, and providesthe final trained model to an engineering team from an e-commerce company. Imagine that some time later, this en-gineering team accidentally introduces errors in the webcrawling code that fetches new target data for the model,and the error changes the scale of certain numeric attributes.The model will still output numeric predictions for unseencompetitor products, but the guarantees about the reliabilityof these predictions are lost due to the unexpected shift inthe data. Unfortunately, it is really difficult for the engineersto validate the predictions, as there is no automated way toretrieve the ground truth (the future sales numbers of thecompetitor in this scenario).While ML experts have specialized knowledge to debug

models and predictions in such cases, there is a lack of auto-mated, declarative tools for non-ML expert users to help themdecide whether they can trust the predictions of a ML modelon new data.

We therefore propose an approach where domain expertsand end users only need to declaratively specify types ofpotential shifts in the target data of their ML models. Basedon these, we describe how to learn a performance predictorfor an existing ML model, which can be shipped togetherwith the model (Section 2). This predictor empowers endusers, as it can automatically warn them in case of poten-tial drops of the predictive quality of the model on unseendata. Our method is easy to implement, and provides severaladvantages over existing work (Section 3) in this area: (i) itdoes not require explicit distributional assumptions on thedataset shift; (ii) our approach works with general black boxmodels, which consume raw input data (e.g., relational tuplesor unstructed data, such as text or images), and hide model

https://doi.org/10.1145/3328519.3329126

https://doi.org/10.1145/3328519.3329126

https://doi.org/10.1145/3328519.3329126

HILDA’19, July 5, 2019, Amsterdam, Netherlands Redyuk et al.

details and internal feature representations. Examples forsuch black box models are scikit-learn pipelines, SparkMLpipelines, or hosted models in cloud ML services.The contributions of this paper are the following:

• We present an approach to automatically assess changesin the prediction scores of black box ML models on unseen,unlabeled and potentially erroneous target data (Section 2).• We validate our approach on various datasets, shifts andmodels, and find that it predicts the true model score withan MAE of 0.01 in the majority of cases (Section 4).

2 APPROACHProblem statement. We focus on generic black box MLmodels in the form of a classification function f : X → Rc ,which takes an example x ∈ X and returns a predictiony ∈ Rc for the c classes. Here, x refers to raw input datafor the black box model, e.g., a row from a relational tableor dataframe, which can contain numerical, ordinal or cate-gorical variables, as well as unstructured data, such as textand images. The black box model f will internally applytransformations to convert the raw input data to numeri-cal features. The predictions y ∈ Rc are numeric vectors,and each dimension of y corresponds to a probability of thec classes. We refer to the data that is available for the MLmodel f at training time as the source dataset – Dsource oflabeled examples (X, y) where y denotes the ground truthlabels for the examples X from c classes. During training, anML expert tunes all parameters and hyperparameters of theblack box model f to maximize a score ℓ = L( f (X), y) suchas accuracy, precision, or area under the ROC curve (AUC).We consider a typical use case, where an end user (who

is not an ML expert) applies the trained black box model toanother datasetDtarget, referred to as target dataset. This datais not available at training time and has no labels available atprediction time. Most ML models rely upon the i.i.d. assump-tion, meaning that both the source and target data are sam-pled independently from the same stationary distribution.In real-world scenarios however, Dtarget could be corrupted(and thereby violate this assumption) due to dataset shift, e.g.,caused by errors induced by the data preparation code, dueto a distribution shift caused by changes in the underlyingdata generating process in the real-world, or due to adver-sarial attacks. In such cases, there is no guarantee about thereliability of the predictions. If the trained model is appliedto target data that underwent a dataset shift, the resultingpredictions can be arbitrarily wrong in principle. However,the predictive performance of the model, measured by thescore ℓtarget = L( f (Xtarget), ytarget) could only be computed ifwe had access to the true labels ytarget, which are unknownfor Dtarget.

Algorithm 1 Training the performance predictor h.Input: (Dtrain,Dtest) disjunct partitions of Dsource

black box model f pretrained on Dtrain1: (Xtest, ytest) ← Dtest2: ℓtest ← L( f (Xtest), ytest)3: M ← ∅4: for shift generator e ← E do5: Xcorrupt ← corrupt Xtest ∈ Dtest with e

6: Ycorrupt ← f (Xcorrupt)

7: ℓcorrupt ← L(Ycorrupt, ytest)8: ϕcorrupt ← percentiles(Ycorrupt)9: M ← M ∪ (ϕcorrupt, ℓcorrupt)10: end for11: h ← train a regression model on M12: return (h, ltest)

The problem that we address in this work is to compute anestimate ℓtarget of the score ℓtarget, which the black box modelachieves on the target examples Xtarget without access to thecorresponding ground truth labels ytarget. If the score drops,the model should not be used, and an ML expert should beconsulted to decide what to do next.

Our approach is based on the idea of (i) simulating datasetshifts, specified by a domain expert, which could occur inthe target data of a trained model, and (ii) subsequentlylearning a model-specific performance predictor (in the formof a regression model) to estimate the impact of these shiftson the black box model’s performance. We simulate datasetshifts on a held-out partition of the source dataset for whichwe know the ground truth labels. If the user-specified shiftscapture the dataset shifts encountered in practice well, thenthe outputs of our predictor will be a reliable estimate of theperformance of the black box model on the new data at hand,even though we do not know the true labels.Training the performance predictor. The steps for train-ing a performance predictor h for a particular black boxmodel are shown in Algorithm 1. We assume that we areprovided with a black box model f which has been trainedon the source data. Furthermore, a domain expert specifies aset of shift generators E that produce certain types of ran-domized dataset shifts. We apply each shift generator e toheld-out data Dtest from the training process, and therebygenerate corrupted examples Xcorrupt in line 5. We computethe actual prediction score ℓcorrupt by comparing the modelpredictions Ycorrupt with the true labels ytest in line 7. Exist-ing work suggests that the univariate distributions of theoutput dimensions of the matrix Y = f (X) are predictive ofdataset shifts [2, 3]. We build on these findings, and leveragea univariate non-parametric estimate of the distributions ofeach output dimension of f . More concretely, we compute

Learning to Validate the Predictions of Black Box ML Models HILDA’19, July 5, 2019, Amsterdam, Netherlands

several percentiles ϕcorrupt of the black box model outputsYcorrupt = f (Xcorrupt) as features for the performance predic-tor h. We record both the features ϕcorrupt and the score astraining data for our regression model in the setM (lines 3-10). Finally, we train our performance predictor h on theseexamples in line 11 to predict the score ℓcorrupt from thefeatures ϕcorrupt.Performance prediction on unseen target data. Algo-rithm 2 gives an example of how to apply our performancepredictorh to unseen (and unlabeled) target dataXtarget. First,we compute the required features ϕtarget in the form of per-centiles of the outputs Ytarget, which the model f assigns tothe examples Xtarget. Next, we have our performance predic-tor h estimate the score ℓtarget = h(ϕtarget) of f on the targetdata, and compare this to the expected generalization errorℓtest, which we computed during predictor training. If thedifference exceeds a user-defined threshold t (e.g., 5%), weissue a warning to the end user.

Algorithm 2 Prediction validation for unlabeled target data.Input: target data Xtarget, black box model f , test score ℓtest,

performance predictor h, threshold t1: Ytarget ← f (Xtarget)

2: ϕtarget ← percentiles(Ytarget)

3: ℓtarget ← h(ϕtarget)

4: return ( |ℓtest − ℓtarget | / ℓtest) ≤ t

3 RELATEDWORKApplying data management techniques in the ML space isa field with growing interest in recent years [4]. Several so-lutions were proposed for validating ML models and theirpredictions. Most of these originate from a statistical MLor a data management perspective. The former require as-sumptions on the nature of the dataset shift [2, 6], whichoften seem inapt to describe practically relevant data changesfor engineers and domain experts, and the proposed meth-ods often limit themselves to adapting a particular modelor learning paradigm. Approaches from the data manage-ment community [1, 5] often require manual tuning, detailedknowledge of model internals and features, or do not di-rectly adress the problem of validating model predictions.For instance, Google’s TFX platform [1] offers skew detec-tion capabilities for featurized input data, but forces the enduser to manually select an appropriate distance function andthreshold for particular features. In short, these methodsare not applicable in scenarios where we intend to handlepretrained black box models that consume raw input data.

4 EVALUATIONDatasets and models. We experiment on three publiclyavailable datasets from different domains. The income1 datasetcontains 48,842 records about adult income data, and the tar-get variable denotes whether a person earns more than 50Kdollars per year or not. The heart2 dataset contains 70,001records about cardiovascular diseases, and the target variabledenotes the presence of a heart disease. The tweets3 datasetcomprises of 20,002 tweets, and the target variable denoteswhether or not a tweet has trolling character (e.g., intendedto insult someone). We leverage two popular classificationapproaches as black box models for our experiments: (i) wetrain a logistic regression model from scikit-learn (refered toas lr), using five-fold cross-validation where we grid searchover regularization type and constant; (ii) we train a feed-forward neural network from tensorflow (refered to as dnn),comprised of two hidden layers with ReLU activation and asoftmax output. We again use five-fold cross-validation andgrid search over the size of the hidden layers.Setup. For every experimental run, we randomly partition adataset into one partition with 90% of the records designatedas source data and another disjoint partition with 10% of therecords designated as target data. Next, we train one of ourblack box models on the source data, as described in Sec-tion 2. We implement our performance prediction approachin Python with a RandomForestRegressor, and train a pre-dictor for each black box model, according to Algorithm 1.Afterwards, we apply shift generators to the unseen targetdata and create randomly corrupted versions of that data.We then have the performance predictor estimate the scoreon the unseen corrupted data, and compare this estimate tothe actual score using the mean absolute error (MAE) (whichwe can compute in this virtual setup because we have accessto the true labels). We run experiments to predict both accu-racy as well as AUC; for experimental runs measuring theaccuracy, we re-sample the data to have balanced classes tomake the scores easier to interpret.Types of errors and dataset shifts. We experiment withfour different types of errors that we introduce into the un-seen target data. For each type of error, we subsequentlycorrupt 0%, 5%, 25%, 50%, 75% and 99% of records in thedataset, and repeat each configuration 100 times: (i ) Missingvalues in categorical attributes – Figure 1(a): In our first ex-periment, we choose a random set of categorical columns,and introduce missing values at random into these columns.(ii ) Gaussian shift in numeric attributes – Figure 1(b): Werandomly choose a set of numeric columns, and corrupt a1https://archive.ics.uci.edu/ml/datasets/adult2https://www.kaggle.com/sulianova/cardiovascular-disease-dataset3https://dataturks.com/projects/abhishek.narayanan/Dataset%20for%20Detection%20of%20Cyber-Trolls/

https://archive.ics.uci.edu/ml/datasets/adult

https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

https://dataturks.com/projects/abhishek.narayanan/Dataset%20for%20Detection%20of%20Cyber-Trolls/

https://dataturks.com/projects/abhishek.narayanan/Dataset%20for%20Detection%20of%20Cyber-Trolls/

HILDA’19, July 5, 2019, Amsterdam, Netherlands Redyuk et al.

0.050.000.05er

ror

MAE 0.0109

income (lr)

MAE 0.0073

heart (lr)

MAE 0.0045

income (lr)

MAE 0.0102

heart (lr)

0.5 0.6 0.7 0.8true score

0.050.000.05er

ror

MAE 0.0202

income (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0082

heart (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0057

income (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0043

heart (dnn)

(a) Score estimation in presence of missing values in categorical attributes.

0.050.000.05er

ror

MAE 0.0067

income (lr)

MAE 0.0137

heart (lr)

MAE 0.0126

income (lr)

MAE 0.0108

heart (lr)

0.5 0.6 0.7 0.8true score

0.050.000.05er

ror

MAE 0.0067

income (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0116

heart (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0056

income (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0092

heart (dnn)

(b) Score estimation in presence of Gaussian shift in numeric attributes.

0.050.000.05er

ror

MAE 0.0219

income (lr)

MAE 0.0086

heart (lr)

MAE 0.0080

income (lr)

MAE 0.0049

heart (lr)

0.5 0.6 0.7 0.8true score

0.050.000.05er

ror

MAE 0.0098

income (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0093

heart (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0055

income (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0055

heart (dnn)

(c) Score estimation in presence of swapped attributes.

0.5 0.6 0.7 0.8true score

0.050.000.05er

ror

MAE 0.0191

tweets (lr)

0.5 0.6 0.7 0.8true score

MAE 0.0097

tweets (dnn)

0.5 0.6 0.7 0.8true score

MAE 0.0097

tweets (lr)

0.5 0.6 0.7 0.8true score

MAE 0.0115

tweets (dnn)

(d) Score estimation in presence of adversarial trolling tweets.

Figure 1: Score estimation for various shifts, models and datasets. The x-axis denotes the true model scores ℓ andthe y-axis shows the corresponding absolute errors ℓ − ℓ from the estimate ℓ given by our performance predictor.Each plot group shows accuracy prediction in blue on the left side and AUC prediction in magenta on the right.

fraction of their values by adding Gaussian noise centeredin the data point with a standard deviation randomly scaledfrom the interval of 2 to 5. (iii ) Swapped column values –Figure 1(c): We choose different pairs of categorical and nu-merical columns, and randomly swap the contained valuesbetween the columns. (iv ) Adversarial attack – Figure 1(d):This experiment leverages the tweets dataset exclusively andsimulates an adversarial attack. We assume that the authorsof the trolling tweets try to fool our classifier by changingthe spelling of their tweets, e.g., by converting the string“hello world” into “h3110 w041d”.Results. Figure 1 demonstrates the results of the correspond-ing experiments. We report the accuracy and AUC score thatour approach provided for our two black box models – lo-gistic regression and the neural network – on the datasetsand the introduced shifts. Our approach predicts the trueperformance score within an MAE of 0.01 in the majority ofcases. In three cases (logistic regression on the income datawith swapped columns, application of the neural network onthe income data with missing values, and logistic regressionon the adversarial attack in the tweets dataset), our approachunderpredicts the accuracy of the black box model, and onlyachieves an MAE of 0.02. In general, our predictions areclose to the true performance score on the unseen data. Wethereby enable end users to distinguish catastrophic modelfailures (such as cases with a low true score, where the modeltends to randomly guess) from cases where the performanceis only slightly affected by the shifts in the data.

5 FUTUREWORKIn future work, we plan to extend our approach to give moreelaborate feedback to end users to empower them to fixthe data errors and arrive at reliable model predictions. Wewill furthermore investigate how well our method performswhen combinations of different shifts and errors are presentin the target data, and how well our predictor works in caseswhere it has not been trained on the correct type of shift.This work was supported by the Moore-Sloan Data ScienceEnvironment at New York University, as well as the BZMLBerlin, ECDF and HEIBRiDS Graduate School Berlin.

REFERENCES[1] Denis Baylor, Eric Breck, Heng-Tze Cheng, and Martin Zinkevich. 2017.

TFX: A TensorFlow-Based Production-Scale Machine Learning Platform.KDD (2017), 1387–1395.

[2] Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. 2018. Detectingand Correcting for Label Shift with Black Box Predictors. arXiv preprintarXiv:1802.03916 (2018).

[3] Stephan Rabanser, Stephan Günnemann, and Zachary C Lipton. 2018.Failing Loudly: An Empirical Study of Methods for Detecting DatasetShift. arXiv preprint arXiv:1810.11953 (2018).

[4] Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas,Stephan Seufert, and Gyuri Szarvas. 2018. On Challenges in MachineLearning Model Management. IEEE Data Engineering Bulletin (12 2018).

[5] Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, FelixBiessmann, and Andreas Grafberger. 2018. Automating large-scale dataquality verification. PVLDB 11, 12 (2018), 1781–1794.

[6] Masashi Sugiyama, Neil D Lawrence, Anton Schwaighofer, et al. 2017.Dataset shift in machine learning. The MIT Press.

learning to validate the predictions of black box machine ... · black box machine learning models...

Documents