the bouncer problem: challenges to remote explainabilitythe bouncer problem as a parallel for...

12
The Bouncer Problem: Challenges to Remote Explainability Erwan Le Merrer Univ Rennes, Inria, CNRS, Irisa [email protected] Gilles Tr´ edan LAAS/CNRS [email protected] Abstract The concept of explainability is envisioned to satisfy so- ciety’s demands for transparency on machine learning decisions. The concept is simple: like humans, algo- rithms should explain the rationale behind their deci- sions so that their fairness can be assessed. While this approach is promising in a local context (e.g. to explain a model during debugging at training time), we argue that this reasoning cannot simply be transposed in a remote context, where a trained model by a service provider is only accessible through its API. This is problematic as it constitutes precisely the target use-case requiring transparency from a societal perspec- tive. Through an analogy with a club bouncer (which may provide untruthful explanations upon customer reject), we show that providing explanations cannot prevent a remote service from lying about the true reasons leading to its decisions. More precisely, we prove the impossi- bility of remote explainability for single explanations, by constructing an attack on explanations that hides discriminatory features to the querying user. We provide an example implementation of this at- tack. We then show that the probability that an ob- server spots the attack, using several explanations for attempting to find incoherences, is low in practical set- tings. This undermines the very concept of remote ex- plainability in general. 1 Introduction Modern decision-making driven by black-box systems now impacts a significant share of our lives [9, 29]. Those systems build on user data, and range from rec- ommenders [21] (e.g., for personalized ranking of in- formation on websites) to predictive algorithms (e.g., credit default) [29]. This widespread deployment, along with the opaque decision process provided by those sys- tems raises concerns about transparency for the general public or for policy makers [12]. This translated in some jurisdictions (e.g., United States of America and Eu- rope) into a so called right to explanation [12, 26], that states that the output decisions of an algorithm must be motivated. Explainability of in-house models An already large body of work is interested in the explainability of implicit machine learning models (such as neural net- work models) [2, 13, 20]. Indeed, those models show state-of-art performances when it comes to a task ac- curacy, but they are not designed to provide explana- tions –or at least intelligible decision processes– when one wants to obtain more than the output decision of the model. In the context of recommendation, the ex- pression “post hoc explanation” has been coined [32]. In general, current techniques for explainability of im- plicit models take trained in-house models and aim at shedding light on some input features causing salient decisions in their output space. LIME [25] for instance builds a surrogate model of a given black-box system that approximates its predictions around a region of interest; the surrogate is an explainable model by con- struction (such as a decision tree), so that it can explain some decision facing some input data. The amount of queries to the black-box model is assumed to be un- bounded by LIME and others [11], permitting virtually exhaustive queries to it. This is what is making them suitable for the inspection of in-house models, by their designers. The temptation to explain decisions to users. The temptation for corporations to apply the same rea- soning in order to explain some decisions to their users is high. Indeed, this would support the will for a more transparent and trusted web by the public. Facebook for instance attempted to offer a form of transparency for the ad mechanism targeting its users, by introduc- ing a “Why I am seeing this” button on received ads. For a user, the decision system is then remote, and can be queried only using inputs (its profile data) and the observation of system decisions. Yet, from a security standpoint, the remote server executing the service is untrusted to the users. Andreou et al. [4] recently em- pirically observed in the case of Facebook that those explanations are “incomplete and can be misleading”, conjecturing that malicious service providers can use this incompleteness to hide the true reasons behind their decisions. In this paper, we question the possibility of such an explanation setup, from a corporate and private model in destination to users: we go one step further than arXiv:1910.01432v1 [cs.LG] 3 Oct 2019

Upload: others

Post on 12-Feb-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

The Bouncer Problem: Challenges to Remote Explainability

Erwan Le MerrerUniv Rennes, Inria, CNRS, Irisa

[email protected]

Gilles TredanLAAS/CNRS

[email protected]

Abstract

The concept of explainability is envisioned to satisfy so-ciety’s demands for transparency on machine learningdecisions. The concept is simple: like humans, algo-rithms should explain the rationale behind their deci-sions so that their fairness can be assessed.

While this approach is promising in a local context(e.g. to explain a model during debugging at trainingtime), we argue that this reasoning cannot simply betransposed in a remote context, where a trained modelby a service provider is only accessible through its API.This is problematic as it constitutes precisely the targetuse-case requiring transparency from a societal perspec-tive.

Through an analogy with a club bouncer (which mayprovide untruthful explanations upon customer reject),we show that providing explanations cannot prevent aremote service from lying about the true reasons leadingto its decisions. More precisely, we prove the impossi-bility of remote explainability for single explanations,by constructing an attack on explanations that hidesdiscriminatory features to the querying user.

We provide an example implementation of this at-tack. We then show that the probability that an ob-server spots the attack, using several explanations forattempting to find incoherences, is low in practical set-tings. This undermines the very concept of remote ex-plainability in general.

1 Introduction

Modern decision-making driven by black-box systemsnow impacts a significant share of our lives [9, 29].Those systems build on user data, and range from rec-ommenders [21] (e.g., for personalized ranking of in-formation on websites) to predictive algorithms (e.g.,credit default) [29]. This widespread deployment, alongwith the opaque decision process provided by those sys-tems raises concerns about transparency for the generalpublic or for policy makers [12]. This translated in somejurisdictions (e.g., United States of America and Eu-rope) into a so called right to explanation [12,26], thatstates that the output decisions of an algorithm mustbe motivated.

Explainability of in-house models An alreadylarge body of work is interested in the explainability ofimplicit machine learning models (such as neural net-work models) [2, 13, 20]. Indeed, those models showstate-of-art performances when it comes to a task ac-curacy, but they are not designed to provide explana-tions –or at least intelligible decision processes– whenone wants to obtain more than the output decision ofthe model. In the context of recommendation, the ex-pression “post hoc explanation” has been coined [32].In general, current techniques for explainability of im-plicit models take trained in-house models and aim atshedding light on some input features causing salientdecisions in their output space. LIME [25] for instancebuilds a surrogate model of a given black-box systemthat approximates its predictions around a region ofinterest; the surrogate is an explainable model by con-struction (such as a decision tree), so that it can explainsome decision facing some input data. The amount ofqueries to the black-box model is assumed to be un-bounded by LIME and others [11], permitting virtuallyexhaustive queries to it. This is what is making themsuitable for the inspection of in-house models, by theirdesigners.

The temptation to explain decisions to users.The temptation for corporations to apply the same rea-soning in order to explain some decisions to their usersis high. Indeed, this would support the will for a moretransparent and trusted web by the public. Facebookfor instance attempted to offer a form of transparencyfor the ad mechanism targeting its users, by introduc-ing a “Why I am seeing this” button on received ads.For a user, the decision system is then remote, and canbe queried only using inputs (its profile data) and theobservation of system decisions. Yet, from a securitystandpoint, the remote server executing the service isuntrusted to the users. Andreou et al. [4] recently em-pirically observed in the case of Facebook that thoseexplanations are “incomplete and can be misleading”,conjecturing that malicious service providers can usethis incompleteness to hide the true reasons behindtheir decisions.

In this paper, we question the possibility of such anexplanation setup, from a corporate and private modelin destination to users: we go one step further than

1

arX

iv:1

910.

0143

2v1

[cs

.LG

] 3

Oct

201

9

Page 2: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

the observations by the work of Andreou et al. [4], bydemonstrating that remote explainability simply can-not be a reliable guarantee of the lack of use of dis-criminative features. In a remote black-box setup suchas the one of Facebook, we show that a simple attack,we coin the Public Relations (PR) attack, underminesremote explainability.

The bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introducethe bouncer problem as an illustration of the difficultyfor users to spot malicious explanations. The analogyworks as follows: let’s picture a bouncer at the door ofa club, deciding whoever might enter the club. Whenhe issues a negative decision –refusing the entrance to agiven person–, he also provides an explanation for thisrejection. However, his explanation might be malicious,in the sense that his explanation does not present thetrue reasons of this persons’ rejection. Consider for in-stance a bouncer discriminating people based on thecolor of their skin. Of course he will not tell people herefuses the entrance based on that characteristic, sincethis is a legal offence. He will instead invent a biasedexplanation that the rejected person is likely to accept.

The classic way to assess a discrimination by thebouncer is for associations to run tests (following theprinciple of statistical causality [23] for instance): sev-eral persons attempt to enter, while they only vary intheir attitude or appearance on the possibly discrimi-nating feature (e.g., the color of their skin). Conflictingdecisions by the bouncer is then the indication of a pos-sible discrimination and is amenable to the building ofa case for prosecution.

We make the parallel with bouncer decisions in thispaper by demonstrating that an user cannot trust a sin-gle (one-host) explanation provided by a remote model,and that the only solution to spot inconsistencies is toissue multiple requests to the service. Unfortunately,we also demonstrate the problem is hard, in the sensethat spotting an inconsistency in such a way is intrin-sically not more efficient then exhaustively search on alocally available model to identify a problem, which isan intractable process.

Rationale and organization of the paper Webuild a general setup for remote explainability in thenext section, that has the purpose of representing ac-tions by a service provider and by users, facing modelsdecisions and explanations. The fundamental blocks forthe impossibility proof of a reliable remote explainabil-ity, or its hardness for multiple queries are presented inSection 2. We present the bouncer problem in Section3, that users have to solve in order to detect maliciousexplanations by the remote service provider. We then il-lustrate the PR attack, that the malicious provider mayexecute to remove discriminative explanations to users,on decision trees (Section 4.1). We then practically ad-dress the bouncer problem by modeling a user tryingto find inconsistencies from a provider decisions basedon the German Credit dataset and a neural network

classifier, in Section 4.2. We discuss open problems inSection 5, before reviewing related works in Section 6and concluding in Section 7. Since we show that re-mote explainability in its current form is undermined,this work thus aims to be a motivation for researchers toexplore the direction of provable explainability, by de-signing new protocols such as for instance one implyingcryptographic means (e.g., such as in proof of owner-ship for remote storage), or to build collaborative ob-servation systems to spot inconsistencies and maliciousexplanation systems.

2 Explainability of remote deci-sions

In this work, we study classifier models, that will issuedecisions given user data. We first introduce the setupwe operate in: it is intended to be as general as possible,so that the results drawn from it can apply widely.

2.1 General Setup

We consider a classifier C : X 7→ Y that assigns in-puts x of the feature space X to a class C(x) = y ∈ Y.Without loss of generality and to simplify the presen-tation, we will assume the case of a binary classifier:Y = 0, 1; the decision is thus the output label re-turned by the classifier.

Discriminative features and classifiers To pro-duce a decision, classifiers rely on features (variables)as an input. Those are for instance the variables as-sociated to a user profile on a given service platform(e.g., basic demographics, political affiliation, purchasebehavior, residential profile [4]). In our model, we con-sider that the feature space contains two types of fea-tures: discriminatory and legitimate features. The useof discriminatory features allows for exhibiting the pos-sibility of a malicious service provider issuing decisionsand biased explanations. This problematic is also ref-ered to as rationalization in a paper by Avodji et al [3].This setup of course does not prevent to consider hon-est providers not leveraging them. While we leave openthe precise definition of discriminatory features, we notethat any particular feature can be considered as dis-criminatory in this current work. In this model, theprincipal property of a discriminatory feature is thatthe service provider does not want to admit its use.Two main reasons come to mind:

• Legal: the jurisdiction’s law forbids decisions basedon a list of criterion1 which are easily found in clas-sifiers input spaces. A service provider risks pros-ecution upon admitting the use of those.

For instance, features such as age, sex, employ-ment, or the status of foreigner are considered asdiscriminatory in the work [14], that looks into the

1For instance in the U.K.: https://www.gov.uk/

discrimination-your-rights.

2

Page 3: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

Xl

Xd

X

Feature space

ClCd

C

Classifier space

Y

Decision

?c1

?c2

Figure 1: Illustration of our model: we consider bi-nary classifiers, that map the input domain X to labelsY = 0, 1. Some dimensions of the input space are dis-criminative Xd, which induces a partition on the classi-fier space. Legitimate classifiers Cl that do not rely ondiscriminative features to issue a label (in green), whileothers (that is, Cd) can rely on any feature (in red).

German Credit Dataset, that links bank customerfeatures to the accordance or not of a credit.

• Strategical: the service provider wants to hide theuse of some features on which its decisions arebased. This could be to hide some business se-cret from competitors (because of the accuracy-fairness trade-off [17] for instance), or to avoid ”re-ward hacking” from users biasing this feature, orto avoid bad press.

Conversely, any feature that is not discriminatory iscoined legit.

Formally, we partition the classifier input space Xalong these two types of features: legitimate featuresXl that the model can legitimately exploit to issue adecision, and discriminative features Xd (please refer toFigure 1). In other words X = (Xl, Xd), and any inputx ∈ X can be decomposed as a pair of legitimate anddiscriminatory features x = (xl, xd). We assume theinput contains at least one legitimate feature: Xl 6= ∅.

We also partition the classifier space accordingly: letCl ⊂ C the space of legitimate classifiers (among allclassifiers C), which do not rely on any feature of Xd

to issue a decision. More precisely, we consider that aclassifier is legitimate if and only if arbitrarily changingany discriminative input feature does never change itsdecision:

C ∈ Cl ⇔ ∀xl ∈ Xl,∀xd, x′d ∈ X 2d , C((xl, xd)) = C((xl, x

′d)).

Observe that therefore, any legitimate classifier Cl

could simply be defined over input subspace Xl ⊂ X .As a slight notation abuse to stress that the value ofdiscriminative features does not matter in this legit-imate context, we write C((xl, ∅)), or C(x ∈ Xl) asthe decision produced regardless of any discriminativefeature. It follows that the space of discriminative clas-sifiers complements the space of legitimate classifiers:Cd = C \ Cl.

We can now reframe the main research question weaddress: Given a set of discriminative features Xd, anda classifier C, can we decide if C ∈ Cd, in the remoteblack-box interaction model ?

The black-box remote interaction model Wequestion the black-box remote interaction model (seee.g., paper [27]), where the classifier is exposed to usersthrough a remote API. In other words users can onlyquery the classifier model with an input and obtain alabel as an answer (e.g., 0 or 1). In this remote setup,users then cannot collect any specific information aboutthe internals of the classifier model, such as its architec-ture, its weights, or its training data. This correspondsto a security threat model where two parties are inter-acting with each other (the user and the remote ser-vice), and where the remote model is implemented ona server, belonging to the service operator, that is un-trusted by the user.

2.2 Requirements for Remote Explain-ability

Explainability is often presented as a solution to in-crease the acceptance of AI [2], and to potentially pre-vent discriminative AI behaviour. Let us expose thelogic behind this connection.

A simple definition of an explanation First, weneed to define what is an explanation, to go beyondMiller’s definition as an “answer to a why-question”[18]. Since the topic of explainability is becoming a hotresearch field with (to the best of our knowledge) noconsensus on a more technical definition of an expla-nation, we will propose for the sake of our demonstra-tion that an explanation is coherent with respect to thestandard modus ponens (⇒) rule. For instance, if ex-planation a explains decision b, it means that in contexta, the decision produced will necessarily be b.

In this light, we directly observe the beneficial effectof such explanations on our parallel to club bouncing:while refusing someone, the bouncer may provide himwith the reasons of that rejection; the person can thencorrect their behaviour in order to be accepted on nextattempt.

This property is also enough to prove non-discrimination: if a does not involve discriminating ar-guments (which can be locally checked by the user asa is a sentence), and a ⇒ b, then decision b is not dis-criminative in case a. On the contrary, if a does involvediscriminating arguments, then decision b is taken ona discriminative basis, and is therefore a discriminativedecision. In other words, this property of an explana-tion is enough to reveal discrimination.

To sum up, any explanation framework that behaves“logically” (i.e., fits the modus ponens truth table)–which is in our view a rather mild assumption– isenough to establish the discriminative basis of a de-cision. We believe this is the rationale of the statement”transparency can improve user’s trust in AI systems”.In the following, we include such explanations in ourinteraction model.

Requirements on the user side for checking ex-planations A user that queries a classifier C with

3

Page 4: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

an input x gets two elements: the decision (inferredclass) y = C(x) and an explanation a such that aexplains y. Formally, upon request x, a user collectsexpC(x) = (y, a).

We assume that such user can locally check a is apro-pos: that a corresponds to its input x. Formally, wewrite a ∈ A(x). This allows us to formally write a non-discriminatory explanation as a ∈ A(xl). This forbidslying by explaining an input that is different than x.

We also assume that the user can locally check theexplanation is consequent : user can check that a is com-patible with y. This forbids crafting explanations thatare incoherent w.r.t. the decision (like a bouncer thatwould explain why you can enter in while leaving thedoor locked).

To produce such explanations, we assume the exis-tence of an explanation framework expC producing ex-planations for classifier C (this could for instance bythe LIME framework [25]). The explanation a explain-ing decision y in context x by classifier C is writtena = expC(y, x).

2.3 Limits of Remote Explainability:The PR (Public Relations) Attack

We articulate our demonstration of the limits of ex-plainability in a remote setup by showing that a mali-cious service provider can hide the use of discriminatingfeatures for issuing its decisions, while conforming tothe mild explainability framework we described in theprevious subsection.

Such a malicious provider thus wants to i) producedecisions based on discriminative features and to ii)produce non-discriminatory explanations to avoid pros-ecution. A practical approach to explain a discrimi-native decision b while not revealing its discriminativenature is to simply omit discriminating arguments inits explanation a. This is what club bouncers may betempted to do. Yet for classifiers, depending on thenature of the explanation it might seem non-trivial. Inthe following we show it is, by introducing a new attackon explanations.

A Generic Attack Against Remote Explainabil-ity We coin this attack the Public Relations attack(noted PR). The idea is rather simple: upon recep-tion of an input x, first compute discriminative deci-sion C(x). Then train a local surrogate model C ′ thatis non-discriminative, and such that C ′(x) = y. ExplainC ′(x), and return this explanation along with C(x).

Figure 2 illustrates a decision based solely on legit-imate features (A.), a provider giving an explanationthat includes discriminatory features (B.), and the at-tack by a malicious provider (C.). In all three scenarios,a user is querying a remote service with inputs x, andobtaining decisions y each along with an explanation.In case B., the explanation expC reveals the use of dis-criminative features Xd; this provider is prone to com-plaints. To avoid those, the malicious provider (C.)

leverages the PR attack, by first computing C(x) us-ing its discriminative classifier C. Then, based on thelegitimate features xl of the input, and its final (dis-criminative) decision y, it derives a classifier C ′ for theexplanation.

Core to the attack is the ability to derive such clas-sifier C ′:

Definition 1 (PR attack). Given an arbitrary classi-fier C ∈ Cd, a PR attack is a function that finds for anarbitrary input x a classifier C ′:

PR(C, x,C(x))→ C ′, (1)

such that C ′ satisfies two properties:

• coherence: C ′(xl) = y.

• legitimacy: C ′ ∈ Cl.

Effectiveness of the attack: Observe that a =expC′(y, x) is apropos since it directly involves x : a ∈A(x). Since we have C ′(x) = y, it is also consequent.Finally, observe that since C ′ ∈ Cl, then a ∈ A(xl): ais non-discriminatory.

Existence of the attack: We note that crafting aclassifier C ′ satisfying the first property is trivial since itonly involves a single data point x. An example solutionis the Dirac delta function of the form:

C ′(x′) = C ′((x′l, x′d)) =

δx′

l,xlif y = 1

1− δx′l,xl

if y = 0,

where δ is the Dirac delta function. Informally, thissolution corresponds to explaining a decision by takingexactly all features in the input as preponderant for thedecision, and then to expose them in the explanation.

One implementation of a PR attack On canrely on state-of-the-art explanation frameworks, suchas LIME, that can produce a decision tree as the ex-planation. Having access to a decision tree trivially fitsthe modus ponens we introduced as a basic explana-tion framework. Let us consider the following process,executed by a malicious provider. The discriminatingclassifier C is first turned into a decision tree, by build-ing a surrogate model. Given this new C, we defineC ′ as the decision tree derived from C by removing allintermediary nodes relying on features in Xd. To re-move those nodes, just take the branch that leads todecision y given xd, and prune the other branches. Theresulting tree C ′ will by definition have C ′(xl) = y since∀x ∈ Xl, C

′(x) = C((xl, xd)). The resulting tree C ′ isanother decision tree. Moreover, since C ′ does not in-volve any features of Xl its explanation is legitimateto the user (as opposed to case B.). This tree pruningalgorithm is given in Section 4.1.

We have presented the framework and an attack nec-essary to question the possibility of remote explainabil-ity. We next discuss the possibility for a user to spotthat an explanation is malicious and obtained by a PRattack. We stress that if a user cannot, then the veryconcept of remote explainability is at stake.

4

Page 5: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

x = (xl, ∅)C→ y

Remote

x y, expC(y, (xl, ∅))

User

A.

x = (xl, xd)C→ y

Remote

x y, expC(y, (xl, xd))

Discriminated User

B.

(xl, xd)C→ y

PR(C, (xl, xd), y) → C′

s.t. C ′(xl) = y

x y, expC′(y, (xl, ∅))

Discriminated& Fooled User

C.

Figure 2: (A.) A provider using a model that does not leverage discriminatory features. (B.) A discriminative modeldivulges its use of a discriminating feature. (C.) The PR attack principle, undermining remote explainability: theblack-box builds a local surrogate model C ′ for each new request x, that decides y based on xl features only. Itexplains y using C ′.

3 The bouncer problem: spot-ting PR attacks

We presented in the previous section a general setup forremote explainability. We now formalise our researchquestion regarding the possibility of a user to spot anattack in that setup.

Definition 2 (The bouncer problem (BP)). Using εrequests that each returns a decision yi = C(xi) and anexplanation expC(yi, x), we denote by BP(ε), decide ifC ∈ Cd.

3.1 An Impossibility Result for One-Shot Explanations

We already know that using a single input point is in-sufficient:

Observation 1. BP (1) has no solution.

Proof. The Dirac construction above always exists.

Indeed, constructions like the introduced Dirac func-tion, or the tree pruning construct a PR attack thatproduces explainable decisions. Given a single expla-nation on model C ′ (i.e., ε = 1) the user cannot dis-tinguish between the use of a model (C in case A.), orthe one of a crafted model by a PR attack (C ′ in caseC.), since it is consequent. This means that such a usercannot spot the use of hidden discriminatory featuresdue to the PR attack by the malicious provider.

We observed that a user cannot spot a PR attack,with BP(1). This is already problematic, as it gives aformal explanation on why Facebook ad system cannotbe trusted [4].

3.2 The Hardness of Multiple Queriesfor Explanation

To address the case BP(ε), we observe that a PR attackgenerates a new model C ′ for each request; in conse-

quence, an approach to detect that attack is to detectthe impossibility (using multiples queries) of a singlemodel C ′ to produce coherent explanations for a set ofobserved decisions. We here study this approach.

Interestingly, classifiers and bouncers share this prop-erty that their outputs are all mutually exclusive (eachinput is mapped to exactly one class). Thus we haveIn ⇒ Out (with In and Out the positive or negativedecision to for instance enter a place). In which case itis impossible to have a⇒ In and a⇒ Out. Note thatnon-mutually exclusive outputs are not bound by thisrule.

A potential problem for the PR attack is a decisionconflict, in which a could explain both b and b its op-posite. For instance, imagine a bouncer refusing youthe entrance of a club because, say, you have whiteshoes. Then, if the bouncer is coherent, he should refusethe entrance to anyone wearing white shoes, and if youwitness someone entering with white shoes, you couldargue against the lack of coherence of the bouncer de-cisions. We build on those incoherences to spot PRattacks.

In order to examine the case BP(ε), where ε > 1, wefirst define the notion of an incoherent pair :

Definition 3 (Incoherent Pair – IP). Let x1 =(x1l , x

1d), x2 = (x2l , x

2d) ∈ X = Xl × Xd be a two input

points in the feature space. x1 and x2 form an inco-herent pair for classifier C iff they both have the samelegit feature values in Xl and yet end up being classifieddifferently:x1l = x2l ∧ C(x1) 6= C(x2). For convenience we write

(x1, x2) ∈ IPC .

Finding such an IP is a potent proof of PR attack onthe model by the provider. Indeed, we can formalise theintuitive reasoning ”if you let others enter with whiteshoes then this wasn’t the true reason for my rejection”:

Proposition 1. Only decisions resulting from a modelcrafted by a PR attack (1) can exhibit incoherent pairs:IPC 6= ∅ ⇒ C ∈ Cd.

5

Page 6: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

Proof. We prove the contra-positive form C 6∈ Cd ⇒IPC = ∅. Let C 6∈ Cd. Therefore C ∈ Cl, and by defini-tion: ∀xl ∈ Xl,∀xd, x′d ∈ X 2

d , C((xl, xd)) = C((xl, x′d)).

By contradiction assume IPC 6= ∅. Let (x1, x2) ∈ IPC :x1l = x2l ∧ C(x1) 6= C(x2). This directly contradictsC ∈ Cl. Thus IPC = ∅.

We can show that there is always a pair of inputsallowing to detect a discriminative classifier C ∈ Cd.

Proposition 2. A classifier C ′, resulting from a PRattack, always has at least one incoherent pair: C ′ ∈Cd ⇒ IPC 6= ∅.

Proof. We prove the contrapositive form IPC = ∅ ⇒C 6∈ Cd. Informally, the strategy here is to prove thatif no such pair exists, this means that decisions are notbased on discriminative features in Xd, and thus theprovider had no interest in conducting a PR attack onthe model; the considered classifier is not discriminat-ing.

Assume that IPC = ∅. Let x∅ ∈ Xd, and let Cl :Xl 7→ Y be a legitimate classifier such that Cl(xl) =C((xl, x∅)).

Since IPC = ∅, this means that ∀x1, x2 ∈ X , x1l =x2l ⇒ C(x1) = C(x2). In particular thus ∀x ∈X , C(x = (xl, xd)) = Cl(xl, x∅). Thus C = Cl; bythe definition of a PR attack being only applied to amodel that uses discriminatory features, this leads toC ∈ C \ Cd, i.e., C 6∈ Cd.

Which directly applies to our problem:

Proposition 3 (Detectability lower bound). BP (|X |)is solvable.

Proof. Straightforward: C ′ ∈ Cd ⇒ IPC 6= ∅, and sinceIP ⊆ X × X testing the whole input space will neces-sarily exhibit such an incoherent pair.

This last result is rather weakly positive: even thoughany PR attack is eventually detectable, in practice itis impossible to exhaustively explore the input spaceof modern classifiers due to their dimension. This re-mark also further questions the opportunity of remoteexplainability.

4 Illustration and experiment

In this section, we first give an algorithm that imple-ments a PR attack on binary decision trees, as an il-lustration. We then experiment on the German Creditdataset to quantify the hardness of finding incoherentpairs in a black-box setup.

4.1 Illustration using Decision Trees

In this section, we embody the previous observationsand approaches on the concrete type of classifiers basedon decision trees. The choice of decision trees is moti-vated first because of its recognised importance (e.g.,C4.5 ranked number one of the top ten data mining

algorithms [30]). Second, there is a wide consensus ontheir explainability, that is straightforward [20]: a pathin the tree ”naturally” lists the attributes considered bythe algorithm to establish a classification. Finally, thesimplicity of crafting PR attacks on those make themgood candidates for an illustration and argues for thepractical implementability of such attack.

We denote T as the set of tree-based classifiers. Wedo not need any assumption on how the tree is built(e.g., C4.5 [24]). Regarding explainability, we hereonly need to assume that decision trees are explainable:∀C ∈ T , expC exists.

Let C ∈ T ∩ Cd be a discriminatory binary tree clas-sifier. Each internal node n ∈ V (C) tests incoming ex-amples based on a feature n.label. Each internal nodeis connected to exactly two sons in the tree, named n.rand n.l for right and left. Depending on the (binary)result of this test, the example will continue on eitherof this paths. We denote the father of an internal nodeby n.father (the root node r is the only node such thatr.father = ∅).

Algorithm 1 presents a PR attack on binary decisiontrees. To ease its presentation, we assume that givenan input x, n.r (right) will by convention always be thebranch taken after evaluating x on n. The algorithmstarts by initializing the target decision tree C ′ as acopy of C. Then, it selectively removes all nodes in-volving discriminative features, and replaces them withthe subtree the target example x would take.

Algorithm 1: PR attack on a discriminative decision

binary tree C

Input: C, x = (xl, xd)1 y = C(x) ; // Find discriminative decision

2 Let n0, . . . nl be breadth first ordering of the nodesof C;

3 Let C′ = C ; // Initialise surrogate

4 for node i = 0 to l do5 if ni.label ∈ Xd then6 C′.ni.father.r = ni.r ; // Reconnect ni

father to right son

7 C′ = C′ \ ni ; // Remove discriminating

node

8 C′ = C′ \ ni.l subtree ; // Remove left

subtree

9 else10 C′.ni.l = y ; // Keep legit node, add dummy

terminal node

11 end

12 end13 return y, expC′(y, (xl, ∅))

To do so, Algorithm 1 removes each discriminativenode ni by connecting ni−1 and ni+1. While this ap-proach would be problematic in the general case (wewould loose the ni.l subtree), in the context of x weknow the explored branch is ni.r, so we simply re-connect this branch, and replace the left subtree by adummy output.

An example is presented in Figure 3: the dis-criminative classifier C is queried for the explanation

6

Page 7: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

Disguised ?C(x) :

Age < 60 Wears pinksocks ?

Enter BounceEnter Bounce

Y N

Y NY N

Disguised ?C ′(xl)|xd < 60 :

Wears pinksocks ?

Enter

Enter Bounce

NY

Y N

Disguised ?C ′(xl)|xd ≥ 60 :

Wears pinksocks ?Bounce

Enter Bounce

NY

Y N

Figure 3: Illustration of a possible implementation (Algorithm 1) of the PR attack: instead of having to explain theuse of a discriminative feature (age in this case) in the classifier C, two non-discriminative classifiers (C ′, on theright) are derived. Depending on the age feature in the request, a C ′ is then selected to produce legit explanations.The two dashed circles represent an Incoherent Pair, that users might seek to detect a malicious explanation.

expC(C(x), x) of input x. To produce an answer for adiscriminative feature such as the age, it first appliesAlgorithm 1 on C, given the query x. If x < 60 (upperright in Figure 3), the explanation expC′ has simplybecame a node with the age limit, leading to an “En-ter” decision. In case x ≥ 60, the explanation node isa legit one (“Disguised”), leading to the “Bounce” de-cision. Both explanation then do not exhibit the factthat the provider relied on a discriminative feature inC. This exhibits that BP (1) does not have solution;second we see that BP (2), testing the age feature inXd has a solution. x1 = (c, 40, d) and x2 = (c, 70, d)are an instance of an incoherent pair (with arbitrary cand d in the (disguised,age,pink-socks) input vector).

Proposition 4. Algorithm 1 implements a PR attack.

Proof. To prove the statement, we need to prove that:

• C ′(xl) is defined

• C ′(xl) = y

• C ′ is explainable

First, observe that any nodes of C ′ containing dis-criminative features is removed line 7. Thus, C ′ onlytakes decisions based on features in Xl: C ′(xl) is de-fined.

Second, observe that by construction since x =(xl, xd), and since any discriminative node n is replacedby this right (n.r) outcome which is the one that cor-responds to xd. In other words, ∀x′l ∈ Xl, C

′(x′l) =C((x′l, xd)): C ′ behaves like C where discriminative fea-tures are evaluated at xd. This is true in particular forxl : C ′(xl) = C((xl, xd)) = C(x) = y.

Finally, observe that C ′ is a valid decision tree.Therefore, according to our explainability framework,C ′ is explainable.

Interestingly, the presented attack can be efficient asit only involves pruning part of the target tree. In theworst case, this one has Ω(2d) elements, but in practicedecision trees are rarely that big.

4.2 Finding IPs on a Neural Model: theGerman Credit dataset

We now take a closer look at the detectability of theattack, namely: how difficult is it to spot an IP ? Weillustrate this by experimenting on the German Creditdataset. Despite the low dimensionality of that dataset,we show that the probability of finding IPs, such as onein Figure 3, is low.

Experimental setup We leverage Keras over Ten-sorFlow to learn a neural network-based model for theGerman Credit dataset [1]. This bank dataset classi-fies client profiles (1, 000 of them), described by a set ofattributes, as good or bad credit risks. Multiple tech-niques has been employed to model the credit risks onthat dataset, which range from 76.59% accuracy for aSVM to 78.90% for a hybrid between genetic algorithmand a neural network [22].

The dataset is composed of 24 features (some cate-gorical ones, such as sex, of status, were set to numer-ical). This thus constitutes a low dimensional datasetas compared to current applications (observations in [4]reported up to 893 features for the sole application ofad placement on user feeds on Facebook).

The neural network we built is inspired by the oneproposed [16] in 2010, and that reached 73.17% accu-racy. It is a simple multi-layer perceptron, with a singlehidden layer of 23 neurons (with sigmoid activations),and a single output neuron for the binary classificationof the input profile to “risky” or not. In this experi-ment we use the Adam optimizer and a learning rate of

7

Page 8: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

All four

Foreigner

Employment

Age

Sex/Status

0 2 4 6

Detection probability (% − by request pair)

Dis

crim

inat

ive

varia

ble

Figure 4: Percentage of label changes when swappingthe discriminative features in the test set data. Thoseindicate the low probability to spot a PR attack on theprovider model.

0.1 (leading to much better convergences than in [16]),with a validation split of 25%. We create 30 models,with an average accuracy of 76.97%@100 epochs on thevalidation set (with a standard deviation of 0.92%).

We removed the columns containing a discriminatingfeature according to [14] (i.e., age, sex, employment,foreigner), and train 30 models with the same parame-ters, and obtain an average of 77.21%@100 epochs (de-viation of 1.36%). The observed relatively high varianceon the final accuracy for a model training explains thehigher accuracy score in this experiment, even averagedover 30 independent trainings.

We kept 50 profiles from the dataset as a test set, sowe can perform our core experiment: the four discrimi-natory features of each of those profiles are sequentiallyreplaced by the ones of the 49 remaining profiles; eachresulting test profile is fed to the model for prediction.(This permits to test the model with realistic valuesin those features. This process creates 2450 profiles tosearch of an IP). We count the number of times the out-put risk label has switched, as compared to the originaluntouched profile fed to the same model. We repeatthis operation 30 times to observe deviations.

The low probability of findings IPs at randomFigure 4 depicts the proportion of label changes over thetotal number of test queries; recall that a label changewhile considering two inputs constitutes an IP. We ob-serve that if we just change one of the four features, weobtain on average 1.86%, 0.27%, 1.40%, 2.27% labelschanges (for the employment, sex/status, age, foreignerfeatures, respectively), while 4.25% if the four featuresare simultaneously changed. (Standard deviations areof 1.48%, 0, 51%, 1, 65%, 2, 17% and 3, 13%, respec-tively).

Despite the low dimensionality of this dataset’s in-puts, the probability to find IPs is thus low; it turnsout that we can also compute an expectation of thenumber of queries for a user to find such an IP. Userscan query the service with inputs, until they are con-

1e−04

1e−03

1e−02

1e−01

1e+00

0 200 400 600

Number of Pairs Queried

1 −

Con

fiden

ce le

vel P(IP)

Sex/status

Age

Employment

Foreigner

All four

Figure 5: Confidence level as a function of the numberof tested input pairs, based on the German Credit de-tection probability in Figure 4. Dashed line representsa 99% confidence level.

fident enough that such pair does not exist. Assum-ing one seeks a 99% confidence level –that is, less thanone percent of chances to falsely detect a discrimi-nating classifier as non-discriminating–, and using thedetection probabilities of Figure 4, we can computethe associated p-values. A user testing a remote ser-vice based on those hypotheses would need to craft re-spectively 490, 2555, 368, 301, 160 (for the employment,sex/status, age, foreigner, and all four respectively)pairs in the hope to decide on the existence or not of anIP, as presented in Figure 5 (please note the log-scaleon the y-axis).

This highlights the hardness to experimentally checkfor PR attacks.

5 Discussion

We now list in this section several consequences of thefindings in this paper, and some open questions.

5.1 Findings and Applicability

We have shown that a malicious provider can alwayscraft a fake explanation to hide its use of discriminatoryfeatures, by creating a surrogate model for providingan explanation to a given user. An impossibility resultfollows, for the user to detect such an attack while usinga single explanation. The detection by a user, or agroup of users, is possible only in the case of multipleand deliberate queries (BP(ε > 1)); this process mayrequire an exhaustive search of the input space. Weargue that with the current increase in dimensionalityof inputs for decision making due to the progress madepossible by deep learning, the probability to detect suchattacks will decrease accordingly.

We note that the malicious providers have anotheradvantage for covering PR attacks. Since multiplequeries must be issued to spot inconsistencies via IPpairs, basic rate limiting mechanisms for queries mayblock and ban the incriminated users. Defenses of this

8

Page 9: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

kind, for preventing attacks on online machine servicesexposing APIs, are being proposed [15]. This adds an-other layer of complexity for the observation of misbe-haviour.

5.2 Connection with Disparate Impact

We now briefly relate our problem to disparate impact :a recent article [10] proposes to adopt ”a generaliza-tion of the 80 percent rule advocated by the US EqualEmployment Opportunity Commission (EEOC)” as acriteria for disparate impact. This notion of disparateimpact proposes to capture discrimination through thevariation of outcomes of an algorithm under scrutinywhen applied to different population groups.

More precisely, let α be the disparity ratio. The au-thors propose the following formula, here adapted toour notations [10]:

α =P(y|xd = 0)

P(y|xd = 1),

where Xd = 0, 1 is the discriminative space reducedto a binary discriminatory variable. Their approach isto consider that if α < 0.8 then the tested algorithmcould be qualified as discriminative.

To connect disparate impact to our framework, wecan conduct the following strategy. Consider a clas-sifier C having a disparate impact α. We search forIncoherent Pairs as follows: first, pick x ∈ Xl a setof legit features. Then take A = (x,0), representingthe discriminated group, and B = (x,1) representingthe undiscriminated group. Then test both C(A) andC(B): if C(A) 6= C(B) then (A,B) is an IP. The prob-ability of finding an IP in this approach can be writtenas P(IP ).

We can develop: P(IP ) = P(C(A) 6= C(B)) =P(C(A) ∩ C(B)) + P(C(A) ∩ C(B)) = P(C(A)) −P(C(A)∩C(B))+P(C(B))−P(C(A)∩C(B)). Since α =P(C(A))/P(C(B)), we write: P(IP ) = P(C(B))(1 +α)−2P(C(A)∩C(B)). Using conditional probabilities,we have P(C(A) ∩ C(B)) = P(C(B)|C(A)).P(C(A)).

Thus P(IP ) = P(C(B))(1 + α − 2α.P(C(B)|C(A))).Since the conditional probability P(C(B)|C(A)) is dif-ficult to assess without further hypotheses on C, let usinvestigate two extreme scenarios:

• Independence: C(A) and C(B) are completely in-dependent events, even though A and B share theirlegit features in x. This scenario, which is not veryrealistic, could model purely random decisions withrespect to attributes from Xd. In this scenarioP(C(B)|C(A)) = P(C(B)).

• Dependence: C(A) ⇒ C(B): if A is selected de-spite its membership to the discriminated group(A = (x, 0)), then necessarily B must be selected,as it can only be ”better” from C’s perspective. Inthis scenario P(C(B)|C(A)) = 1.

Figure 6 represents the numerical evaluation of ourtwo scenarios. First, it shows that the probability of

0.00

0.25

0.50

0.75

0.00 0.25 0.50 0.75 1.00

P(C(B))

P(I

P)

α

0.1

0.5

0.8

0.9

1

scenario

Dependence

Independence

Figure 6: Probability to find an Incoherent Pair (IP),as a function of P(C(B)) the probability of success fora non-discriminated group.

finding an IP strongly depends on the probability of asuccess for the non-discriminated group P(C(B)). In-deed, since the discriminated group has an even lowerprobability of success, a low success probability for thenon-discriminated group implies frequent cases whereboth C(A) and C(B) are failures, which does not con-stitutes an incoherent pair.

In the absence of disparate impact, both scenariosprovide very different results: the independence sce-nario easily identifies incoherent pairs –which is co-herent with the ”random” nature of the independenceassumption–. This underlines the unrealistic nature ofthe independence scenario in this context. With a highdisparate impact however (e.g., α = 0.1), the discrimi-nated group has a high probability of failure. Thereforethe probability of finding an IP is very close to the sim-ple probability of the non-discriminated group having asuccess P(C(B)), regardless of the considered scenario.

The dependence scenario nicely illustrates a naturalconnection: the higher the disparate impact, the higherthe probability to find an IP. While this only constitutesa thought experiment, we believe this highlights possi-ble connections with standard discrimination measuresand conveys the intuition that in practice, the proba-bility of finding incoherent pairs exposing a PR attackstrongly depends on the intensity of the discriminationhidden by the PR attack.

5.3 Open Problems for Remote Ex-plainability

On the test efficiency Its is common for fairness as-sessment tools to leverage testing. As the features thatare considered discriminating are often precise [11, 14],the test queries for fairness assessment can be targetedand some notions of efficiency in terms of the amountof requests can be derived. This may be done by sam-pling the feature space under question for instance (asin work by Galhotra et al. [11]).

Yet, it appears that with current applications such as

9

Page 10: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

social networks [4], users spend a considerable amountof time online, producing more and more data that turninto features, and also are the basis to the generationof other meta-features. In that context, the full scopeof features, discriminating or not, may not be clear to auser. This makes exhaustive testing even theoreticallyunreachable, due to the very likely non-complete pic-ture of what providers are using to issue decisions. Thisis is another challenge on the way to remote explainabil-ity, if providers are not willing to release a complete andprecise list of all attributes leveraged in their system.

Towards a provable explainability? Some othercomputing applications, such as data storage or inten-sive processing also have questioned the possibility ofmalicious service providers in the past. Motivated bythe plethora of offers in the cloud computing domainand the question of quality of service, protocols suchas proof of data possession [5], or proof-based verifi-able computation [7], assume that the service providermight be malicious. A solution to still have servicesexecuted remotely in this context is then to rely oncryptographic protocols to formally verify the work per-formed remotely. To the best of our knowledge, no sucha provable process logic has been adapted to explain-ability. That is certainly an interesting development tocome.

6 Related Work

As a consequence of the major impact of machine learn-ing models in many areas of our daily life, the notionof explainability has been pushed by policy makers andregulators. Many works address explainability of in-spected model decisions on a local setup (please referto surveys [8,13,20]) – some specifically for neural net-work models [31] –, where the number of requests tothe model is unbounded. Regarding the question offairness, a recent work specifically targets the fairnessand discrimination of in-house softwares, by developinga testing-based method [11].

The case of models available through a remote black-box interaction setup is particular, as external observersare bound to scarce data (labels corresponding to in-puts, while being limited in the number of queries to theblack-box [28]). Adapting the explainability reasoningto models available in a black box setup is of a major so-cietal interest: Andreou et al. [4] shown that Facebook’sexplanations for their ad platform are incomplete andsometimes misleading. They also conjecture that ma-licious service providers can “hide” sensitive featuresused, by explaining decisions with very common ones.In that sense, our paper is exposing the hardness of ex-plainability in that setup, confirming that malicious at-tacks are possible. Milli et al. [19] provide a theoreticalground for reconstructing a remote model (a two-layerReLu neural network) from its explanations and inputgradients; if further research prove the approach prac-tical for current applications, this technique may help

to infer the use of discriminatory features in use by theservice provider.

In the domain of security and cryptography, somesimilar setups have found a large body of work to solvethe trust problem in remote interacting systems. Inproof of data possession protocols [5], a client executesa cryptographic protocol to verify the presence of hisdata on a remote server; the challenge that the stor-age provider responds to assesses the possession or notof some particular piece of data. Protocols can givecertain or probabilistic guarantees. In proof-based ver-ifiable computation [7], the provider returns the resultsof a queried computation, along with a proof for thatcomputation. The client can then check that the com-putation indeed took place. Those schemes, along thispaper exhibiting attacks on remote explainability, mo-tivate the need for the design of secure protocols.

Our work in complementary to classic discriminationdetection in automated systems. In contrast to workson fairness [6] that attempt to identify and measurediscrimination from systems, our work does not aim atspotting discrimination, as we have shown it can behidden by the remote malicious provider. We insteadare targeting the occurrence of incoherent explanationproduced by such a providers in the will to cover itsbehavior, which is a a completely different nature thanfairness based test suites. Galhotra et al. [11], inspiredby statistical causality [23], for instance propose cre-ate input datasets for observing discrimination on somespecific features by the system under test.

More closely related to our work is the recent paperby Aivodji et al. [3], that introduces the concept of ra-tionalization, in which a black-box algorithm is approx-imated by a surrogate model that is ”fairer” that theoriginal black-box. In our terminology, they craft C ′

models that optimise arbitrary fairness objectives. Toachieve this, they explore decision tree models trainedusing the black-box decisions on a predefined set of in-puts. This produces another argument against black-box explanability in a remote context. The main tech-nical difference with our tree algorithm section 4.1 isthat their surrogates C ′ optimises an exterior metric(fairness) at the cost of some coherence (fidelity in theauthors’ terminology). In contrast, our illustration sec-tion 4.1 produces surrogates with perfect coherence thatdo not optimise any exterior metric such as fairness. Inour model, spotting an incoherence (i.e., the explainedmodel produces a y while the black-box produces a y)would directly provide a proof of manipulation and re-veal the trickery. Interestingly, the incoherent pair ap-proach fully applies in the context of their model sur-rogates, as it arises as soon as more than one surrogateis used (regardless of the explanation). Our paper fo-cuses on the user-side observation of explanations, andusers ability to discover such attacks. We rigorouslyprove that single queries are not sufficient to determinea manipulation, and that the problem is hard even inthe presence of multiple queries and observations.

10

Page 11: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

7 Conclusion

In this paper, we studied explainability in a remote con-text, which is sometimes presented as a way to satisfysociety’s demand for transparency facing automated de-cisions. We prove it is unwise to blindly trust thoseexplanations: like humans, algorithms can easily hidethe true motivations of a decision when asked for ex-planation. To that end, we presented an attack thatgenerates explanations to hide the use of an arbitraryset of features by a classifier. While this constructionapplies to any classifier queried in a remote context, wealso presented a concrete implementation of that at-tack on decision trees. On the defensive side, we haveshown that such a manipulation cannot be spotted byone-shot requests, which is unfortunately the nominaluse-case. However, the proof of such trickery (pairs ofclassifications that are not coherent) necessarily exists.We further evaluated in a practical scenario the proba-bility of finding such pairs, which is low. The attack isthus arguably impractical to detect for an isolated user.

We conclude that this must consequently question thewhole concept of the explainability of a remote modeloperated by a third party provider, at the very least.A research direction is to develop secure schemes inwhich the involved parties can trust the exchanged in-formation about decisions and their explainability, asenforced by new protocols. A second line of researchmay be the collaboration of users observations for spot-ting the attack in an automated way. We believe thisis an interesting development to come, in relation withthe promises of AI and automated decisions processes.

Acknowledgements

The authors want to thank the bouncer of the Berghainclub, for having inspired this work.

References

[1] Statlog (german credit data) data set. https:

//archive.ics.uci.edu/ml/datasets/Statlog+

(German+Credit+Data).

[2] A. Adadi and M. Berrada. Peeking inside the black-box: A survey on explainable artificial intelligence(xai). IEEE Access, 6:52138–52160, 2018.

[3] U. Aivodji, H. Arai, O. Fortineau, S. Gambs, S. Hara,and A. Tapp. Fairwashing: the risk of rationalization.In ICML, 2019.

[4] A. Andreou, G. Venkatadri, O. Goga, K. P. Gummadi,P. Loiseau, and A. Mislove. Investigating Ad Trans-parency Mechanisms in Social Media: A Case Study ofFacebook’s Explanations. In NDSS, 2018.

[5] G. Ateniese, R. Burns, R. Curtmola, J. Herring,L. Kissner, Z. Peterson, and D. Song. Provable datapossession at untrusted stores. In CCS, 2007.

[6] R. Binns. Fairness in machine learning: Lessons frompolitical philosophy. In FAT*, 12 2017.

[7] B. Braun, A. J. Feldman, Z. Ren, S. Setty, A. J. Blum-berg, and M. Walfish. Verifying computations withstate. In SOSP, 2013.

[8] A. Datta, S. Sen, and Y. Zick. Algorithmic trans-parency via quantitative input influence: Theory andexperiments with learning systems. In 2016 IEEE sym-posium on security and privacy (SP), pages 598–617.IEEE, 2016.

[9] P. B. de Laat. Algorithmic decision-making based onmachine learning from big data: Can transparencyrestore accountability? Philosophy & Technology,31(4):525–541, Dec 2018.

[10] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger,and S. Venkatasubramanian. Certifying and remov-ing disparate impact. In Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Dis-covery and Data Mining, 2015.

[11] S. Galhotra, Y. Brun, and A. Meliou. Fairness testing:Testing software for discrimination. In ESEC/FSE,2017.

[12] B. Goodman and S. Flaxman. European union regu-lations on algorithmic decision-making and a ”right toexplanation”. AI Magazine, 38(3):50–57, Oct. 2017.

[13] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi-annotti, and D. Pedreschi. A survey of methods forexplaining black box models. ACM computing surveys(CSUR), 51(5):93, 2018.

[14] S. Hajian, J. Domingo-Ferrer, and A. Martınez-Balleste. Rule protection for indirect discriminationprevention in data mining. In Modeling Decision forArtificial Intelligence, 2011.

[15] J. Hou, J. Qian, Y. Wang, X.-Y. Li, H. Du, andL. Chen. Ml defense: Against prediction api threats incloud-based machine learning service. In Proceedingsof the International Symposium on Quality of Service,IWQoS ’19, pages 7:1–7:10, New York, NY, USA, 2019.ACM.

[16] A. Khashman. Neural networks for credit risk eval-uation: Investigation of different neural models andlearning schemes. Expert Systems with Applications,37(9):6233 – 6239, 2010.

[17] A. K. Menon and R. C. Williamson. The cost of fairnessin binary classification. In Proceedings of the 1st Con-ference on Fairness, Accountability and Transparency,pages 107–118, 2018.

[18] T. Miller. Explanation in artificial intelligence: Insightsfrom the social sciences. CoRR, abs/1706.07269, 2017.

[19] S. Milli, L. Schmidt, A. D. Dragan, and M. Hardt.Model reconstruction from model explanations. In Pro-ceedings of the Conference on Fairness, Accountability,and Transparency, FAT* ’19, pages 1–9, New York, NY,USA, 2019. ACM.

[20] C. Molnar. Interpretable Machine Learn-ing. 2019. https://christophm.github.io/

interpretable-ml-book/.

[21] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sun-daraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G.Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii,Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko,S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong,and M. Smelyanskiy. Deep learning recommendation

11

Page 12: The Bouncer Problem: Challenges to Remote ExplainabilityThe bouncer problem as a parallel for hard-ness For the sake of the demonstration, we introduce the bouncer problem as an illustration

model for personalization and recommendation sys-tems. CoRR, abs/1906.00091, 2019.

[22] S. Oreski and G. Oreski. Genetic algorithm-basedheuristic for feature selection in credit risk assessment.Expert Systems with Applications, 41(4, Part 2):2052 –2064, 2014.

[23] J. Pearl. Causal inference in statistics: An overview.Statistics Surveys, 3:96–146, 2009.

[24] J. R. Quinlan. C4. 5: programs for machine learning.Elsevier, 2014.

[25] M. T. Ribeiro, S. Singh, and C. Guestrin. ”why shouldi trust you?”: Explaining the predictions of any classi-fier. In KDD, 2016.

[26] A. D. Selbst and J. Powles. Meaningful information andthe right to explanation. International Data PrivacyLaw, 7(4):233–242, 12 2017.

[27] F. Tramer, F. Zhang, A. Juels, M. K. Reiter, andT. Ristenpart. Stealing machine learning models viaprediction apis. In Proceedings of the 25th USENIXConference on Security Symposium, SEC’16, pages601–618, Berkeley, CA, USA, 2016. USENIX Associ-ation.

[28] F. Tramer, F. Zhang, A. Juels, M. K. Reiter, andT. Ristenpart. Stealing machine learning models viaprediction apis. In 25th USENIX Security Symposium(USENIX Security 16), pages 601–618, Austin, TX,2016. USENIX Association.

[29] M. Veale. Logics and practices of transparency andopacity in real-world applications of public sector ma-chine learning. In FAT/ML, 2017.

[30] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang,H. Motoda, G. J. McLachlan, A. Ng, B. Liu, S. Y.Philip, et al. Top 10 algorithms in data mining. Knowl-edge and information systems, 14(1):1–37, 2008.

[31] C.-K. Yeh, J. Kim, I. E.-H. Yen, and P. K. Raviku-mar. Representer point selection for explaining deepneural networks. In Advances in Neural InformationProcessing Systems 31. 2018.

[32] Y. Zhang and X. Chen. Explainable recommendation:A survey and new perspectives. In CoRR, volume

abs/1804.11192, 2018.

12