a methodology for direct and indirect discrimination prevention in data mining

A Methodology for Direct and IndirectDiscrimination Prevention in

Data Mining

Presented By:Rucha Bhutada

Guided By:Prof. M. R. Wanjari

Outline:

Introduction Challenges Discrimination analysis Why discrimination Papers read Findings of the base paper Future plans

Introduction:

Data mining is an increasingly important technology for extracting useful knowledge hidden in large collections of data.

Some Negative social perceptions can also be mined, like: Potential Privacy invasion Potential discrimination

If the training datasets are biased in what regards discriminatory attributes like gender, race, religion, discriminatory decisions may follow.

Challenges:

Direct and indirect discrimination instead of only direct discrimination

To find a good tradeoff between discrimination removal and the quality of the resulting training data sets and data mining models.

Why this topic:

It’s an extension to association rule mining. And a novel application of association rule mining in social environment.

It is more than obvious that most people do not want to be discriminated on any of the sensitive issues.

Can be useful in deriving discrimination free rule base for decision making systems like insurance, loan, job etc.

Example:

U.S. federal laws prohibit discrimination on the basis of: Race , Color, Religion, Nationality, Marital status, Age

In a number of settings: • Credit/insurance scoring • Sale, rental, and financing of housing • Personnel selection and wage• Access to public accommodations, education, nursing

homes, adoptions, and health care.

Papers read:

Sr. No.

Paper Name Author Year Conclusion

1 A Methodology for Direct and IndirectDiscrimination Prevention in Data Mining

Sara Hajian and Josep Domingo-Ferrer

2013 To develop a new preprocessing discrimination prevention methodology

2 “RuleProtection for Indirect Discrimination Prevention in DataMining

S. Hajian, J. Domingo-Ferrer, and A. Martı´nez-Balleste

2011 To protect thedecision rules made for discrimination

3 Classification with no Discriminationby Preferential Sampling

F. Kamiran and T. Calders

2010 To refine the model of discrimination

Discussion On Findings Of Base Paper

Discrimination is unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit

Discrimination can be either direct or indirect:

Direct discrimination occurs when decisions are made based on sensitive attributes.

Indirect discrimination occurs when decisions are made based on non-sensitive attributes which are strongly correlated with biased sensitive ones.

Approach: Anti-discrimination techniques have been introduced in data

mining:

- Discrimination discovery:Consists of supporting the discovery of discriminatory

decisions hidden, either directly or indirectly, in a dataset of historical decision records.

- Discrimination Prevention:Consists of inducing patterns that do not lead to

discriminatory decisions even if the original data sets are biased.

Approach: (cont’d) Preprocessing approach Data sets: collection of data objects Item , An item set, The support of an item set, supp(X), is the fraction of records that contain

the item set X. We say that a rule X C is completely supported by a record if both X and C appear in the record.

The confidence of a rule, conf(X C), measures how often the class item C appears in records that contain X. Hence, if supp(X) > 0 then

Support and confidence range over [0,1].

Approach: (cont’d):

• A frequent classification rule is a classification rule with support and confidence greater than respective specified lower bounds.

• The negated item set, i.e., not of X is an item set with the same attributes as X, but the attributes in not of X take any value except those taken by attributes in X.

Approach: (cont’d):

o Potentially Discriminatory and Nondiscriminatory Classification Rules Let DIs be the set of predetermined discriminatory items in DB (eg.

DI={foreign worker= yes, Race= black, Gender= female}). Frequent classification rules in FR fall into one of the following two classes:

(FR stands for frequent classification rule) A classification rule X→C is potentially discriminatory (PD) when X = A,B

with A subset of DI, a nonempty discriminatory item set and B a nondiscriminatory item set. For example, {foreign worker= yes, city = NYC}→Hire = no.

A classification rule X→C is potentially nondiscriminatory (PND) when X = D,B is a nondiscriminatory item set. For example,{zip = 10451,City = NYC}→Hire = no or {Experience = low, City = NYC}→ Hire = no.

The word “potentially” means that a PD rule could probably lead to discriminatory decisions. Also, a PND rule could lead to discriminatory decisions in combination with some background knowledge;

Approach: (cont’d)o Direct Discrimination Measure Definition 1. Let A,B→C be a classification rule such that

conf(B→C>0). The extended lift of a rule is

The idea here is to evaluate the discrimination of a rule as the gain of confidence due to the presence of thediscriminatory items

Definition 2. Let α ε R be a fixed threshold and let A be a discriminatory item set. A PD classification rule c = A,B →C is a α protective w r t elift if elift (c) < α. Otherwise, c is α discriminatory.

Approach: (cont’d)

o Indirect Discrimination Measure: Definition 3. A PND classification rule r: D, B →C is a

redlining rule if it could yield an α discriminatory rule r’ : A,B→C in combination with currently available background knowledge rules of the form rb1 : A,B→D and rb2 : D,B→A, where A is a discriminatory item set.

For example: {zip= 10451, city= NYC} →Hire= no.

Approach: (cont’d)o Data Transformation for Direct Discrimination:

Direct Rule Protection:- converts α discriminatory rule into an α protective

rule

o Data transformation for indirect Discrimination:Indirect Rule Protection:

- Turns into redlining rule into non redlining

Data sets:

• Adult data set:This data set consists of 48,842 records, split into a

“train” part with 32,561 records and a “test” part with 16,281 records. The data set has 14 attributes (without class attribute).

• German credit data set: We also used the German Credit data set. This data set

consists of 1,000 records and 20 attributes (without class attribute) of bank account holders. This is a well-known real-life data set, containing both numerical and categorical attributes.

Result: (table 1)

• Misses cost (MC). This measure quantifies the percentage of rules among those extractable from the original data set that cannot be extracted from the transformed data set (side effect of the transformation process).

Ghost cost (GC). This measure quantifies the percentage of the rules among those extractable from the transformed data set that were no extractable from the original data set (side effect of the transformation process).

Result: (table 2)

Result: (table 3 and 4) .

Tables 3 and 4 shows that lower information loss in terms of the GC measure in the Adult data set than in the German Credit data set.

Future plans:

This can be implemented in Indian Scenario

To check the corruption

Gender discrimination

References:1. S. Hajian, J. Domingo-Ferrer, and A. Martı´nez-Balleste´,

“Rule Protection for Indirect Discrimination Prevention in Data Mining,” Proc. Eighth Int’l Conf. Modeling Decisions for Artificial Intelligence (MDAI ’11), pp. 211-222, 2011.

2. D. Pedreschi, S. Ruggieri, and F. Turini, “Discrimination-Aware Data Mining,” Proc. 14th ACM Int’l Conf. Knowledge Discovery and Data Mining (KDD ’08), pp. 560-568, 2008.

3. S. Ruggieri, D. Pedreschi, and F. Turini, “Data Mining for Discrimination Discovery,” ACM Trans. Knowledge Discovery from Data, vol. 4, no. 2, article 9, 2010.

4. S. Ruggieri, D. Pedreschi, and F. Turini, “DCUBE: Discrimination

5. Discovery in Databases,” Proc. ACM Int’l Conf. Management of Data (SIGMOD ’10), pp. 1127-1130, 2010.

THANK YOU…!!!

a methodology for direct and indirect discrimination prevention in data mining

Documents

discrimination discovery

discrimination removal

antidiscrimination techniques

data miningpresented

data mining models

original data sets

data miningsara hajian

resulting training data