measuring model biases in the absence of ground truth
TRANSCRIPT
Measuring Model Biases in the Absence of Ground Truth
Osman AkaâGoogleU.S.A.
Ken BurkeâGoogleU.S.A.
Alex BÀuerleâ Ulm University
Germany
Christina GreerGoogleU.S.A.
Margaret Mitchellâ¡Ethical AI LLC
U.S.A.
ABSTRACTThe measurement of bias in machine learning often focuses onmodel performance across identity subgroups (such as ððð andð€ðððð) with respect to groundtruth labels [15]. However, thesemethods do not directly measure the associations that a model mayhave learned, for example between labels and identity subgroups.Further, measuring a modelâs bias requires a fully annotated evalu-ation dataset which may not be easily available in practice.
We present an elegant mathematical solution that tackles bothissues simultaneously, using image classification as a working ex-ample. By treating a classification modelâs predictions for a givenimage as a set of labels analogous to a âbag of wordsâ [17], werank the biases that a model has learned with respect to differentidentity labels. We use {ððð,ð€ðððð} as a concrete example of anidentity label set (although this set need not be binary), and presentrankings for the labels that are most biased towards one identity orthe other. We demonstrate how the statistical properties of differentassociation metrics can lead to different rankings of the most âgen-der biased" labels, and conclude that normalized pointwise mutualinformation (ððððŒ ) is most useful in practice. Finally, we announcean open-sourced ððððŒ visualization tool using TensorBoard.
CCS CONCEPTS⢠General and referenceâMeasurement;Metrics; ⢠Appliedcomputing â Document analysis.
KEYWORDSdatasets, image tagging, fairness, bias, stereotypes, informationextraction, model analysis
âEqual contribution.â Work conducted during internship at Google.â¡Work conducted while author was at Google.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] â21, May 19â21, 2021, Virtual Event, USA.© 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8473-5/21/05. . . $15.00https://doi.org/10.1145/3461702.3462557
ACM Reference Format:OsmanAka, Ken Burke, Alex BÀuerle, Christina Greer, andMargaretMitchell.2021. Measuring Model Biases in the Absence of Ground Truth. In Proceed-ings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES â21),May 19â21, 2021, Virtual Event, USA. ACM, New York, NY, USA, 22 pages.https://doi.org/10.1145/3461702.3462557
1 INTRODUCTIONThe impact of algorithmic bias in computer vision models has beenwell-documented [c.f., 5, 24]. Examples of the negative fairnessimpacts of machine learning models include decreased pedestriandetection accuracy on darker skin tones [29], gender stereotypingin image captioning [6], and perceived racial identities impactingunrelated labels [27]. Many of these examples are directly relatedto currently deployed technology, which highlights the urgency ofsolving these fairness problems as adoption of these technologiescontinues to grow.
Many commonmetrics for quantifying fairness in machine learn-ing models, such as Statistical Parity [11], Equality of Opportunity[15] and Predictive Parity [8], rely on datasets with a significantamount of ground truth annotations for each label under analysis.However, some of the most commonly used datasets in computervision have relatively sparse ground truth [19]. One reason for thisis the significant growth in the number of predicted labels. Thebenchmark challenge dataset PASCAL VOC introduced in 2008 hadonly 20 categories [12], while less than 10 years later, the bench-mark challenge dataset ImageNet provided hundreds of categories[22]. As systems have rapidly improved, it is now common to usethe full set of ImageNet categories, which number more than 20,000[10, 26].
While large label spaces offer a more fine-grained ontologyof the visual world, they also increase the cost of implementinggroundtruth-dependent fairnessmetrics. This concern is compoundedby the common practice of collecting training datasets from multi-ple online resources [20]. This can lead to patterns where specificlabels are omitted in a biased way, either through human bias (e.g.,crowdsourcing where certain valid labels or tags are omitted sys-tematically) or through algorithmic bias (e.g., selecting labels forhuman verification based on the predictions of another model [19]).If the ground truth annotations in a sparsely labelled dataset arepotentially biased, then the premise of a fairness metric that ânor-malizes" model prediction patterns to groundtruth patterns may beincomplete. In light of this difficulty, we argue that it is important
arX
iv:2
103.
0341
7v3
[cs
.CV
] 6
Jun
202
1
to develop bias metrics that do not explicitly rely on âunbiasedâground truth labels.
In this work, we introduce a novel approach for measuring prob-lematic model biases, focusing on the associations between modelpredictions directly. This has several advantages compared to com-mon fairness approaches in the context of large label spaces, makingit possible to identify biases after the regular practice of runningmodel inference over a dataset. We study several different met-rics that measure associations between labels, building upon workin Natural Language Processing [9] and information theory. Weperform experiments on these association metrics using the OpenImages Dataset [19] which has a large enough label space to illus-trate how this framework can be generally applied, but we note thatthe focus of this paper is on introducing the relevant techniques anddo not require any specific dataset. We demonstrate that normal-ized pointwise mutual information (ððððŒ ) is particularly useful fordetecting the associations between model predictions and sensitiveidentity labels in this setting, and can uncover stereotype-alignedassociations that the model has learned. This metric is particularlypromising because:
⢠It requires no ground truth annotations.⢠It provides a method for uncovering biases that the modelitself has learned.
⢠It can be used to provide insight into per-label associationsbetween model predictions and identity attributes.
⢠Biases for both low- and high-frequency labels are able tobe detected and compared.
Finally we announce an open-sourced visualization tool in Tensor-Board that allows users to explore patterns of label bias in largedatasets using the ððððŒ metric.
2 RELATEDWORKIn 1990, Church and Hanks [9] introduced a novel approach toquantifying associations between words based on mutual infor-mation [13, 23] and inspired by psycholinguistic work on wordnorms [28] that catalogue words that people closely associate. Forexample, subjects respond more quickly to the word nurse if it fol-lows a highly associated word such as doctor [7, 14]. Church andHanksâ proposed metric applies mutual information to words usinga pointwise approach, measuring co-occurrences of distinct wordpairs rather than averaging over all words. This enables a quan-tification of the question, "How closely related are these words?"by measuring their co-occurrence rates relative to chance in thedataset of interest. In this case, the dataset of interest is a computervision evaluation dataset, and the words are the labels that themodel predicts.
This information theoretic approach to uncovering word associ-ations became a prominent method in the field of Natural LanguageProcessing, with applications ranging from measuring topic co-herence [1] to collocation extraction [4] to great effect, althoughoften requiring a good deal of preprocessing in order to incorpo-rate details of a sentenceâs syntactic structure. However, withoutpreprocessing, this method functions to simply measure word asso-ciations regardless of their order in sentences or relationship to oneanother, treating words as an unordered set of tokens (a so-called"bag-of-words") [16].
As we show, this simple approach can be newly applied to anemergent problem in the machine learning ethics space: The identi-fication of problematic associations that an ML model has learned.This approach is comparable to measuring correlations, althoughthe common correlation metric of Spearman Rank [25] operates onassumptions that are not suitable for this task, such as linearity andmonotonicity. The related correlation metric of the Kendall RankCorrelation [18] does not require such behavior, and we includecomparisons with this approach.
Additionally, many potentially applicable metrics for this prob-lem rely on simple counts of paired words, which does not takeinto consideration how the words are distributed with other words(e.g., sentence syntax or context); we will elaborate on how thisinformation can be formally incorporated into a bias metric in theDiscussion and Future Work sections.
This work is motivated by recent research on fairness in machinelearning (e.g., [15]), which at a high level seeks to define criteriathat result in equal outcomes across different subpopulations. Thefocus in this paper is complementary to previous fairness work,honing in on ways to identify and quantify the specific problematicassociations that a model may learn rather than providing an overallmeasurement of a modelâs unfairness. It also offers an alternative tofairness metrics that rely on comprehensive ground truth labelling,which is not always available for large datasets.
Open Images Dataset. The Open Images dataset we use in thiswork was chosen because it is open-sourced, with millions of di-verse images and a large label space of thousands of visual concepts(see [19] for more details). Furthermore, the dataset comes withpre-computed labels generated by a non-trivial algorithm that com-bines machine-generated predictions and human-verification; thisallowed us to focus on analysis of label associations (rather thantraining a new classifier ourselves) and uncover the most commonconcepts related to sensitive characteristics, which in this datasetareððð andð€ðððð.
We now turn to a formal description of the problem we seek tosolve.
3 PROBLEM DEFINITIONWe have a dataset D which contains image examples and labelsgenerated by a image classifier. This classifier takes one imageexample and predicts âIs label ðŠð relevant to the image?" for eachlabel in L = {ðŠ1, ðŠ2, ..., ðŠð}1. We infer ð (ðŠð ) and ð (ðŠð , ðŠ ð ) from Dsuch that for a given random image inD, ð (ðŠð ) is the probability ofhavingðŠð as positive prediction and ð (ðŠð , ðŠ ð ) is the joint probabilityof having both ðŠð and ðŠ ð as positive predictions. We further assumethat we have identity labels ð¥1, ð¥2, ...ð¥ð â L that belong to somesensitive identity group for which we wish to compute a bias metric(e.g.,ððð andð€ðððð as labels for gender)2. These identity labelsmay even be generated by the same process that generates the labelsðŠ, as in our case whereððð, ð€ðððð and all other ðŠ are elementsof L predicted by the classifier. We then measure bias with respect1For ease of notation, we use ðŠ and ð¥ rather than ï¿œÌï¿œ and ð¥ .2For the rest of the paper, we focus on only two identity labels with notation ð¥1 andð¥2 for simplicity, however the identity labels need not be binary (or one-dimensional)in this way. The remainder of this work is straightforwardly extended to any numberof identity labels by using the pairwise comparison of all identity labels, or by using aone-vs-all gap (e.g.,ðŽ(ð¥, ðŠ) â E [ðŽ(ð¥ â², ðŠ) ] where ð¥ â² is the set of all other ð¥ ).
Figure 1: Label ranking shifts formetric-to-metric comparison. Each point represents the rank of a single label when sorted byðº (ðŠ |ð¥1, ð¥2, ðŽ(·)) (i.e., the label rank by gap). The coordinates represent the rankings by gap for different associationmetricsðŽ(·)on the ð¥ and ðŠ axes. A highly-correlated plot along ðŠ = ð¥ would imply that the two metrics lead to very similar bias rankingsaccording to ðº (·).
to these identity labels for all other labels ðŠ â L. As the size of thislabel space |L| approaches tens of thousands and increases year byyear for modern machine learning models, it is important to havea simple bias metric that can be computed and reasoned about atscale.
For ease of discussion in the rest of this paper, we denote anygeneric associationmetricðŽ(ð¥ ð , ðŠ), where ð¥ ð is an identity label andðŠ is any other label. We define an association gap for labelðŠ betweentwo identity labels [ð¥1, ð¥2] with respect to the association metricðŽ(ð¥ ð , ðŠ) asðº (ðŠ |ð¥1, ð¥2, ðŽ(·)) = ðŽ(ð¥1, ðŠ) âðŽ(ð¥2, ðŠ). For example, theassociation between the labels ð€ðððð and ðððð , ðŽ(ð€ðððð,ðððð)can be compared to the association between the labelsððð andðððð , ðŽ(ððð,ðððð). The difference between them is the associa-tion gap for the label ðððð :
ðº (ðððð |ð€ðððð,ððð,ðŽ(·)) =ðŽ(ð€ðððð,ðððð) - ðŽ(ððð,ðððð)
We use this association gap ðº (·) as a measurement of âbias" orâskew" of a given label across sensitive identity subgroups.
The first objective we are interested in is the question âIs theprediction of label ðŠ biased towards either ð¥1 or ð¥2?". The secondobjective is then ranking the labels ðŠ by the size of this bias. Ifð¥1 and ð¥2 both belong to the same identity group (e.g.,ððð andð€ðððð are both labels for âgender"), then one may consider thesemeasurements to approximate the gender bias.
We choose this gender example because of the abundance ofthese specific labels in the Open Images Dataset, however thischoice should not be interpreted to mean that gender representa-tion is one-dimensional, nor that paired labels are required for thegeneral1 approach. Nonetheless, this simplification is importantbecause it allows us to demonstrate how a single per-label approx-imation of âbias" can be measured between paired labels, and weleave details of further expansions, including calculations acrossmultiple sensitive identity labels, to the Discussion section.
3.1 Association MetricsWe consider several sets of related association metrics ðŽ(·) thatcan be applied given the constraints of the problem at hand â lim-ited groundtruth, non-linearity, and limited assumptions about the
underlying distribution of the data. All of these metrics share incommon the general intuition of measuring how labels associatewith each other in a dataset, but as we will demonstrate, they yieldvery different results due to differences in how they quantify thisnotion of âassociation".
We first consider fairness metrics as types of association met-rics. One of the most common fairness metrics, Demographic (orStatistical) Parity [3, 11, 15], a quantification of the legal doctrineof Disparate Impact [2], can be applied directly for the given taskconstraints.3 Other metrics that are possible to adopt for this taskinclude those based on Intersection-over-Union (IOU) measure-ments, and metrics based on correlation and statistical tests. Wenext describe these metrics in further detail and their relationshipto the task at hand. In summary, we compare the following familiesof metrics:
⢠Fairnesss: Demographic Parity (ð·ð )⢠Entropy: Pointwise Mutual Information (ðððŒ ), NormalizedPointwise Mutual Information (ððððŒ ).
⢠IOU: SÞrensen-Dice Coefficient (ðð·ð¶), Jaccard Index (ðœ ðŒ ).⢠Correlation and Statistical Tests: Kendall Rank Correlation(ðð ), Log-Likelihood Ratio (ð¿ð¿ð ), ð¡-test.
One of the important aspects of our problem setting is the countsof images with labels and label intersections, i.e., ð¶ (ðŠ), ð¶ (ð¥1, ðŠ),and ð¶ (ð¥2, ðŠ). These values can span a large range for differentlabels ðŠ in the label set L, depending on how common they are inthe dataset. Some metrics are theoretically more sensitive to thefrequencies/counts of the label ðŠ as determined by their nonzeropartial derivatives with respect to ð (ðŠ) (see Table 2). However, aswe further discuss in the Experiments and Discussion sections, ourexperiments indicate that in practice, metrics with non-zero partialderivatives are surprisingly better able to capture biases across arange of label frequencies thanmetrics with a zero partial derivative.Differential sensitivity to label frequency could be problematic inpractice for two reasons:
(1) It would not be possible to compareðº (ðŠ |ð¥1, ð¥2, ðŽ(·)) betweendifferent labelsðŠ with different marginal frequencies (counts)ð¶ (ðŠ). For example, the ideal bias metric should be able tocapture gender bias equally well for both ððð and ððð ð ðð
ððð¥ððð even though the first label is more common thanthe second.
(2) The alternative, bucketizing labels by marginal frequencyand setting distinct thresholds per bucket, would add sig-nificantly more hyperparameters and essentially amount tomanual frequency-normalization.
The following sections contain basic explanations of these met-rics for a general audience, with the running example of ðððð,ððð,
andð€ðððð. We leave further mathematical analyses of the metricsto the Appendix. However, integral to the application of ððððŒ inthis task is the choice of normalization factor, and so we discussthis in further detail in the Normalizing ðððŒ subsection.
Demographic Parity
ðº (ðŠ |ð¥1, ð¥2, ð·ð) = ð (ðŠ |ð¥1) â ð (ðŠ |ð¥2)3Other common fairness metrics, such as Equality of Opportunity, require both amodel prediction and a groundtruth label, which makes the correct way to apply themto this task less clear. We leave this for further work.
Demographic Parity focuses on differences between the conditionalprobability of ðŠ given ð¥1 and ð¥2: How likely ðððð is for ððð vsð€ðððð.
Entropy
ðº (ðŠ |ð¥1, ð¥2, ðððŒ ) = ðð
(ð (ð¥1, ðŠ)
ð (ð¥1)ð (ðŠ)
)â ðð
(ð (ð¥2, ðŠ)
ð (ð¥2)ð (ðŠ)
)Pointwise Mutual Information, adapted from information theory,is the main entropy-based metric studied here. In this form, we areanalyzing the entropy difference between [ð¥1, ðŠ] and [ð¥2, ðŠ]. Thisessentially examines the dependence of, for example, the ðððð labeldistribution on two other label distributions:ððð andð€ðððð.
Remaining MetricsWe use the SÞrensen-Dice Coefficient (ðð·ð¶), which has the
commonly-used F1-score as one of its variants; the Jaccard Index(ðœ ðŒ ), a common metric in Computer Vision also known as Inter-section Over Union (IOU); Log-Likelihood Ratio (ð¿ð¿ð ), a classicflexible comparison approach; Kendall Rank Correlation, whichis also known as ðð -correlation, and is the particular correlationmethod that can be reasonably applied in this setting; and the ð¡-ð¡ðð ð¡ ,a common statistical significance test that can be adapted in thissetting [17]. Each of these metrics have different behaviours, how-ever, we limit our mathematical explanation to the Appendix, aswe found these metrics are either less useful in practice or behavesimilarly to other metrics in this use case.
3.2 Normalizing PMIOne major challenge in our problem setting is the sensitivity ofthese association metrics to the frequencies/counts of the labelsin L. Some metrics are weighted more heavily towards commonlabels (i.e., large marginal counts, ð¶ (ðŠ)) in spite of differences intheir joint probabilities with identity labels (ð (ð¥1, ðŠ), ð (ð¥2, ðŠ)). Theopposite is true for other metrics, which are weighted towards rarelabels with smaller marginal frequencies. In order to compensatefor this problem, several different normalization techniques havebeen applied to ðððŒ [4, 21]. Common normalizations include:
⢠ððððŒðŠ : Normalizing each term by ð (ðŠ).⢠ððððŒð¥ðŠ : Normalizing the two terms by ð (ð¥1, ðŠ) and ð (ð¥2, ðŠ),respectively.
⢠ðððŒ2: Using ð (ð¥1, ðŠ)2 and ð (ð¥2, ðŠ)2 instead of ð (ð¥1, ðŠ) andð (ð¥2, ðŠ), the normalization effects of which are further illus-trated in the Appendix.
Each of these normalization methods have different impacts on theðððŒ metric. The main advantage of these normalizations is theability to compare association gaps between label pairs [ðŠ1, ðŠ2] (e.g.,comparing the gender skews of two labels like Long Hair and DidoFlip) even if ð (ðŠ1) and ð (ðŠ2) are very different. In the Experimentssection, we discuss which of these is most effective and meaningfulfor the fairness and bias use case motivating this work.
4 EXPERIMENTSIn order to compare these metrics, we use the Open Images Dataset(OID) [19] described above in the Related Works section. This
dataset is useful for demonstrating realistic bias detection use casesbecause of the number of distinct labels that may be applied tothe images (the dataset itself is annotated with nearly 20,000 la-bels). The label space is also diverse, including objects (cat, car,tree), materials (leather, basketball), moments/actions (blowing outcandles, woman playing guitar), and scene descriptors and attributes(beautiful, smiling, forest). We seek to measure the gender bias asdescribed above for each of these labels, and compare these biasvalues directly in order to determine which of the labels are mostskewed towards associating with the labelsððð andð€ðððð.
In our experiments, we apply each of the association metricsðŽ(ð¥ ð , ðŠ) to the machine-generated predictions for identity labelsð¥1 = ððð, ð¥2 = ð€ðððð and all other labels ðŠ in the Open ImagesDataset. We then compute the gap value ðº (·) between the identitylabels for each labelðŠ and sort, providing a ranking of labels that aremost biased towards ð¥1 or ð¥2. As we will show, sorting labels by thisassociation gap creates different rankings for different associationmetrics. We examine which labels are ranked within the top 100 forthe different association metrics in the Top 100 Labels by MetricGaps subsection.
4.1 Label RanksThe first experiment we performed is to compute the associationmetrics and the gaps between them for different labels â ðŽ(ð¥1, ðŠ),ðŽ(ð¥2, ðŠ) and ðº (ðŠ |ð¥1, ð¥2, ðŽ(·)) â over the OID dataset. We thensorted the labels by ðº (·) and studied how the ranking by gap dif-fered between different association metrics ðŽ(·). Figure 1 showsexamples of these metric-to-metric comparisons of label rankingsby gap (all other metric comparisons can be found in the Appendix).We can see that a single label can have quite a different rankingdepending on the metric.
When comparing metrics, we found that they grouped togetherin a few clusters based on similar ranking patterns when sorting byðº (ðŠ |ððð,ð€ðððð,ðŽ(·)). In the first cluster, pairwise comparisonsbetween ðððŒ , ðððŒ2 and ð¿ð¿ð show linear relationships when sort-ing labels byðº (·). Indeed, while some labels show modest changesin rank between these metrics, they share all of their top 100 la-bels, and > 99% of label pairs maintain the same relative rankingbetween metrics. By contrast, there are only 7 labels in commonbetween the top 100 labels of ðððŒ2 and ðð·ð¶ . Due to the similarbehavior of this cluster of metrics, we chose to focus on ðððŒ asrepresentative of these 3 metrics moving forward (see the Appendixfor further details on these relationships).
Similar results were obtained for another cluster of metrics: ð·ð ,ðœ ðŒ , and ðð·ð¶ . All pairwise comparisons generated linear plots, withabout 70% of the top 100 labels shared in common between thesemetrics when sorted by ðº (·). Furthermore, about 95% of pairs ofthose overlapping labels maintained the same relative ranking be-tween metrics. Similar to the ðððŒ cluster, we chose to focus onDemographic Parity (ð·ð ) as the representative metric from its clus-ter, due to its mathematical simplicity and prominence in fairnessliterature.
We next sought to understand how incremental changes to thecounts of the labels, ð¶ (ðŠ), affect these association gap rankings ina real dataset (see Appendix Section E.2). To achieve this, we addeda fake label to the real labels of OID, setting initial values for its
Metrics Min/Max Min/Max Min/Maxð¶ (ðŠ) ð¶ (ð¥1, ðŠ) ð¶ (ð¥2, ðŠ)
ðððŒ 15 / 10,551 1 / 1,059 8 / 7,755
ðððŒ 2 15 / 10,551 1 / 1,059 8 / 7,755
ð¿ð¿ð 15 / 10,551 1 / 1,059 8 / 7,755
ð·ð 6,104 / 785,045 628 / 239,950 5,347 / 197,795
ðœ ðŒ 4,158 / 562,445 399 / 144,185 3,359 / 183,132
ðð·ð¶ 2,906 / 562,445 139 / 144,185 2,563 / 183,132
ððððŒðŠ 35 / 562,445 1/144,185 9 / 183,132
ððððŒð¥ðŠ 34 / 270,748 1 / 144,185 20 / 183,132
ðð 6,104 / 785,045 628 / 207,723 5,347 / 183,132
ð¡ -ð¡ðð ð¡ 960 / 562,445 72 / 144,185 870 / 183,132
Table 1: Minimum and maximum countsð¶ (ðŠ) of the top 100labels with the largest association gaps for eachmetric. Notethat these min/max values forð¶ (ðŠ) vary by orders of magni-tude for different metrics. A larger version of this table isavailable in Appendix, Table 7.
counts and co-occurences in the dataset, ð (ðŠ), ð (ð¥1, ðŠ), and ð (ð¥2, ðŠ).Then we incrementally increased or decreased the count of label ðŠ,and measured whether its bias ranking in ðº would change relativeto the other labels in OID. We repeated this procedure for differentorders of magnitude of label countð¶ (ðŠ) while maintaining the ratioð (ð¥1, ðŠ)/ð (ð¥2, ðŠ) as constant.
This experiment allowed us to determine whether the theoreticalsensitivities of eachmetric to label frequency ð (ðŠ), as determined bypartial derivatives ððŽ(·)/ðð (ðŠ) (see Table 2), would hold in the con-text of real-world data, where the underlying distribution of labelfrequencies may not be uniform. If certain subregions of the labeldistribution are relatively sparse, for example, then the gap rankingof our hypothetical label may not change even if ððŽ(·)/ðð¶ (ðŠ) â 0.However, in practice we do not observe this behavior in the testedsettings (see Appendix for plots of these experiments), where labelrank moves with label count roughly as predicted by the partialderivatives in Table 2. In fact, we observed that metrics with largerpartial derivatives for ð¥1, ð¥2, or ðŠ often led to a larger change inrank. For example, slightly increasing ð (ð¥1, ðŠ) when ðŠ always co-occurs with ð¥1, ð (ðŠ) = ð (ð¥1, ðŠ) affects rankingmore forðŽ = ððððŒðŠcompared to ðŽ = ðððŒ (see Appendix).
4.2 Top 100 Labels by Metric GapsWhen applying these metrics to fairness and bias use cases, modelusers may be most interested in surfacing the labels with the largestassociation gaps. If one filters results to a âtop K" label set, then thenormalization chosen could lead to vastly different sets of labels
Figure 2: Top 100 count distributions for ðððŒ , ððððŒð¥ðŠ , and ððThe distribution of ð¶ (ðŠ), ð¶ (ð¥1, ðŠ), and ð¶ (ð¥2, ðŠ) for the top 100 labels sorted by gap ðº (ðŠ |ð¥1, ð¥2, ðŽ(·)) for ðððŒ , ððððŒð¥ðŠ , and ðð . The ð¥-axis is
the logarithmic-scaled bins and the ðŠ-axis is the number of labels which have the corresponding count values in that bin.
ðð (ðŠ) ðð (ð¥1, ðŠ) ðð (ð¥2, ðŠ)
ðDP 0 1ð (ð¥1)
â1ð (ð¥2)
ðPMI 0 1ð (ð¥1,ðŠ)
â1ð (ð¥2,ðŠ)
ðððððŒðŠðð ( ð (ð¥2 |ðŠ)
ð (ð¥1 |ðŠ))
ðð2 (ð (ðŠ))ð (ðŠ)1
ðð (ð (ðŠ))ð (ð¥1,ðŠ)â1
ðð (ð (ðŠ))ð (ð¥2,ðŠ)
ðððððŒð¥ðŠ1
ðð (ð (ð¥1,ðŠ))ð (ðŠ) â1
ðð (ð (ð¥2,ðŠ))ð (ðŠ)ðð (ð (ðŠ))âðð (ð (ð¥1))ðð2 (ð (ð¥1,ðŠ))ð (ð¥1,ðŠ)
ðð (ð (ð¥2))âðð (ð (ðŠ))ðð2 (ð (ð¥2,ðŠ))ð (ð¥2,ðŠ)
ððððŒ2 0 2ð (ð¥1,ðŠ)
â2ð (ð¥2,ðŠ)
ðSDC see Appendix C 1ð (ð¥1)+ð (ðŠ)
â1ð (ð¥2)+ð (ðŠ)
ðJI see Appendix C ð (ð¥1)+ð (ðŠ)(ð (ð¥1)+ð (ðŠ)âð (ð¥1,ðŠ))2
ð (ð¥2)+ð (ðŠ)(ð (ð¥2)+ð (ðŠ)âð (ð¥2,ðŠ))2
ðLLR 0 1ð (ð¥1,ðŠ)
â1ð (ð¥2,ðŠ)
ððð see Appendix C(2â 4
ð)â
(ð (ð¥1)âð (ð¥1)2) (ð (ðŠ)âð (ðŠ)2)( 4ðâ2)â
(ð (ð¥2)âð (ð¥2)2) (ð (ðŠ)âð (ðŠ)2)
ðð¡-ð¡ðð ð¡_ðððâð (ð¥2)â
âð (ð¥1)
2âð (ðŠ)
1âð (ð¥1)ð (ðŠ)
â1âð (ð¥2)ð (ðŠ)
Table 2: Metric orientations.
This table shows the partial derivatives of the metrics with respect to ð (ðŠ), ð (ð¥1, ðŠ), and ð (ð¥2, ðŠ). This provides a quantification of thetheoretical sensitivity of the metrics for different probability values of ð (ðŠ), ð (ð¥1, ðŠ), and ð (ð¥2, ðŠ). We see similar experimental results for
the metrics with similar orientations (see Appendix).
(e.g., as mentioned earlier, ðððŒ2 and ðð·ð¶ only shared 7 labels intheir top 100 set for OID).
To further analyze this issue, we calculated simple values foreach metricâs top 100 labels sorted byðº (·): minimum andmaximumvalues ofð¶ (ðŠ),ð¶ (ð¥1, ðŠ) andð¶ (ð¥2, ðŠ) as shown in Table 1. The mostsalient point is that the clusters of metrics from the Label Rankssubsection also appear to hold in this analysis as well; ðððŒ , ðððŒ2,and ð¿ð¿ð have low ð¶ (ðŠ), ð¶ (ð¥1, ðŠ) and ð¶ (ð¥2, ðŠ) ranges, whereas ð·ð ,ðœ ðŒ , and ðð·ð¶ have relatively high ranges. Another straightforwardobservation we can make is that the ððððŒðŠ and ððððŒð¥ðŠ rangesare much broader than the first two clusters, and include the othermetricsâ ranges especially for the joint co-occurrences,ð¶ (ð¥1, ðŠ) andð¶ (ð¥2, ðŠ).
To demonstrate this point more clearly, we plot the distributionsof these counts for ðððŒ , ððððŒð¥ðŠ and ðð skews in Figure 2 (all othercombinations can be found in the Appendix). These three metricdistributions show that gap calculations based on ðððŒ (blue distri-bution) exclusively rank labels with low counts in the top 100 mostskewed labels, where ðð calculations (green distribution) almostexclusively rank labels with much higher counts. The exceptionis ððððŒðŠ and ððððŒð¥ðŠ (orange distribution); these two metrics arecapable of capturing labels across a range of marginal frequencies.In other words, ranking labels by ðððŒ gaps is likely to highlightrare labels, ranking by ðð will highlight common labels, and rankingby ððððŒð¥ðŠ will highlight both rare and common labels.
An example of this relationship between association metricchoice and label commonality can be seen in Table 3 (note, herewe use Demographic Parity instead of ðð because they behave simi-larly in this respect). In this table, we show the âTop 15 labels" mostheavily skewed towardsð€ðððð relative toððð according to ð·ð ,unnormalized ðððŒ , and ððððŒð¥ðŠ . ð·ð almost exclusively highlightscommon labels predicted for over 100,000 images in the dataset (e.g.,Happiness and Fashion), whereas ðððŒ largely highlights rarer labelspredicted for less than 1,000 images (e.g., Treggings and Boho-chic).By contrast, ððððŒð¥ðŠ highlights both common and rare labels (e.g.,Long Hair as well as Boho-chic).
5 DISCUSSIONIn the previous section, we first showed that some associationmetrics behave very similarly when ranking labels from the OpenImages Dataset (OID). We then showed that the mathematical ori-entations and sensitivity of these metrics align with experimentalresults from OID. Finally, we showed that the different normaliza-tions affect whether labels with high or low marginal frequenciesare likely to be detected as having a significant bias according toðº (ðŠ |ð¥1, ð¥2, ðŽ(·)) in this dataset. We arrive at the conclusion thatthe ððððŒ metrics are preferable to other commonly used associ-ation metrics in the problem setting of detecting biases withoutgroundtruth labels.
What is the intuition behind this particular association metricas a bias metric? All of the studied entropy-based metrics (gaps inððððŒ as well as ðððŒ ) approximately correspond to whether oneidentity label ð¥1 co-occurs with the target label ðŠ more often thananother identity label ð¥2 relative to chance levels. This chance-level normalization is important because even completely unbiasedlabels would still co-occur at some baseline rate by chance alone.
The further normalization of ðððŒ by either the marginal orjoint probability (ððððŒðŠ and ððððŒð¥ðŠ , respectively) takes this onestep further in practice by surfacing labels with larger marginalcounts at higher ranks in ðº (ðŠ |ð¥1, ð¥2, ððððŒ ) alongside labels withsmaller marginal counts. This is a somewhat surprising result, be-cause in theory ðððŒ should already be independent of the marginalfrequency of ð (ðŠ) (because ððððŒ/ðð (ðŠ) = 0), whereas this de-rivative for ððððŒ is non-zero. When we examined this patternin practice, the labels with smaller counts can achieve very largeð (ð¥1, ðŠ)/ð (ð¥2, ðŠ) ratios (and therefore their bias rankings can getvery high) merely by reducing the denominator to a single imageexample. ðððŒ is unable to compensate for this noise, whereas thenormalizations we use for ððððŒ allow us to capture a significantamount of common labels in the top 100 labels byðº (ðŠ |ð¥1, ð¥2, ððððŒ )in spite of this pattern. This result is indicated by the ranges inTable 1 and Figure 2, as well as the set of both common and rarelabels for ððððŒ in Table 3.
Indeed, if the evaluation set is properly designed to match thedistribution of use cases of a classification model âin the wild", thenwe argue more common labels that have a smaller ð (ð¥1, ðŠ)/ð (ð¥2, ðŠ)ratio are still critical to audit for biases. Normalization strategiesmust be titrated carefully to balance this simple ratio of joint prob-abilities with the labelâs rarity in the dataset.
An alternative solution to this problem could be bucketing labelsby their marginal frequency. We argue this is a suboptimal solutionfor two reasons. First, determining even a single threshold hyper-parameter is a painful process for defining fairness constraints.Systems that prevent models from being published if their fair-ness discrepancies exceed a threshold would then be required totitrate this threshold for every bucket. Secondly, bucketing labelsby frequency is essentially a manual and discontinuous form ofnormalization; we argue that building normalization into the metricdirectly is a more elegant solution.
Finally, to enable detailed investigation of the model predictions,we implemented and open-sourced a tool to visualize ððððŒ metricsas a TensorBoard plugin for developers4 (see Figure 3). It allowsusers to investigate discrepancies between two or more identitylabels and their pairwise comparisons. Users can visualize probabil-ities, image counts, sample and label distributions, and filter, flag,and download these results.
4https://github.com/tensorflow/tensorboard/tree/master/tensorboard/plugins/npmi
Metric ðŽ ð·ð ðððŒ ððððŒð¥ðŠ
Ranks Label ðŠ Count Label ðŠ Count Label ðŠ Count0 265,853 Dido Flip 140 6101 270,748 Webcam Model 184 Dido Flip 1402 221,017 Boho-chic 151 2,9063 166,186 610 Eye Liner 3,1444 Beauty 562,445 Treggings 126 Long Hair 56,8325 Long Hair 56,832 Mascara 539 Mascara 5396 Happiness 117,562 145 Lipstick 8,6887 Hairstyle 145,151 Lace Wig 70 Step Cutting 6,1048 Smile 144,694 Eyelash Extension 1,167 Model 10,5519 Fashion 238,100 Bohemian Style 460 Eye Shadow 1,23510 Fashion Designer 101,854 78 Photo Shoot 8,77511 Iris 120,411 Gravure Idole 200 Eyelash Extension 1,16712 Skin 202,360 165 Boho-chic 46013 Textile 231,628 Eye Shadow 1,235 Webcam Model 15114 Adolescence 221,940 156 Bohemian Style 184
Table 3: Top 15 labels skewed towardsð€ðððð.
Ranking by the gap of label ðŠ betweenð€ðððð ð¥1 andððð ð¥2, according to ðº (ðŠ |ð¥1, ð¥2, ðŽ(·)). Identity label names are omitted.
Figure 3: Open-sourced tool we implemented for bias investigation.In A, annotations are filtered by an association metric, in this case the ððððŒ difference (ððððŒ_ððð) betweenðððð and ð ððððð . In B,
distribution of association values. In C, the filtered annotations view. In D, different models or datasets can be selected for display. In E, aparallel coordinates visualization of selected annotations to assess where association differences come from.
6 CONCLUSION AND FUTUREWORKIn this paper we have described association metrics that can mea-sure gaps â biases or skews â towards specific labels in the largelabel space of current computer vision classification models. Thesemetrics do not require ground truth annotations, which allowsthem to be applied in contexts where it is difficult to apply standardfairness metrics such as Equality of Opportunity [15]. According toour experiments, Normalized Pointwise Mutual Information (ððððŒ )is a particularly useful metric for measuring specific biases in areal-world dataset with a large label space, e.g., the Open ImagesDataset.
This paper also introduces several questions for future work. Thefirst is whetherððððŒ is also similarly useful as a bias metric in smalllabel spaces (e.g., credit and loan applications). Second, if we were tohave exhaustive ground truth labels for such a dataset, how wouldthe sensitivity of ððððŒ in detecting biases compare to ground-truth-dependent fairness metrics? Finally, in this work we treatedthe labels predicted for an image as a flat set. However, just likesentences have rich syntactic structure beyond the âbag-of-words"model in NLP, images also have rich structure and relationshipsbetween objects that are not captured by mere rates of binary co-occurrence. This opens up the possibility that within-image labelrelationships could be leveraged to better understand how conceptsare associated in a large computer vision dataset. We leave thesequestions for future work.
REFERENCES[1] Nikolaos Aletras and Mark Stevenson. 2013. Evaluating Topic Coherence Us-
ing Distributional Semantics. In Proceedings of the 10th International Conferenceon Computational Semantics (IWCS 2013) â Long Papers. Association for Com-putational Linguistics, Potsdam, Germany, 13â22. https://www.aclweb.org/anthology/W13-0102
[2] Solon Barocas and Andrew D. Selbst. 2014. Big Dataâs Disparate Impact. SSRNeLibrary (2014).
[3] Richard Berk. 2016. A primer on fairness in criminal justice risk assessments.The Criminologist 41, 6 (2016), 6â9.
[4] G. Bouma. 2009. Normalized (pointwise) mutual information in collocationextraction.
[5] Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional AccuracyDisparities in Commercial Gender Classification (Proceedings of Machine LearningResearch, Vol. 81), Sorelle A. Friedler and Christo Wilson (Eds.). PMLR, New York,NY, USA, 77â91. http://proceedings.mlr.press/v81/buolamwini18a.html
[6] Kaylee Burns, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. 2018.Women also Snowboard: Overcoming Bias in Captioning Models. CoRRabs/1803.09797 (2018). arXiv:1803.09797 http://arxiv.org/abs/1803.09797
[7] Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Seman-tics derived automatically from language corpora contain human-like biases.Science 356, 6334 (2017), 183â186. https://doi.org/10.1126/science.aal4230arXiv:https://science.sciencemag.org/content/356/6334/183.full.pdf
[8] Alexandra Chouldechova. 2016. Fair prediction with disparate impact: A studyof bias in recidivism prediction instruments. arXiv:1610.07524 [stat.AP]
[9] KennethWard Church and Patrick Hanks. 1990. Word Association Norms, MutualInformation, and Lexicography. Computational Linguistics 16, 1 (1990), 22â29.
https://www.aclweb.org/anthology/J90-1003[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR09.[11] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S.
Zemel. 2011. Fairness Through Awareness. CoRR abs/1104.3913 (2011).arXiv:1104.3913 http://arxiv.org/abs/1104.3913
[12] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams,John Winn, and Andrew Zisserman. 2015. The Pascal Visual Object ClassesChallenge: A Retrospective. International Journal of Computer Vision 111, 1 (Jan.2015), 98â136. https://doi.org/10.1007/s11263-014-0733-5
[13] Robert M Fano. 1961. Transmission of information: A statistical theory of com-munications. American Journal of Physics 29 (1961), 793â794.
[14] A. G. Greenwald, D. E. McGhee, and J. L. Schwartz. 1998. Measuring individ-ual differences in implicit cognition: the implicit association test. Journal ofpersonality and social psychology 74 (1998). Issue 6.
[15] Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of Opportunity inSupervised Learning. arXiv:1610.02413 [cs.LG]
[16] Zellig Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146â162.https://doi.org/10.1007/978-94-009-8467-7_1
[17] Dan Jurafsky and James H. Martin. 2009. Speech and language process-ing : an introduction to natural language processing, computational linguis-tics, and speech recognition. Pearson Prentice Hall, Upper Saddle River,N.J. http://www.amazon.com/Speech-Language-Processing-2nd-Edition/dp/0131873210/ref=pd_bxgy_b_img_y
[18] M. G. Kendall. 1938. A New Measure of Rank Correlation. Biometrika 30,1-2 (06 1938), 81â93. arXiv:https://academic.oup.com/biomet/article-pdf/30/1-2/81/423380/30-1-2-81.pdf https://doi.org/10.1093/biomet/30.1-2.81
[19] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin,Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, andVittorio Ferrari. 2018. The Open Images Dataset V4: Unified image classification,object detection, and visual relationship detection at scale. CoRR abs/1811.00982(2018). arXiv:1811.00982 http://arxiv.org/abs/1811.00982
[20] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Gir-shick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. LawrenceZitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312(2014). arXiv:1405.0312 http://arxiv.org/abs/1405.0312
[21] FranÃois Role and Mohamed Nadif. 2011. Handling the Impact of Low FrequencyEvents on Co-Occurrence Based Measures of Word Similarity - A Case Study ofPointwise Mutual Information. In Proceedings of the International Conference onKnowledge Discovery and Information Retrieval - KDIR, (IC3K 2011). SciTePress,218â223.
[22] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa,ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV) 115, 3 (2015), 211â252. https://doi.org/10.1007/s11263-015-0816-y
[23] Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst.Tech. J. 27, 3 (1948), 379â423. http://dblp.uni-trier.de/db/journals/bstj/bstj27.html#Shannon48
[24] Jacob Snow. 2018. Amazonâs Face Recognition Falsely Matched 28 Members ofCongress With Mugshots. (2018).
[25] C. Spearman. 1904. The Proof and Measurement of Association Between TwoThings. American Journal of Psychology 15 (1904), 88â103.
[26] Stanford Vision Lab. 2020. ImageNet. http://image-net.org/explore (2020). accessed6.Oct.2020.
[27] Pierre Stock andMoustapha Cisse. 2018. Convnets and imagenet beyond accuracy:Understanding mistakes and uncovering biases. In Proceedings of the EuropeanConference on Computer Vision (ECCV). 498â512.
[28] M. P. Toglia and W. F. Battig. 1978. Handbook of semantic word norms. LawrenceErlbaum.
[29] Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. 2019. PredictiveInequity in Object Detection. CoRR abs/1902.11097 (2019). arXiv:1902.11097http://arxiv.org/abs/1902.11097
APPENDIXThe following sections break down all of the different metrics we examined. In Section A, Gap Calculations, we present the calculationsof the label association metric gaps, a measurement of the skew with respect to the given sensitive labels ð¥1 and ð¥2, such asð€ðððð (ð¥1)andððð (ð¥2). In Section C, Metric Orientations, we present the derivations to calculate their orientations, which represent the theoreticalsensitivities of each metric to various marginal or joint label probabilities.
A GAP CALCULATIONS⢠Demographic Parity (ð·ð ):
ð·ð_ððð = ð (ðŠ |ð¥1) â ð (ðŠ |ð¥2)⢠SÞrensen-Dice Coefficient (ðð·ð¶):
ðð·ð¶_ððð =ð (ð¥1, ðŠ)
ð (ð¥1) + ð (ðŠ) âð (ð¥2, ðŠ)
ð (ð¥2) + ð (ðŠ)⢠Jaccard Index (ðœ ðŒ ):
ðœ ðŒ_ððð =ð (ð¥1, ðŠ)
ð (ð¥1) + ð (ðŠ) â ð (ð¥1, ðŠ)â ð (ð¥2, ðŠ)ð (ð¥2) + ð (ðŠ) â ð (ð¥2, ðŠ)
⢠Log-Likelihood Ratio (ð¿ð¿ð ):ð¿ð¿ð _ððð = ðð(ð (ð¥1 |ðŠ)) â ðð(ð (ð¥2 |ðŠ))
⢠Pointwise Mutual Information (ðððŒ ):
ðððŒ_ððð = ðð
(ð (ð¥1, ðŠ)ð (ð¥1)ð (ðŠ)
)â ðð
(ð (ð¥2, ðŠ)ð (ð¥2)ð (ðŠ)
)⢠Normalized Pointwise Mutual Information, p(y) normalization (ððððŒðŠ ):
ððððŒðŠ_ððð =
ðð
(ð (ð¥1,ðŠ)
ð (ð¥1)ð (ðŠ)
)ðð (ð (ðŠ)) â
ðð
(ð (ð¥2,ðŠ)
ð (ð¥2)ð (ðŠ)
)ðð (ð (ðŠ))
⢠Normalized Pointwise Mutual Information, p(x,y) normalization (ððððŒð¥ðŠ ):
ððððŒð¥ðŠ_ððð =
ðð
(ð (ð¥1,ðŠ)
ð (ð¥1)ð (ðŠ)
)ðð (ð (ð¥1, ðŠ))
âðð
(ð (ð¥2,ðŠ)
ð (ð¥2)ð (ðŠ)
)ðð (ð (ð¥2, ðŠ))
⢠Squared Pointwise Mutual Information (ðððŒ2):
ðððŒ2_ððð = ðð
(ð (ð¥1, ðŠ)2ð (ð¥1)ð (ðŠ)
)â ðð
(ð (ð¥2, ðŠ)2ð (ð¥2)ð (ðŠ)
)⢠Kendall Rank Correlation (ðð ):
ðð =ðð â ððâïž
(ð0 â ð1) (ð0 â ð2)ðð_ððð = ðð (ð1 = ð¥1, ð2 = ðŠ) â ðð (ð1 = ð¥2, ð2 = ðŠ)
⢠t-test:
ð¡-ð¡ðð ð¡_ððð =ð (ð¥1, ðŠ) â ð (ð¥1)ð (ðŠ)âïž
ð (ð¥1)ð (ðŠ)â ð (ð¥2, ðŠ) â ð (ð¥2)ð (ðŠ)âïž
ð (ð¥2)ð (ðŠ)
B MATHEMATICAL REDUCTIONS⢠PMI:
ðððŒ_ððð = ðð
(ð (ð¥1, ðŠ)ð (ð¥1)ð (ðŠ)
)â ðð
(ð (ð¥2, ðŠ)ð (ð¥2)ð (ðŠ)
)(1)
ðððŒ_ððð = ðð
(ð (ð¥1, ðŠ)ð (ð¥1)ð (ðŠ)
.ð (ð¥2)ð (ðŠ)ð (ð¥2, ðŠ)
)(2)
ðððŒ_ððð = ðð
(ð (ð¥1, ðŠ)ð (ð¥1)
.ð (ð¥2)ð (ð¥2, ðŠ)
)(3)
ðððŒ_ððð = ðð
(ð (ðŠ |ð¥1)ð (ðŠ |ð¥2)
)(4)
ðððŒ_ððð = ðð(ð (ðŠ |ð¥1)) â ðð(ð (ðŠ |ð¥2)) (5)
⢠nPMI p(y) normalization:
ððððŒ_ððð =
ðð
(ð (ð¥1,ðŠ)
ð (ð¥1)ð (ðŠ)
)ðð (ð (ðŠ)) â
ðð
(ð (ð¥2,ðŠ)
ð (ð¥2)ð (ðŠ)
)ðð (ð (ðŠ))
(6)
ððððŒ_ððð =ðð(ð (ðŠ |ð¥1)) â ðð(ð (ðŠ |ð¥2))
ðð(ð (ðŠ)) (7)
⢠nPMI p(x,y) normalization:
ððððŒ_ððð =
ðð
(ð (ð¥1,ðŠ)
ð (ð¥1)ð (ðŠ)
)ðð (ð (ð¥1, ðŠ))
âðð
(ð (ð¥2,ðŠ)
ð (ð¥2)ð (ðŠ)
)ðð (ð (ð¥2, ðŠ))
(8)
ððððŒ_ððð = ðð
((ð (ð¥1, ðŠ)ð (ð¥1)ð (ðŠ)
)1/ðð (ð (ð¥1,ðŠ)) )â ðð
((ð (ð¥2, ðŠ)ð (ð¥2)ð (ðŠ)
)1/ðð (ð (ð¥2,ðŠ)) )(9)
ððððŒ_ððð = ðð©Â«(
ð (ð¥1,ðŠ)ð (ð¥1)ð (ðŠ)
)1/ðð (ð (ð¥1,ðŠ))(ð (ð¥2,ðŠ)
ð (ð¥2)ð (ðŠ)
)1/ðð (ð (ð¥2,ðŠ)) ª®®®¬ (10)
ððððŒ_ððð = ðð©Â«(ð (ð¥1,ðŠ)ð (ð¥1)
)1/ðð (ð (ð¥1,ðŠ))(ð (ð¥2,ðŠ)ð (ð¥2)
)1/ðð (ð (ð¥2,ðŠ)) .ð (ðŠ) (1/ðð (ð (ð¥1,ðŠ))â1/ðð (ð (ð¥2,ðŠ)))ª®®®¬ (11)
ððððŒ_ððð = ðð©Â«(ð (ð¥1,ðŠ)ð (ð¥1)
)1/ðð (ð (ð¥1,ðŠ))(ð (ð¥2,ðŠ)ð (ð¥2)
)1/ðð (ð (ð¥2,ðŠ)) ª®®®¬ + ðð(ð (ðŠ) (1/ðð (ð (ð¥1,ðŠ))â1/ðð (ð (ð¥2,ðŠ)))
)(12)
ððððŒ_ððð = ðð
(ð (ðŠ |ð¥1)1/ðð (ð (ð¥1,ðŠ))
ð (ðŠ |ð¥2)1/ðð (ð (ð¥2,ðŠ))
)+ ðð(ð (ðŠ)) .
(1
ðð(ð (ð¥1, ðŠ))â 1ðð(ð (ð¥2, ðŠ))
)(13)
ððððŒ_ððð =ðð(ð (ðŠ |ð¥1))ðð(ð (ð¥1, ðŠ))
â ðð(ð (ðŠ |ð¥2))ðð(ð (ð¥2, ðŠ))
+ ðð(ð (ðŠ)) . ©Â«ðð( ð (ð¥2,ðŠ)
ð (ð¥1,ðŠ) )ðð(ð (ð¥1, ðŠ))ðð(ð (ð¥2, ðŠ))
ª®¬ (14)
ððððŒ_ððð =ðð(ð (ð¥2))ðð(ð (ð¥2, ðŠ))
â ðð(ð (ð¥1))ðð(ð (ð¥1, ðŠ))
+ðð(ð (ðŠ))ðð( ð (ð¥2,ðŠ)
ð (ð¥1,ðŠ) )ðð(ð (ð¥1, ðŠ))ðð(ð (ð¥2, ðŠ))
(15)
ððððŒ_ððð =ðð(ð (ð¥2))ðð(ð (ð¥2, ðŠ))
â ðð(ð (ð¥1))ðð(ð (ð¥1, ðŠ))
+ ðð(ð (ðŠ))ðð(ð (ð¥1, ðŠ))
â ðð(ð (ðŠ))ðð(ð (ð¥2, ðŠ))
(16)
(17)
⢠PMI2:
ðððŒ2_ððð = ðð
(ð (ð¥1, ðŠ)2ð (ð¥1)ð (ðŠ)
)â ðð
(ð (ð¥2, ðŠ)2ð (ð¥2)ð (ðŠ)
)(18)
ðððŒ2_ððð = ðð
(ð (ð¥1, ðŠ)2ð (ð¥1)ð (ðŠ)
.ð (ð¥2)ð (ðŠ)ð (ð¥2, ðŠ)2
)(19)
ðððŒ2_ððð = ðð
(ð (ð¥1, ðŠ)2ð (ð¥1)
.ð (ð¥2)
ð (ð¥2, ðŠ)2
)(20)
ðððŒ2_ððð = ðð
(ð (ðŠ |ð¥1)ð (ð¥1, ðŠ)ð (ðŠ |ð¥2)ð (ð¥2, ðŠ)
)(21)
ðððŒ2_ððð = ðð(ð (ðŠ |ð¥1)) â ðð(ð (ðŠ |ð¥2)) + ðð(ð (ð¥1, ðŠ)) â ðð(ð (ð¥2, ðŠ)) (22)
⢠Kendall Rank Correlation (ðð ):
Formal definition:ðð =
ðð â ððâïž(ð0 â ð1) (ð0 â ð2)
â ð0 = ð(ð â 1)/2
â ð1 =âð ð¡ð (ð¡ð â 1)/2
â ð2 =â
ð ð¢ ð (ð¢ ð â 1)/2â ðð is number of concordant pairsâ ðð is number of discordant pairsâ ðð is number of concordant pairsâ ð¡ð is number of tied values in the ðð¡â group of ties for the first quantityâ ð¢ ð is number of tied values in the ðð¡â group of ties for the second quantity
New notations for our use case:
ðð (ð1 = ð¥1, ð2 = ðŠ) = ðð (ð1 = ð¥1, ð2 = ðŠ) â ðð (ð1 = ð¥1, ð2 = ðŠ)âïž((ð2)â ðð (ð = ð¥1)) (
(ð2)â ðð (ð = ðŠ))
â ðð (ð1, ð2) =(ð¶ð1=0,ð2=0
2)+
(ð¶ð1=1,ð2=12
)â ðð (ð1, ð2) =
(ð¶ð1=0,ð2=12
)+
(ð¶ð1=1,ð2=02
)â ðð (ð) =
(ð¶ð=02
)+
(ð¶ð=12
)â ð¶ðððððð¡ðððð is number of examples which satisfies the conditionsâ ð number of examples in data set
ðð (ð1 = ð¥, ð2 = ðŠ) = ð2 (1 â 2ð (ð¥) â 2ð (ðŠ) + 2ð (ð¥,ðŠ) + 2ð (ð¥)ð (ðŠ)) + ð(2ð (ð¥) + 2ð (ðŠ) â 4ð (ð¥,ðŠ) â 1)ð2
âïž(ð (ð¥) â ð (ð¥)2) (ð (ðŠ) â ð (ðŠ)2)
(23)
ðð_ððð = ðð (ð1 = ð¥1, ð2 = ðŠ) â ðð (ð1 = ð¥2, ð2 = ðŠ) (24)
C METRIC ORIENTATIONSIn this section, we derive the orientations of each metric, which can be thought of as the sensitivity of each metric to the given labelprobability. Table 4 provides a summary, followed by the longer forms.
Table 4: Metric Orientations
ðð (ðŠ) ðð (ð¥1, ðŠ) ðð (ð¥2, ðŠ)
ðDP 0 1ð (ð¥1)
â1ð (ð¥2)
ðPMI 0 1ð (ð¥1,ðŠ)
â1ð (ð¥2,ðŠ)
ðððððŒðŠðð ( ð (ð¥2 |ðŠ)
ð (ð¥1 |ðŠ))
ðð2 (ð (ðŠ))ð (ðŠ)1
ðð (ð (ðŠ))ð (ð¥1,ðŠ)â1
ðð (ð (ðŠ))ð (ð¥2,ðŠ)
ðððððŒð¥ðŠ1
ðð (ð (ð¥1,ðŠ))ð (ðŠ) â1
ðð (ð (ð¥2,ðŠ))ð (ðŠ)ðð (ð (ðŠ))âðð (ð (ð¥1))ðð2 (ð (ð¥1,ðŠ))ð (ð¥1,ðŠ)
ðð (ð (ð¥2))âðð (ð (ðŠ))ðð2 (ð (ð¥2,ðŠ))ð (ð¥2,ðŠ)
ððððŒ2 0 2ð (ð¥1,ðŠ)
â2ð (ð¥2,ðŠ)
ðSDC ð (ð¥2,ðŠ)(ð (ð¥2)+ð (ðŠ))2 â
ð (ð¥1,ðŠ)(ð (ð¥1)+ð (ðŠ))2
1ð (ð¥1)+ð (ðŠ)
â1ð (ð¥2)+ð (ðŠ)
ðJI ð (ð¥1,ðŠ)ð (ð¥1)+ð (ðŠ)âð (ð¥1,ðŠ) â
ð (ð¥2,ðŠ)ð (ð¥2)+ð (ðŠ)âð (ð¥2,ðŠ)
ð (ð¥1)+ð (ðŠ)(ð (ð¥1)+ð (ðŠ)âð (ð¥1,ðŠ))2
ð (ð¥2)+ð (ðŠ)(ð (ð¥2)+ð (ðŠ)âð (ð¥2,ðŠ))2
ðLLR 0 1ð (ð¥1,ðŠ)
â1ð (ð¥2,ðŠ)
ððð annoyingly long(2â 4
ð)â
(ð (ð¥1)âð (ð¥1)2) (ð (ðŠ)âð (ðŠ)2)( 4ðâ2)â
(ð (ð¥2)âð (ð¥2)2) (ð (ðŠ)âð (ðŠ)2)
ðð¡-ð¡ðð ð¡_ðððâð (ð¥2)â
âð (ð¥1)
2âð (ðŠ)
1âð (ð¥1)âð (ðŠ)
â1âð (ð¥2)âð (ðŠ)
⢠Demographic Parity:
ð·ð_ððð = ð (ðŠ |ð¥1) â ð (ðŠ |ð¥2) (25)ðð·ð_ððððð (ðŠ) = 0 (26)
ðð·ð_ððððð (ð¥1, ðŠ)
=1
ð (ð¥1)(27)
ðð·ð_ððððð (ð¥2, ðŠ)
=â1
ð (ð¥2)(28)
⢠SÞrensen-Dice Coefficient:
ðð·ð¶_ððð =ð (ð¥1, ðŠ)
ð (ð¥1) + ð (ðŠ) âð (ð¥2, ðŠ)
ð (ð¥2) + ð (ðŠ) (29)
ððð·ð¶_ððððð (ðŠ) =
ð (ð¥2, ðŠ)(ð (ð¥2) + ð (ðŠ))2
â ð (ð¥1, ðŠ)(ð (ð¥1) + ð (ðŠ))2 (30)
ððð·ð¶_ððððð (ð¥1, ðŠ)
=1
ð (ð¥1) + ð (ðŠ) (31)
ððð·ð¶_ððððð (ð¥2, ðŠ)
=â1
ð (ð¥2) + ð (ðŠ) (32)
(33)
⢠Jaccard Index:
ðœ ðŒ_ððð =ð (ð¥1, ðŠ)
ð (ð¥1) + ð (ðŠ) â ð (ð¥1, ðŠ)â ð (ð¥2, ðŠ)ð (ð¥2) + ð (ðŠ) â ð (ð¥2, ðŠ)
(34)
ððœ ðŒ_ððððð (ðŠ) =
ð (ð¥2, ðŠ)(ð (ð¥2) + ð (ðŠ) â ð (ð¥2, ðŠ))2
â ð (ð¥1, ðŠ)(ð (ð¥1) + ð (ðŠ) â ð (ð¥1, ðŠ))2
(35)
ððœ ðŒ_ððððð (ð¥1, ðŠ)
=ð (ð¥1) + ð (ðŠ)
(ð (ð¥1) + ð (ðŠ) â ð (ð¥1, ðŠ))2(36)
ððœ ðŒ_ððððð (ð¥2, ðŠ)
=ð (ð¥2) + ð (ðŠ)
(ð (ð¥2) + ð (ðŠ) â ð (ð¥2, ðŠ))2(37)
(38)
⢠Log-Likelihood Ratio:
ð¿ð¿ð _ððð = ðð(ð (ð¥1 |ðŠ)) â ðð(ð (ð¥2 |ðŠ)) (39)ðð¿ð¿ð _ððððð (ðŠ) =
ð (ð¥2, ðŠ)ð (ð¥2 |ðŠ)ð (ðŠ)2
â ð (ð¥1, ðŠ)ð (ð¥1 |ðŠ)ð (ðŠ)2
= 0 (40)
ðð¿ð¿ð _ððððð (ð¥1, ðŠ)
=1
ð (ð¥1 |ðŠ)ð (ðŠ)=
1ð (ð¥1, ðŠ)
(41)
ðð¿ð¿ð _ððððð (ð¥2, ðŠ)
=â1
ð (ð¥2, ðŠ)(42)
(43)
⢠PMI:
ðððŒ_ððð = ðð(ð (ðŠ |ð¥1)) â ðð(ð (ðŠ |ð¥2)) (44)ððððŒ_ððððð (ðŠ) = 0 (45)
ððððŒ_ððððð (ð¥1, ðŠ)
=1
ð (ð¥1, ðŠ)(46)
ððððŒ_ððððð (ð¥2, ðŠ)
=1
ð (ð¥2, ðŠ)(47)
⢠PMI2:
ðððŒ2_ððð = ðð(ð (ðŠ |ð¥1)) â ðð(ð (ðŠ |ð¥2)) + ðð(ð (ð¥1, ðŠ)) â ðð(ð (ð¥2, ðŠ)) (48)
ððððŒ2_ððððð (ðŠ) = 0 (49)
ððððŒ2_ððððð (ð¥1, ðŠ)
=2
ð (ð¥1, ðŠ)(50)
ððððŒ2_ððððð (ð¥2, ðŠ)
=â2
ð (ð¥2, ðŠ)(51)
⢠nPMI, normalized by p(y) :
ððððŒ_ððð =ðð(ð (ðŠ |ð¥1))ðð(ð (ðŠ)) â ðð(ð (ðŠ |ð¥2))
ðð(ð (ðŠ)) (52)
ðððððŒ_ððððð (ðŠ) =
ðð(ð (ðŠ |ð¥2))ðð2 (ð (ðŠ))ð (ðŠ)
â ðð(ð (ðŠ |ð¥1))ðð2 (ð (ðŠ))ð (ðŠ)
=ðð( ð (ð¥2 |ðŠ)
ð (ð¥1 |ðŠ) )
ðð2 (ð (ðŠ))ð (ðŠ)(53)
ðððððŒ_ððððð (ð¥1, ðŠ)
=1
ðð(ð (ðŠ))ð (ð¥1, ðŠ)(54)
ðððððŒ_ððððð (ð¥2, ðŠ)
=â1
ðð(ð (ðŠ))ð (ð¥2, ðŠ)(55)
⢠nPMI, normalized by p(x,y) :
ððððŒ_ððð =ðð(ð (ð¥2))ðð(ð (ð¥2, ðŠ))
â ðð(ð (ð¥1))ðð(ð (ð¥1, ðŠ))
+ ðð(ð (ðŠ))ðð(ð (ð¥1, ðŠ))
â ðð(ð (ðŠ))ðð(ð (ð¥2, ðŠ))
(56)
ðððððŒ_ððððð (ðŠ) =
1ðð(ð (ð¥1, ðŠ))ð (ðŠ)
â 1ðð(ð (ð¥2, ðŠ))ð (ðŠ)
(57)
ðððððŒ_ððððð (ð¥1, ðŠ)
=ðð(ð (ðŠ)) â ðð(ð (ð¥1))ðð2 (ð (ð¥1, ðŠ))ð (ð¥1, ðŠ)
(58)
ðððððŒ_ððððð (ð¥1, ðŠ)
=ðð(ð (ð¥2)) â ðð(ð (ðŠ))ðð2 (ð (ð¥2, ðŠ))ð (ð¥2, ðŠ)
(59)
(60)
⢠Kendall rank correlation (ðð ):
ðð_ððð = ðð (ð1 = ð¥1, ð2 = ðŠ) â ðð (ð1 = ð¥2, ð2 = ðŠ) (61)
ððð_ððððð (ð¥1, ðŠ)
=(2 â 4
ð )âïž(ð (ð¥1) â ð (ð¥1)2) (ð (ðŠ) â ð (ðŠ)2)
(62)
ððð_ððððð (ð¥2, ðŠ)
=(2 â 4
ð )âïž(ð (ð¥2) â ð (ð¥2)2) (ð (ðŠ) â ð (ðŠ)2)
(63)
(64)
⢠ð¡-ð¡ðð ð¡ :
ð¡-ð¡ðð ð¡_ððð =ð (ð¥1, ðŠ) â ð (ð¥1) â ð (ðŠ)âïž
ð (ð¥1) â ð (ðŠ)â ð (ð¥2, ðŠ) â ð (ð¥2) â ð (ðŠ)âïž
ð (ð¥2) â ð (ðŠ)(65)
ðð¡-ð¡ðð ð¡_ððððð (ðŠ) =
âïžð (ð¥2) â
âïžð (ð¥1)
2âïžð (ðŠ)
(66)
ðð¡-ð¡ðð ð¡_ððððð (ð¥1, ðŠ)
=1âïž
ð (ð¥1) â ð (ðŠ)(67)
ðð¡-ð¡ðð ð¡_ððððð (ð¥2, ðŠ)
=â1âïž
ð (ð¥2) â ð (ðŠ)(68)
(69)
D COMPARISON TABLES
Table 5: Mean/Std of top 100 Male Association Gaps
Scales of top 100 (ð¥1 =ððð)Metrics Mean/Std ð¶ (ðŠ) Mean/Std ð¶ (ð¥1, ðŠ)ð·ð 101.38(±199.32) 1.00(±0.00)
ððððŒð¥ðŠ 24371.15(±27597.10) 237.92(±604.29)ððððŒðŠ 75160.45(±119779.13) 5680.44(±17193.19)ðððŒ 1875.86(±3344.23) 2.43(±4.72)ðððŒ2 831.43(±620.02) 1.09(±0.32)ðð·ð¶ 676.33(±412.23) 1.00(±0.00)ðð 189294.69(±149505.10) 24524.89(±32673.77)
Table 6: Mean/Std of top 100 Male-Female Association Gaps
Counts for top 100 gaps(ð¥1=MALE, ð¥2=FEMALE)Metrics Mean/Std ð¶ (ðŠ) Mean/Std ð¶ (ð¥1, ðŠ) Mean/Std ð¶ (ð¥2, ðŠ)ð·ð 154960.81(±147823.81) 71552.62 (±67400.44) 70398.07(±56824.32)ðœ ðŒ 70403.46 (±86766.14) 29670.94 (±35381.09) 35723.13(±39308.46)ð¿ð¿ð 1020.60 (±1971.77) 57.95 (±178.59) 558.36 (±1420.15)
ððððŒð¥ðŠ 13869.42 (±46244.17) 5771.73 (±23564.63) 9664.46 (±31475.19)ððððŒðŠ 23156.11 (±73662.90) 6632.79 (±25631.26) 10376.39 (±33559.27)ðððŒ 1020.60 (±1971.77) 57.95 (±178.59) 558.36 (±1420.15)ðððŒ2 1020.60 (±1971.77) 57.95 (±178.59) 558.36 (±1420.15)ðð·ð¶ 67648.20 (±87258.41) 28193.83 (±35295.25) 34443.22 (±39501.21)ðð 134076.65 (±144354.70) 53845.80 (±58016.12) 56032.09 (±52124.92)
Table 7: Minimum/Maximum of top 100 Male-Female Association Gaps
Metrics Min/Max ð¶ (ðŠ) Min/Max ð¶ (ð¥1, ðŠ) Min/Max ð¶ (ð¥2, ðŠ)
ðððŒ 15 / 10,551 1 / 1,059 8 / 7,755
ðððŒ2 15 / 10,551 1 / 1,059 8 / 7,755
ð¿ð¿ð 15 / 10,551 1 / 1,059 8 / 7,755
ð·ð 6,104 / 785,045 628 / 239,950 5,347 / 197,795
ðœ ðŒ 4,158 / 562,445 399 / 144,185 3,359 / 183,132
ðð·ð¶ 2,906 / 562,445 139 / 144,185 2,563 / 183,132
ððððŒðŠ 35 / 562,445 1 / 144,185 9 / 183,132
ððððŒð¥ðŠ 34 / 270,748 1 / 144,185 20 / 183,132
ðð 6,104 / 785,045 628 / 207,723 5,347 / 183,132
ð¡-ð¡ðð ð¡ 960 / 562,445 72 / 144,185 870 / 183,132
E COMPARISON PLOTSE.1 Overall Rank Changes
Table 8: Demographic Parity (di), Jaccard Index (ji), SÞrensen-Dice Coefficient (sdc) Comparison
Table 9: Pointwise Mutual Information (pmi), Squared PMI (pmi_2), Log-Likelihood Ratio (ll) Comparison
Table 10: Demographic Parity (di), PointwiseMutual Information (pmi), Normalized PointwiseMutual Information (npmi_xy),ðð (tau_b) Comparison
E.2 Movement plots
Table 11: Demographic Parity (di), Jaccard Index (ji), SÞrensen-Dice Coefficient (sdc) Comparison
Table 12: Pointwise Mutual Information (pmi), Squared PMI (pmi_2), Log-Likelihood Ratio (ll) Comparison
Table 13: Demographic Parity (di), PointwiseMutual Information (pmi), Normalized PointwiseMutual Information (npmi_xy),ðð (tau_b) Comparison
Table 14: Example Top Results
DP PMI nPMIð¥ðŠRanks Label Count Label Count Label Count
0 Female 265853 Dido Flip 140 6101 Woman 270748 Webcam Model 184 Dido Flip 1402 Girl 221017 Boho-chic 151 29063 Lady 166186 610 Eye Liner 31444 Beauty 562445 Treggings 126 Long Hair 568325 Long Hair 56832 Mascara 539 Mascara 5396 Happiness 117562 145 Lipstick 86887 Hairstyle 145151 Lace Wig 70 Step Cutting 61048 Smile 144694 Eyelash Extension 1167 Model 105519 Fashion 238100 Bohemian Style 460 Eye Shadow 123510 Fashion Designer 101854 78 Photo Shoot 877511 Iris 120411 Gravure Idole 200 Eyelash Extension 116712 Skin 202360 165 Boho-chic 46013 Textile 231628 Eye Shadow 1235 Webcam Model 15114 Adolescence 221940 156 Bohemian Style 184