what we can't measure, we can't understand': challenges

12
"What We Can’t Measure, We Can’t Understand": Challenges to Demographic Data Procurement in the Pursuit of Fairness McKane Andrus Elena Spitzer mckane@partnershiponai.org emspitzer1@gmail.com Partnership on AI Jeffrey Brown jeff@partnershiponai.org Partnership on AI Minnesota State University, Mankato Alice Xiang alice.xiang@sony.com Sony AI Partnership on AI Abstract As calls for fair and unbiased algorithmic systems increase, so too does the number of individuals working on algorithmic fairness in industry. However, these practitioners often do not have access to the demographic data they feel they need to detect bias in practice. Even with the growing variety of toolkits and strategies for work- ing towards algorithmic fairness, they almost invariably require access to demographic attributes or proxies. We investigated this dilemma through semi-structured interviews with 38 practitioners and professionals either working in or adjacent to algorithmic fair- ness. Participants painted a complex picture of what demographic data availability and use look like on the ground, ranging from not having access to personal data of any kind to being legally required to collect and use demographic data for discrimination assessments. In many domains, demographic data collection raises a host of difficult questions, including how to balance privacy and fairness, how to define relevant social categories, how to ensure meaningful consent, and whether it is appropriate for private companies to infer someone’s demographics. Our research suggests challenges that must be considered by businesses, regulators, researchers, and community groups in order to enable practitioners to address al- gorithmic bias in practice. Critically, we do not propose that the overall goal of future work should be to simply lower the barriers to collecting demographic data. Rather, our study surfaces a swath of normative questions about how, when, and whether this data should be procured, and, in cases where it is not, what should still be done to mitigate bias. CCS Concepts Social and professional topics Privacy policies; Systems plan- ning; Socio-technical systems. Keywords demographic data, sensitive data, special category data, data privacy, fairness, anti-discrimination Research funded by the Partnership on AI. Senior author listed last. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. FAccT ’21, March 3–10, 2021, Virtual Event, Canada © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8309-7/21/03. . . $15.00 https://doi.org/10.1145/3442188.3445888 ACM Reference Format: McKane Andrus, Elena Spitzer, Jeffrey Brown, and Alice Xiang. 2021. "What We Can’t Measure, We Can’t Understand": Challenges to Demographic Data Procurement in the Pursuit of Fairness. In Conference on Fairness, Accountability, and Transparency (FAccT ’21), March 3–10, 2021, Virtual Event, Canada. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ 3442188.3445888 1 Introduction As the risk of algorithmic discrimination has increased in recent years, so too has the number of proposed fixes from the field of algo- rithmic fairness. Countless fairness strategies, metrics, and toolkits have been developed and, more rarely, integrated into algorithmic systems and products. Most of these innovations revolve around measuring and sometimes mitigating disparities in treatment and performance across a “sensitive” or “protected” attribute. Typically, this process will require access to data revealing this “sensitive” or “protected” attribute. Using open-access, technical toolkits as an example, the fairness methods in Fairlearn (Microsoft) [56], AI Fairness 360 (IBM) [9], What-if tool (Google) [79], LiFT (LinkedIn) [75], Aequitas [65], and FAT-forensics [71] all require access to demographic data. In many situations, however, information about demographics can be extremely difficult for practitioners to even procure [38, 76]. For the toolkits that do not require demographic data access, they generally rely on simulated data (e.g. [23]) or approach the problem more from the angle of design (e.g. [7]). Demographic data stands apart from much of the other types of data fed into algorithmic systems in that its collection and use are socially and legally constrained [2]. Data protection regulation, such as the EU’s General Data Protection Regulation (GDPR) [25], builds in extra protections for data about certain attributes, deeming it off-limits for many use cases. Perceiving this potential conflict between data availability and making systems less discriminatory, legal scholars have argued for regulatory provisions for collecting demographic data to audit algorithmic bias [73, 80, 83, 85, 87]. On the other hand, computer science researchers have proposed a number of methods and techniques that try to bypass the col- lection of demographic data through inference and proxies (e.g., [30, 64, 86]), to only use demographic data during model training (e.g., [27, 46, 66, 84]), or to identify and mitigate algorithmic dis- crimination for computationally identified groups without the use of demographics at all (e.g., [12, 36, 50]). Coming at the problem from a privacy angle, others have sought to make demographic data collection and use more private and less centralized through various combinations of data sanitization, cryptographic privacy, differen- tial privacy, and third-party auditors (e.g., [31, 43, 48, 49, 76]). arXiv:2011.02282v2 [cs.CY] 23 Jan 2021

Upload: others

Post on 06-Jan-2022

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What We Can't Measure, We Can't Understand': Challenges

"What We Can’t Measure, We Can’t Understand": Challenges toDemographic Data Procurement in the Pursuit of Fairness

McKane AndrusElena Spitzer

[email protected]@gmail.com

Partnership on AI

Jeffrey [email protected]

Partnership on AIMinnesota State University, Mankato

Alice Xiang∗[email protected]

Sony AIPartnership on AI

AbstractAs calls for fair and unbiased algorithmic systems increase, so

too does the number of individuals working on algorithmic fairnessin industry. However, these practitioners often do not have access tothe demographic data they feel they need to detect bias in practice.Even with the growing variety of toolkits and strategies for work-ing towards algorithmic fairness, they almost invariably requireaccess to demographic attributes or proxies. We investigated thisdilemma through semi-structured interviews with 38 practitionersand professionals either working in or adjacent to algorithmic fair-ness. Participants painted a complex picture of what demographicdata availability and use look like on the ground, ranging from nothaving access to personal data of any kind to being legally requiredto collect and use demographic data for discrimination assessments.In many domains, demographic data collection raises a host ofdifficult questions, including how to balance privacy and fairness,how to define relevant social categories, how to ensure meaningfulconsent, and whether it is appropriate for private companies toinfer someone’s demographics. Our research suggests challengesthat must be considered by businesses, regulators, researchers, andcommunity groups in order to enable practitioners to address al-gorithmic bias in practice. Critically, we do not propose that theoverall goal of future work should be to simply lower the barriersto collecting demographic data. Rather, our study surfaces a swathof normative questions about how, when, and whether this datashould be procured, and, in cases where it is not, what should stillbe done to mitigate bias.

CCS Concepts• Social and professional topics→ Privacy policies; Systems plan-ning; Socio-technical systems.

Keywordsdemographic data, sensitive data, special category data, data privacy,fairness, anti-discrimination∗Research funded by the Partnership on AI. Senior author listed last.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’21, March 3–10, 2021, Virtual Event, Canada© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-8309-7/21/03. . . $15.00https://doi.org/10.1145/3442188.3445888

ACM Reference Format:McKane Andrus, Elena Spitzer, Jeffrey Brown, and Alice Xiang. 2021. "WhatWe Can’t Measure, We Can’t Understand": Challenges to DemographicData Procurement in the Pursuit of Fairness. In Conference on Fairness,Accountability, and Transparency (FAccT ’21), March 3–10, 2021, VirtualEvent, Canada. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3442188.3445888

1 IntroductionAs the risk of algorithmic discrimination has increased in recent

years, so too has the number of proposed fixes from the field of algo-rithmic fairness. Countless fairness strategies, metrics, and toolkitshave been developed and, more rarely, integrated into algorithmicsystems and products. Most of these innovations revolve aroundmeasuring and sometimes mitigating disparities in treatment andperformance across a “sensitive” or “protected” attribute. Typically,this process will require access to data revealing this “sensitive”or “protected” attribute. Using open-access, technical toolkits asan example, the fairness methods in Fairlearn (Microsoft) [56], AIFairness 360 (IBM) [9], What-if tool (Google) [79], LiFT (LinkedIn)[75], Aequitas [65], and FAT-forensics [71] all require access todemographic data. In many situations, however, information aboutdemographics can be extremely difficult for practitioners to evenprocure [38, 76]. For the toolkits that do not require demographicdata access, they generally rely on simulated data (e.g. [23]) orapproach the problem more from the angle of design (e.g. [7]).

Demographic data stands apart from much of the other typesof data fed into algorithmic systems in that its collection and useare socially and legally constrained [2]. Data protection regulation,such as the EU’s General Data Protection Regulation (GDPR) [25],builds in extra protections for data about certain attributes, deemingit off-limits for many use cases. Perceiving this potential conflictbetween data availability and making systems less discriminatory,legal scholars have argued for regulatory provisions for collectingdemographic data to audit algorithmic bias [73, 80, 83, 85, 87].

On the other hand, computer science researchers have proposeda number of methods and techniques that try to bypass the col-lection of demographic data through inference and proxies (e.g.,[30, 64, 86]), to only use demographic data during model training(e.g., [27, 46, 66, 84]), or to identify and mitigate algorithmic dis-crimination for computationally identified groups without the useof demographics at all (e.g., [12, 36, 50]). Coming at the problemfrom a privacy angle, others have sought to make demographic datacollection and use more private and less centralized through variouscombinations of data sanitization, cryptographic privacy, differen-tial privacy, and third-party auditors (e.g., [31, 43, 48, 49, 76]).

arX

iv:2

011.

0228

2v2

[cs

.CY

] 2

3 Ja

n 20

21

Page 2: What We Can't Measure, We Can't Understand': Challenges

FAccT ’21, March 3–10, 2021, Virtual Event, Canada Andrus, Spitzer, Brown, and Xiang

Acknowledging that these alternative methods exist, our goalwas to understand how demographic data availability is encoun-tered and dealt with in practice. By interviewing a wide range ofpractitioners dealing with concerns around algorithmic fairness, westrove to characterize the actual impediments to demographic datause as they emerge on the ground. We also assessed a number ofconcerns practitioners have around how to ensure that their demo-graphic data collection process is responsible and ethical. Thoughthere is some question as towhether algorithmic fairness techniquesmeaningfully address discrimination and injustice and thereby jus-tify the collection and use of this sensitive data [5, 37, 47, 68], un-derstanding how practitioners in industry think about and addressthe use of demographic data is an important piece of this debate.

2 Background and Related ResearchPast work on the barriers to demographic data procurement and

use has largely considered the question of legality. While it is clearthat data protection regulations like GDPR are likely to inhibit theprocurement, storage, and use of many demographic attributes, it isless clear what types of carve-outs might apply if using the data forfairness purposes [28]. There is also an open question of whether ornot anti-discrimination law permits the inclusion of demographicattributes in decision-making processes [11, 83]. Bogen et al. [13]survey the requirements of U.S anti-discrimination law aroundthe collection and usage of sensitive attribute data in employment,credit, and healthcare, and find that “there are few clear, generallyaccepted principles about when and why companies should collectsensitive attribute data for anti-discrimination purposes.” [13, p. 2]

A more critical branch of scholarship is now interrogating the al-gorithmic fairness community’s conceptualization of demographicattributes themselves and how that propagates into notions offairness. Work in this area has addressed race [12, 33], gender[32, 40, 67], and disability [10], calling into question what it meansto even rely on these categories as a basis for assessing unfairness,and what harms are reproduced by relying on these infrastruc-tures of categorization. Similar to this work, important contribu-tions emerging from various academic fields and activist commu-nities on the issue of justice in data use interrogate when andwhere it is acceptable to collect what types of data and what de-gree of control data subjects should have over their data thereafter[16, 21, 57, 60, 61, 72]. These lines of work center questions of au-tonomy and representation around data and its use, going beyondthe common legal standards of privacy and anti-discrimination.

Less work has been done, however, in understanding how practi-tioners confront issues around demographic data procurement anduse and if the difficulties they encounter mirror those discussedabove. There is a growing literature of work studying fairness prac-titioners themselves that has focused on the needs of public sectorpractitioners [77], the needs of private sector practitioners [38]and the organizational patterns they fit within [62]. These studieslay important groundwork in identifying the problem,1 but withthis paper we aim to provide more texture as to how practitionersthemselves think about and navigate this gap, and, in turn, whatpaths forward would need to address in order to be effective.1Holstein et al. found, for example, that a majority of their survey respondents “in-dicated that the availability of tools to support fairness auditing without access todemographics at an individual level would be at least ‘Very’ useful” [38, p. 8].

Table 1: Participant Roles

Role Count

External Consultant (EC) 4Leadership Role (LR) 9Legal/Policy (LP) 4Product/Program/Project Manager (PM) 7Researcher (R) 4Tech Contributor (TC) 10

Table 2: Participant Sectors

Sector Participant IDs Count

Ad Tech LP1, LP2 2Finance LR2, LR3, TC3, TC4 4Healthcare EC1, LR8, PM5, TC7 4Hiring/HR LR1, LR4, LR7, PM7, TC10 5Multi-SectorTechnology

LR5, PM2, PM3, PM4,R1, R2, R3, TC2, TC6 9

Social LP3, LP4, LR6, PM6, TC5, TC9 6Telecom LR9, PM1, TC1 3Other EC2, EC3, EC4, R4, TC8 5

3 MethodsBuilding off of themes from prior work in this area [1], we scoped

an interview study looking at how demographic data procurementand use actually proceeds in practice. We interviewed 38 practition-ers from 26 organizations, with 5 participants being the maximumnumber coming from a single organization. All participants eitherA) were involved in efforts to detect bias or unfairness in a modelor product, B) were familiar with company policies relevant to theusage of demographic data, or C) were familiar with regulations orpolicies relevant to the collection or use of demographic data.

Most participants were from for-profit tech companies (34 outof 38). A slight majority (55%) of participants were from companieswith more than 1,000 employees, with the rest from smaller organi-zations. 80% of participants worked in the US. Participants held avariety of positions within their organizations (Table 1) and camefrom a diverse set of sectors (Table 2). For the categories in 1, “Exter-nal Consultant” includes outside auditors and advisors for variousissues surrounding system bias and fairness. The “Leadership Role”category is used to describe participants overseeing the work oflarger teams (e.g. “Director/Head of X”). “Tech Contributor” is anumbrella term for roles such as software engineers, engineeringmanagers, and data scientists.

We employed a variety of recruitment methods. Firstly, we iden-tified individuals in the authors’ professional networks that wouldmost likely have experience with attempting to implement algorith-mic fairness techniques. These contacts were asked to participatethemselves and were also encouraged to share our call for partic-ipants with relevant individuals in their own network. Similarly,we distributed an interest form and study description to variousorganizational mailing lists. In order to broaden the search beyondour established networks, we also carried out searches for vari-ous participant archetypes through the LinkedIn Recruiter service[52]. As suggested by Maramwidze-Merrison [53], LinkedIn can

Page 3: What We Can't Measure, We Can't Understand': Challenges

"What We Can’t Measure, We Can’t Understand" FAccT ’21, March 3–10, 2021, Virtual Event, Canada

be a prime mechanism for identifying and contacting specialiststhat would otherwise not be accessible to the researcher. By usinga Recruiter account, we were able to send requests for participa-tion to individuals with greater than two degrees of separationfrom the account-holder. Given the typical use case of this medium,we clearly stated that we were not reaching out with an employ-ment opportunity. In addition to these methods, we also conductedsearches for news articles and organizational publications on topicsrelated to algorithmic fairness and bias auditing in order to iden-tify teams and authors to directly reach out to. Finally, followingeach interview we asked participants to refer any relevant contacts.With all of these recruitment methods, we attempted to sampleparticipants from industries with varying degrees of regulation, aswe expected this to be an important axis of analysis based on theexisting literature.

Participation in the interviews involved one 60-75 minute videocall. In some cases, we agreed to have multiple interviewees fromthe same organization participate in the same call. Before the call,we shared our consent and confidentiality practices with partici-pants and then obtained verbal consent to record before startingthe interview. We transcribed the audio recordings, and deletedthem after transcription. We redacted all findings to maintain theanonymity of participants and their institutions.

Since the term “demographic data” may mean different things todifferent people, we provided the following, rather broad, definitionat the beginning of the interviews.

Demographic data includes most regulated categories (e.g.,“protected classes” and “sensitive attributes” like sex, race,national origin) as well as other less-protected classes (e.g.,socioeconomic class and geography). Demographic data caneven include data that might be used as proxies for thesevariables, such as likes or dislikes.The interview questions focused on asking participants to walk

us through examples of times when they had participated in suc-cessful or unsuccessful attempts to use demographic data for biasassessments. We probed for details about what data they wanted touse and what their intended purpose was. We asked about the avail-ability or obtainability of the desired data. And we asked followupquestions about what types of approvals or interactions with otherteams were required for the collection or usage of this type of data.

In cases where the participant was in a legal/policy role, thequestions were similar to the above, but framed from the vantagepoint of someone reviewing a request for data access or usage. Inthese interviews, we also asked additional questions about whatlegal or other risks they take into account when making decisions.In cases where participants were identified based on press reportingor organizational publications on algorithmic fairness issues, wesometimes included more tailored questions about the content ofthe publication.

After hearing about more specific cases, we also asked for moregeneral reflections from participants on the trade-offs involved inthis type of bias assessment work — e.g., “How do you balance userprivacy concerns with the need for demographic data?” In addition,we heard some insightful reflections in response to more forward-looking questions such as “In your ideal world, what would biasassessment look like for you?”

To analyze the interviews, we used open and closed coding inMaxQDA. Open-coding was used to parse and summarize an initialselection of 25% of the interviews. Open codes were then itera-tively grouped and synthesized using thematic networks [3], withthemes organized around either the challenges with or constraintson demographic data use. These themes and their subthemes theninformed a set of closed codes that were applied and expandedupon over the corpus of transcripts.

4 Results OverviewBefore delving into emergent themes, it is important to first note

some of the broad trends we heard concerning the availability ofdemographic data. Almost every participant described access todemographic data as a significant barrier to implementing vari-ous fairness techniques, a result that mirrors the responses fromHolstein et al. [38]. In terms of accessible data, virtually all com-panies that collected or commissioned their own data had accessto gender and age and frequently used that data to detect bias.These categories of demographics data were regarded as entail-ing much less sensitivity and privacy risk than other categories.Outside of employment, healthcare, and the occasional financialservice contexts, few practitioners had direct access to data on race,so most did not try to assess their systems and products for racialbias. When attempts were made, it was generally through prox-ies or inferred values, but such methods were rarely deployed inpractice (conforming to previous findings by Holstein et al. [38]for demographic attributes more generally). Finally, practitionersin business-to-business (B2B) companies and external consultantsgenerally had very limited access to sensitive data of any kind.

We organize the themes uncovered through our analysis intotwo types. The first type are the regulatory and organizationalconstraints on demographic data procurement and use, which aregenerally maintained by a network of actors outside the practi-tioner’s team (e.g., legal and compliance teams, policy teams, com-pany leadership, external auditors, and regulators). The second typeare concerns surrounding demographic data procurement and usethat the practitioners themselves surfaced or encountered duringthe course of their work. Having made this distinction, however, itis important to note that it was not uncommon for the constraintsin the first category to inform the concerns of practitioners or forthe concerns of practitioners to feed into organizational policy.

5 Legal and Organizational Constraints onDemographic Data Procurement and Use

Looking first to the network of constraints surrounding the pro-curement and use of demographic data, we discuss three majorfactors at play. The first, and perhaps most obvious, are the variousregimes of privacy laws and policies. Second, we consider the roleof anti-discrimination laws and policies in determining whether de-mographic data is collected. Finally, we consider a suite of potentialrestrictions stemming from organizational priorities and practices.

5.1 Privacy Laws and PoliciesGDPR-related notions like data-subject consent and the Privacy

By Design standard of only collecting required data seemed toprimarily guide legal, policy, and privacy teams’ handling of data

Page 4: What We Can't Measure, We Can't Understand': Challenges

FAccT ’21, March 3–10, 2021, Virtual Event, Canada Andrus, Spitzer, Brown, and Xiang

requests, despite most participants being based in the U.S. As de-scribed by LP1, a legal practitioner in the Ad Tech space:

“The general framework is that the rules apply to data aboutpeople. And then, one axis on there is ‘what does it mean thatthe data is about a person.’ [T]he old school way of lookingat that was [whether the data was] identifiable, which meantit had their name or contact information, email address. Thenewer GDPR-like way to look at it is that even pseudonymousdata — [i.e.,] I’m not [First Name, Last Name], I’m ABC789(or my computer or my browser is) — even pseudonymousdata is regulated.”

Consent and Special Category Data. For this work we left our dis-cussions of “demographic data” open to any membership categoryof data relevant to our participants. Generally speaking, however,practitioners pointed to the categories defined by law as protectedor sensitive. An important distinction that emerged for organiza-tions with operations in Europe was between categories classifiedas “special” by GDPR and those that are not. Most notably, gender,age, and region, three categories not considered “special,” wereavailable to most of the practitioners we spoke to, whereas race, a“special” category, was not. This is especially significant becauserace was seen to be a very salient axis of analysis for every domain,yet attempting to procure data on race was rare. The absence ofthis data was keenly felt by practitioners attempting to show thefailure modes of various systems, as depicted by TC9, a productmanager in the social media domain: “Every time we do one ofthese assessments, people ask, ‘Where’s race?’”

Though few participants saw the requirements set by GDPRaround special category data as entirely prohibitive, many discussedhow the general practice is to just avoid the risks of wide-scalespecial category data collection altogether. In the rare cases wherespecial category data was procurable at a large scale, it was eitherthrough government datasets, through open-sourced data with self-identification, or by following explicitly applicable regulation andstandards. There were also some reported cases where racial datawas collected for user studies and experimentation, but this wasdone with informed consent and compensation, as well as strictlegal review. The only domain in which we saw the attemptedcollection of racial data outside of the U.S. was for hiring and HR. 2

Don’t Collect More Than What You Need. Beyond the legal risksincurred by potential GDPR violations, we also heard from prac-titioners that it is “the combination of GDPR and the ethos thatGDPR lives in, which is an ethos of tech fear and nervousness andlack of trust” (PM3) that can lead to inaction. If public trust in datause is already low, the Privacy By Design standard of not collectingmore data than you need can very easily take precedence over thelargely experimental process of ensuring product fairness. Describ-ing an experience voiced to us by multiple participants, TC5 talkedof how “for the collection of new data, there’s deep conversationswith policy [and] legal, and then we need to loop in external stake-holders [to ask], ‘does the need to be able to measure this seem tosufficiently balance out the user concerns for privacy?’” In many

2Hiring and HR carve-outs for special category data in regional GDPR implementationstend to defer to employment law in determining what data can be collected and forwhat reasons, making it potentially less fraught than for other domains. See e.g., theGerman Federal Data Protection Act Section 26 (3) [26]

of our interviews, we heard practitioners recall cases where datarequests would be stifled by this question of necessity.

Interestingly, practitioners reported that inferring demographicdata is similarly frowned upon. One potential reason for this isthe increasing prevalence of data access request rights throughregulation such as GDPR and the California Consumer Privacy Act.LP4 discussed how with “data access requests, it’s still an openquestion under certain laws whether inferred data is user data and[thus] needs to be shared or not. And so [a] concern around that [is],‘are people aware that this might be happening?’” In this way, thepotential of forced revelation of inferred data may be encouragingcorporations to be more conservative about what they infer.Data Sharing. One final difficulty often presented by data privacyregulation in this space is the inaccessibility of demographic data foroutside auditors or consultants as well as for business-to-businessvendors of algorithmic tools. Though we did hear of cases whereexternal auditors were brought into an organization and givenpseudonymized access to data, we also heard of many cases whereauditors, consultants, and vendors had to build their models orprovide recommendations without ever seeing the data. For prac-titioners in this kind of situation, some spoke of the importanceof publicly available datasets drawn from domains similar to theorganizations they were assisting. By combining this external datawith API access or knowledge of the infrastructures in use, thesepractitioners noted that they could point to specific, high-risk fail-ure modes without needing to risk any legal privacy violations.In one unique case, a practitioner opted to forego the use of pub-lic/private data altogether, instead making the case for “pathways”of discrimination in conceptual Structural Causal Models (SCMs)built in concert with subject matter experts and other employeesfrom the organization they were assisting. The SCMs was then usedto inform the client of potential sources of bias in their data, despitenot taking in any of that data.Implications. While there are many other motivations for somecompanies to not actively collect “special category data,” GDPRhas raised the bar for private organizations that wish to use thisdata for the purposes of bias detection or fairness assessments.Even if GDPR does not prohibit the collection of sensitive attributesfor bias detection (as outlined in recent guidance from the UKInformation Commissioner’s Office [74]), increasingly protectiveprivacy regulation has empowered privacy and legal teams to erron the side of caution when it comes to data sensitivity. This islikely why many practitioners discussed bias detection efforts beingdenied if they required new data. As such, fairness practitionerseither are pushed to inaction or have to be creative in how theyassess and address bias, as we explore in Section 6 below.

5.2 Anti-Discrimination Laws and PoliciesMost often, the closest legal and policy teams come to think-

ing about fairness is through anti-discrimination policy. For thissection, we first consider a few domains where we saw clear anti-discrimination standards interact with the collection and use ofdemographic data, and then move to looking at how this interactionplays out when liability for discrimination is less clear.Finance. The few practitioners we spoke with from the Financesector were quick to discuss the differences between mortgage prod-ucts and credit-based products in the U.S. For the former, lenders are

Page 5: What We Can't Measure, We Can't Understand': Challenges

"What We Can’t Measure, We Can’t Understand" FAccT ’21, March 3–10, 2021, Virtual Event, Canada

required by law to collect demographic attribute information, whilefor the latter they are nearly barred from doing so. In both of thesecases, however, financial institutions are held to anti-discriminationstandards by various watch-dog institutions, such as the ConsumerFinance Protection Bureau (CFPB). For products where the data re-quired to ensure compliance is sparse or cannot be directly collected,these participants discussed how their institutions have followedthe CFPB’s lead in inferring racial categories through the BayesianImproved Surname Geocoding (BISG) [24] method. Even so, theyreported never having direct access to these attributes themselves,inferred or otherwise. Instead, compliance teams with special dataaccess will test models after they have been produced, sendingthem back if they are found to be discriminatory. When discussingthis arrangement, TC4 suggested that "they kind of want the wholetesting function to be a black box," in part so that no intentionalcircumvention of compliance tests can occur. This can have theunfortunate effect, however, of leaving engineering teams in thedark about the impacts of their models.

For some types of discrimination, even clear regulation does notnecessarily lead to measurement and mitigation. On the issue ofregulations prohibiting financial service discrimination based onsexuality, LR3 noted that, “in practice, the regulators understandthat data for certain classes are not easy to obtain and so they don’tmake any enforcement. [...] So it’s one of those things where whenthe rubber hits the road, you have to figure out, is this actuallysomething you can do?” Even for classes of data that are possiblyeasier to obtain, financial institutions can still be resistant to theircollection depending on how anti-discrimination is enforced.Human Resources. In the Hiring/HR domain, participants refer-enced clearer standards that they defer to. In the U.S. context, prac-titioners mentioned how the demographic categories on EEO-1forms (gender and race/ethnicity) were often made available tothem such that they could ensure their models’ adherence to the"80% rule". 3 Difficulties can arise when the organization’s standardpractices for bias and discrimination assessment go beyond theexpectations, requirements, and norms for doing so in the coun-try of business, however. When discussing doing business in acountry outside their own, PM7 described how, “We talked aboutdemographics and they were like, ‘This is not something that isimportant to us.’ [...] It becomes a real challenge to get customerseven interested in why we do [model debiasing].” As a result ofthese types of concerns, participants discussed bringing on em-ployment anti-discrimination experts specializing in the variousregions they operate in. In terms of how procured demographicdata often gets used, teams we talked with that did model debiasingrelied on specially collected and curated datasets, generally fromclient organizations, that have demographic attributes of reliableaccuracy. Typically, this type of high-quality data is only used forthe purposes of reducing discriminatory effects before deployment,as there are too many practical and legal constraints on includingdemographic data in deployment.Healthcare. Healthcare was the third and final domain where wespoke to participants that pointed to concrete anti-discriminationpolicies. Though not necessarily codified in law, professionals from

3The EEOC’s adverse impact rule of thumb, where if the selection rate for one groupis four-fifths that of another it may point to deeper issues [17].

one organization noted that they try to strictly adhere to the In-stitute of Medicine’s Six Dimensions of Health Care Quality [41],one of which is health equity. This established what they saw asclear alignment between the goals of their organization and thepush for anti-discrimination in algorithmic systems. Furthermore,practitioners reported that demographic data is often required fortreatment purposes anyways, so obtaining it for fairness assess-ments is generally not an issue. Interestingly, this was also themain domain in which practitioners saw the potential of algorith-mic fairness as extending beyond interventions to the algorithm orproduct. While in other domains societal bias could sometimes beseen as beyond the scope of the practitioners’ organization, practi-tioners in healthcare reported using their trained models to informinterventions anywhere in the patient treatment pipeline. In otherwords, uncovered unfairness did not necessarily have to stem frominstitutional practices or technical models to warrant addressing.Less Regulated Domains. In cases outside of these domains, we sawan abundance of uncertainty around anti-discrimination policy thatseemed to lead to risk-aversion and inaction. One of the biggestconstraints participants described was that even without clear anti-discrimination policies, there were still fears that procuring demo-graphic data and using it to uncover discrimination without a clearplan to mitigate it — either because doing so might not align withthe organization’s priorities or because there are no clear answersfor how to resolve the issues —would open the door to legal liability.As PM4 described, “I think there’s always this sort of balancingtest that the lawyers are applying, which is weighing whether thebenefits of being proactive about testing for bias outweigh the risksassociated with this leaking or this becoming scooped up in somesort of litigation.” Participants also expressed the sentiment thatunless an organization is distinguishing itself by its commitmentto fairness (however defined) or directly responding to outsidepressure, it is often easier to just not collect the data and not callattention to any possible algorithmic discrimination.Implications. As described in a congressional testimony given bythe GAO [81], financial institutions are not confident they will notbe held liable for disparities uncovered through the collection ofdemographic data, incentivizing inaction. In other domains as well,the spectre of increased discrimination liability looms large when itcomes to demographic data collection. So long as there is a dearthof applicable legal precedent, we are relying on self-regulation in acontext where often the safest move is to simply not act.

Drawing on the strategies seen in the healthcare domain andas has been suggested elsewhere [1, 6], algorithmic fairness as-sessments can be used as a type of diagnostic tool for the wholenetwork of actors and interactions, identifying salient disparities inintake, treatment, and outcome and pointing to possible causes forthese disparities. This approach, however, goes well beyond whatis currently required by anti-discrimination law (in cases whererequirements even exist) and will likely require a shift in politicalwill to meaningfully take root in other domains.

5.3 Organizational Priorities and StructuresAs in any corporate setting, practitioners working on algorithmic

fairness face constraints set by organizational structures, risk aver-sion, and the drive for profit. Though some practitioners describedprovisions for fairness practitioners to do their work unbeholden

Page 6: What We Can't Measure, We Can't Understand': Challenges

FAccT ’21, March 3–10, 2021, Virtual Event, Canada Andrus, Spitzer, Brown, and Xiang

to the short-term goals of other product teams, many still describedconstraints imposed by organizational priorities and practices.Fairness versus KPIs. Starting with issues arising from tenuousfairness commitments, multiple practitioners mentioned having tojustify their bias analyses and calls for more or better data withexpected improvements to key performance indicators (KPIs). Inthese cases, it was not uncommon for practitioners to find thattheir proposed interventions were orthogonal to certain KPIs, suchthat they had to make qualitative arguments about how differentdemographic groups, and in turn the company, would be betterserved by their interventions. Practitioners who had tried this ap-proach divulged that if it failed, the work could just be abandoned.A similar constraint came from practitioners reporting that at timesthe cost of gathering more detailed or more representative datasimply came with too high of a price tag and too long of a lead timeto be deemed worthwhile.Organizational Momentum. Another significant concern we heardwas that often fairness and responsibility teams do not have clearpathways for pushing recommendations into practice. Other teams’existing processes around data use and procurement are generallyalready well-established, and so it can be very difficult to figure outboth how to plug in operationally as well as how much those pro-cesses can be overwritten or overloaded in the pursuit of fairness.Public Relations (PR) Risk. A final constraint on this front is theneed to avoid bad PR stemming from demographic data collection.Practitioners expressed concern that efforts to collect demographicdata are likely to be seen as an extension of the frequently reportedtrends of data misappropriation and data misuse (See e.g., [69]).Specifically addressing the issue of racial data collection at theirorganization, PM2 discussed how, from the public’s perspective,such data collection would be viewed with suspicion: “[It’s] justadding one more thing. [The public might say,] ’now you’re askingme if I’m White or Black, why would you need that information?’Right? ‘You’re already in trouble with data. How can I trust youwith this?’ So I think it was an optics thing, like, we can’t even askthat. It’s kind of off the table.” The calculus can change dramatically,however, in cases where there is already a lot of public attentionaround a company’s biased services. As described by EC3, “publicinterest and attention and focus – it actually puts these companiesat a tipping point where they can’t just do nothing, because eventhough they can say they’re complying with all the laws, clearlythe public doesn’t feel like that’s adequate.” In cases such as these,participants reported that their companies would start up limitedand measured data collection processes to address, at least in part,the source of public concern.Implications. It is important to note that many of these constraintsseem to exist on a spectrum of severity, largely dictated by theamount of buy-in from both leadership and the organization’s clien-tele to ensuring algorithmic fairness. As organizational practicesaround fairness shift and solidify, fairness practitioners will likelyneed to play a central role in making the argument for collectingdemographic data despite these organizational risks [62].

Tying into Section 5.2, the PR risk of the appearance of discrimi-nation is also likely to translate into some legal risk. Though thisrisk was rarely discussed by participants, past work by the authorshas outlined some of the ways the attempts at algorithmic fairness

might be litigated against for being discriminatory under certainlegal interpretations of ant-discrimination law [83]. It remains tobe seen, however, how these types of cases will play out in court.

6 Fairness Practitioners’ ConcernsBeyond business, legal, and reputational pressures from other

parts of the organization, we see a number of sources of cautionsurrounding the procurement and subsequent use of demographicdata on the fairness practitioner side.

6.1 Concerns Around Self-ReportingRegarding the actual process of procuring demographic data,

practitioners reported a number of concerns around the effective-ness of current collection methods.Frequency and Accuracy of Responses. Themost commonly reportedconcern about self-reported demographic data was that it is oftenunreliable or incomplete. In many cases, there are not strong incen-tives for individuals to respond to requests for their demographicdata. Outside of government surveys and forms, participants notedthat organizations often need to incentivize individuals to provideaccurate data. 4 In the case of the American financial industry, oneparticipant discussed how although they are permitted by federallaw to collect demographic data through optional surveys, the re-sponse rates are incredibly low. While it might have been possibleto improve the response rate by making the intended use for thedata clearer [20], this participant noted that more reliably collect-ing demographic data could actually increase the liability of theinstitution, as discussed in Section 5.2. As such, the sparse surveydata was only used for validating data that had been compiled andinferred via other sources.

There are many reasons why a survey might get a low responserate, but the most frequently discussed reason was the matter oftrust. LR6 discussed how when their team was working on improv-ing search diversity for their social media platform, they surveyedusers asking, “would you want to give us more demographic in-formation?” LR6 described how “the consensus was, ‘no, [if youcollect this data,] then what are you not showing me? [...] Who areyou to decide what I would like just because I am this person?’”

Within the healthcare domain, practitioners were acutely awareof the types of difficulties associated with accurate data collection,as they collect demographic data for treatment analysis and toenhance their ability to provide care. As LR8 described, “We thinkabout data collection as data donation, it’s like donating blood.It takes effort, someone’s got to do it, right? We don’t have [an]automated way to collect all this information, so we have to thinkcarefully about the most effective and efficient ways to collect [...]most of the data that we collect.”Collection Procedure Difficulties. On top of data reliability concerns,how organizations would even go about the process of collectingdemographic data is unclear. EC4, an expert who assists platformcompanies trying to deal with potential bias or discrimination,discussed how an organizationmight “want to do a disparate impact[analysis] on gender” when “[they] don’t have gender." EC4 asked,"Are [they] going to throw up a [pop-up] and say, ‘Hey, tell us4This is by no means a new problem, social scientists have long been studying how toincrease survey response rates and the biases associated with non-responses [54].

Page 7: What We Can't Measure, We Can't Understand': Challenges

"What We Can’t Measure, We Can’t Understand" FAccT ’21, March 3–10, 2021, Virtual Event, Canada

about your gender identity,’ to their whole user base? There’s somecost to doing that.” Due to the range of potential public responsesand PR risks as described in Section 5.3, the reputational cost couldeven exceed the expense of designing and deploying such a pop-up.So, as EC4 put it, whether “any company is ever going to do that toa user base of millions of people that are already on their service”remains an open question.Flexible Categories. Furthermore, some practitioners discussed howeven reliably collected demographic data can lead to issues downthe line. Most often, demographic data is self-reported by selectingonly one answer from a set of predetermined options, which may ormay not include an “Other” category. What individuals signify byselecting one of these options, however, is not necessarily consistent.TC9 outlined a number of the potential concerns that can arise heresurrounding gender:

“And I think one thing that I’m very conscious of is thatanything can be misinterpreted. [...] So for example, we havefour gender options, ‘male,’ ‘female,’ ‘unknown,’ and ‘customgender’, where ‘custom gender’ is an opt-in, non-binary gen-der that a user can add a specific string for. ‘Unknown’ isyou haven’t provided it or you’ve explicitly said you don’twant to tell us your gender. It’s a whole bunch of things.And even though this is not inferred data, I’m still very, verycareful whenever I talk about gender analyses, to be reallyclear about what exactly we’re seeing here, because ‘customgender’ does not mean non-binary, ‘unknown’ gender doesnot mean the person opted into not giving it to us.”

Similarly, other practitioners described exercising caution aroundthe demographic data they do have access to, acknowledging themyriad identities that can fit within a single box on a form.Implications. At a high level, when data subjects are wary of thereasons behind or potential outcomes of demographic data collec-tion, it is reasonable to expect that the response rate and responseaccuracy are likely to be very poor. While it is likely to be difficultto transfer the healthcare style of thinking to other domains, posingdata collection for anti-discrimination and fairness assessments assomething akin to “data donation” might be a useful framing forincreasing buy-in and response rates.

6.2 Concerns around Proxies and InferenceWhere practitioners’ concerns around the collection of demo-

graphic data mirrored some of the well-established pitfalls and risksof survey design, we see a rather different set of concerns aroundthe use of proxies and inferred demographic data. For sensitivedemographic features that are unlikely to be approved for collec-tion by various oversight teams (often for reasons given in Section5), we sometimes see practitioners infer attributes directly usingavailable data or look at other available categories of data that havean implicit correlation with attributes of interest.Prevalence of Proxies and Inferred Categories. When asked whether,and if so how, inferred demographics were used in any of theirfairness analyses, we heard a wide range of responses from prac-titioners. Participants from U.S. financial institutions noted thatthey follow the standard set by the Consumer Financial ProtectionBureau in using Bayesian Improved Surname Geocoding (BISG) to

infer individual race [14]. Participants from domains where demo-graphic data collection is mandated, such as the HR/Hiring space,on the other hand, balked at the question, as exemplified by theresponse of LR4: “We both A) don’t have a reason to do that in theline of what we’re doing, B) would be deeply opposed to doing thatin this context. [...] Fascinating, I would want to know who elsewould do that in what context and why you would ever do that?”

Responses from participants in other domains were situatedbetween these two extremes. Oftentimes inference is the only avail-able option for conducting discrimination or bias assessments, butthere are few standardized practices for doing so. Of the partici-pants that discussed using demographic inference, it was largelyfor content data, such as in skin-tone classification for images orauthor gender prediction for pieces of text. Inferring demographicsin other cases, while perhaps feasible, was seen as introducing pri-vacy risks as well as dignity concerns to any fairness evaluationsconducted with them. These concerns were especially heightenedaround sensitive attributes. As described by TC5, “In contexts ofrace or sexual orientation, or other really sensitive groups, we don’tview inferring attributes without fully informed user consent as apath forward. We think that all of the concerns are around whether[our organization] has the data. And so, if we were to try to pursue[acquiring the data], we would need to do it explicitly and, youknow, with fully informed consent.” Other practitioners suggestedthat when inference is done at the group level (i.e. ”X percent ofthis dataset is women”), that might resolve some of the consent anddignity concerns of classifying individuals.When Inferred Categories are Preferred. Although inferred demo-graphics are generally used in contexts where self-reported demo-graphics are not available, an interesting dilemma arises in caseswhere inferred demographics are more suitable for the task. Forinstance, EC3 discussed how on one of the systems they workedon, there were potential issues based on how users perceived otherusers’ races: “So you’re doing perceived race. Then the question is,so if I’m a tech company and I want to study this racial experiencegap, right, should I go around guessing everybody’s races?” Whilehaving labelers or an algorithm guess someone’s race is perhaps thebest way to operationalize the concept of perceived race, EC3 andothers made sure to point out that it opens the door to subsequentbacklash given the sensitivity of these labels.Treating Proxies as aWeak Signal A unique set of concerns arises forproxies that are not explicitly treated as such. In some cases, whena salient attribute is inaccessible, practitioners use other availableattributes to obtain some signal about potential discrimination orunfairness surrounding that more important attribute (e.g., using asubscription tier-level to point towards potential problems acrosssocio-economic status). In cases such as these, practitioners werevery clear that these proxies are not meant to be treated as theattribute itself: TC9 explained, “We’re very careful to not drawfurther conclusions, because we don’t want to make assumptionsor directly infer [the attribute of interest], basically.” Instead theseattributes are treated as a very rough signal, where if you noticelarge disparities, you should treat it as a call for deeper inspection.Implications. Although the algorithmic fairness literature has fre-quently called out the dangers of using proxies for salient attributesor target variables in fairness assessments [8, 18, 42], this is often

Page 8: What We Can't Measure, We Can't Understand': Challenges

FAccT ’21, March 3–10, 2021, Virtual Event, Canada Andrus, Spitzer, Brown, and Xiang

the only option practitioners have to make a quantitative argu-ment for escalation and deeper analysis. The real risk that we seearising with this practice is when/if the use of the proxy becomessufficiently normalized such that practitioners no longer reflecton its inadequacies. Proxies should not just be plugged into stan-dard bias mitigation techniques without first clearly outlining howthey constitute a "measurement model" of the desired attribute [42].Using them to uncover egregious discrepancies, however, mightinspire interrogation of the full decision-making pipeline. This inturn could reveal more about the conditions of unfairness thanwould a technical tweak to the algorithm.

6.3 Relevance of Demographic CategoriesMost types of fairness analysis require practitioners to identify

and focus on discrete groups, but there is often discomfort anduncertainty around what groups warrant attention and how thosegroups should be defined. These concerns were seen to be espe-cially salient within organizations and teams that had to build andmaintain products and services for regions outside of their own.Deficient Standards. Looking first to the definition of demographiccategories, numerous practitioners discussed the inadequacy ofthe standard categories. This was especially a concern in indus-tries where demographic data is collected according to governmen-tal standards. PM7, a product manager in the hiring/HR domain,pointed to how they were actively “finding ways to gather moredata outside of the employment law data,” as “the categories weredefined back in the sixties and are challenging these days. [...] Youonly get six boxes for your race. There’s two boxes for gender. Youknow, it just doesn’t work for a lot of people.” For other domains,the established categories were sometimes just used out of traditionor because it would be hard to change them.Difficulties with Granular Representations. A common responseto this problem was to try to make the categories representative,but practitioners were generally unsure of where to draw the line.R3, who was tasked with expanding the demographic categoriesand variables of concern used by their organization, said of thisissue: “You can go on and on and create a list that’s like 20 oreven a hundred long, but we need to set a limit, [...] like what’sthe reasonable limit on how many categories you want to put insuch a question and what we want to recommend internally to likeproduct teams to at least sample across.” A common concern hereis that you start to lose statistical significance as the groups getsmaller. Concerning asking more granular demographic questions,a practitioner from the Healthcare domain (LR8) asked, “how shouldwe capture [...] relatively low frequency answers to those questions?[...] All of a sudden you become an N of 1 and you don’t fit into anycategory and we can’t figure out how to help you. Whereas, if wecould have lumped you into another category, might we have beenable to do better?”What Categories Should Be Focused on? Whether or not the de-mographic categories are expanded before data collection, therestill remains the question of what the most relevant categoriesor combinations of categories are to consider for evaluation anddeeper inspection. PM4 said of this problem, “[this] step of [asking],‘what demographic slices are most relevant,’ that’s very difficultto get because almost nobody feels like they have the authority to

say, ‘these are the ones that matter.’” While it might not requiremuch extra effort or resourcing to assess bias across all availableattributes, practitioners’ teams were also generally responsible forinforming algorithmic, design, and policy changes. On top of this,several practitioners recounted needing to find bias issues withgroups at the intersection of multiple attributes as well [19, 37].Considering all possible intersections of demographic attributesis generally not practical for a small team (e.g., if you have fiveattributes, each with four options, then there are 45 = 1, 024 to-tal groups), so some practitioners had used written feedback fromclients and users to help identify which groups their systems werenot working well for.

Arguably the most straightforward approach to this problem isto just defer to legal or policy requirements around what groupsto consider, but they often do not map onto practitioners’ views ofwhat is ethically or culturally salient. R1 notes that this can becomea significant impediment to making systems fairer in practice: “I’dsay that the main conflict is where we’ll do some testing on a prod-uct and reveal issues. And we’ll say ‘look, this, it doesn’t performon this group.’ Or, ‘it makes these failures.’ And then the policyteam will say [something] like, ‘well, we don’t have a specific policyabout this, so why are you testing for it?’” When these conflictsarise between the types of categories practitioners think are mostsalient and the categories that their organizations expect them to as-sess bias for, a substantial minority of participants reported havingto just shelve their work on those categories.Regional Differences. An additional concern reported by practition-ers was that their companies’ awareness of what categories arelegally salient is often based on the laws of the region in whichthe company is located. Concerning the general focus on “race,ethnicity, gender, and maybe sexual orientation” at their organiza-tion, PM4 remarked, “I often think that the way we approach thisis very U.S. centric, that our understanding of what demographiccharacteristics would be relevant in other countries and culturesis pretty rudimentary.” Some organizations try to ameliorate thisconcern by introducing subject matter experts with understandingsof regional and cultural standards, but this is often only if thereare explicit anti-discrimination regimes they have to work within.In reference to their algorithmic HR support tools, LR1 said, “Themain problem is local employment law. We don’t have knowledgeof local employment law, and every country is different. And so wealways work with what we call a HR business partner, who is ableto bring in labor experts if needed.”Convenience. Finally, when the relevant data is just not availablefor testing, it may lead to a default focus on whatever category datais on hand. When discussing ensuring dataset diversity beyondgender, PM1 stated that “it’s a complex problem because we are notallowed to collect such [demographic] data. So that’s why genderis prioritized, because we can do something about that.” Similar tothe reliance on loose proxies, this issue of focusing on categoriesdue to availability and not social salience or public demand was acommon concern relayed to us by participants.Implications. Adhering to inapt categorization schemas can bean impediment to meaningfully understanding the types of biaspresent, as the demographic categories used can lead to dramati-cally disparate analyses and conclusions [33, 39]. Tackling this risk

Page 9: What We Can't Measure, We Can't Understand': Challenges

"What We Can’t Measure, We Can’t Understand" FAccT ’21, March 3–10, 2021, Virtual Event, Canada

will likely require earnest attempts to grapple with specific issuesin limited contexts, as the most salient forms of possible discrim-ination are likely to be missed with a "one-size-fits-all" approachthat prioritizes scalabilty.

6.4 Ensuring Trustworthiness and AlignmentThough there is seemingly an alignment between model devel-

oper and data subject when implementing fairness interventions,practitioners discussed how this relationship can be difficult to bearout in practice due to both low public trust in technology companiesand weak alignment between company and community goals.Generating Well-Founded Trust. On the issue of trust, some prac-titioners in less-regulated industries expressed concerns aboutwhether their organizations could take the steps to meaningfullygarner public trust around demographic data collection. Often cit-ing recent PR-cycles around data misuse and mishandling (e.g. [51]),practitioners expressed concern that users and clients simply donot believe their data is going to be used in a way they have controlover or for their benefit. We even heard from a few practitionersthat they were not confident themselves that data collected forbias, fairness, or discrimination assessments would be sufficientlysheltered from other uses. One potential path forward here camefrom external auditors and consultants. EC4 noted that such ex-ternal entities could require organizations to make enforceable,public commitments to narrowly use specially collected data onlyfor fairness and anti-discrimination purposes in order to surmountconcerns around misuse.

Some participants did, however, express hopes that their organi-zations might be moving in a direction of making clearer what datathey would like to collect and why. A key element of trust-building,as one practitioner pointed out, is that the relationship needs to goboth ways — users and clients provide the data needed to assesssystem performance, and then organizations follow through ontheir commitments, providing at least a summary of the resultsback to users and clients. As described by LR9, “trust is a word thatis almost meaningless. Trustworthy is something else, and muchmore weighty. [...] Trustworth[iness] is showing capacity on bothsides of the trust dynamic. [Saying] someone is trustworthy meansthat they can receive information and treat it respectfully. Trustalone is...toothless.” There are likely to be many organizational andlegal barriers to this type of trust-building exercise, but some prac-titioners felt that they might need to start exploring other ways ofaddressing fairness if the sensitive data they currently need cannotbe collected in a trustworthy, consensual way.Alignment with Data Subjects. Building on the notions of trustwor-thiness and consent, a few practitioners reported trying to reach ahigher level of alignment with their data subjects. Concerning theuse of demographic data in their research, R2 broached the ques-tion, “if you recognize that this system is working poorly for somedemographic, how do you rectify that in a way that meaningfullypays attention and gives voice to this group that it’s not workingfor?” By showing a commitment to centering the perspectives andneeds of the groups that are either failed by or directly harmed byalgorithmic systems, R2 and a few other participants suggested thatit is more likely individuals from those groups would be willing toshare personal information, including demographic attributes.

Implications. Reinforcing practitioner perspectives that the publicis wary of corporate data handling, recent surveys have shown thatAmericans are both largely unclear on how their data gets used andare not convinced that it is used for their benefit [4]. R2’s questionaround giving voice to groups that systems do not work for pointsto one possible path forward in making systems more trustworthyand resolving this trust gap. By ensuring that data subjects not onlyhave a say in what data gets collected, but are actually involved inthe thinking about how the data should be used, it is both morelikely that data will be employed in the interest of those it is aboutand that the fairness achieved will have material benefits for theirlives [22, 29, 47, 55, 58, 78].

6.5 Mitigation UncertaintyIf practitioners are uncertain about how effectively they will be

able to use demographic data and make changes based upon it, theycan be deterred from trying to overcome the barriers and concernsdescribed in previous sections.What Should Be Done About Uncovered Bias. As discussed in Sec-tion 1, the algorithmic fairness literature provides a number oftechniques for potentially mitigating detected bias, but practition-ers noted that there are few applicable, touchstone examples ofthese techniques in practice. Moreover, the methods proposed oftendo not detail exactly what types of harms they are designed tomitigate, so it is not always clear which methods should be appliedin which contexts. The only domains in which we saw practitionersproceed with confidence were those with relatively well-establisheddefinitions of what it means to make systems fair (e.g., Hiring/HRand Finance) and those where the solutions were not limited toalgorithmic fixes (e.g. Healthcare). In the first case, precedent suchas the 80% rule (see Section 5.2) gave practitioners confidence thatthey could appropriately mitigate uncovered bias. In the latter case,a lack of clear technical solutions did not hamper efforts to detectbias since practitioners could report evidence of bias to relevantstakeholders and the necessary interventions could be made. As anexample, LR8 discussed a case where a model showed that individ-uals who spoke English as a second language had worse predictedhealth outcomes associated with sepsis. Instead of pursuing a tech-nical intervention, they found some of the issues stemmed froman absence of Spanish language materials on the infection. In thiscase, had the spoken-language data field been either not collectedor thrown out of the model as a naive anti-discrimination policymight dictate, equitable interventions would have been inhibited.Uncertainty About Whether Proposed Interventions Will Be Adopted.Fairness practitioners often want to go beyond just the minimumrequirements of corporate policy, but it is not always clear what willbe permitted. One element of this is that that, as described in Sec-tion 6.3, their work can simply run ahead of what the policy teamhas considered. When conducting fairness analyses, practitionersmentioned surfacing uncomfortable questions, such as, what theirresponsibility is for historical inequalities that impact various per-formance metrics. When organizations are not prepared to answerthese questions, it can be difficult for practitioners to know howto (and whether to) proceed with demographic data collection anduse. A related issue is that when there are established policies onfairness, they generally seek to ensure all parties are treated “thesame.” Whether this is through equalizing performance metrics

Page 10: What We Can't Measure, We Can't Understand': Challenges

FAccT ’21, March 3–10, 2021, Virtual Event, Canada Andrus, Spitzer, Brown, and Xiang

across groups, such as calibration or predictive parity, or by re-moving/ignoring demographic data entirely, practitioners worriedthat achieving equity might require more nuanced treatments ofdifferent groups. As argued by TC9, “I would like to make a littlebit stronger of a stand on [being] anti-bias. Instead of just makingsure our models aren’t biased, [we should] potentially carve out aspace for treating groups differently and treating the historicallymarginalized group with more care.”

Implications. Though organizational policies around product fair-ness and anti-discrimination are becoming more common, our in-terviews revealed that product and research teams are still the onesbeing left to grapple with what constitutes algorithmic fairness.While these teams are equipped to answer some of the questionsthat arise, optimizing for fairness can and will surface larger ques-tions about the role of corporations in creating a more just world,so the task cannot be left to these teams alone. Furthermore, inthe cases we heard about where these teams were given explicitguidance about how to resolve these dilemmas, it was usually viapolicies that attempted to just treat every group the same. Thisstrategy of trying to espouse a “view from nowhere” [34], how-ever, often just means taking the perspective of the majority orthose with power (e.g. multiple participants discussed hate-speechclassifiers that would always classify texts with the N-word ashate-speech, implying that the speaker is not Black). Before newdemographic data collection efforts are taken up for the sake ofalgorithmic fairness, we recommend that organizations first comeup with a clear goal for their fairness efforts and a plan for how toempower the relevant teams to achieve those goals.

7 DiscussionHaving mapped out this dense web of challenges practition-

ers face when attempting to use demographic data in service offairness goals, we do not believe that the next step should be tosimply lower the barriers to collecting demographic data. To thecontrary, many of the challenges participants raised highlight deepnormative questions not only around how but also whether demo-graphic data should be collected and used by algorithm developers.While the main goal of this paper was to explore the contours ofthese normative questions from the practitioner perspective, wetake the rest of this section to consider three elements of possiblepaths forward: clearer legal requirements around data protectionand anti-discrimination, privacy-respecting algorithmic fairnessstrategies, and meaningful agency of data subjects.

Looking first at legal provisioning, using demographic data toassess and address algorithmic discrimination will require clearcarve-outs in data protection regulation. This could entail estab-lishing official third-party data holders and/or auditors, or it couldsimply mean opening the door for private organizations to do thiswork themselves. During the course of this interview project, wedid see the UK Information Commissioner’s Office publish explicitguidance on how to legally use special category data to assessfairness in algorithmic systems [74], but it is not yet clear whatimpact this will have. The approach of relaxing demographic dataprotections by itself, however, will likely leave many of the otherconstraints and practitioner concerns unaddressed. Further still,anti-discrimination law does not necessarily provide incentives

for using demographic attributes to inform algorithmic decision-making [11, 35, 73, 83]. As such, data protection carve-outs willlikely need to coincide with updated protections around algorithmicdiscrimination. Looking to domains where demographic data collec-tion is already legally provisioned (e.g. HR in the U.S.), we see thatpractitioner concerns do not dissolve so much as change shape (e.g.,issues with the government categories), suggesting the importanceof crafting this legislation in collaboration with practitioners.

Turning to algorithmic fairness strategies that maintain privacy,they are often pitched as a means of enabling data collection inlow-trust environments, ensuring such data can only be used forassessing and addressing bias and discrimination. Veale and Binns[76] recommend a number of approaches that enlist third-partieswith higher trust to collect the data and conduct secure fairnessanalyses, fair model training, and knowledge sharing. Other pro-posed methods build on this third-party arrangement, ensuringcryptographic or differential privacy from the point of collection[43, 48]. Decentralized approaches such as federated learning couldhave similar benefits as they would possibly eliminate the need fordata sharing entirely, though methods for fairness analyses are lessbuilt out in this area [45]. While these methods can address some ofthe trust-related concerns discussed by practitioners, they can alsoexacerbate other challenges. By off-loading the responsibilities ofcollecting, handling, and processing this data, the third-party wouldalso inherit some of the practitioners’ more socially-minded con-cerns, such as how to determine demographic saliency and how tomeaningfully resolve detected unfairness. Depending on the natureof the third-party organization and the level of investment fromthe primary company, this deferral of responsibility could increasethe likelihood that these questions just remain unaddressed.

The final approach we consider is more firmly incorporating datasubjects into the work around fairness. As a relevant and timelyexample of why this is important to do, throughout the course ofthis project we saw debates flare up around the use of racial datain COVID response plans [15, 44, 63, 70]. Opponents argue thatwithout a strong commitment to actually intervening on and re-versing systemic health inequalities, generating such data is morelikely to perpetuate harm (e.g., through the construction of variousnegative narratives blaming disadvantaged communities for healthdisparities [70]) than to address it [59]. In a similar way, we mightexpect hasty, uncritical pushes for demographic data collection toleave the roots of discrimination unaddressed—possibly even cov-ering deeper issues with the veneer of having reduced algorithmicbias. As such, we see more transparent, inclusionary practices sur-rounding demographic data and algorithmic fairness [47, 55, 82] asa path forward that can clearly address many of the issues outlinedabove. Though meaningfully taking steps to to keep data subjectsinformed and to give them a voice in how their data is used is likelyto be difficult and costly, doing so would mitigate privacy risksand a majority of practitioners’ concerns. Data subject consent is awidely consistent legal basis for demographic data collection, andconcerns around data reliability would be reduced as the collectionprocess comes to reflect what LR8 referred to as “data donation.”Furthermore, by deeply engaging various groups in the collectionand use of demographic data, answers to questions around what thesalient demographics are as well as how to meaningfully addressuncovered bias will be closer at hand.

Page 11: What We Can't Measure, We Can't Understand': Challenges

"What We Can’t Measure, We Can’t Understand" FAccT ’21, March 3–10, 2021, Virtual Event, Canada

AcknowledgmentsWe want to thank all the busy interviewees that took the time

to talk with us about their work and all the wonderful members ofthe Partnership on AI community that helped us scope this project.

References[1] McKane Andrus and Thomas K. Gilbert. 2019. Towards a Just Theory of Mea-

surement: A Principled Social Measurement Assurance Program for MachineLearning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Soci-ety. ACM, Honolulu HI USA, 445–451. https://doi.org/10.1145/3306618.3314275

[2] McKane Andrus, Elena Spitzer, and Alice Xiang. 2020. Working to AddressAlgorithmic Bias? Don’t Overlook the Role of Demographic Data. https://www.partnershiponai.org/demographic-data/

[3] Jennifer Attride-Stirling. 2001. Thematic networks: an analytic tool for qualitativeresearch. Qualitative Research 1, 3 (Dec. 2001), 385–405. https://doi.org/10.1177/146879410100100307

[4] Brooke Auxier, Lee Rainie, Monica Anderson, Andrew Perrin, Madhu Kumar,and Erica Turner. 2019. Americans and privacy: Concerned, confused and feelinglack of control over their personal information. Pew Research Center: Internet,Science & Tech (blog). November 15 (2019), 2019.

[5] Chelsea Barabas. 2019. Beyond Bias: Re-Imagining the Terms of ‘Ethical AI’in Criminal Law. SSRN Electronic Journal (2019). https://doi.org/10.2139/ssrn.3377921

[6] Chelsea Barabas, Madars Virza, Karthik Dinakar, Joichi Ito, and Jonathan Zittrain.2018. Interventions over predictions: Reframing the ethical debate for actuar-ial risk assessment. In Conference on Fairness, Accountability and Transparency.PMLR, 62–76.

[7] Bissan Barghouti, Corinne Bintz, Dharma Dailey, Micah Epstein, Vivian Guetler,Bernease Herman, Pa Ousman Jobe, Michael Katell, P. M. Krafft, Jennifer Lee,Shankar Narayan, Franziska Putz, Daniella Raz, Brian Robick, Aaron Tam, AbielWoldu, and Meg Young. 2020. Algorithmic Equity Toolkit. https://www.aclu-wa.org/AEKit

[8] Solon Barocas and Andrew D. Selbst. 2016. Big Data’s Disparate Impact. Cali-fornia Law Review 104 (2016), 671. https://heinonline.org/HOL/Page?handle=hein.journals/calr104&id=695&div=&collection=

[9] Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, StephanieHoude, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta,Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, JohnRichards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varsh-ney, and Yunfeng Zhang. 2018. AI Fairness 360: An Extensible Toolkit for Detect-ing, Understanding, and Mitigating Unwanted Algorithmic Bias. arXiv:1810.01943[cs] (Oct. 2018). http://arxiv.org/abs/1810.01943

[10] Cynthia L. Bennett and Os Keyes. 2020. What Is the Point of Fairness?: Disability,AI and the Complexity of Justice. ACM SIGACCESS Accessibility and Computing125 (March 2020), 1–1. https://doi.org/10.1145/3386296.3386301

[11] Jason R Bent. 2020. Is Algorithmic Affirmative Action Legal? Georgetown LawJournal 108 (2020), 803.

[12] Sebastian Benthall and Bruce D. Haynes. 2019. Racial Categories in MachineLearning. In Proceedings of the Conference on Fairness, Accountability, and Trans-parency - FAT* ’19. ACM Press, Atlanta, GA, USA, 289–298. https://doi.org/10.1145/3287560.3287575

[13] Miranda Bogen, Aaron Rieke, and Shazeda Ahmed. 2020. Awareness in practice:tensions in access to sensitive attribute data for antidiscrimination. In Proceedingsof the 2020 conference on fairness, accountability, and transparency. 492–500.

[14] Consumer Financial Protection Bureau. 2014. Using publicly avail-able information to proxy for unidentified race and ethnicity. (2014).https://www.consumerfinance.gov/data-research/research-reports/using-publicly-available-information-to-proxy-for-unidentified-race-and-ethnicity/

[15] Brandee Butler. 2020. For the EU to Effectively Address Racial Injustice, We NeedData. Al Jazeera.

[16] Marika Cifor, Patricia Garcia, TL Cowan, Jasmine Rault, Tonia Sutherland,Anita Say Chan, Jennifer Rode, Anna Lauren Hoffmann, Niloufar Salehi, and LisaNakamura. 2019. Feminist data manifest-no. https://www.manifestno.com/

[17] US Equal Employment Opportunity Commission et al. 1979. Questions andanswers to clarify and provide a common interpretation of the uniform guidelineson employee selection procedures.

[18] Sam Corbett-Davies and Sharad Goel. 2018. The Measure and Mismeasure ofFairness: A Critical Review of Fair Machine Learning. arXiv:1808.00023 [cs] (Aug.2018). http://arxiv.org/abs/1808.00023

[19] Kimberlé Crenshaw. 1989. Demarginalizing the intersection of race and sex:A black feminist critique of antidiscrimination doctrine, feminist theory andantiracist politics. u. Chi. Legal f. (1989), 139.

[20] Cara Crotty. 2020. Revised form for self-identification of disability re-leased. https://www.constangy.com/affirmative-action-alert/revised-form-for-self-identification-of-disability

[21] d4bl. 2020. Data 4 Black Lives. https://d4bl.org/

[22] Roel Dobbe, Sarah Dean, Thomas Gilbert, and Nitin Kohli. 2018. A BroaderView on Bias in Automated Decision-Making: Reflecting on Epistemology andDynamics. arXiv:1807.00553 [cs, math, stat] (July 2018). arXiv:1807.00553 [cs,math, stat] http://arxiv.org/abs/1807.00553

[23] Alexander D’Amour, Hansa Srinivasan, James Atwood, Pallavi Baljekar, D. Sculley,and Yoni Halpern. 2020. Fairness is not static: Deeper understanding of long termfairness via simulation studies. In Proceedings of the 2020 conference on fairness, ac-countability, and transparency (FAccT ’20). Association for Computing Machinery,New York, NY, USA, 525–534. https://doi.org/10.1145/3351095.3372878

[24] Marc N Elliott, Peter A Morrison, Allen Fremont, Daniel F McCaffrey, PhilipPantoja, and Nicole Lurie. 2009. Using the Census Bureau’s surname list toimprove estimates of race/ethnicity and associated disparities. Health Servicesand Outcomes Research Methodology 9, 2 (2009), 69.

[25] European Parliament and Council of European Union. 2016. Regulation (EU)2016/679 (General Data Protection Regulation). https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN

[26] German Bundestag 2017. Federal Data Protection Act of 30 June 2017(BDSG). , 2097 pages. https://www.gesetze-im-internet.de/englisch_bdsg/englisch_bdsg.html

[27] Soheil Ghili, Ehsan Kazemi, and Amin Karbasi. 2019. Eliminating latent discrim-ination: Train then mask. In Proceedings of the AAAI Conference on ArtificialIntelligence, Vol. 33. 3672–3680.

[28] Bryce W Goodman. 2016. A step towards accountable algorithms? algorithmicdiscrimination and the european union general data protection. In 29th conferenceon Neural Information Processing Systems (NIPS 2016), Barcelona. NIPS foundation.

[29] Ben Green and Salomé Viljoen. 2020. Algorithmic Realism: Expanding theBoundaries of Algorithmic Thought. In Proceedings of the 2020 Conferenceon Fairness, Accountability, and Transparency. ACM, Barcelona Spain, 19–31.https://doi.org/10.1145/3351095.3372840

[30] Maya Gupta, Andrew Cotter, Mahdi Milani Fard, and Serena Wang. 2018. ProxyFairness. arXiv:1806.11212 [cs, stat] (June 2018). arXiv:1806.11212 [cs, stat]http://arxiv.org/abs/1806.11212

[31] Sara Hajian, Josep Domingo-Ferrer, Anna Monreale, Dino Pedreschi, and FoscaGiannotti. 2015. Discrimination- and Privacy-Aware Patterns. Data Miningand Knowledge Discovery 29, 6 (Nov. 2015), 1733–1782. https://doi.org/10.1007/s10618-014-0393-7

[32] Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M. Branham. 2018. GenderRecognition or Gender Reductionism?: The Social Implications of EmbeddedGender Recognition Systems. In Proceedings of the 2018 CHI Conference on HumanFactors in Computing Systems - CHI ’18. ACM Press, Montreal QC, Canada, 1–13.https://doi.org/10.1145/3173574.3173582

[33] AlexHanna, Emily Denton, Andrew Smart, and Jamila Smith-Loud. 2020. Towardsa critical race methodology in algorithmic fairness. In Proceedings of the 2020Conference on Fairness, Accountability, and Transparency. 501–512.

[34] Donna Haraway. 1988. Situated Knowledges: The Science Question in Feminismand the Privilege of Partial Perspective. Feminist Studies 14, 3 (1988), 575–599.https://doi.org/10.2307/3178066

[35] Zach Harned and Hanna Wallach. 2019. Stretching human laws to apply tomachines: The dangers of a’Colorblind’Computer. Florida State University LawReview, Forthcoming (2019).

[36] Tatsunori B. Hashimoto, Megha Srivastava, Hongseok Namkoong, and PercyLiang. 2018. Fairness Without Demographics in Repeated Loss Minimization.arXiv:1806.08010 [cs, stat] (July 2018). arXiv:1806.08010 [cs, stat] http://arxiv.org/abs/1806.08010

[37] Anna Lauren Hoffmann. 2019. Where Fairness Fails: Data, Algorithms, and theLimits of Antidiscrimination Discourse. Information, Communication & Society22, 7 (June 2019), 900–915. https://doi.org/10.1080/1369118X.2019.1573912

[38] Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudík, andHanna Wallach. 2019. Improving Fairness in Machine Learning Systems: WhatDo Industry Practitioners Need? Proceedings of the 2019 CHI Conference on HumanFactors in Computing Systems - CHI ’19 (2019), 1–16. https://doi.org/10.1145/3290605.3300830 arXiv:1812.05239

[39] Junia Howell and Michael O. Emerson. 2017. So What “ Should ” We Use?Evaluating the Impact of Five Racial Measures on Markers of Social Inequality.Sociology of Race and Ethnicity 3, 1 (Jan. 2017), 14–30. https://doi.org/10.1177/2332649216648465

[40] Lily Hu and Issa Kohler-Hausmann. 2020. What’s sex got to do with machinelearning?. In Proceedings of the 2020 Conference on Fairness, Accountability, andTransparency. 513–513.

[41] Institute of Medicine (US) Committee on Quality of Health Care in America. 2001.Crossing the Quality Chasm: A New Health System for the 21st Century. NationalAcademies Press (US), Washington (DC). http://www.ncbi.nlm.nih.gov/books/NBK222274/

[42] Abigail Z. Jacobs and Hanna Wallach. 2019. Measurement and Fairness.arXiv:1912.05511 [cs] (Dec. 2019). arXiv:1912.05511 [cs] http://arxiv.org/abs/1912.05511

[43] Matthew Jagielski, Michael Kearns, Jieming Mao, Alina Oprea, Aaron Roth, SaeedSharifi-Malvajerdi, and Jonathan Ullman. 2019. Differentially private fair learning.

Page 12: What We Can't Measure, We Can't Understand': Challenges

FAccT ’21, March 3–10, 2021, Virtual Event, Canada Andrus, Spitzer, Brown, and Xiang

In International Conference on Machine Learning. PMLR, 3000–3008.[44] LLana James. 2020. Race-Based COVID-19 Data May Be Used to Discriminate

against Racialized Communities. http://theconversation.com/race-based-covid-19-data-may-be-used-to-discriminate-against-racialized-communities-138372

[45] Peter Kairouz and Others. 2019. Advances and Open Problems in FederatedLearning. arXiv:1912.04977 [cs, stat] (Dec. 2019). arXiv:1912.04977 [cs, stat]http://arxiv.org/abs/1912.04977

[46] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. 2011. Fairness-awareLearning through Regularization Approach. In 2011 IEEE 11th InternationalConference on Data Mining Workshops. 643–650. https://doi.org/10.1109/ICDMW.2011.83

[47] Michael Katell, Meg Young, Dharma Dailey, Bernease Herman, Vivian Guetler,Aaron Tam, Corinne Binz, Daniella Raz, and P. M. Krafft. 2020. Toward SituatedInterventions for Algorithmic Equity: Lessons from the Field. In Proceedings of the2020 Conference on Fairness, Accountability, and Transparency. ACM, BarcelonaSpain, 45–55. https://doi.org/10.1145/3351095.3372874

[48] Niki Kilbertus, Adrià Gascón, Matt J. Kusner, Michael Veale, Krishna P. Gum-madi, and Adrian Weller. 2018. Blind Justice: Fairness with Encrypted SensitiveAttributes. arXiv:1806.03281 [cs, stat] (June 2018). arXiv:1806.03281 [cs, stat]http://arxiv.org/abs/1806.03281

[49] Satya Kuppam, Ryan Mckenna, David Pujol, Michael Hay, Ashwin Machanava-jjhala, and Gerome Miklau. 2020. Fair Decision Making Using Privacy-ProtectedData. arXiv:1905.12744 [cs] (Jan. 2020). arXiv:1905.12744 [cs] http://arxiv.org/abs/1905.12744

[50] Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain,Xuezhi Wang, and Ed H. Chi. 2020. Fairness without Demographics throughAdversarially Reweighted Learning. arXiv:2006.13114 [cs, stat] (June 2020).arXiv:2006.13114 [cs, stat] http://arxiv.org/abs/2006.13114

[51] Issie Lapowsky. 2019. How Cambridge Analytica Sparked the Great PrivacyAwakening. Wired (March 2019). https://www.wired.com/story/cambridge-analytica-facebook-privacy-awakening/

[52] LinkedIn. [n.d.]. LinkedIn Recruiter: The Industry-Standard Recruiting Tool.https://business.linkedin.com/talent-solutions/recruiter

[53] Efrider Maramwidze-Merrison. 2016. Innovative Methodologies in Qualitative Re-search: Social Media Window for Accessing Organisational Elites for Interviews.12, 2 (2016), 11.

[54] Kent H Marquis, M Susan Marquis, and J Michael Polich. 1986. Response biasand reliability in sensitive topic surveys. J. Amer. Statist. Assoc. 81, 394 (1986),381–389.

[55] Donald Martin Jr., Vinodkumar Prabhakaran, Jill Kuhlberg, Andrew Smart, andWilliam S. Isaac. 2020. Participatory Problem Formulation for Fairer MachineLearning Through Community Based System Dynamics. arXiv:2005.07572 [cs,stat] (May 2020). arXiv:2005.07572 [cs, stat] http://arxiv.org/abs/2005.07572

[56] Microsoft. 2020. Fairlearn. https://github.com/fairlearn/fairlearn[57] Stefania Milan and Emiliano Treré. 2019. Big Data from the South(s): Beyond

Data Universalism. Television & New Media 20, 4 (May 2019), 319–335. https://doi.org/10.1177/1527476419837739

[58] Deirdre K. Mulligan, Joshua A. Kroll, Nitin Kohli, and Richmond Y. Wong. 2019.This Thing Called Fairness: Disciplinary Confusion Realizing a Value in Tech-nology. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (Nov.2019), 1–36. https://doi.org/10.1145/3359221 arXiv:1909.11869

[59] Mimi Onuoha. 2020. When Proof Is Not Enough. https://fivethirtyeight.com/features/when-proof-is-not-enough/

[60] Tawana Petty, Mariella Saba, Tamika Lewis, Seeta Peña Gangadharan, and Vir-ginia Eubanks. 2018. Our Data Bodies: Reclaiming Our Data. June 15 (2018),37.

[61] Stephanie Carroll Rainie, Tahu Kukutai, Maggie Walter, Oscar Luis Figueroa-Rodríguez, Jennifer Walker, and Per Axelsson. 2019. Indigenous data sovereignty.The State of Open Data: Histories and Horizons (2019), 300.

[62] Bogdana Rakova, Jingying Yang, Henriette Cramer, and Rumman Chowdhury.2020. Where Responsible AI Meets Reality: Practitioner Perspectives on En-ablers for Shifting Organizational Practices. arXiv:2006.12358 [cs] (July 2020).arXiv:2006.12358 [cs] http://arxiv.org/abs/2006.12358

[63] Nani Jansen Reventlow. [n.d.]. Data collection is not the solution for Europe’sracism problem. https://www.aljazeera.com/opinions/2020/7/29/data-collection-is-not-the-solution-for-europes-racism-problem/?gb=true

[64] Alexey Romanov, Maria De-Arteaga, Hanna Wallach, Jennifer Chayes, ChristianBorgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, AnnaRumshisky, and Adam Tauman Kalai. 2019. What’s in a Name? Reducing Biasin Bios without Access to Protected Attributes. arXiv:1904.05233 [cs, stat] (April2019). arXiv:1904.05233 [cs, stat] http://arxiv.org/abs/1904.05233

[65] Pedro Saleiro, Benedict Kuester, Loren Hinkson, Jesse London, Abby Stevens, AriAnisfeld, Kit T. Rodolfa, and Rayid Ghani. 2019. Aequitas: A Bias and FairnessAudit Toolkit. arXiv:1811.05577 [cs] (April 2019). http://arxiv.org/abs/1811.05577

[66] P. Sattigeri, S. C. Hoffman, V. Chenthamarakshan, and K. R. Varshney. 2019.Fairness GAN: Generating Datasets with Fairness Properties Using a GenerativeAdversarial Network. IBM Journal of Research and Development 63, 4/5 (July2019), 3:1–3:9. https://doi.org/10.1147/JRD.2019.2945519

[67] Morgan Klaus Scheuerman, Kandrea Wade, Caitlin Lustig, and Jed R. Brubaker.2020. How We’ve Taught Algorithms to See Identity: Constructing Race andGender in Image Databases for Facial Analysis. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (May 2020), 1–35. https://doi.org/10.1145/3392866

[68] Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian,and Janet Vertesi. 2019. Fairness andAbstraction in Sociotechnical Systems. In Pro-ceedings of the Conference on Fairness, Accountability, and Transparency - FAT* ’19.ACM Press, Atlanta, GA, USA, 59–68. https://doi.org/10.1145/3287560.3287598

[69] Suranga Seneviratne. 2019. The Ugly Truth: Tech Companies AreTracking and Misusing Our Data, and There’s Little We Can Do.http://theconversation.com/the-ugly-truth-tech-companies-are-tracking-and-misusing-our-data-and-theres-little-we-can-do-127444

[70] Sachil Singh. 2020. Collecting race-based data during pandemic may fuel danger-ous prejudices. https://www.queensu.ca/gazette/stories/collecting-race-based-data-during-pandemic-may-fuel-dangerous-prejudices

[71] Kacper Sokol, Alexander Hepburn, Rafael Poyiadzi, Matthew Clifford, RaulSantos-Rodriguez, and Peter Flach. 2020. FAT Forensics: A Python Toolboxfor Implementing and Deploying Fairness, Accountability and TransparencyAlgorithms in Predictive Systems. Journal of Open Source Software 5, 49 (2020),1904. https://doi.org/10.21105/joss.01904

[72] Linnet Taylor. 2017. What Is Data Justice? The Case for Connecting Digital Rightsand Freedoms Globally. Big Data & Society 4, 2 (Dec. 2017), 205395171773633.https://doi.org/10.1177/2053951717736335

[73] Alexander Tischbirek. 2020. Artificial intelligence and discrimination: Discrim-inating against discriminatory systems. In Regulating artificial intelligence.Springer, 103–121.

[74] UK Information Commissioner’s Office. 2020. What do we need todo to ensure lawfulness, fairness, and transparency in AI systems?https://ico.org.uk/for-organisations/guide-to-data-protection/key-data-protection-themes/guidance-on-ai-and-data-protection/what-do-we-need-to-do-to-ensure-lawfulness-fairness-and-transparency-in-ai-systems/

[75] SriramVasudevan and KrishnaramKenthapadi. 2020. LiFT: A Scalable Frameworkfor Measuring Fairness in ML Applications. arXiv:2008.07433 [cs] (Aug. 2020).https://doi.org/10.1145/3340531.3412705

[76] Michael Veale and Reuben Binns. 2017. FairerMachine Learning in the RealWorld:Mitigating Discrimination without Collecting Sensitive Data. Big Data & Society4, 2 (Dec. 2017), 205395171774353. https://doi.org/10.1177/2053951717743530

[77] Michael Veale, Max Van Kleek, and Reuben Binns. 2018. Fairness and Account-ability Design Needs for Algorithmic Support in High-Stakes Public SectorDecision-Making. In Proceedings of the 2018 CHI Conference on Human Fac-tors in Computing Systems - CHI ’18. ACM Press, Montreal QC, Canada, 1–14.https://doi.org/10.1145/3173574.3174014

[78] Salome Viljoen. 2020. Democratic Data: A Relational Theory For Data Governance.SSRN Scholarly Paper ID 3727562. Social Science Research Network, Rochester,NY. https://doi.org/10.2139/ssrn.3727562

[79] JamesWexler, Mahima Pushkarna, Tolga Bolukbasi, MartinWattenberg, FernandaViégas, and JimboWilson. 2020. TheWhat-If Tool: Interactive Probing of MachineLearning Models. IEEE Transactions on Visualization and Computer Graphics 26,1 (Jan. 2020), 56–65. https://doi.org/10.1109/TVCG.2019.2934619

[80] Williams, Brooks, and Shmargad. 2018. How Algorithms Discriminate Basedon Data They Lack: Challenges, Solutions, and Policy Implications. Journal ofInformation Policy 8 (2018), 78. https://doi.org/10.5325/jinfopoli.8.2018.0078

[81] Orice M Williams. 2008. Fair Lending: Race and Gender Data are Limited forNon-Mortgage Lending. Subcommittee on Oversight and Investigations, Committeeon Financial Services, House of Representatives (2008). arXiv:GAO-08-1023T

[82] Pak-Hang Wong. 2020. Democratizing Algorithmic Fairness. Philoso-phy&Technology 33, 2 (June 2020), 225–244. https://doi.org/10.1007/s13347-019-00355-w

[83] Alice Xiang. 2021. Reconciling legal and technical approaches to algorithmicbias. Tennessee Law Review 88, 3 (2021).

[84] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P.Gummadi. 2017. Fairness Constraints: Mechanisms for Fair Classification. InArtificial Intelligence and Statistics. PMLR, 962–970. http://proceedings.mlr.press/v54/zafar17a.html

[85] Tal Z Zarsky. 2014. Understanding discrimination in the scored society. Wash-ington Law Review 89 (2014), 1375.

[86] Yan Zhang. 2016. Assessing Fair Lending Risks Using Race/Ethnicity Prox-ies. Management Science 64, 1 (Nov. 2016), 178–197. https://doi.org/10.1287/mnsc.2016.2579

[87] Indre Žliobaite and Bart Custers. 2016. Using Sensitive Personal Data May BeNecessary for Avoiding Discrimination in Data-Driven DecisionModels. ArtificialIntelligence and Law 24, 2 (June 2016), 183–201. https://doi.org/10.1007/s10506-016-9182-5