science data, responsibly

57
Data Ethics in Data Science Education (plus: Science Data, Responsibly) Bill Howe University of Washington

Upload: university-of-washington

Post on 14-Apr-2017

89 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Science Data, Responsibly

Data Ethics in Data Science Education

(plus: Science Data, Responsibly)

Bill HoweUniversity of Washington

Page 2: Science Data, Responsibly

05/03/2023 2

Plan

• context: eScience Institute (1 min)• context: Data Science MOOC (3 min)• Vignette on Teaching Data Ethics (5

min)

• Science Data, Responsibly (6 min)– Automated Curation– Viziometrics

Data, Responsibly @ Dagstuhl

Page 3: Science Data, Responsibly

• People• Research Staff (~4 100% Data Scientists, ~4 50% Research

Scientists)• Postdocs (~12 at steady state)• Faculty (~9 Exec Committee, ~20 Steering Committee, ~100

Affiliates)• Adminstrative Staff (Program Managers, Finance, Admin)

• Programs– Short and long-term research, education programs

ugrad/masters/Phd, software, research consulting – Leadership on all things data science around campus

• Funding• $700k / yr permanent appropriation from the state of WA• $32.8M for 5 years jointly with NYU and UC Berkeley from the

Gordon and Betty Moore Foundation and the Alfred P Sloan Foundation to build a “Data Science Environment”

• $9M for 5 years from the Washington Research Foundation• $500k / yr from the Provost for half-lines for recruiting in relevant

fields

Page 4: Science Data, Responsibly

05/03/2023 4Bill Howe, UW

Page 5: Science Data, Responsibly

05/03/2023 5

Data Science Education

Bill Howe, UW

Students Non-StudentsCS/Informatics Non-Major professionals researchersundergrads grads undergrads grads

(2011) Data Science Certificate (2013) Data Science MOOC(2013) NSF IGERT Big Data PhD (2013) New CS Courses (2016) Data Science Masters (2015) Data Sci. for Social Good

Data Ethics being incorporated in all programs

Page 6: Science Data, Responsibly

Session 2Summer 2014

121,215 students

Session 1 Spring 2013

119,504 students

Introduction to Data Science MOOC on Coursera

Page 7: Science Data, Responsibly

Participation numbers• “Registered:” 119,517 totally

irrelevant• Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663• Completed all assignments: ~9000 typical for a MOOC• “Passed:” 7022• Forum threads: 4661• Forum posts: 22,900

Fairly consistent with Coursera data across “hard” courses

Define success however you want– Many love it in parts, start late, don’t turn in homework, etc.– Learning rather than watching television

Page 8: Science Data, Responsibly

Syllabus• Data Science Landscape (~1 week)• Data Manipulation at Scale

– Relational Databases (~1 week)– MapReduce (~1 week)– NoSQL (~1 week)

• Analytics– Statistics Topics (~1 week)– Machine Learning Topics (~2 weeks)

• Visualization (~1 week)• Graph Analytics (~1 week)

Page 9: Science Data, Responsibly

2015: MOOC Recast as a 4-course “Specialization”

Data Manipulation at ScaleDatabases, Systems, Algorithms

Practical Predictive AnalyticsStats (resampling methods, multiple hypothesis testing, more)ML (rules/trees/forests, ensembles/boosting/bagging, SVMs, GD,

eval…)Communicating Data Science

Visualization, ethics and privacyCapstone

Page 10: Science Data, Responsibly

05/03/2023 10

VIGNETTE ON TEACHING DATA ETHICS

Bill Howe, UW

Page 11: Science Data, Responsibly

Alcohol Study, Barrow Alaska, 1979

Native leaders and city officials, worried about drinking and associated violence in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions.

Page 12: Science Data, Responsibly

Methods

• 10% representative sample (N=88) of everyone over the age of 15 using a 1972 demographic survey

• Interviewed on attitudes and values about use of alcohol

• Obtained psychological histories including drinking behavior

• Given the Michigan Alcoholism Screening Test (Seltzer, 1971)

• Asked to draw a picture of a person– Used to determine cultural identity

Page 13: Science Data, Responsibly

Results announced unilaterally and publicly

At the conclusion of the study researchers formulated a report entitled “The Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues Eskimos

Page 14: Science Data, Responsibly

The results of the Barrow Alcohol Study in Alaska were revealed in the context of a press conference that was held far from the Native village, and without the presence, much less the knowledge or consent, of any community member who might have been able to present any context concerning the socioeconomic conditions of the village. Study results suggested that nearly all adults in the community were alcoholics. In addition to the shame felt by community members, the town’s Standard and Poor bond rating suffered as a result, which in turn decreased the tribe’s ability to secure funding for much needed projects.

Backlash

Page 15: Science Data, Responsibly

Methodological Problems

“The authors once again met with the Barrow Technical Advisory Group, who stated their concern that only Natives were studied, and that outsiders in town had not been included.”

“The estimates of the frequency of intoxication based on association with the probability of being detained were termed "ludicrous, both logically and statistically.””

Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study

Page 16: Science Data, Responsibly

Ethical Problems

• Participants were not in control of their data nor the context in which they were presented.

• Easy to demonstrate specific, significant harms:– Social: Stigmatization– Financial: Bond rating lowered

• Important: Nothing to do with individual privacy– No PII revealed at any point, to anyone– No violations of best practices in data handling– But even those who did not participate in the study incurred

harm

Page 17: Science Data, Responsibly

Two Topics

• Social Component: Codes of Conduct• Technical Component: Managing Sensitive Data

Page 18: Science Data, Responsibly

Ethical principles vs. ethical rules

• In the Barrow example, ethical rules were generally followed

• But ethical principles were violated: The researchers appear to have placed their own interests ahead of those of the research subjects, the client, and society

Page 19: Science Data, Responsibly

Principles: Codes of Conduct

• American Statistical Association– http://www.amstat.org/committees/ethics/

• Certified Analytics Professional– https://www.certifiedanalytics.org/ethics.php

• Data Science Association– http://www.datascienceassn.org/code-of-conduct.

html

Page 20: Science Data, Responsibly

05/03/2023 20

SCIENCE DATA, RESPONSIBLY

Bill Howe, UW

Page 21: Science Data, Responsibly

05/03/2023 21

Science is a complete mess• Reproducibility

– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that

approximated the original result (Science, 2015)– Ioannidis 2005: Why most public research findings are false– Reinhart & Rogoff: global economic policy based on spreadsheet

fuck ups

Bill Howe, UW

Page 22: Science Data, Responsibly

Science, 2015

Page 23: Science Data, Responsibly

05/03/2023 23Data, Responsibly @ Dagstuhl

Retractions are increasing…..

Page 24: Science Data, Responsibly

05/03/2023 24

Science is a complete mess• Reproducibility

– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that

approximated the original result (Science, 2015)– Ioannidis 2005: Why most public research findings are false– Reinhart & Rogoff: global economic policy based on spreadsheet

fuck ups• Fraud

– Diederik Stapel: 38 articles with fictitious data– Bharat Aggarwal: a huge number of images with evidence of

manipulation

Bill Howe, UW

Page 25: Science Data, Responsibly

Bharat Aggarwalalleged data manipulation

Page 26: Science Data, Responsibly

05/03/2023 27

Science is a complete mess• Reproducibility

– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that

approximated the original result (Science, 2015)– Ioannidis 2005: Why most public research findings are false– Reinhart & Rogoff: global economic policy based on spreadsheet

fuck ups• Fraud

– Diederik Stapel: 38 articles with fictitious data– Bharat Aggarwal: a huge number of images with evidence of

manipulation• Public Trust

– Churn: Chocolate, egg yolks, red meat, red wine, etc.– Climate change, vaccines

Bill Howe, UW

Page 27: Science Data, Responsibly
Page 28: Science Data, Responsibly
Page 29: Science Data, Responsibly

05/03/2023 32

Vision: Validate scientific claims automatically– Check for manipulation (manipulated images, Benford’s Law)– Extract claims from papers– Check claims against the authors’ data– Check claims against related data sets– Automatic meta-analysis across the literature + public

datasets

• First steps– Automatic curation: Validate and attach metadata to public

datasets– Longitudinal analysis of the visual literature

Data, Responsibly @ Dagstuhl

Page 30: Science Data, Responsibly

“DEEP” CURATIONScience Data, Responsibly

Page 31: Science Data, Responsibly

Microarray experiments

Page 32: Science Data, Responsibly

05/03/2023 41Bill Howe, UW

Microarray samples submitted to the Gene Expression Omnibus

Curation is fast becoming the bottleneck to data sharing

Maxim Gretchkin

Hoifung Poon

Page 33: Science Data, Responsibly

color = labels supplied as metadata

clusters = 1st two PCA dimensions on the gene expression data itself

Can we use the expression data directly to curate algorithmically?

Maxim Gretchkin

Hoifung Poon

The expression data and the text labels appear to disagree

Page 34: Science Data, Responsibly

Maxim Gretchkin

Hoifung Poon

Better Tissue Type Labels

Domain knowledge (Ontology)

Expression data

Free-text Metadata

2 Deep Networkstext

expr

SVM

Page 35: Science Data, Responsibly

Deep Curation Maxim Gretchkin

Hoifung Poon

Distant supervision and co-learning between text-based classified and expression-based classifier: Both models improve by training on each others’ results.

Free-text classifierExpression classifier

Page 36: Science Data, Responsibly

Deep Curation: Our stuff wins, with no training data

Maxim Gretchkin

Hoifung Poon

state of the art

our reimplementation of the state of the art

our dueling pianos NN

amount of training data used

Page 37: Science Data, Responsibly

05/03/2023 46

VIZIOMETRICS:COMPREHENDING VISUAL INFORMATION IN THE SCIENTIFIC LITERATURE

Human-Data Interaction

Bill Howe, UW

Page 38: Science Data, Responsibly

Step 1: Dismantling Composite Figures

Poshen Lee

ICPRAM 2015

Page 39: Science Data, Responsibly

Do high-impact papers have fewer equations, as indicated by Fawcett and Higginson? (Yes)

Poshen LeeJevin West

high impact papers low impact papers

Page 40: Science Data, Responsibly

Do high-impact papers have more diagrams? (Yes)

Poshen LeeJevin West

Page 41: Science Data, Responsibly
Page 42: Science Data, Responsibly
Page 43: Science Data, Responsibly

TEACHING DATA ETHICS IN DATA SCIENCE

Page 44: Science Data, Responsibly

Session 2Summer 2014

121,215 students

Session 1 Spring 2013

119,504 students

Page 45: Science Data, Responsibly

Participation numbers• “Registered”: 119,517 totally

irrelevant• Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663• Completed all assignments: ~9000 typical for a MOOC• “Passed”: 7022• Forum threads: 4661• Forum posts: 22,900

Fairly consistent with Coursera data across “hard” courses

Define success however you want– Many love it in parts, start late, don’t turn in homework, etc.– Learning rather than watching television

Page 46: Science Data, Responsibly

Lectures• Data Science Context and Case Studies (~1 week)• Data Management at Scale

– Relational Databases (~1 week)– MapReduce (~1 week)– NoSQL (~1 week)

• Topics in Analytics– Permutation Methods, Bayesian Methods (~1 week)– Machine Learning Algorithms and Evaluation (~1 week)

• Visualization (~1 week)• Graph Analytics (~1 week)• Guest Lectures

Page 47: Science Data, Responsibly

05/03/2023 56Bill Howe, UW

Who took the course?

Page 48: Science Data, Responsibly

05/03/2023 57Bill Howe, UW

Who took the course?

Page 49: Science Data, Responsibly

05/03/2023 58Bill Howe, UW

Who took the course?

What programming language do you typically use?

??

Page 50: Science Data, Responsibly

05/03/2023 59Bill Howe, UW

Page 51: Science Data, Responsibly

05/03/2023 60Bill Howe, UW

Page 52: Science Data, Responsibly

05000

100001500020000250003000035000400004500050000

Attrition, video lectures

Number of students watching videos by segment, ordered by time

Page 53: Science Data, Responsibly

05/03/2023 62Bill Howe, UW

Twitter

1

Twitter

2

Twitter

3

Twitter

4

Twitter

5

Twitter

6

Database

1

Database

2

Database

3

Database

4

Database

5

Database

6

Database

7

Database

8

Database

9

MapRed

uce 1

MapRed

uce 2

MapRed

uce 3

MapRed

uce 4

MapRed

uce 5

MapRed

uce 6Kag

gle

Tablea

u0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Attrition, assignments

Number of students completing assignments by part

Page 54: Science Data, Responsibly
Page 55: Science Data, Responsibly

05/03/2023 64Bill Howe, UW

Who took the course?

In a directory with 1000 text files, you are asked to create a list of files that contain the word Drosophila

Page 56: Science Data, Responsibly

05/03/2023 65Bill Howe, UW

Who took the course?

What if you were given a billion documents spread across many computers and asked to count the occurrences of a given phrase?

Page 57: Science Data, Responsibly

“I left the company I co-founded in 2005 to do data analytics with Wibidata, with whom I was introduced as a result of their guest lecture in your course.