science data, responsibly

Download Science Data, Responsibly

Post on 14-Apr-2017

81 views

Category:

Data & Analytics

0 download

Embed Size (px)

TRANSCRIPT

Slide 1

Data Ethics in Data Science Education

(plus: Science Data, Responsibly)Bill HoweUniversity of Washington

Plancontext: eScience Institute (1 min)context: Data Science MOOC (3 min)Vignette on Teaching Data Ethics (5 min)

Science Data, Responsibly (6 min)Automated CurationViziometrics7/20/16Data, Responsibly @ Dagstuhl2

PeopleResearch Staff (~4 100% Data Scientists, ~4 50% Research Scientists)Postdocs (~12 at steady state)Faculty (~9 Exec Committee, ~20 Steering Committee, ~100 Affiliates)Adminstrative Staff (Program Managers, Finance, Admin)ProgramsShort and long-term research, education programs ugrad/masters/Phd, software, research consulting Leadership on all things data science around campusFunding$700k / yr permanent appropriation from the state of WA$32.8M for 5 years jointly with NYU and UC Berkeley from the Gordon and Betty Moore Foundation and the Alfred P Sloan Foundation to build a Data Science Environment$9M for 5 years from the Washington Research Foundation$500k / yr from the Provost for half-lines for recruiting in relevant fields

7/20/16Bill Howe, UW4

We use this device to talk about this idea: the pi-shaped researcher.

4

Data Science Education7/20/16Bill Howe, UW5StudentsNon-StudentsCS/InformaticsNon-Majorprofessionalsresearchersundergradsgradsundergradsgrads(2011) Data Science Certificate(2013) Data Science MOOC(2013) NSF IGERT Big Data PhD(2013) New CS Courses(2016) Data Science Masters(2015) Data Sci. for Social Good

Data Ethics being incorporated in all programs

Session 2Summer 2014121,215 studentsSession 1 Spring 2013119,504 studentsIntroduction to Data Science MOOC on Coursera

Participation numbersRegistered: 119,517 totally irrelevantClicked play in first 2 weeks: 78,589 Turned in 1st homework: 10,663Completed all assignments: ~9000 typical for a MOOCPassed: 7022Forum threads:4661Forum posts: 22,900

Fairly consistent with Coursera data across hard courses

Define success however you wantMany love it in parts, start late, dont turn in homework, etc.Learning rather than watching television

SyllabusData Science Landscape (~1 week)Data Manipulation at ScaleRelational Databases (~1 week)MapReduce (~1 week)NoSQL (~1 week)AnalyticsStatistics Topics (~1 week)Machine Learning Topics (~2 weeks)Visualization (~1 week)Graph Analytics (~1 week)

2015: MOOC Recast as a 4-course SpecializationData Manipulation at ScaleDatabases, Systems, AlgorithmsPractical Predictive AnalyticsStats (resampling methods, multiple hypothesis testing, more)ML (rules/trees/forests, ensembles/boosting/bagging, SVMs, GD, eval)Communicating Data ScienceVisualization, ethics and privacyCapstone

9

Vignette on Teaching Data Ethics

7/21/16Bill Howe, UW10

Alcohol Study, Barrow Alaska, 1979

Native leaders and city officials, worried about drinking and associated violence in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions.

11

Methods

10% representative sample (N=88) of everyone over the age of 15 using a 1972 demographic surveyInterviewed on attitudes and values about use of alcoholObtained psychological histories including drinking behaviorGiven the Michigan Alcoholism Screening Test (Seltzer, 1971) Asked to draw a picture of a personUsed to determine cultural identity

Results announced unilaterally and publicly

At the conclusion of the study researchers formulated a report entitled The Inupiat, Economics and Alcohol on the Alaskan North Slope which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues Eskimos

13

The results of the Barrow Alcohol Study in Alaska were revealed in the context of a press conference that was held far from the Native village, and without the presence, much less the knowledge or consent, of any community member who might have been able to present any context concerning the socioeconomic conditions of the village. Study results suggested that nearly all adults in the community were alcoholics. In addition to the shame felt by community members, the towns Standard and Poor bond rating suffered as a result, which in turn decreased the tribes ability to secure funding for much needed projects. Backlash

Native leaders and city officials in Barrow, Alaska, worried about drinking and associated violence and accidental deaths in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions. At the conclusion of the study researchers formulated a report entitled The Inupiat, Economics and Alcohol on the Alaskan North Slope which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues

14

Methodological ProblemsThe authors once again met with the Barrow Technical Advisory Group, who stated their concern that only Natives were studied, and that outsiders in town had not been included.

The estimates of the frequency of intoxication based on association with the probability of being detained were termed "ludicrous, both logically and statistically.

Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study

Ethical ProblemsParticipants were not in control of their data nor the context in which they were presented.Easy to demonstrate specific, significant harms:Social: StigmatizationFinancial: Bond rating lowered

Important: Nothing to do with individual privacyNo PII revealed at any point, to anyoneNo violations of best practices in data handlingBut even those who did not participate in the study incurred harm

Two TopicsSocial Component: Codes of ConductTechnical Component: Managing Sensitive Data

Ethical principles vs. ethical rulesIn the Barrow example, ethical rules were generally followedBut ethical principles were violated: The researchers appear to have placed their own interests ahead of those of the research subjects, the client, and society

Principles: Codes of ConductAmerican Statistical Associationhttp://www.amstat.org/committees/ethics/Certified Analytics Professionalhttps://www.certifiedanalytics.org/ethics.phpData Science Associationhttp://www.datascienceassn.org/code-of-conduct.html

Responsibility to which parties?* Society* Employers and Clients* Colleagues* Research Subjects

ASA:ProfessionalismResponsibilities to Funders, Clients, EmployersResponsibilities in Publications and Testimony Responsibilities to Research SubjectsResponsibilities to Research Team Colleagues Responsibilities to Other Statisticians or Statistical PractitionersResponsibilities Regarding Allegations of MisconductResponsibilities of Employers

Code of Conduct: RulesCompetenceDo what you client asks, unless violates lawCommunication with clientsConfidential informationConflicts of interestRule 7: More on conflicts of interest and confidentialityRule 8: Scientific integrity+++ Interesting: If a data scientist reasonably believes a client is misusing data science to communicate a false reality or promote an illusion of understanding, the data scientist shall take reasonable remedial measures, including disclosure to the client, and including, if necessary, disclosure to the proper authorities. The data scientist shall take reasonable measures to persuade the client to use data science appropriately.Rule 9: Misconduct (follow the rules)

19

Science Data, Responsibly

7/20/16Bill Howe, UW20

Science is a complete messReproducibilityBegley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015)Ioannidis 2005: Why most public research findings are falseReinhart & Rogoff: global economic policy based on spreadsheet fuck ups7/20/16Bill Howe, UW21

Science, 2015

7/20/16Data, Responsibly @ Dagstuhl23

Retractions are increasing..

Science is a complete messReproducibilityBegley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015)Ioannidis 2005: Why most public research findings are falseReinhart & Rogoff: global economic policy based on spreadsheet fuck upsFraudDiederik Stapel: 38 articles with fictitious dataBharat Aggarwal: a huge number of images with evidence of manipulation7/20/16Bill Howe, UW24

Bharat Aggarwalalleged data manipulation

7/20/16Bill Howe, UW26

Bharat Aggarwalalleged data manipulation

Science is a complete messReproducibilityBegley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015)Ioannidis 2005: Why most public research findings are falseReinhart & Rogoff: global economic policy based on spreadsheet fuck upsFraudDiederik Stapel: 38 articles with fictitious dataBharat Aggarwal: a huge number of images with evidence of manipulationPublic TrustChurn: Chocolate, egg yolks, red meat, red wine, etc.Climate change, vaccines7/20/16Bill Howe, UW27

7/20/16Data, Responsibly @ Dagstuhl29Nature 483, 531533 (29 March 2012)

Begley & Ellis

One goofy idea: Open up a market for reproducibilityElizabeth Iorns, Science Exchange

Vision: Validate scientific claims automaticallyCheck for manipulation (manipulated images, Benfords Law)Extract claims from papersCheck claims against th