it takes a village to solve a problem in data science

15
It takes a village to solve a problem in data science Daniel Russ, Ph.D., Kwan-Yuet Ho Ph.D., and Melissa Friesen Ph.D. The NIH SOCcer team MD Data Science Meetup June 19, 2017

Upload: datasciencemd

Post on 21-Jan-2018

749 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: It Takes a Village To Solve A Problem in Data Science

It takes a village to solve a

problem in data science

Daniel Russ, Ph.D., Kwan-Yuet Ho Ph.D.,

and Melissa Friesen Ph.D.

The NIH SOCcer team

MD Data Science Meetup

June 19, 2017

Page 2: It Takes a Village To Solve A Problem in Data Science

National Institutes of Health

The nation's medical research agency

Extramural Research

Colleges & Universities

Intramural Research

The National Institutes of Health

NHGRI

CIT

NCI

NLM

Industry

80% 15%

27 Institutes and Centers

Me

Page 3: It Takes a Village To Solve A Problem in Data Science

SOCcer: A Case Study

U.S. Renal Cancer Study New England Bladder Cancer Study

Montreal Lung Cancer Study other studies

Powered by

Page 4: It Takes a Village To Solve A Problem in Data Science

Occupation is interesting to NIH

As early as the 1770’s, doctors realized that chimney sweeps were at increased risk of cancer.

How do we get occupation into cancer risk models?

Page 5: It Takes a Village To Solve A Problem in Data Science

2010 U.S Standard Occupational

Classification System19-0000 - Life, Physical, and Social Science Occupations

19-1000 - Life Scientists

19-1010 - Agricultural and Food Scientists

19-1011 - Animal Scientists

19-1012 - Food Scientists and Technologists

19-1013 - Soil and Plant Scientists

19-1020 - Biological Scientists

19-2000 – Physical Scientists

….

21-0000 - Community and Social Service Occupations

https://www.bls.gov/soc/major_groups.htm

Page 6: It Takes a Village To Solve A Problem in Data Science

Occupation Coding

Given a job description select the most appropriate occupational code?

“I know it when I see it”

– Justice Potter Stewart

Jacobellis v. Ohio

A plumber is a 47-215247-2152 Plumbers, Pipefitters, and Steamfitters

Page 7: It Takes a Village To Solve A Problem in Data Science

Occupational Information

Many ways of describing an occupation

Changes depending on who is asking

Different level of detail

Page 8: It Takes a Village To Solve A Problem in Data Science

Coders Results IMIS-100

47-2061 Construction Laborers 51-4051 Metal-refining furnace Operators and

tenders

51-2000 Assemblers and

Fabricators

51-9000 other production occupations

Cohen’s Kappa: (6 digit) 0.35 (3 digit) 0.67

SOC Agreement

Level

Example Job Title Example

SIC

Coder 1 Coder 2 Counts at

Agreement level 1st

Choice 2nd

Choice 1st Choice 2

nd Choice

6-digit mixer 3269 47-2061 47-2061 47-2060 15

3-digit small parts operator 3692 51-2022 51-2090 51-2099 38

6-digit 2nd

choice boston pot oper 3691 51-4051 47-2060 51-4051 6

3-digit 2nd

choice grinding-rice lake 3321 51-4033 51-9021 51-9020 7

No agreement Pouring floor labore 3366 51-9198 47-2051 47-4000 23

Unable to code n/a 7692 99-0000 99-0000 11

Russ DE, Ho KY, Johnson CA, et al. Computer-based coding of occupation codes for

epidemiological analyses. Proc IEEE Int Symp Comput Based Med Syst 2014;2014:347–50.

Page 9: It Takes a Village To Solve A Problem in Data Science

SOCcer is an Ensemble

Page 10: It Takes a Village To Solve A Problem in Data Science

SOCcer extends

opennlp soccertrain –model model.bin -train train.xml

opennlp soccer model.bin < data.csv

SOCcerME

SoccerModel train(InputStream xmlProperties)

double[] eval(JobDescription job)

<<Interface>>EventBuilder

Event build(JobDescription job)

SoccerModel

List<PrimaryModels> list

CodingSystem cs

CodingSystem.Level level

double[] logisticWeights

JobTitleEventBuilder

JobTaskEventBuilder

EventBuilder

EventBuilderSoftJaccardScoreTrainer

init(TrainingParameters)

doTrain(DataIndexer)

Page 11: It Takes a Village To Solve A Problem in Data Science

Training XML file

<soccermodel description=”MD DataSci SOCcer Model”>

<codingsystem system="soc2010" level="detailed"/>

<primarymodels>

<model description=”Soft Jaccard Score">

<eventtrainer class="gov.nih.cit.ml.jaccard.JaccardScoreTrainer"/>

<eventbuilder class="gov.nih.cit.ml.eventbuilder.JobTitleEventBuilder"/>

… {training data, parameterMaps}

</model>

<model>…</model>

</primarymodels>

<logisticmodel>

<trainingdata>..</trainingdata>

</logisticmodel>

</soccermodel>

Page 12: It Takes a Village To Solve A Problem in Data Science

Side effects

With minor refactoring, OpenNLP can have

novel functionality:

• built-in ensemble classification.

• ensemble classification by voting.

• bring your own classifier!

Page 13: It Takes a Village To Solve A Problem in Data Science

SOCcer Agreement

Russ, DE. et.al., Computer-based coding of free-text job descriptions to efficiently

identify occupations in epidemiological studies, Occup Environ Med 2016;73:417-424

We would like

We get

Agreement (USRenal)

Level 1.0 2.0

2 76.3 78.4

3 63.8 68.0

5 51.5 59.1

6 44.5 51.8

Page 14: It Takes a Village To Solve A Problem in Data Science

Real Effects

Automated Coding

Study Subject Code

Assisted by SOCcer

Translate from other Coding Systems to SOC 2010

Coding in other systems

Coding in other languages

Finding Occupations in Free Text (NLM)

Page 15: It Takes a Village To Solve A Problem in Data Science

Thank you

Stephen Ho, Melissa Friesen

And all the committers, PMC members, and users of