it takes a village to solve a problem in data science
TRANSCRIPT
It takes a village to solve a
problem in data science
Daniel Russ, Ph.D., Kwan-Yuet Ho Ph.D.,
and Melissa Friesen Ph.D.
The NIH SOCcer team
MD Data Science Meetup
June 19, 2017
National Institutes of Health
The nation's medical research agency
Extramural Research
Colleges & Universities
Intramural Research
The National Institutes of Health
NHGRI
CIT
NCI
NLM
Industry
80% 15%
27 Institutes and Centers
Me
SOCcer: A Case Study
U.S. Renal Cancer Study New England Bladder Cancer Study
Montreal Lung Cancer Study other studies
Powered by
Occupation is interesting to NIH
As early as the 1770’s, doctors realized that chimney sweeps were at increased risk of cancer.
How do we get occupation into cancer risk models?
2010 U.S Standard Occupational
Classification System19-0000 - Life, Physical, and Social Science Occupations
19-1000 - Life Scientists
19-1010 - Agricultural and Food Scientists
19-1011 - Animal Scientists
19-1012 - Food Scientists and Technologists
19-1013 - Soil and Plant Scientists
19-1020 - Biological Scientists
…
19-2000 – Physical Scientists
….
21-0000 - Community and Social Service Occupations
…
https://www.bls.gov/soc/major_groups.htm
Occupation Coding
Given a job description select the most appropriate occupational code?
“I know it when I see it”
– Justice Potter Stewart
Jacobellis v. Ohio
A plumber is a 47-215247-2152 Plumbers, Pipefitters, and Steamfitters
Occupational Information
Many ways of describing an occupation
Changes depending on who is asking
Different level of detail
Coders Results IMIS-100
47-2061 Construction Laborers 51-4051 Metal-refining furnace Operators and
tenders
51-2000 Assemblers and
Fabricators
51-9000 other production occupations
Cohen’s Kappa: (6 digit) 0.35 (3 digit) 0.67
SOC Agreement
Level
Example Job Title Example
SIC
Coder 1 Coder 2 Counts at
Agreement level 1st
Choice 2nd
Choice 1st Choice 2
nd Choice
6-digit mixer 3269 47-2061 47-2061 47-2060 15
3-digit small parts operator 3692 51-2022 51-2090 51-2099 38
6-digit 2nd
choice boston pot oper 3691 51-4051 47-2060 51-4051 6
3-digit 2nd
choice grinding-rice lake 3321 51-4033 51-9021 51-9020 7
No agreement Pouring floor labore 3366 51-9198 47-2051 47-4000 23
Unable to code n/a 7692 99-0000 99-0000 11
Russ DE, Ho KY, Johnson CA, et al. Computer-based coding of occupation codes for
epidemiological analyses. Proc IEEE Int Symp Comput Based Med Syst 2014;2014:347–50.
SOCcer is an Ensemble
SOCcer extends
opennlp soccertrain –model model.bin -train train.xml
opennlp soccer model.bin < data.csv
SOCcerME
SoccerModel train(InputStream xmlProperties)
double[] eval(JobDescription job)
<<Interface>>EventBuilder
Event build(JobDescription job)
SoccerModel
List<PrimaryModels> list
CodingSystem cs
CodingSystem.Level level
double[] logisticWeights
JobTitleEventBuilder
JobTaskEventBuilder
EventBuilder
EventBuilderSoftJaccardScoreTrainer
init(TrainingParameters)
doTrain(DataIndexer)
Training XML file
<soccermodel description=”MD DataSci SOCcer Model”>
<codingsystem system="soc2010" level="detailed"/>
<primarymodels>
<model description=”Soft Jaccard Score">
<eventtrainer class="gov.nih.cit.ml.jaccard.JaccardScoreTrainer"/>
<eventbuilder class="gov.nih.cit.ml.eventbuilder.JobTitleEventBuilder"/>
… {training data, parameterMaps}
</model>
<model>…</model>
</primarymodels>
<logisticmodel>
<trainingdata>..</trainingdata>
</logisticmodel>
</soccermodel>
Side effects
With minor refactoring, OpenNLP can have
novel functionality:
• built-in ensemble classification.
• ensemble classification by voting.
• bring your own classifier!
SOCcer Agreement
Russ, DE. et.al., Computer-based coding of free-text job descriptions to efficiently
identify occupations in epidemiological studies, Occup Environ Med 2016;73:417-424
We would like
We get
Agreement (USRenal)
Level 1.0 2.0
2 76.3 78.4
3 63.8 68.0
5 51.5 59.1
6 44.5 51.8
Real Effects
Automated Coding
Study Subject Code
Assisted by SOCcer
Translate from other Coding Systems to SOC 2010
Coding in other systems
Coding in other languages
Finding Occupations in Free Text (NLM)
Thank you
Stephen Ho, Melissa Friesen
And all the committers, PMC members, and users of