personal health train (pht) - how to select appropriate data in the patient's...
TRANSCRIPT
Selecting Appropriate Datafor the Personal Health Train
Mark D. Wilkinson([email protected])
BBVA-UPM Industry Chair on Biotechnology Isaac Peral/Marie Curie Distinguished Researcher
Universidad Politécnica de Madrid
Which “bit” of the train/track am I interested in?
Which “bit” of the train/track am I interested in?
In this frame of the PHT video, the train is being “scanned”
Which “bit” of the train/track am I interested in?
In this frame of the PHT video, the train is being “scanned”
Meta descriptors of “questions” (analyses, data gathering, etc.)
Which “bit” of the train/track am I interested in?
In this frame of the PHT video, the train is being “scanned”
Meta descriptor of data holdings inside the “locker”
Which “bit” of the train/track am I interested in?
Matching of question against data via metadata comparison
Putative Match!
Which “bit” of the train/track am I interested in?
Accomplished by the FAIR Data Point(s) and indexes of these
Putative Match!
Which “bit” of the train/track am I interested in?
Accomplished by the FAIR Data Point(s) and indexes of these
Putative Match!
Which “bit” of the train/track am I interested in?
Importantly, this happens “in the open” (may involve humans!)
Which “bit” of the train/track am I interested in?
Also very interesting issues around informed consent…
A match of the question metadata against public “station” metadata tells the train to enter that station
to see if there are any relevant data points
What happens inside the station, however,
is a “Black Box”
Now we are inside the stationi.e. a data repository or “locker”
All decisions from here onwardsmust be fully autonomous! No peeking!
How can this be??Because a metadata match is not the same as a data match!
What is actually in the matched Locker will be unpredictable
Analytical algorithms/Q’s may have specific requirements
(data type, format)
that don’t match the data content in this locker
The desired data may not exist at all
(e.g. inclusion/exclusion criteria such as a specific type of clinical measurement, in the
context of a specific drug)
Metadata cannot describe everything about the data
(otherwise, it would be the data )
We require:
Intelligent, autonomous matching of FAIR Data against analytical tools/workflows
both semantically, and syntactically
We require:
Automatic data reformatting, where necessary
We require:
Automatic detection of “fillable gaps” in the data
(and filling those gaps)
We require:
Automatic staging of data for analysis
We require:
Automatic execution of analysis
(“analysis” may be a single algorithm or a workflow)
We require:
Automatic collection of results, and all provenance metadata
from the analysis
We require:
Automatic purging of any identifiable/private data remaining in the output dataset
We require:
No human intervention at any point!
This is happening in a “black box”
Between 2006-2008
my laboratory at St. Paul’s Hospital, Vancouver
created technologies to address exactly this problem
in the context of FAIR Data
(…but before FAIR was a “thing” ;-) )
Semantic Automated Discovery and Integration
A design-pattern for analytical tools that utilize FAIR Data
Semantic Health And Research Environment
A multi-faceted “engine” that automaticallyassembles FAIR Data and uses it to
execute appropriate SADI tools to answer research questions
Original Purpose
Facilitate interoperability betweenGlobally-distributed Web Services
Re-Purpose
Facilitate interoperability betweenincoming PHT analyses and Locker data
SADI Defines a design pattern for the interface
to any analytical tool that consumes FAIR Data
Includes support for NanoPublication of the output from analyses
(i.e. SADI natively outputs FAIR data also)
SHAREQuery interpretation
Semantic reasoning over dataAnalytical tool selection (SADI)
Workflow synthesisData reformatting
Data/Service matchmakingWorkflow execution
[Provenance capture]Output formatting
Height: 187Weight: 89
TypicalAnalytical Tool
25.5
BMI Calculator
187 Analytical ToolWith SADI
BMI
25.5
Patient_09
height
89
weight
187
Patient_09
height
89
weight
Provenance
BMI Calculator
187 Analytical ToolWith SADI
BMI
25.5
Patient_09
height
89
weight
187
Patient_09
height
89
weight
Provenance
BMI Calculator
SADI Tools are described by metadatathat contain OWL models of their
Input and Output data, which must be FAIR
SADI Tools are described by metadatathat contain OWL models of their
Input and Output data, which must be FAIR
Data/Tool matching can be done by:
Exact-matchor
Ontological reasoning
SADI Tools are described by metadatathat contain OWL models of their
Input and Output data, which must be FAIR
Data/Tool matching can be done by:
To understand SHARE
it is best to see it in-action
These are 100% real, working examples of SHARE doing the
kinds of analyses that we expect the PHT to do…
For each SNP in each patient, where the SNP results in an altered protein product, we want to know the pathways that are
affected in that patient
SELECT ?gene ?pathway WHERE {
uniprot:XXXXXX pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
Start simply… Exact Match Discovery + Analysis
The patient who owns this locker is recorded as having a SNP variant that affects protein P47989 (UniProt). What pathways
are affected by this SNP?
SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
The PHT is now inside an individual locker
Give that query to SHARE
Tools carried in the PHT “car”(or in some circumstances, even external to the PHT)
are now matched against the data in the Locker, assembled into an analytical workflow,
and the workflow is executed
SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway
. }
First: a tool is discovered that takes UniProt identifiers and maps them to their respective genes
Second: the appropriate data is selected from the data source (locker) and that tool is executed.
Third: the output from that tool is evaluated to ensure it is correct input to the tool that determines the pathways that a gene participates in
Fourth: that tool is executed, and the output is collected and formatted…
SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway
. }
SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway
. }
That was a simple example
The PHT will encounter much more complex cases
Detect if the patient who owns this locker is rejecting their kidney transplant
If so, then collect their latest Blood Urea Nitrogen and Creatinine levels
SELECT ?patient ?bun ?creatFROM <patient:locker>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Detect if the patient who owns this locker is rejecting their kidney transplant
If so, then collect their latest Blood Urea Nitrogen and Creatinine levels
SELECT ?patient ?bun ?creatFROM <patient:locker>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Likely Rejecter:
A patient who has creatinine levelsthat are increasing over time
- - Mark D Wilkinson’s definition
Likely Rejecter:
FAIR does not equal “Predictable”!!
The information requested by a researcher is not always going to be recorded in a patient’s Personal Health Locker
or even in a hospital clinical database
at least, not always in the way they want it…
Likely Rejecter:
The PHT is going to have to deal with a wide range of scenarios, including data that has not been annotated in the
manner required to answer the question
We’re in the Black Box, we can’t ask for human assistance
The system must decide autonomously!
Likely Rejecter:
In this case, we will assume that the patient’s clinical information contains only a time-series of
blood creatinine measurements
“worst-case” scenario
No guidance whatsoever! Only raw, uninterpreted data.…but there is sufficient info. to solve the problem!
My definition of a Likely Rejecter is encoded in a machine-readable document written in the OWL Ontology language
Basically:
“the regression line over creatinine measurements should have an increasing slope”
SELECT ?patient ?bun ?creatFROM <patient:locker>WHERE {
?patient rdf:type patient:LikelyRejecter .
?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
SHARE examines the query and determines that it is looking for “Rejecters”
SHARE examines the query and determines that it is looking for “Rejecters”
It checks if the “Rejecter” property is in the patient’s locker, and finds that it is not.
SHARE examines the query and determines that it is looking for “Rejecters”
It checks if the “Rejecter” property is in the patient’s locker, and finds that it is not.
It examines the definition of “Rejecter” and matches each property (slope, intercept, etc.) with a SADI Tool. These are
pipelined into a workflow
SHARE examines the query and determines that it is looking for “Rejecters”
It checks if the “Rejecter” property is in the patient’s locker, and finds that it is not.
It examines the definition of “Rejecter” and matches each property (slope, intercept, etc.) with a SADI Tool. These are
pipelined into a workflow
Finally, it determines what data is available, and where that data can be piped into the workflow (semantic matching)
SHARE decides that it needs to do a
Linear Regression analysis
on the blood creatinine measurements
It finds a linear regression tool (SADI) repackages the data
and executes the analysis
A screenshot of SHARE solving the Likely Rejecter query
How SHARE interprets the data varies throughout the execution of the analysis
Example?
Blood Creatinine measurements
were not dictated to only be
Blood Creatinine measurements
Example?
FAIR Data has the ‘qualities/properties’ that
allows one analytical tool to interpret
that they are Blood Creatinine measurements
(e.g. to determine which patients are rejecting)
Example?
But the data also has the ‘qualities/properties’ that
allows another analytical tool to interpret them as
Simple X/Y coordinate data
(e.g. the Linear Regression calculation tool)
Because of the “I” in FAIR
FAIR Data is amenable to
autonomous
InterpretationReinterpretation
Reformattingand (Re-)Integration
Because SADI Tools are defined in terms of the FAIR Data they operate-on
And because the PHT will carry a limited number of such tools (selected by the researcher for their specific task)
We can rely on the PHT’s SHARE to undertake rapid, efficient, autonomous matchmaking between the
patient data, and the appropriate tools/workflows
inside the black box of the Patient locker
And this gives us…
http://www.flickr.com/people/faernworks/
One more example
Here, we address a problem that we know the PHT is going to encounter
ID
HEIGHT
WEIGHT
SBP CHOL
HDL
BMI
GR
SBP
GR
CHOL
GR
HDL
GR
pt1 1.82 177 128 227 55 0 0 1 0
pt2 179 196 13.4 5.9 1.7 1 0 1 0
A legacy clinical dataset (from the 1970’s) used in our SHARE R&D studies
Height in m and cm Chol in mmol/l and mg/l
...and other delicious weirdness
GOAL:
autonomous detection and resolution of conflicts
in the recorded measurement unitsbetween disparate clinical datasets
Rich data structures like this one can be “Projected”
from existing FAIR Data sources like the PH Locker
These become input to…
Unified SADI Tool for automated Unit conversion of any type
• Send it a dataset with mixed units• (optional) tell it the harmonized unit you want back• Returns you a dataset with harmonized units
Automatic semantic detection of the “nature” of the incoming unit type (e.g. “unit of pressure”)
Automatic conversion based on dimensionality and/or offset & multiplier
The researcher asking the question will define the clinical measurements of interest to them
including measurement units and inclusion/exclusion criteria
measure:HighRiskSystolicBloodPressure
measure:SystolicBloodPressure and sio:hasMeasurement some (sio:Measurement and (“sio:has unit” value om:kilopascal) and (sio:hasValue some double[>= "18.7"^^double])))
Now we’re being specificMUST be in kpascal and must be > 18.7
SELECT ?record ?convertedvalue ?convertedunitFROM <patient:locker> WHERE {
?record rdf:type measure:HighSystolicBloodPressure . ?record sio:hasMeasurement ?measurement. ?measurement sio:hasValue ?convertedvalue. ?record cardio:ExpertClassification ?riskgrade . }
RecordID Start Val Start Unit End Val End Unit cm_hg1 15 cmHg 19.998 KiloPascalcm_hg2 14.6 cmHg 19.465 KiloPascalmm_hg1 14.8 mmHg 19.731 KiloPascalmm_hg2 146 mmHg 19.465 KiloPascal
SHARE query
Because HighSystolicBloodPressure was definedin kpascal, SHARE automatically told SADI toconvert everything into kpascal
Different things can/will happen inside of different lockers, even in the context of
the same question
But these are black boxes!
SADI services natively output NanoPublications, therefore we have a detailed record of
provenance associated with EACH AND EVERY data point. We can peek inside the black box!
Final Note #1Reproducibility & Scholarly Rigor
How do we get SHARE,the relevant SADI services
and the workflowsinto the locker?
Final Note #2Deployment
How do we get SHARE,the relevant SADI services
and the workflowsinto the locker?
Final Note #2Deployment
We are not alone…
We are not alone…
Accurate, autonomous matchmaking between data and tools/workflows is tricky
…even if the data is FAIR!
SADI and SHARE were designedspecifically to solve
this problem!
Specific Acknowledgements to:
Dr. Mikel Egaña Aranguren (SADI + Galaxy + Docker)
Dr. Soroush Samadian (clinical measurement unit conversion)
Luke McCarthy and Ben Vandervalk (SADI + SHARE)
Microsoft Research