personal health train (pht) - how to select appropriate data in the patient's...

Selecting Appropriate Datafor the Personal Health Train

Mark D. Wilkinson([email protected])

BBVA-UPM Industry Chair on Biotechnology Isaac Peral/Marie Curie Distinguished Researcher

Universidad Politécnica de Madrid

Which “bit” of the train/track am I interested in?


In this frame of the PHT video, the train is being “scanned”



Meta descriptors of “questions” (analyses, data gathering, etc.)



Meta descriptor of data holdings inside the “locker”


Matching of question against data via metadata comparison

Putative Match!


Accomplished by the FAIR Data Point(s) and indexes of these

Putative Match!


Importantly, this happens “in the open” (may involve humans!)


Also very interesting issues around informed consent…

A match of the question metadata against public “station” metadata tells the train to enter that station

to see if there are any relevant data points

What happens inside the station, however,

is a “Black Box”

Now we are inside the stationi.e. a data repository or “locker”

All decisions from here onwardsmust be fully autonomous! No peeking!

How can this be??Because a metadata match is not the same as a data match!

What is actually in the matched Locker will be unpredictable

Analytical algorithms/Q’s may have specific requirements

(data type, format)

that don’t match the data content in this locker

The desired data may not exist at all

(e.g. inclusion/exclusion criteria such as a specific type of clinical measurement, in the

context of a specific drug)

Metadata cannot describe everything about the data

(otherwise, it would be the data )

We require:

Intelligent, autonomous matching of FAIR Data against analytical tools/workflows

both semantically, and syntactically

We require:

Automatic data reformatting, where necessary

We require:

Automatic detection of “fillable gaps” in the data

(and filling those gaps)

We require:

Automatic staging of data for analysis

We require:

Automatic execution of analysis

(“analysis” may be a single algorithm or a workflow)

We require:

Automatic collection of results, and all provenance metadata

from the analysis

We require:

Automatic purging of any identifiable/private data remaining in the output dataset

We require:

No human intervention at any point!

This is happening in a “black box”

Between 2006-2008

my laboratory at St. Paul’s Hospital, Vancouver

created technologies to address exactly this problem

in the context of FAIR Data

(…but before FAIR was a “thing” ;-) )

Semantic Automated Discovery and Integration

A design-pattern for analytical tools that utilize FAIR Data

Semantic Health And Research Environment

A multi-faceted “engine” that automaticallyassembles FAIR Data and uses it to

execute appropriate SADI tools to answer research questions

Original Purpose

Facilitate interoperability betweenGlobally-distributed Web Services

Re-Purpose

Facilitate interoperability betweenincoming PHT analyses and Locker data

SADI Defines a design pattern for the interface

to any analytical tool that consumes FAIR Data

Includes support for NanoPublication of the output from analyses

(i.e. SADI natively outputs FAIR data also)

SHAREQuery interpretation

Semantic reasoning over dataAnalytical tool selection (SADI)

Workflow synthesisData reformatting

Data/Service matchmakingWorkflow execution

[Provenance capture]Output formatting

Height: 187Weight: 89

TypicalAnalytical Tool

25.5

BMI Calculator

187 Analytical ToolWith SADI

BMI

25.5

Patient_09

height

89

weight

187

Patient_09

height

89

weight

Provenance

BMI Calculator

SADI Tools are described by metadatathat contain OWL models of their

Input and Output data, which must be FAIR



Data/Tool matching can be done by:

Exact-matchor

Ontological reasoning



Data/Tool matching can be done by:

To understand SHARE

it is best to see it in-action

These are 100% real, working examples of SHARE doing the

kinds of analyses that we expect the PHT to do…

For each SNP in each patient, where the SNP results in an altered protein product, we want to know the pathways that are

affected in that patient

SELECT ?gene ?pathway WHERE {

uniprot:XXXXXX pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

Start simply… Exact Match Discovery + Analysis

The patient who owns this locker is recorded as having a SNP variant that affects protein P47989 (UniProt). What pathways

are affected by this SNP?


uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

The PHT is now inside an individual locker

Give that query to SHARE

Tools carried in the PHT “car”(or in some circumstances, even external to the PHT)

are now matched against the data in the Locker, assembled into an analytical workflow,

and the workflow is executed


uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway

. }

First: a tool is discovered that takes UniProt identifiers and maps them to their respective genes

Second: the appropriate data is selected from the data source (locker) and that tool is executed.

Third: the output from that tool is evaluated to ensure it is correct input to the tool that determines the pathways that a gene participates in

Fourth: that tool is executed, and the output is collected and formatted…



. }



. }

That was a simple example

The PHT will encounter much more complex cases

Detect if the patient who owns this locker is rejecting their kidney transplant

If so, then collect their latest Blood Urea Nitrogen and Creatinine levels

SELECT ?patient ?bun ?creatFROM <patient:locker>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Likely Rejecter:

A patient who has creatinine levelsthat are increasing over time

- - Mark D Wilkinson’s definition

Likely Rejecter:

FAIR does not equal “Predictable”!!

The information requested by a researcher is not always going to be recorded in a patient’s Personal Health Locker

or even in a hospital clinical database

at least, not always in the way they want it…

Likely Rejecter:

The PHT is going to have to deal with a wide range of scenarios, including data that has not been annotated in the

manner required to answer the question

We’re in the Black Box, we can’t ask for human assistance

The system must decide autonomously!

Likely Rejecter:

In this case, we will assume that the patient’s clinical information contains only a time-series of

blood creatinine measurements

“worst-case” scenario

No guidance whatsoever! Only raw, uninterpreted data.…but there is sufficient info. to solve the problem!

My definition of a Likely Rejecter is encoded in a machine-readable document written in the OWL Ontology language

Basically:

“the regression line over creatinine measurements should have an increasing slope”

SELECT ?patient ?bun ?creatFROM <patient:locker>WHERE {

?patient rdf:type patient:LikelyRejecter .

?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

SHARE examines the query and determines that it is looking for “Rejecters”


It checks if the “Rejecter” property is in the patient’s locker, and finds that it is not.



It examines the definition of “Rejecter” and matches each property (slope, intercept, etc.) with a SADI Tool. These are

pipelined into a workflow



It examines the definition of “Rejecter” and matches each property (slope, intercept, etc.) with a SADI Tool. These are

pipelined into a workflow

Finally, it determines what data is available, and where that data can be piped into the workflow (semantic matching)

SHARE decides that it needs to do a

Linear Regression analysis

on the blood creatinine measurements

It finds a linear regression tool (SADI) repackages the data

and executes the analysis

A screenshot of SHARE solving the Likely Rejecter query

How SHARE interprets the data varies throughout the execution of the analysis

Example?

Blood Creatinine measurements

were not dictated to only be

Blood Creatinine measurements

Example?

FAIR Data has the ‘qualities/properties’ that

allows one analytical tool to interpret

that they are Blood Creatinine measurements

(e.g. to determine which patients are rejecting)

Example?

But the data also has the ‘qualities/properties’ that

allows another analytical tool to interpret them as

Simple X/Y coordinate data

(e.g. the Linear Regression calculation tool)

Because of the “I” in FAIR

FAIR Data is amenable to

autonomous

InterpretationReinterpretation

Reformattingand (Re-)Integration

Because SADI Tools are defined in terms of the FAIR Data they operate-on

And because the PHT will carry a limited number of such tools (selected by the researcher for their specific task)

We can rely on the PHT’s SHARE to undertake rapid, efficient, autonomous matchmaking between the

patient data, and the appropriate tools/workflows

inside the black box of the Patient locker

And this gives us…

http://www.flickr.com/people/faernworks/

One more example

Here, we address a problem that we know the PHT is going to encounter

ID

HEIGHT

WEIGHT

SBP CHOL

HDL

BMI

GR

SBP

GR

CHOL

GR

HDL

GR

pt1 1.82 177 128 227 55 0 0 1 0

pt2 179 196 13.4 5.9 1.7 1 0 1 0

A legacy clinical dataset (from the 1970’s) used in our SHARE R&D studies

Height in m and cm Chol in mmol/l and mg/l

...and other delicious weirdness

GOAL:

autonomous detection and resolution of conflicts

in the recorded measurement unitsbetween disparate clinical datasets

Rich data structures like this one can be “Projected”

from existing FAIR Data sources like the PH Locker

These become input to…

Unified SADI Tool for automated Unit conversion of any type

• Send it a dataset with mixed units• (optional) tell it the harmonized unit you want back• Returns you a dataset with harmonized units

Automatic semantic detection of the “nature” of the incoming unit type (e.g. “unit of pressure”)

Automatic conversion based on dimensionality and/or offset & multiplier

The researcher asking the question will define the clinical measurements of interest to them

including measurement units and inclusion/exclusion criteria

measure:HighRiskSystolicBloodPressure

measure:SystolicBloodPressure and sio:hasMeasurement some (sio:Measurement and (“sio:has unit” value om:kilopascal) and (sio:hasValue some double[>= "18.7"^^double])))

Now we’re being specificMUST be in kpascal and must be > 18.7

SELECT ?record ?convertedvalue ?convertedunitFROM <patient:locker> WHERE {

?record rdf:type measure:HighSystolicBloodPressure . ?record sio:hasMeasurement ?measurement. ?measurement sio:hasValue ?convertedvalue. ?record cardio:ExpertClassification ?riskgrade . }

RecordID Start Val Start Unit End Val End Unit cm_hg1 15 cmHg 19.998 KiloPascalcm_hg2 14.6 cmHg 19.465 KiloPascalmm_hg1 14.8 mmHg 19.731 KiloPascalmm_hg2 146 mmHg 19.465 KiloPascal

SHARE query

Because HighSystolicBloodPressure was definedin kpascal, SHARE automatically told SADI toconvert everything into kpascal

Different things can/will happen inside of different lockers, even in the context of

the same question

But these are black boxes!

SADI services natively output NanoPublications, therefore we have a detailed record of

provenance associated with EACH AND EVERY data point. We can peek inside the black box!

Final Note #1Reproducibility & Scholarly Rigor

How do we get SHARE,the relevant SADI services

and the workflowsinto the locker?

Final Note #2Deployment

We are not alone…

Accurate, autonomous matchmaking between data and tools/workflows is tricky

…even if the data is FAIR!

SADI and SHARE were designedspecifically to solve

this problem!

Specific Acknowledgements to:

Dr. Mikel Egaña Aranguren (SADI + Galaxy + Docker)

Dr. Soroush Samadian (clinical measurement unit conversion)

Luke McCarthy and Ben Vandervalk (SADI + SHARE)

Microsoft Research

personal health train (pht) - how to select appropriate data in the patient's...

Healthcare