an identity crisis in the life sciences

29
An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi And our users And the EPSRC

Upload: nascha

Post on 12-Jan-2016

52 views

Category:

Documents


0 download

DESCRIPTION

An Identity Crisis in the Life Sciences. Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi And our users And the EPSRC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Identity Crisis  in the Life Sciences

An Identity Crisis in the Life Sciences

Jun Zhao, Carole Goble and Robert StevensThe University of Manchester, UK

Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi

And our usersAnd the EPSRC

Page 2: An Identity Crisis  in the Life Sciences

UK e-Science project

Middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications.

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt

Page 3: An Identity Crisis  in the Life Sciences
Page 4: An Identity Crisis  in the Life Sciences
Page 5: An Identity Crisis  in the Life Sciences

Bioinformatics workflows

Taverna workflow workbench

collected metabolic pathway

computed BLAST report

computed BLAST report

• Data pipelines• Collect data• Compute data• Frequently

updated public resources

• Open world• Get the same data

product in different experiment context

Bioinformatician users

Page 6: An Identity Crisis  in the Life Sciences

urn:data:f2urn:data:f2

urn:data1urn:data1

urn:data2urn:data2

urn:compareinvocation3urn:compareinvocation3

urn:data12urn:data12

Blast_report

[input]

[output]

[input]

[distantlyDerivedFrom]

SwissProt_seq[instanceOf]

Sequence_hit

[hasHits]

urn:hit2….urn:hit2….

urn:hit1…urn:hit1…

urn:hit50…..

urn:hit50…..

[instanceOf]

[similar_sequence_to]

Data generated by services/workflows

Concepts

[ ]

[performsTask]

Find similar sequence[contains]

Services

urn:data:3urn:data:3

urn:hit8….urn:hit8….

urn:hit5…urn:hit5…

urn:hit10…..

urn:hit10…..

[contains]

[instanceOf]

urn:BlastNInvocation3urn:BlastNInvocation3

urn:invocation5urn:invocation5urn:data:f1urn:data:f1[output]

New sequenceMissed sequence

[hasName][hasName]

literalsDatumCollection

[type]

LSDatum

[type]Properties

[instanceOf]

[output]

[output]

[directlyDerivedFrom]

Concept

Data

Page 7: An Identity Crisis  in the Life Sciences

Fusion between different data models using

shared concepts and shared data

outputOf

createdFromcontains_similiar_seq_to

urn:genbank2…

urn:genbank2…

urn:genbank1…

urn:genbank1…

urn:genbank50…

urn:genbank50…

Blast_reportDNA_sequence

DNA_sequence

urn:BlastNInvocation3urn:BlastNInvocation3

urn:data:3urn:data:3urn:data2urn:data2

inputOf

Blast_service

instanceOf

instanceOf

instanceOf

instanceOf

urn:williamsA

urn:williamsA

urn:run5urn:run5

urn:data2urn:data2

urn:run7urn:run7

urn:williamsBurn:williamsB

GenBank UniProt

runOfinputOf

inputOf

runOf

createdBy

LSID

createdBy

urn:data:f2

urn:data:f2

urn:data1urn:data1

urn:data2urn:data2

urn:compareinvocation3urn:compareinvocation3

urn:data12

urn:data12

Blast_report

[input]

[output]

[input]

[distantlyDerivedFrom]

SwissProt_seq[instanceOf]

Sequence_hit

[hasHits]

urn:hit2….

urn:hit2….

urn:hit1…urn:hit1…

urn:hit50…..

urn:hit50…..

[instanceOf]

[similar_sequence_to]

Data generated by services/workflowsConcepts

[ ]

[performsTask]

Find similar sequence

[contains]

Services

urn:data:3urn:data:3

urn:hit8….

urn:hit8….

urn:hit5…urn:hit5…

urn:hit10…..

urn:hit10…..

[contains]

[instanceOf]

urn:BlastNInvocation3urn:BlastNInvocation3

urn:invocation5urn:invocation5

urn:data:f1

urn:data:f1

[output]

New sequence

Missed sequence

[hasName] [hasName

]literalsDatumCollection

[type]

LSDatum

[type]Properties

[instanceOf]

[output]

[output]

[directlyDerivedFrom]

Add assertions, Add rules

Reason over assertions

Page 8: An Identity Crisis  in the Life Sciences

Putting Provenance to Use

• Single workflow– audit trail– recipe

• Multiple workflow runs (versions)– Aggregation - gathering– Integration - merging– Comparison - differencing

Page 9: An Identity Crisis  in the Life Sciences

Any idea?

• 30350027• 30350027

• gi:30350027 Life Science IdentifierA ruddy great lump of RDF

Page 10: An Identity Crisis  in the Life Sciences

URIs for Dataurn:lsid:mygrid.ac.uk:data:49841:1

• Life Science Identifier• Protocol for allocation and

resolution• Adopted by a range of data

providers• LSIDs in the data providers

databases we collect during workflow execution

• LSIDs for the data products we computed during the workflow execution

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt

http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02

Page 11: An Identity Crisis  in the Life Sciences

Having a BLAST in every workflow!Seq

GenBankReport

databasescore

BLAST

BLAST_simplifer

GenBank_retrieve

BlastReport

A list of Sequences

Page 12: An Identity Crisis  in the Life Sciences

Alignment of sequence AC005089

Page 13: An Identity Crisis  in the Life Sciences

Computed Collections and Collected data items

BLAST

ReportSequence1

Sequence2

Sequence3

BLAST

ReportSequence1

Sequence2

Sequence3

BLAST

ReportSequence1

Sequence2

Sequence4

SEQ

listOf

BLASTsimplifer

SEQ

listOf

BLASTsimplifer

SEQ

listOf

BLASTsimplifer

Page 14: An Identity Crisis  in the Life Sciences

BLAST

ReportSequence1

Sequence2

Sequence3

BLAST

ReportSequence1

Sequence2

Sequence4

SEQ

listOf

BLASTsimplifer

SEQ

listOf

BLASTsimplifer

Equivalent data

Corresponding data

Data Co-references

Context of the

workflow

Page 15: An Identity Crisis  in the Life Sciences

Run2Run1

Aggregation of repeated run

AC005089

BLASTReport

urn:lsid:tav:ic531

urn:lsid:tav:ic537

urn:lsid:tav:ic538

urn:lsid:tav:57b6

urn:lsid:tav:57b13

urn:lsid:tav:57b14

refersTo

derivedFrom

derivedFrom

derivedFrom

DNASeq

DNASeq

derivedFrom

refersTo

refersTo

rdf:type

rdf:type

rdf:type rdf:type

rdf:type

Page 16: An Identity Crisis  in the Life Sciences
Page 17: An Identity Crisis  in the Life Sciences

External Duplicates

gi:15145617

ac073846

urn:lsid:myg:ac073846

mmu:11423

Different providers

A replica

Different tool providers

Sequence

Page 18: An Identity Crisis  in the Life Sciences

LSID Assignment Process

Workflow enactorProvenance

service

Data service

External domainservice

Data storage group

wfEvents

Taverna LSID Authority

MySQL relational stores

KAVE

BAKLAVA

CustomizedDB

CustomizedDB

Jena/Sesame RDF store

Equivalent data in repeated runsDuplicate ids for these data

Page 19: An Identity Crisis  in the Life Sciences

Provenance from two repeated runs

my:derivedFrom

my:hasElement

my:derivedFrom

my:derivedFrom

my:hasElement

Run1

Run2

No convergence

urn:lsid:tav:brpt1

urn:lsid:tav:brpt2

urn:lsid:tav:seqcollection1

urn:lsid:tav:seqcollection2

urn:lsid:tav:seq1

urn:lsid:tav:seq2

my:derivedFrom

Page 20: An Identity Crisis  in the Life Sciences

urn:lsid:tav:brpt1 urn:lsid:tav:brpt2

urn:gb:seq1Sequence1 Sequence1

Execution duplicates

BLAST BLAST_simpliferBlastReport A list of Seq

GenBank_retrieve

But hidden!!

urn:gb:seq1

BLAST report BLAST report

Page 21: An Identity Crisis  in the Life Sciences

BLAST BLAST_simpliferBlastReport A list of Seq

GenBank_retrieve

SEQ1 Sequence1

Sequence2

Sequence3

listOfurn:tav:seqc1 urn:tav:seq1

urn:gb:seq1

SEQ1 listOfurn:tav:seqc2 urn:tav:seq2Sequence1

Sequence2

Sequence3

Execution duplicates

Page 22: An Identity Crisis  in the Life Sciences

Managing identity co-reference

• Identity co-reference:– Identifying duplicate identities that refer to the

same object but kept context

• An approach:– An IDSet entity

• Identity equivalence for collected data• Identity correspondence for computed data

– An identity service– Identity normalisation and cleansing activity

Page 23: An Identity Crisis  in the Life Sciences

IDSet entity

• IDSet(BLASTrpt) = {{urn:tav:brpt1}, {urn:tav:brpt3}}

urn:gb:seq1Sequence

Query by its identity

Query by

its content

IDSet1

merge

IDSet created by another organization

IDSet3

urn:lsid:tav:brpt1

BLASTreport

Page 24: An Identity Crisis  in the Life Sciences

Extended Architecture

Workflow enactor Provenance service

Data service

External domainservice

Data storage group

wfEvents

Taverna LSID Authority

MySQL relational stores

BAKLAVA

CustomizedDB

CustomizedDB

Identity service

KAVE+

Jena/Sesame RDF store MySQL

relational store

Identitystore

KAVE

Page 25: An Identity Crisis  in the Life Sciences

Identifying collected product

Identity service

urn:gb:seq1

Identitystore

Receivean identity

Look for or create

Its IDSet

KAVE+

1

2 3

3

urn:gb:seq1

Store the id and the

IDSet

IDSet

1urn:gb:seq1

Page 26: An Identity Crisis  in the Life Sciences

Identifying a collection product

Identity service

Identitystore

Receivean identity

Look for or create

Its IDSet

KAVE+

1

2 3

3

Store the id and the

IDSet

IDSeturn:lsid:seqc1

Seq1

Seq2

Seq3

SEQ2listOf

unr:lsid:seqc2

Look for equivalent

collection

unr:lsid:seqc1

unr:lsid:seqc2

Page 27: An Identity Crisis  in the Life Sciences

Putting the Identity Service to Use

Provenance Integration

Provenance Aggregation

Identity Management

Provenance Normalization

Run2

Run1b1

c1s1

b2

c2s2

Page 28: An Identity Crisis  in the Life Sciences

Discussion

• Scalability issues:– Normalizing provenance graphs– Building IDSet for collections with multiple hierarchies

• Open world data type-free context• Use experimental context more effectively –

workflows are not independently executed.• Granularity of identity• Identity aware operations in workflow• Multiple naming schemes• Migration duplicates• Compacting data results

Page 29: An Identity Crisis  in the Life Sciences

Conclusion

• Combining provenance kind of depends on finding points of commonality. Like data identity.

• Duplicate identities will occur in an open world• Hard to achieve uniqueness without community

commitment• Different types of equivalent objects• How much can be avoided? • And how much has to be repaired?