open data, compound repurposing, and rare diseases -- point loma nazarene university

57
Open data, compound repurposing, and rare diseases Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org January 30, 2017 Slides: slideshare.net/andrewsu

Upload: andrew-su

Post on 12-Apr-2017

47 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Open data, compound repurposing, and rare diseases

Andrew Su, Ph.D.@[email protected]://sulab.org

January 30, 2017

Slides: slideshare.net/andrewsu

Page 2: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

2

Programmer/Comp sci

Statistician/ Mathematician

Biologist

Data scientist

Bioinformatician Biostatistician

Adapted from http://blog.fejes.ca/?p=2418

…teach STEM students the importance of connecting computational, mathematical, and natural sciences.

Page 3: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

3

Credit: http://www.slideshare.net/PhRMA/rare-disease-infographics

Page 4: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

4

Credit: http://www.slideshare.net/PhRMA/rare-disease-infographics

Page 5: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Rare disease case study #15

Photo: Retta Beery

Page 6: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

6

Bainbridge et al., STM, 2011

Page 7: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

7

Photo: Retta Beery

Page 8: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Rare disease case study #28

Page 9: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

9

… but no obvious treatments

Page 10: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

10

Bainbridge et al., STM, 2011

SPR

Page 11: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

What differentiates SPR and NGLY1?11

SPR

Page 12: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

12

Sarah Olmsteadhttps://flic.kr/p/364dZW

NGLY1

Page 13: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

13

NGLY1(11 PubMed articles)

Congenital disorders of glycosylation

(822)

PNGase(686)

ERAD(1330)

glycosylation(48,862)

alacrima(164)

Genetic interactors

(3016)

symptoms(109,928)

25 million articles in PubMed

Page 14: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

The biomedical literature is massive…14

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

0200,000400,000600,000800,000

1,000,0001,200,0001,400,000

Number of new PubMed-indexed articles

Page 15: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

… but it is very hard to query and compute15

Page 16: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

… but it is very hard to query and compute16

ImatinibCrizotinibErlotinibGefitinibSorafenibLapatinibDasatinib

Acute myeloid leukemiaAcute lymphoblastic leukemia

Chronic myelogenous leukemiaChronic lymphocytic leukemia

Hodgkin lymphomaNon-Hodgkin lymphoma

Myeloma…

AND

GleevecGlivecSTI-571STI 571STI571ST1571ST 1571CGP-57148CGP 57148CGP57148CGP57148B…

Page 17: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

… but it is very hard to query and compute17

EntrezGene ID HGNC symbol Description

10884 MRPS30 mitochondrial ribosomal protein S30

10914 PAPOLA poly(A) polymerase alpha

11333 PDAP1 PDGFA associated protein 1

11334 TUSC2 tumor suppressor candidate 2

130120 REG3G regenerating islet-derived 3 gamma

5068 REG3A regenerating islet-derived 3 alpha

50807 ASAP1 ArfGAP with SH3 domain, ankyrin repeat and PH domain 1

55 ACPP acid phosphatase, prostate

8853 ASAP2 ArfGAP with SH3 domain, ankyrin repeat and PH domain 2

Human genes referred to as “PAP”

Page 18: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

18

Biomedical research relies on effective

Pie

tro B

ellin

iht

tps:

//flic

.kr/p

/k5j

mja

KNOWELDGE MANAGEMENT

Page 19: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Information extraction from biomedical text19

1. Identify biomedical concepts in text

… We report a case of familial systemic mastocytosis with the rare KIT K509I germ line mutation. In vitro treatment with imatinib, dasatinib and PKC412 reduced cell viability of primary mast cells harboring KIT K509I mutation. Both patients with familial systemic mastocytosis had remarkable hematological and skin improvement after three months of imatinib treatment.

Leuk Res. 2014 Oct;38(10):1245-51. doi: 10.1016/j.leukres.

GENES

DISEASES

DRUGS

VARIANTS

Page 20: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Information extraction from biomedical text20

imatinib

dasatinib

PKC412

Familial systemic mastocytosis

KIT

K509I

1. Identify biomedical concepts in text

2. Identify relationships between concepts

Mutation of

Mutation causes

causes

treats

inhibits

Page 21: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

21

Goal: Assemble a network of biomedical knowledge that is comprehensive, current, computable and traceable.

Page 22: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

22

http://www.navy.mil/management/photodb/photos/101104-N-6383T-508.jpg

Page 23: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

The Gene Wiki project, circa 200823

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Page 24: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

24

Page 25: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Lissencephaly

Gene-disease annotation databases25

Query: Reelin (RELN)

Page 26: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Gene-disease annotation databases26

Lissencephaly Familial Temporal Lobe Epilepsy

Query: Reelin (RELN)

Page 27: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Gene-disease annotation databases27

Lissencephaly Familial Temporal Lobe Epilepsy OtosclerosisSchizophrenia

Query: Reelin (RELN)

Page 28: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Gene-disease annotation databases28

Lissencephaly Familial Temporal Lobe Epilepsy OtosclerosisSchizophreniaBipolar Disorder Autistic Disorder Alzheimer Disease Schizophrenic Psychology Breast Neoplasms …

Child Development Disorders, Pervasive

Cognition Cognition Disorders Dominance, Cerebral Executive Function Field Dependence-

Independence Functional Laterality Choice Behavior Precursor T-Cell

Lymphoblastic Leukemia-Lymphoma

27 “diseases”

Psychotic Disorders Attention Attention Deficit Disorder

with Hyperactivity Memory Memory, Short-Term Mental Disorders Task Performance and

Analysis Tobacco Use Disorder Weight Gain Schizophrenia, Paranoid

Query: Reelin (RELN)

Page 29: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

is to data

is to text

biomedicalProvide a database of the world’s knowledge that anyone can edit

- Denny Vrandečić

Page 30: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Subclass of

Regulates

Physically interacts with

Protein

Neural development

Property:P279

Property:P128

Property:P129

Q8054

Q1345738

VLDL receptor Q1979313

Amyloid beta A4 Q423510

Q13561329

http

://w

ww

.wik

idat

a.or

g/w

iki/Q

1356

1329

Decreased expression in

Property:P1910Schizophrenia Q41112

Bipolar disorder Q131755

Page 31: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Property:P279

Property:P128

Property:P129

Q8054

Q1345738

Q1979313

Q423510

Q13561329

http

s://

ww

w.w

ikid

ata.

org/

w/a

pi.p

hp?a

ctio

n=w

bget

entit

ies&

ids=

Q13

5613

29&

form

at=j

son

Property:P1910Q41112

Q131755

Page 32: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

32

Page 33: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Seeding Wikidata with biomedical data

• All human, mouse genes and proteins

• All Gene Ontology terms• All FDA approved drugs • 9,000+ human diseases• 120 reference microbial genomes

Mitraka et al (2015) Semantic Web Applications for the Life SciencesBurgstaller-Muelbacher et al (2016) DatabasePutman et al (2016) Database

Page 34: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Centralizing key data storage34

287 language editions of Wikipedia

Bioinformatics community

Toxicology community

Epidemiology community… …

Page 35: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

“Show all tyrosine kinase inhibitors that are used to treat hematologic cancers.”

Page 36: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

“Show all human membrane proteins associated with colorectal cancer.”

Page 37: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

“Show all monoclonal antibodies used to treat melanoma.”

Page 38: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

39

Crowdsourcing via Citizen Science

Biomedical Linked Open Data

Page 39: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

40

Sou

rce:

http

s://w

ilson

com

mon

slab

.org

/201

4/03

/06/

calli

ng-a

ll-su

ppor

ters

Page 40: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts?

41

Page 41: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

42

Experts versus crowd for concept identification

593 PubMed abstracts

6,900 mentions of “disease concepts”

F = 0.87F = 0.78

$$$

Page 42: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

43

Experts versus crowd for concept identification

593 PubMed abstracts

6,900 mentions of “disease concepts”

F = 0.87F = 0.87

$$$

• 9 days• 145 workers• Total: $630.96

Page 43: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

45

http://mark2cure.org

Page 44: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

46

Paid crowdsourcing

• F = 0.84• 28 days• 212 workers• Total cost: $0

$$$

• F = 0.87• 9 days• 145 workers• Total: $630.96

“Help science, please”

Citizen Science

Page 45: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Does Citizen Science scale?47

1,000,000 articles * 10 AE / article 15,828 volunteers

needed10,275 AE * 365 days

212 annotators* 28 days

AE = Annotation events

=

Number of annotation events per year

Number of annotation events per year

per volunteer

Page 46: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Does Citizen Science scale?48

15,828 volunteers

needed

200,000 volunteers

460,000 volunteers

37,000 volunteers

1,000,000+ volunteers

Page 47: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Mapping the biomedical network around NGLY1 49

NGLY1

Page 48: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

50

http://mark2cure.org

Page 49: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

51

A preliminary view of the NGLY1-focused biological network

1,200 contributors3,200 documents 787,400 annotations

Page 50: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Finding new indications for existing drugs or therapies53

Raynaud’s Syndrome

Fish oil

Abnormal platelet activity

Abnormal blood

viscosity

High blood viscosity

Elevated RBC rigidity

Vasodilation

Low blood triglycerides

Increased prostacyclins

Page 51: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Finding new indications for existing drugs or therapies54

Page 52: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Finding new indications for existing drugs or therapies55

Raynaud’s Syndrome

Fish oil

Abnormal platelet activity

Abnormal blood

viscosity

High blood viscosity

Elevated RBC rigidity

Vasodilation

Low blood triglycerides

Increased prostacyclins

A

C

B

B BB

BB

B

Page 53: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

56

A preliminary view of the NGLY1-focused biological network

A

C

B

B BB

BB

B

AB

B BB

BB

B

A

B

B BB

BB

B

Page 54: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

57

Biomedical research relies on effective

Pie

tro B

ellin

iht

tps:

//flic

.kr/p

/k5j

mja

KNOWELDGE MANAGEMENT

Page 55: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

58

Paul Pavlidis,

UBC

Lynn Schriml,

U Maryland

Matt and Cristina Might,

Crowd volunteers and partners

(Salomon) (Lotz)

(Yang, Maximov) (Topol)

Page 56: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

Louis Gioia

Julee Adesara

Toby Li

Karthik G

Erick Scott

Adam Mark

Kevin Xin

Jake Bruggemann

Mike Mayers

Andra Waagmeester

Max Nanis

Cyrus Afrasiabi

Ian MacLeod

Julia Turner

Ginger Tsueng

Sebastien Lelong

Erik Clarke

Jennifer Fouquier

Ben GoodChunlei Wu Shirley Willis

Tobias Meissner Katie Fisch Sandip Chatterjee

Ramya Gamini Greg Stupp Sebastian Burgstaller

Tim Putman Nuria Queralt Rosinach

Sal Loguercio

M2C M2C

GW

GW

GW

GW GW

GW

GW

Page 57: Open data, compound repurposing, and rare diseases -- Point Loma Nazarene University

60