a deep learning based approach for genetic risk prediction · a deep learning based approach for...

38
A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research Translational Institute [email protected], @RaquelDiasSRTI Ali Torkamani, PhD. [email protected], @ATorkamani

Upload: others

Post on 19-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

A deep learning based approach for genetic risk prediction

Raquel Dias, PhD.

Senior Staff Scientist

Scripps Research Translational Institute

[email protected], @RaquelDiasSRTI

Ali Torkamani, PhD.

[email protected], @ATorkamani

Page 2: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Whole Genome Sequencing vs. Genotype array

Full Data

(whole genome sequencing)Sparse Data

(genotype array)

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ?

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1

0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1

0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1

0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0

0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1

0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

Page 3: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Whole Genome Sequencing vs. Genotype array

Full Data

~80M genetic variantsSparse Data

~4 million genetic

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ?

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1

0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1

0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1

0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0

0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1

0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

Page 4: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Genetic imputation problem

...

...

...

...

... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

Referencehaplotypes

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

HapMap or1,000 Genomes(whole genome)

Cases andControls typedGenotype array

Prediction

...

...

...

...

Studygenotypes

Page 5: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

A typical imputation approach

MultiethnicHaplotypeReference

Consortium(HRC)

Studygenotypes

Referencepanel

Linkage disequilibrium(LD r2) structure

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

0 0 ? 0 0 1 1 ? 0 ? ? ? 0 ? 1 ? 1

Mapping

Prediction...

...

...

...

... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

...

...

...

...

Page 6: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

A typical imputation approach

Muli-ethinicHaplotypeReference

Consortium(HRC)

Studygenotypes

Referencepanel

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1

Mapping

Prediction...

...

...

...

... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

...

...

...

...

Linkage disequilibrium(LD r2) structure

Page 7: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Polygenic Risk Score (PRS)

Page 8: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Polygenic Risk Calculation

Design100,000+ subjects

ResultsMillions of known variants

Polygenic Risk ScoreCumulative sum

w/ Trait* w/o Trait

Σ

*Trait can often be heterogeneous e.g. coronary artery = heart attack, stroke, bypass surgery, etc.

Page 9: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Objectives

1. More accurate and faster imputation

2. Find important genetic variants

3. Better polygenic risk score calculation

Page 10: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Our proposed approach

v

1

1

0

0

2

1

1

2

0

2𝑤 𝑤′

Input

layer

Encoding

Hidden

layer

Output

layer

Decoding

Page 11: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Denoising autoencoder for image restoration

Noise

Mask

Bigdeli, Siavash Arjomand, and Matthias Zwicker. "Image restoration using autoencoding priors." arXiv preprint arXiv:1703.09964 (2017).

Wang, Ruxin, and Dacheng Tao. "Non-local auto-encoder with collaborative stabilization for image restoration." IEEE Transactions on Image Processing 25.5 (2016): 2117-2129.

Page 12: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Genotype imputation case study example

Ground truth

(whole genome sequencing)Masked input

(genotype array)

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ?

0 0 ? ? ? 0 0 1 1 ? ? 0 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 0 ? ? ? 0 0 1 1 ? ? 1 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 1 ? ? ? 1 0 1 1 ? ? 0 ? ? ? ?

0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1

0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1

0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1

0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0

0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1

0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1

Mask

Page 13: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Case study: 9p21.3 region of the genome

• Length: 59846 bp

• 846 genetic variants in reference panel (whole genome data)• Approx. 200 common variants

• Approx. 600 rare variants

• Only 17-47 variants in genotype array!!!

• Strong association to coronary artery disease (CAD)

• Genotyped and sequenced in many studies

Page 14: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Training on the reference panel:Data augmentation strategy

...

...

...

... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Reference

Whole Genome

...

...

...

...

...

...... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Masked

input

...

...

...

...

...

...

... 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 ...

... 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 ...

... 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 ...

Reconstructed

Output

...

...

...

Mask

Autoencoder

Page 15: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Customized Sparsity Loss Function

Sparsity loss with Kullback-Leibler (KL) / cross entropy element:

Customized loss adjusted for hidden activation sparsity:

Mean Squared Error:

𝐷𝐾𝐿(𝜌| ො𝜌 = 𝜌 ∗ log𝜌

ො𝜌+ 1 − 𝜌 ∗ log

1 − 𝜌

1 − ො𝜌

𝑙𝑜𝑠𝑠 = 𝑀𝑆𝐸 + 𝛽 ∗

𝑖=1

𝑛

𝐷𝐾𝐿(𝑖)

𝑀𝑆𝐸 =1

𝑛

𝑖=1

𝑛

(𝑦𝑖 − ො𝑦𝑖)

Page 16: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Hyper parameters to be optimized

• b

• r

• Activation functions

• L1/L2 regularizers

• Learning rate

• Batch size

Page 17: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Parallel Grid SearchHyperparameter optimization approach

10000 X training

Grid samples

100 X grid

search samplesHyperparameter

combinations 9 GPUs available:

- 7 GTX 1080,

- 1 Titan V,

- 1 Titan Xp

1

860 hours

(sequential run,

100 epochs)

2

100

4

3

2

1

10000

9999

10000 X training

Grid samplesTrained model

performance

4

3

2

1

10000

9999

Accuracy, loss

Sparsity, MSE

Page 18: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Grid Search Results: training accuracy

Page 19: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Grid Search results: assessing best hyperparameter values

Page 20: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

b

r

b

r

Learning rate

Effect of hyper parameter values in training accuracy

Pe

arson

corre

lation

(r2)

Page 21: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

b

r

b

r

Learning rate

Effect of hyper parameter values in training accuracy

Pe

arson

corre

lation

(r2)

Page 22: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

b

r

b

r

Learning rate

Effect of hyper parameter values in training accuracy

Pe

arson

corre

lation

(r2)

Page 23: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

b

r

b

r

Learning rate

Effect of hyper parameter values in training accuracy

Pe

arson

corre

lation

(r2)

Page 24: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Optimizing batch size: training accuracy

Learning steps

Acc

ura

cy

Loss

Learning steps

10 batches 50 batches 100 batches 1000 batches

Page 25: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Optimizing batch size: training run time

Run time (hours)

Acc

ura

cy

10 batches 50 batches 100 batches 1000 batches

Page 26: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Testing on multiple case studies

• Atherosclerosis Risk in Communities (ARIC) • More than 3000 samples

• Whole genome sequencing (846 variants, 0% mask, ground truth)

• Affymetrix 6.0 genotype array (17 variants, 98% mask, input data)

• Framingham Heart Study (FHS)• More than 500 samples

• Whole genome sequencing (846 variants, 0% mask, ground truth)

• Illumina 500K genotype array (47 variants, 95% mask, input data)

• Illumina 5M (93 variants, 89% mask, input data)

Page 27: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Accuracy in additional case studies: Proposed approach versus common statistic methodology

Performance: rare variantsPerformance: all variants

Page 28: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Accuracy in additional case studies: Proposed approach versus common statistic methodology

Performance: common variantsPerformance: all variants

Page 29: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Run time: Proposed approach versus common statistic methodology

Page 30: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Linkage disequilibrium structure: ARIC

Pre

dic

tio

nG

rou

nd

tru

th

All variants Rare variants Common variants

Linkage

dise

qu

ilibriu

m(LD

) r2

Page 31: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Linkage disequilibrium structure: FHS

Pre

dic

tio

nG

rou

nd

tru

th

All variants Rare variants Common variants

Linkage

dise

qu

ilibriu

m(LD

) r2

Page 32: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Interpretability: identifying representative genetic variants

Maximal information criteria

Page 33: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Conclusions

• Grid search was able to find high accuracy models (>0.90)

• Hyperparameters played an important role in training performance

• Reconstruction of genetic variants from very sparse data with high accuracy (>0.80)

• Superior computational performance, faster predictions

• Fine parameter tuning may be necessary

Page 34: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Future steps• Expand to other genomic regions, fine parameter tuning

• Use imputation autoencoder results as input for polygenic risk score calculation

𝑤 𝑤′

Input

layerHidden

layer

Imputed

genotypes

𝑤 𝑤′

Input

layerHidden

layer

Predicted

Genetic

Risk

Imputation Autoencoder Feed Forward Neural Network

Page 35: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Future steps

• Focal loss to compensate for rare variants

Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings of the IEEE international conference on computer vision. 2017.

Page 36: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Limitations: expanding the methodology to other genomic regions

Page 37: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Acknowledgements

Johnny IsraeliCarla LeibowitzFernanda FoertterBrian WelkerDavid Nola

Ali Torkamani PhDShang-Fu ChenElias Salfati PhDDoug EvansShuchen LiuAlex LippmanNathan W. PhDEmily Spencer PhDEric Topol MD

Contact:[email protected], @[email protected], @Atorkamani

NVIDIA collaborators

NIH/NCRR flagship CTSA Grant

Funding

Page 38: A deep learning based approach for genetic risk prediction · A deep learning based approach for genetic risk prediction Raquel Dias, PhD. Senior Staff Scientist Scripps Research

Thanks for your attention!!