100,000 genomes project

59
The 100,000 Genomes Project David Montaner Bioinformatics Department [email protected] Valencia University, October 6 th 2016

Upload: david-montaner

Post on 15-Apr-2017

186 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: 100,000 Genomes Project

The 100,000 Genomes ProjectDavid MontanerBioinformatics [email protected]

Valencia University, October 6th 2016

Page 2: 100,000 Genomes Project

Talk Outline

1. Introduction & Background2. Pipelines3. Systems and Databases4. Cancer5. Rare Diseases

2

Page 3: 100,000 Genomes Project

3

The 100,000 Genomes Project

Genomics England & Partners

Page 4: 100,000 Genomes Project

Genomics England

• Owned by the Department of Health, UK• Set up to deliver the 100,000 Genomes Project:  100,000 whole genome sequences of National Health Service (NHS)

patients with: • Rare Diseases (and family members)• Cancer

Aims: Create an ethical and transparent programme based on consent Establish the infrastructure, human capacity & capability to set up a

genomic medicine service for the NHS and bring benefit to patients. Enable new scientific discovery and medical insights, and add to

the already extensive databases on human variation Working with the National Health Service (NHS), academics and industry

to make the UK a world leader in Genomic Medicine

4

Who are we & what are we doing?

Generate health & wealth

Page 5: 100,000 Genomes Project

• Sequence 100,000 genomes• Cancer and rare genetic disease• Capture data delivered

electronically, store it securely and analyse it

• within an English data centre (reading library)

• Combine genomes with extracted clinical information for analysis, interpretation, and aggregation

• Create capacity, capability and legacy in personalised medicine for the UK

Goals of Genomics England

1. To bring benefit to NHS patients

2. To enable new scientific discovery and medical insights

3. To create an ethical and transparent programme based on consent

4. To kickstart the development of a UK genomics industry

Page 6: 100,000 Genomes Project

Inception of the 100,000 genomes project (2012, 2014)

“If we get this right, we could transform how we diagnose and treat our most complex diseases not only here but across the world” (December 2012)

“I am determined to do all I can to support the health and scientific sector to unlock the power of DNA, turning an important scientific breakthrough into something that will help deliver better tests, better drugs and above all better care for patients.” (August 2014)

Page 7: 100,000 Genomes Project

Schedule

2012 -2014: consortium creation 2014-2015: pilot studies 2016-2015: main project

Page 8: 100,000 Genomes Project

Where are we?

8Lodon

Page 9: 100,000 Genomes Project

Where are we?

9Lodon

London: Management All data storage Cambridge: Software for genomic data storage Oxford: Software for clinical data storage and collection

Page 10: 100,000 Genomes Project

Recruitment and clinical interface13 “GMCs”, Scotland and Northern Ireland

• Genomic Medicine Centres• Networks of NHS hospitals

including genomics labs• 13 “Lead organisation” plus

71 “Local Delivery Partners”• Contracted by NHS England• Cover recruitment, data and

return of results• Scotland

• Doing own sequencing• Northern Ireland

• Similar to a GMC• Contracted by NI payer

+

Page 11: 100,000 Genomes Project

The Journey of a Genome

11

ACGTTTGAAGC

Consent & Sample

collection

DNAextraction

Bio-repository

Sequencing

Variant Calling

Interpretation

Feedback to clinician

Validation

Treatment

Page 12: 100,000 Genomes Project

The Journey of a Genome: Partners

12

ACGTTTGAAGC?

Consent & Sample

collection

DNAextraction

Bio-repository

Sequencing

Variant Calling

Interpretation

Feedback to clinician

Validation

Treatment

Genome Medicine

Centres (GMCs)13x NHS

organisations

Genomics England Clinical Interpretation Partnerships

(GeCIPs)Collaborations of

clinicians & academics,

> 2,000 researchers

Clinical interpretation

companies• Omicia• Congenica• Nextcode

Hiseq X Ten

Page 13: 100,000 Genomes Project

GENE Consortium

• Working together on a year-long Industry Trial involving a selection of whole genome sequences across cancer and rare diseases

• Aims to identify most effective and secure way to accelerate development of new diagnostics and treatments for patients 

• Working in a pre-competitive environment

AbbVieAlexion PharmaceuticalsAstraZenecaBerg HealthBiogenDimension TherapeuticsGSKHelomicsNGM BiopharmaceuticalsRocheTakeda

Genomics Expert Network for Enterprises

Page 14: 100,000 Genomes Project

14

BAM fileFrom Illumina

Variant Callingpipelines: VCF file

QC1 QC2

Variant Annotation

Tiering of variantsDispatchClinical Interpretation

QC Portal Reporting portal

Medical review

Validation

Simplified Workflow

Genomic Medicine Centre (GMC)

Page 15: 100,000 Genomes Project

Bioinformatics Team Role

15

ACGTTTGAAGC?

Consent & Sample

collection

DNAextraction

Biorepository

Sequencing

Variant Calling

Interpretation

Feedback to clinician

Validation

Treatment

Page 16: 100,000 Genomes Project

Genomics Education

Health Education England• MSc in Genomic Medicine

• 10 Universities across the UK• Online training courses and resources

• The fundamentals of genomics• Sample handling and DNA

extraction• Bioinformatics • How to support patients through

the consent process

Genomics England Communications Team

Page 17: 100,000 Genomes Project

Update on numbers: at about 10%

• >10,000 genomes received

• >1PB of primary data• >1.3M files received or

generated and indexed • 200M germline variants

databased• 48M somatic variants

databased• 70,000 HPO terms asserted• >450,000 hospital episodes

Page 18: 100,000 Genomes Project

100,000 Genomes

• Rare Disease• Each Genome: 100Gb• Trio is preferred so 300Gb per

participant• x 50,000 participants =

15,000,000Gb total • Cancer

• Germline: 100Gb• Tumour: 200Gb• 300Gb per patient• x 25,000 participants =

15,000,000Gb total

• 10,000,000Gb = 10 Petabytes• Expecting around 30 Petabytes

18

Huge Amount of Data

10 Billion Photos = 1.5 Petabytes

Data Processed in 1 day = 20 Petabytes

Page 19: 100,000 Genomes Project

19

Pipelines

Page 20: 100,000 Genomes Project

bertha_default 1.1.0

Single Sample QC & Processing

Analysis

Intake QC

Multi Sample QC

Cross Sample Contamination

Single-Sample QC Check Point

Identity by DecentMendelian Inconsistency Rate

Sex Check

Somatic VCF re-headering

Tumour Cross Sample ContaminationCross Species Contamination Depth of Coverage Concordance check

Intake QC Check Point

Merge Array Genotypes

Multi-Sample QC Check Point

Consent Check Point

Variant Calling

Variant Normalisation

Tumour PloidyTumour PurityTumour ClonalityMutation SignatureViral InsertionsActionable Mutation CoverageSNV & Indel RefinementMutation BurdenInbreeding Coefficient Homozygosity Runs

Variant Annotation

Variant Tiering

Interpretation Dispatch Exomiser

Delivery API

Integrity Check

MD5 Check

Validate BAM Picard

Filtered Bamstats Unfiltered Bamstats Q30 Bamstats VCF QC

Fix Permissions

Plot Filtered Bamstats Generate Filtered Metrics Bamstats Plot Unfiltered Bamstats Generate Q30 Metrics Bamstats

QC Stats Post-processing

WorkflowdiagrammeData intake

Single Sample QC & Processing

Multi-sample QC

Analysis

Interpretation Request DispatchedInterpretationAPI

Page 21: 100,000 Genomes Project

BerthaDistributed Workflow Management System

Interpretation Dispatch

Message Broker

Tracking DB

Job Scheduler

Dashboard

DeliveryAPI

Auditor

Orchestrator

Grid Consumer

Oxford Bus

Page 22: 100,000 Genomes Project

6 node Hadoop cluster:• Transform: 97 min• Load: 80 sec• Merge: 84 sec• Millisecond response

times for regional queries• Whole genome filtering

queries for all individuals within seconds

OpenCGA: storage

Extensive capabilities to query across genotype and phenotype relationships

https://github.com/opencb/opencga

Page 23: 100,000 Genomes Project

To be fully GA4GH compatible from v1.0

global data standards for Genomics - http://ga4gh.org/

Page 24: 100,000 Genomes Project

Clinical data

+ 150 tables (+2000 variables)

Administrative & ConsentClinical / medical reviewsImaging, blood & non genetic testsDisease status and phenotypeFamily & pedigreeTreatments and clinical history

Security and logs:CMCs access here

CatalogBioinformatics

Oxford

Page 25: 100,000 Genomes Project

OpenCGA - Catalog

Metadata store and A&A for OpenCGA• Manages roles, groups,

acls• Audit log• LDAP integration• Arbitrary schemas

(annotation sets)

Page 26: 100,000 Genomes Project

Cellbase: annotation

Reference Genomic data warehouse

• Compared in testing against VEP• More than 99.999% similarity in Consequence

types

• Phased annotation implemented for MNVs

• Initial structural variation annotation• Can annotate 4-5 families per hour

(>8000 variants/s) on a single database instance

• Will have (very soon) an Rpackagesimilar to biomaRt

Page 27: 100,000 Genomes Project

PanelApp

27https://panelapp.extge.co.uk/crowdsourcing/PanelApp

Page 28: 100,000 Genomes Project

Panel list

28https://panelapp.extge.co.uk/crowdsourcing/PanelApp/

Page 29: 100,000 Genomes Project

Platform for interpretation

Page 30: 100,000 Genomes Project

● Filter and classify variants● Well-defined rules, stable across the project● General, it works for any family configuration● Implemented using VCF/Cellbase or OpenCGA● Based on GA4GH variant model ● Uses pedigrees as defined at Genomics England

(Based on phenotips format) Uses PanelApp as source of gene panels

Variant Tiering

Page 31: 100,000 Genomes Project

Yes No

Tier 1 Tier 2Tier 3

Yes No

Expected pathogenic(set criteria; transcript_ablation,

splice_donor_variant, splice_acceptor_variant, stop_gained,

frameshift_variant, stop_lost, initiator_codon_variant)

Is the variant in a gene in the Virtual Gene Panel (green list) for that disorder?

Known Pathogenic(Not implemented)

Yes No

Tier 3

Is the variant in a gene in the Virtual Gene Panel (green list) for that disorder?

Other coding impact (set criteria;

inframe_insertioninframe_deletionmissense_variant

transcript_amplificationsplice_region_variant

incomplete_terminal_codon_variant)

Impact of the variant?

OtherDoes not fit any

of the other criteria?

The variant allele is not commonly found in the general healthy population (set criteria for allele frequency filter)Familial segregation

Allelic state matches known mode of inheritance for the gene and disorder (moi required)

Variant

Variant Tiering

Page 32: 100,000 Genomes Project

32

The Cancer Programme

Page 33: 100,000 Genomes Project

Cancer

33

Which cancers?• Lung• Breast• Colon• Prostate• Ovary• Hematological

malignancies (CLL)• Pediatric Cancers

Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)

Why sequence?• Disease of disordered

genomes• >200 driver genes known• Stratified

Management/targeted therapy

• Complications: Heterogeneity

Page 34: 100,000 Genomes Project

Sequencing cancer genomes

34

Tumour genome

Germlinegenome

Germline variants

Tumour variants

Somatic variation=

Page 35: 100,000 Genomes Project

Coverage

35

High Depth

ATGCGTTCGATGAGTGATGAAACCCATGATGGATGCCGATGAGATGATG

Coverage

Germline Samples35x Coverage

• Rare Disease Participants

• Cancer “Normal”

Cancer Samples75x Coverage

• Cancer “Tumour” Samples

Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)

Page 36: 100,000 Genomes Project

Normal Contamination

Coverage

36

Why Higher Depth for Cancer?

Clonality/Heterogeneity

Page 37: 100,000 Genomes Project

Cancer Pilot

• Resections/Biopsies are routinely fixed in formalin and embedded in paraffin

• Causes DNA damage• Difficult to extract DNA

• Fresh frozen logistically difficult & not trusted to maintain morphology

37

Fresh Frozen vs Formalin-fixed, paraffin-embedded (FFPE) tumour samples

Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)

Page 38: 100,000 Genomes Project

Cancer Pilot

• Difficulty in obtaining long fragments

• “Random” DNA damage• “Cross-links” DNA which can be

reversed – but currently at high temperatures

• Chimeric fragments in library preparation

38

Problems with FFPE

Heat

A T

Repetitive Regions Re-anneal causing Chimeric Reads

GC Rich regions are more robust

Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)

FFPE = Formalin-fixed, paraffin-embedded tumour samples

Page 39: 100,000 Genomes Project

Read Alignment

Page 40: 100,000 Genomes Project

CG Content

Page 41: 100,000 Genomes Project

FF Copy Number Data

41Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)

Page 42: 100,000 Genomes Project

FFPE Copy Number Data

42Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)

Page 43: 100,000 Genomes Project

Fraction of overlapping SNVsin FF and FFPE samples from 5 trios

Page 44: 100,000 Genomes Project

Improving FFPE Sequencing

44

What can we do?

Procedure

Procedure FixationFixation

DNA Extractio

n

DNA Extractio

n

Library Preparati

on

Library Preparati

on

Cold Ischaemic Time

Storage Conditions

Time of Fixation

Size of Sample

pH of Fixative

Temperature of De-crosslinking

Addition of Salt

Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)

FFPE = Formalin-fixed, paraffin-embedded tumour samples

Page 45: 100,000 Genomes Project

Cancer reports

45

• Quality metrics pre- and post-sequencing• A small number of clinically actionable mutations • Germline results which affect cancer development• Remainder of results are mostly of research interest

for now, but in future may assist:• Drug development• Targeted treatment selection• Prediction of prognosis• Monitoring of disease progression

Page 46: 100,000 Genomes Project

46

Rare Disease Programme

Page 47: 100,000 Genomes Project

47

Page 48: 100,000 Genomes Project

The case for whole genomes

• Severe intellectual disability occurs in 0.5% of newborns

• Whole-genome sequencing at 80x in 50 parent-offspring with no diagnosis for their severe intellectual disability.

• Overall 62% increase in diagnostic yield with WGS.• Most diagnoses were for de-novo dominant mutations, roughly

equally divided in SNVs and CNVs.

48

Gilissen et al (2014), Nature PMID: 24896178

Page 49: 100,000 Genomes Project

Why make a genetic diagnosis?

49

For a patient with rare disease

• Understand why their condition happened

• More accurate knowledge of how it might develop in future

• Possible treatment avenues• Early intervention may help

avoid disability• Contact with others with the

same condition

For the family• Predict whether family

members will get the condition

• Offer screening/treatment to prevent it

• Reproductive decisions

For medical research• Further our understanding of

disease mechanisms• Novel drug development or

drug repurposing

Page 50: 100,000 Genomes Project

Rare disease programme

• Over 200 disorders so far

Data model: describes the clinical information to be collected for each disorder

Disorders nominated by the NHS and academia

Eligibility & Exclusion criteria for recruitment; rare, mendelian, unmet clinical diagnostic need, prior genetic testing

Virtual Gene panel to aid analysis

Challenges

• Equity of diseases for inclusion

• Tightness of criteria for patient inclusion

• Equity of WGS consumption per phenotype

Page 51: 100,000 Genomes Project

The biggest challenge?

51

Interpretation• ~5-10 million variants in our

genome• ~3.5 million “known” SNPs• ~0.5 million “novel” SNPs• ~0.5 million small indels• ~1000 large (>500bp) CNVs• ~20,000-25,000 coding variants• ~9,000-11,000 non-synonymous

• 92 rare missense variants (MAF <0.1%)

• 5 rare truncating variants (MAF <0.1%)

• 0-2 de novo variants

Page 52: 100,000 Genomes Project

What information is needed?

52

To aid interpretation of variants

• Allele frequency: How common is the variant in the ‘healthy’ population?

• Familial segregation: Is the variant present in the family members with the disorder, and not in those without it?

• Mode of inheritance: Does the pattern fit with the inheritance within the family and what is known about the gene?

• Likely consequence: Does the variant cause a change in the protein sequence likely to affect function?

• Gene panel: Is the variant in a gene associated with causing the disorder?

• Known pathogenicity? Has the variant been seen before in people with the same disease?

Page 53: 100,000 Genomes Project

Rare Diseases

Gender• X chromosome homozygosity, Y chromosome genotyping

rate• Copy number for X and Y chromosomes

Relatedness• Mendelian error checking for parent-child pairs• IBD sharing estimation for all participants

Inbreeding/ excess homozygosity• Observed vs expected homozygosity

Ancestry• Multidimensional scaling

53

Genetic data checks and analyses

Dr Katherine Smith, Lead Analyst for Rare Disorders (Bioinformatics)

Page 54: 100,000 Genomes Project

Rare Disease Pilot

54

4800 people

Primary Data

• 4,128 participants data cleansed

• (15,065 including family members),

• 149 different conditions. 

• 56,004 HPO terms used

• 12,966 terms present• 43,088 terms absent

Secondary Data

• Hospital Episodes• 250,000 records• 11,910 - Accident

Dept• 37,479 - Inpatient• 199418 - Outpatients

Page 55: 100,000 Genomes Project

Rare disease pilot – 4,919 samples

55

Page 56: 100,000 Genomes Project

Relatedness checking

56

Page 57: 100,000 Genomes Project

Georgia

57

Georgia and her familyImage courtesy of Great Ormond

Street Hospital

• Undiagnosed condition that included physical and mental developmental delay, a rare eye condition affecting sight, impaired kidney function, verbal dyspraxia.

• Through enrolling in the project, a mutation in a single gene was found in Georgia’s genome which is likely to be the cause of her condition.

• Provides a molecular diagnosis for her condition for the first time.

Maria Bitner-Glindzicz – Great Ormond Street Hospital

http://www.genomicsengland.co.uk/first-children-recieve-diagnoses-through-100000-genomes-project/

Page 58: 100,000 Genomes Project

Jessica

58

Jessica and her family. Image courtesy of Great Ormond

Street Hospital.

“Now that we have this diagnosis there are things that we can do differently almost straight away. Her condition is one that has a high chance of improvement on a special diet, which means that her medication dose is likely to decrease and her epilepsy may be more easily controlled. Hopefully she might have better balance so she can be more stable and walk more…”

“…More than anything the outcome of the project has taken the uncertainty out of life for us and the worry of not knowing what was wrong. It has allowed us to feel like we can take control of things and make positive changes for Jessica. It may also open doors to other research projects that we can to go on. These could be more specific to her condition and we are hopeful that they could one day find a cure.”

http://www.genomicsengland.co.uk/first-children-recieve-diagnoses-through-100000-genomes-project/

Mum, Kate Palmer:

Page 59: 100,000 Genomes Project

59

Thank you!