the server of the spanish population variability
TRANSCRIPT
CIBERER Exome Server (CES) The server of the Spanish Population Variability
Joaquín Dopazo, PhD Department of Computational Genomics, CIPF, Valencia
Hospital Universitario La Paz, Madrid
28 de abril, 2014
Why is interesting to have a Spanish Exome Variant repository
Rationale: Local variability is more important than previously thought. The existence of numerous local rare variants, many of them (apparently) deleterious hampers the prioritization of disease variants. Data recycling: CIBERER has accumulated a large number of samples that can be used as (pseudo)controls of normal population
Pipeline of data analysis Primary processing
Initial QC
FASTQ file
Mapping
BAM file
Variant calling
VCF File
Knowledge-based prioritization
Proximity to other known disease genes
Functional proximity
Network proximity
Burden tests
Other prioritization methods
Secondary analysis
(Successive filtering)
Variant annotation
Filtering by effect
Filtering by MAF
Filtering by family segregation
Primary analysis Gene prioritization
1000 genomes
EVS
Local variants
Use known variants and their population frequencies to filter out. • Typically dbSNP, 1000 genomes
and the 6515 exomes from the
ESP are used as sources of
population frequencies.
• We selected 75 local controls to
add and extra filtering step to the
analysis pipeline
Novembre et al., 2008.
Genes mirror geography
within Europe. Nature
Comparison of Spanish controls to 1000g
How important do you
think is local information
to detect disease genes?
Filtering with or without local variants
Number of genes as a function of individuals in the study of a dominant disease Retinitis Pigmentosa autosomal dominant
The use of local
variants makes an
enormous difference
What do we know about the Spanish population Variability?
Using CIBERER families to create a first version of the database of local variability of Spanish population
• In each family we select two unrelated members (preferably the parents)
• If there are no parents, then one of the unaffected children (unaffected, if possible) are selected
• A total of 75, out of the 136 samples available among the families analyzed in the BiER, were initially selected.
• Variant files (VCF) were obtained following the same pipeline (with missing values included) and merged.
• Genotype proportions and MAFs were obtained for all the variable positions. ONLY this information is used in the web server.
Samples used UNIT n %
U723 12 16
U737 11 14,7
U759 2 2,7
U705 10 13,3
U720 12 16
U732 1 1,3
U755 3 4
U746 9 12
U728 2 2,7
U729 3 4
U703 7 9,3
U718 1 1,3
U730 2 2,7
Total 75 100
DISEASE n %
3-Methylglutaconic aciduria 11 14,7
Atypical fracture 4 5,3
Autosomal DOMINANT non-syndromic hearing loss 1 1,3
Autosomal RECESSIVE non-syndromic hearing loss 1 1,3
BCKDK-deficiency disease 2 2,7 CMT 1 1,3
Congenital disorder of glycosylation types I and II 8 10,7 CoQ disease 3 4,0
CoQ10 deficiency and DNA depletion 3 4,0
CoQ10 deficiency 2 2,7
Inherited Metabolic Disease 2 2,7
MMD (Multiple deletion of mitochondrial DNA) 4 5,3
MSUD (Maple Syrup Urine Disease) 1 1,3 Opitz 8 10,7 Pelizaeus-like 2 2,7
RCD (Respiratory complexes deficiency) 8 10,7
Retinitis pigmentosa 11 14,7 Usher 3 4,0 Total 75 100,0
Gender
Man
Woman
Phenotype
Affected
Healthy
Variability spectrum of the Spanish population
A total of 131.897 variant positions, unique in Spanish population, were
detected in all the 75 samples together. Approximately 90.000 were
singletons. 51.295 variants are non-synonymous changes and 18.450
correspond to synonymous changes (singleton-driven pattern, opposite to
variants shared with 1000g and EVS, from polymorphic positions).
The CIBERER Exome Server (CES): the first repository of variability of the Spanish
population Only another similar initiative exists:
the GoNL http://www.nlgenome.nl/
http://ciberer.es/bier/exome-server/
Information provided
Genotypes in the
different reference
populations
Genomic coordinates,
variation, and gene.
SNPid
if any
Information provided
PolyPhen and SIFT
patogenicity indexes Phenotyphe,
if available
Variants can also be seen in their genomic context
GenomeMaps viewer (Medina et al., 2013, NAR) embedded in the application.
GenomeMaps is the official genome viewer of the ICGC (http://dcc.icgc.org/)
Occurrence of pathological variants in “normal” population
Reference
genome is
mutated
Nine carriers
in 1000
genomes
One affect and
73 carriers in
EVS
Current usage options
Query
Configuration
of the display
Genomic
context
Spanish variability database. FAQ
What is stored in the database?
ONLY frequencies of the genotypes observed in the positions in which
variants have been found in at least one individual. This information is
obtained from Spanish unrelated individuals.
What information is provided by the database?
Aggregated information on the genotype frequencies of the variable position
in the gene(s) requested.
Is possible to know that a particular individual is stored in the database?
No, unless you sequence the individual and check if the genotype
frequencies are compatible with the database, but seems stupid because
you already have the information pursued.
Lets imagine that I am stupid and managed to know that the individual is in
the database, can I retrieve her/his genome?
No, it is impossible from the aggregated information
Spanish variability database. FAQ
Who can contribute?
Anyone (especially if you are sequencing with public resources)
What do you need to submit?
Anonymized files of variants (VCF: variant calling format)
Why VCFs?
Because we need to check that your contribution contains no relatives of
the individuals in the database
What’s next?
• Strategic steps:
– Populating the database with contributions of CIBERER and externals. Future project SPANEx
– Opening the database
• Technical steps:
– Automatic access to the local variability data via webservices
– Use in gene discovery pipelines
– Use for the interpretation of incidental findings in diagnostic panels
Table of Spanish Frequencies
(TSF)
DB of Spanish variants (DBSV)
Chr Position Ref Alt 0/0 0/1 1/1
1 1365313 A T 75 0 0
1 1484884 G A 70 4 1
2 326252 T C 25 35 15
CES use
Other countries
CES input
External
Unrelated? (DBSV)
VCFs Spanish? (TSF)
YES YES
NO NO
Counts
Internal
Regional
Future of the Database of variation in Spanish population
CIBERER contributions
SPANEx contributions
CIBERER 76 samples Unaffected
CES II 76+269+X
Mixed
MGP 269 samples
Healthy controls
Phase I Phase II Phase III
CES II 1000+76+269+X
Mixed
More CIBERER samples
SPANEX: 1000 exomes
CIBERER
CIBERER exome server roadmap
2014-June 2014 2015
Future utilization. Access via webservices
Access to aggregated data of
variation and genotype
frequencies. Therefore, no
confidentiality or privacy issues
associated.
Spanish variation database
CellBase. (Bleda et al., 2012. NAR) Our
data server system. Now at the EBI
NA19660 NA19661
NA19600 NA19685
BiERapp: the interactive filtering tool for easy candidate prioritization
http://bierapp.babelomics.org
Panel (real or virtual) manager
Tool for defining panels
New filter based on
local population
variant frequencies
If no diagnostic variants appear, then
secondary findings can be studied
Diagnostic mutations
http://team.babelomics.org
Take home message
• Local variability is critical for distinguishing real pathologic variants from local polymorphisms
• CES will be populated with the SPANEX project (M.A. Moreno talk)
• CES is the starting point of a more ambitious crowdsourcing project that aims at constructing a high-resolution map of the Spanish population variation
• Contributions to CES are compliant with confidentially issues. No patient information is shared, only statistical information.
The Computational Genomics Department at the Centro de Investigación Príncipe Felipe (CIPF),
Valencia, Spain, and… ...the INB, National Institute of Bioinformatics (Functional Genomics Node) and the CIBERER Network of Centers for Rare Diseases, and…
...the Medical Genome Project (Sevilla)
@xdopazo
@bioinfocipf