![Page 1: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/1.jpg)
1
Hubert HacklBiocenter, Institute of Bioinformatics
Medical University of Innsbruck Innrain 80, 6020 Innsbruck, Austria
Tel: +43-512-9003-71403Email: [email protected]
URL: http://icbi.at
104540 VO/2 Bioinformatik SS2020
High throughput methods
Biological databases
Data integration methods
VI Data Integration
![Page 2: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/2.jpg)
2
Whole genome sequencing
• Few large contigs are better than many small contigs• N50 = length of smallest contig in the set of largest contigs covering 50% of assembly• Maximal and average length of contigs
• Few scaffolds with high number of contigs are better than many with few• S50 (according to N50)• Maximal and average length of contigs
Green et al., Nature Genet, 2001
• Assembly in O(n2) where n is number of reads• Sequencing errors (e.g. homopolymers)• Repeats in different length• Areas without or limited coverages• Finishing gap closure
Greedy algorithm• Find shortest common substring T for reads {s1,s2,…}• Solution is very time consuming (NP hard) but can approximated by
greedy algorithm• Successful used for small genomes (e.g. bacteria)• CAP3, SSAKE, VCAKE, SHARCGS
Overlap-layout-consensus• Graph, where the nodes represent each of the reads and an edge
connects two nodes if the corresponding reads overlap• Identifying a path through the graph that contains all the nodes - a
Hamiltonian path• Arachne, Celera Assembler,
newbler, Minimus, Edena
Genome assembly
![Page 3: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/3.jpg)
3
De Brujin graphs• Break up each read into a collection of overlapping k-mers.• Each k-mer is represented in a graph as an edge connecting two nodes
corresponding to its k-1 bp prefix and suffix respectively.• A graph that uses all the edges containing the information obtain from
all the reads is a solution to the assembly problem (Eulerian path).• Repeats • Euler, Velvet,
Allpath, ABySS
Genome assembly
Compeau et al., Nature Biotech, 2011
Personal genomes
Sequencing of the genomes of fraternal twins diagnosed with a movement disorder
6000m nucleotides (diploid human genome)1.63m single-base variants shared by twins that differ from reference
human genome9531 variants that code for proteins4605 variants that change amino-acid sequence77 rare variants (which are more likely to cause disease)3 candidate genes1 gene linked to disorder
Bainbridge et al., Sci Transl Med, 2011Maher, Nature, 2011
![Page 4: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/4.jpg)
4
Exome sequencing
• 85% of disease causing genes are in the exome• exome ~1% of the genome
Bioinformatics analysis
• Sequence quality• Alignment (e.g. BWA-mem)• Filter (mapped read, duplicate, exome, local alignment around DIP)• Variant detection (SNP, DIP, homo/heterozygous splitter)• Annotation and visualization (IGV)
Somatic variant calling (Mutetct2)
![Page 5: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/5.jpg)
5
Copy number variation
Bisulfite sequencing (DNA methylation)
• Lower sequence complexity can make problems (e.g. primer design)• Incomplete conversions• Degradation of DNA during bisulfite treatment
HT sequencing (PCR)
• Bisulfite converts cytosine (C) residues to uracil (U) • Leaves 5-methylcytosine (5mC) residues unaffected
![Page 6: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/6.jpg)
6
Yeast two-hybrid (Y2H)
Protein-protein interaction (Y2H)
TF-DNA interaction (Y1H)
Synthetic lethal interactions
![Page 7: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/7.jpg)
7
eQTL
• An expression profile can be mapped to gene expression Quantitative Trait Loci by linkage or association method.
• QTLs are stretches of DNA containing or linked to the genes that underlie a quantitative trait (phenotype, charcateristics)
hot spots
LC-MS/MS
![Page 8: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/8.jpg)
8
Spectrum
Peptide fragment fingerprinting (PFF)
![Page 9: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/9.jpg)
9
Quantitative proteomics
ICAT
Isotope-Coded Affinity TagsStable isotope labeling with amino
acids in cell culture
SILAC
Tissue microarray
![Page 10: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/10.jpg)
10
FACS
Examples of public databases
• National Center for Biotechnology Information (NCBI)
GenBank
• European Bioinformatics Institute (EBI) and Sanger Center
Ensembl
• The Molecular Biology Database Collection
http://www3.oup.co.uk/nar/database/c/
![Page 11: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/11.jpg)
11
National Institutes of Health
National Library of Medicine (NLM)National Center for Biotechnology Information (NCBI)
![Page 12: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/12.jpg)
12
Databases at the NCBI
• Pubmed• Protein• Nucleotides• Structure• Genome• Books• CancerChromosomes• Conserved Domains• 3D Domains• Gene• Genome Project• dbGAP• GEO Profiles• GEO Datasets• GeneSat
• HomoloGene• Journals• MeSH• NLM Catalogs• OMIA• OMIM• PMC• PopSet• Probe• Protein Cluster• SNP• Taxonomy• UniGene• UniSTS
Linking within Entrez
GenBank
GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences,with annotations describing the biological information these records contain.
• Full release of GenBank every 2 months.
• Incremental and cumulative releases: daily.
• GenBank is only available from the Internet.
![Page 13: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/13.jpg)
13
GenBank
GenBank Flat file
![Page 14: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/14.jpg)
14
RefSeq
• Best, comprehensive, non-redundant set of sequences
• For genomic DNA, transcript (RNA), and protein
• For major research organisms (2645 organisms)
• Based on GenBank derived sequences
• Ongoing curation by NCBI staff and collaborators, with review status indicated on each record
• Identifiers: NT_ Genomic contig
NM_ mRNA
NP_ protein
NR_ None-coding RNA
XM_ mRNA
XP_ protienautomatic annotation
Gene
• A record represents a single gene from an organism• A gene-specific information such as map, sequence,
expression, structure, function, homology and publications
• Includes data for all organisms that have RefSeqgenome records
• Official gene symbol and gene name are used
Gene ID 5091
Official Symbol PC
Official Full Name pyruvate carboxylase
For human provided fromHUGO Gene NomenclatureCommitee (HGNC)
![Page 15: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/15.jpg)
15
PubMed
• The database was designed to provide access to citations (with abstracts) from biomedical journals.
• PubMed has more than 15 million MEDLINE journal article references and abstracts (~1960-2008)
• 700 million searches per year over the web
• Linking feature to provide access to full-text journal articles at web sites of participating publishers, as well as to other related web resources.
Online Mendelian Inheritance in Man (OMIM)
• Is a timely, authoritative compendium of bibliographic material and observations on inherited disorders and human genes.
• Curation of the database and editorial decisions take place at The Johns Hopkins University School of Medicine.
• OMIM provides authoritative free text overviews of genetic disorders and gene loci that can be used by clinicians, researchers, students, and educators.
![Page 16: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/16.jpg)
16
Gene expression databases
• Public microarray data repositories
ArrayExpress (AE) Gene Expression Omnibus
www.ebi.ac.uk/arrayexpress/ www.ncbi.nlm.nih.gov/geo/
Protein sequences
![Page 17: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/17.jpg)
17
Three-dimensional Protein Structures
Protein-protein interaction
• centralized platform to visualize domain architecture, post-translational
modifications, interaction networks and disease association (golden standard for PPI)
• 36,500 unique PPIs annotated for 25,000 proteins (2007).
• > 50% of molecules annotated in HPRD have at least one PPI
• 10% have more than 10 PPIs.
• 3 categories of experiments for PPIs:
in vitro, in vivo and yeast two hybrid (Y2H).
![Page 18: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/18.jpg)
18
Genome Browsers
Ensembl
– Normalized
– Each data point stored only once
– Quick updates
– Minimal storage requirements
– BUT: Many tables
Many joins for complicated queries
Slow for data mining questions
– De-normalized– Tables with ‘redundant’ information– Query-optimized– Fast and flexible
Core database
Mart database (EnsMart)
![Page 19: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/19.jpg)
19
Comparative Genomics
Genomes change over time
![Page 20: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/20.jpg)
20
Definitions
Homologs: A – B – C
Orthologs: B1 – C1
Paralogs: C1 – C2 –C3
Inparalogs: C2 – C3
Outparalogs: B2 – C1
Xenologs: A1 – AB1
Protein A
Orthologues prediction
![Page 21: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/21.jpg)
21
Ortholog databases
• YOGY (eukarYotic OrtholoGY) is a web-based resource and integrates 5 independent resources (Sanger)
• COG Cluster of ortholog groups of proteins and KOG for 7 eukaryotic genomes (NCBI),
• Inparanoid (Center Stockholm Bioinformatics)
• HomoloGene (NCBI)
• OrthoMCL use Markov Clustering algorithm (University of Pennsylvania)
Rhodes et al., Nat Biotechnol, 2005
Probabilistic model for data integration
DIP Coexpression GO Interpro
![Page 22: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/22.jpg)
22
HPRD(Gold standard)
Y N
Example protein-protein interaction network
Naïve Bayes model
LR (f1..fn)P(f1..fn|pos)
P(f1..fn|neg)= ∏i=1
n
LR (fi) ∏i=1
n
=
Evidence (data sets) f1..fn
Prior odds Oprior = P(pos)/P(neg)
Posterior odds Opost = P(pos|f1..fn)/P(neg|f1..fn))
Likelihood ratio LR (f1..fn) = P(f1..fn|pos)/P(f1..fn|neg)
Opost = Oprior* LR (f1..fn)
Log-likelihood score LLS=log LR (f1..fn)
![Page 23: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/23.jpg)
23
Integration of datasets
Fraser AG, Marcotte EM. Nat Genet (2004)
Fisher’s χ2 :
Mudholkar-George’s T:
Liptak-Stouffer’s Z:
Meta analysis: combination of p-values
• Get list of significant p-values from different type of experiments (data)
• There are a number of (weighted) statistical measures to combine p-values from
k datasets
χ2 = -2 ∑ wi*log(pi) (df=2k)
T = f(k) ∑ wi*log(pi/(1-pi)
Z = (1/sqrt(∑wi2))∑ wi*Ф-1(1-pi)
i=1
k
• Derive networks based on overall p-values
Intersection min (pi)
Union max (pi)k
k
i=1
i=1
![Page 24: 104540 VO/2 Bioinformatik SS2020 · 2020. 6. 5. · Data integration methods VI Data Integration. 2 ... • centralized platform to visualize domain architecture, post-translational](https://reader036.vdocuments.site/reader036/viewer/2022081410/60a049bbf5d565330331c293/html5/thumbnails/24.jpg)
24
Technologies for data integration
data sources (databases)
Presentation layer (user)
mediator
wrapper
direct links
data warehouse
data martsOLAP tools (cubes)data mining
extract, transform, load (ETL)
databases
Linking (relational database, SQL)
Mediator-based approach(federated databases)
Data warehouse
Semantic integration Agents, RDF, OWL, SPARQL, Ontologies