what is bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/k...1000 genomes project...

28
What is Bioinformatics? “Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline.” - NCBI “The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.” - NCBI http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

Upload: others

Post on 31-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

What is Bioinformatics?

■  “Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline.” - NCBI

■  “The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.” - NCBI

http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

Page 2: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

⇒ A 3’ hydroxyl group is essential for chain elongation

CHAIN TERMINATOR

DNA Sequencing

5’

3’

Page 3: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Capillary Gel Electrophoresis

⇒ The sequencing reaction is run out in a single capillary gel. ⇒ The gel is scanned by a laser. ⇒ The sequence is read automatically using computer software from the pattern of different wavelengths emitted by the fluorescent dyes.

Page 4: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Page 5: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Automated sequencers: ABI 3700

■  Made by Applied

Biosystems ■  Most widely used

automated sequencers:

–  96 capillaries –  robot loading from 384-

well plates ■  Two to three hours per

run ■  600–700 bases per run

96–well plate

robotic arm and syringe

96 glass capillaries

load bar

Page 6: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Workflow of conventional vs. second-generation sequencing

6

High-throughput shotgun Sanger sequencing

Cyclic array shotgun sequencing

96 or 384 long reads per run

Millions of short reads per run

Template immobilization Sanger cycle seq

(Template amplification)

Template amplification

Capillary electrophoresis

Seq by synthesis or hybridization

Page 7: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Illumina

7 Figu

re fr

om M

. Met

zker

, Nat

Rev

Gen

et, J

an. 2

010

Page 8: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Cost of Sequence per megabase

Page 9: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Benefits of Next-gen sequencing

https://genomevolution.org/wiki/images/1/16/Plant_Genome_Growth.png

Page 10: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Why do we sequence? ■  Genome Annotation:

A complete genome sequence provides us with the raw data to construct a "parts list".

■  Comparative Genomics:

Conserved regions in the genome are more likely to play an important role in biology of the species.

■  Functional Genomics:

Sequencing the RNA provides us with an insight into the transcriptionally active regions of the genome.

■  Population Genetics and Genomics:

Genetic structure and diversity reveals history and distribution of phenotypic traits (e.g. disease susceptibility alleles)

■  Genetic Analysis:

Map and characterize molecular basis of allelic variants 10

Page 11: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

We have the genome sequence, now what ?

●  Well...!●  We don’t know how many genes there are!!●  We don’t know where they are!!●  We don’t know what they do!!

Page 12: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Definitions of Annotation

■  Interpreting raw sequence data into useful biological information

■  Information attached to genomic coordinates with start and end point, can occur at different levels

■  Addition of as much reliable and up-to-date information as possible to describe a sequence

■  Identification, structural description, characterization of putative protein products and other features in primary genomic sequence

Page 13: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Genome annotation

•  Structural annotation = Nucleotide-Protein level annotation. Finding genes and other biologically relevant sites thus building up a model of genome as objects with specific locations

•  Functional annotation = Objects are used in database searches (and experiments) aim is attributing biologically relevant information to whole sequence and individual objects

Large-scale genome analysis projects

•  Rate-limiting step is annotation

Two Main Levels

Page 14: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

14

How do we get from here …

Page 15: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

to here,

Page 16: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Summary of gene annotation steps

Page 17: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Gene prediction through comparative genomics

■  Highly similar (Conserved) regions between two genomes are useful or else they would have diverged

■  If genomes are too closely related all regions are similar, not just genes

■  If genomes are too far apart, analogous regions may be too dissimilar to be found

17

Page 18: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Mouse-human comparison

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458

Page 19: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 From: J.W. Thomas et al - Nature 14 August 2003

Page 20: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

The ENCODE Project Consortium (2011) A User’s Guide to the Encyclopedia of DNA Elements (ENCODE). PLoS Biol 9(4)

Page 21: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Automated Manual Merged

Page 22: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants
Page 23: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Basic Distributed Annotation Systems (DAS)

Page 24: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Genome and Functional Annotation:

Predicted genes, GO, MIPSFuncat

Data to support modeling efforts Protein-protein interactions Protein-DNA interactions Pathways (KEGG, AraCyc)

Experimental Data Microarray Chip-Chip

Contents of an Integrated Database

Page 25: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Bioinformaticians integrate the data into one database

1)   Find the data. Decentralized databases Data in different formats

XML is a good idea (SBML) 2) Convert to a common format

3) Data integration. Manual: Excel sheet comparisons (Biologists) Automated: Perl Scripts (Informatician) Database: Queries e.g. SQL (High-production labs)

4) Gene list intersect.

Experiments Function Models

5) Modeling Biological function in Gene list Need visualization and network modeling tools

Annotation

Page 26: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

UCSC browser

Page 27: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Examples of Large Genome Projects ■  1000 Genomes Project (www.1000genomes.org). An effort to

sequence the genome of 1000 people to identify genetic variants that affect 1% of the human population.

■  1001 Arabidopsis thaliana Genomes Project (www.1001genomes.org) . Study the genomes and phenotypes of 1001 strains that can explain difference in phenotype caused by adaptation of different conditions.

■  Metagenomics (http://commonfund.nih.gov/hmp/): Sequencing of DNA samples from environments, for example mouth, skin, and digestive system, to identify the different bacterial species present.

Page 28: What is Bioinformatics?hpc.ilri.cgiar.org/beca/training/ilri-aau_2015/clc/K...1000 Genomes Project (). An effort to sequence the genome of 1000 people to identify genetic variants

Your genome ■  Personal Genome Sequencing: Several companies provide a

service where you can submit your DNA to get sequenced. This can help you learn more about your heritage and also which diseases you are susceptible to.

■  Medical Genomic Studies: There are already a collection of genetic testing procedures that look for specific genes. Unfortunately they are not accurate which can result in individuals making bad decisions. But hope is that with more genes, we can make better and more informed decisions.