ncbi bioinformatics workshop

25
NCBI Bioinformatics Workshop Rabat, Morocco 2012

Upload: elmer

Post on 25-Feb-2016

92 views

Category:

Documents


1 download

DESCRIPTION

NCBI Bioinformatics Workshop. Rabat, Morocco 2012. What is Bioinformatics?. Bioinformatics is the application of information technology to the field of molecular biology . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NCBI Bioinformatics Workshop

NCBI Bioinformatics Workshop

Rabat, Morocco 2012

Page 2: NCBI Bioinformatics Workshop

What is Bioinformatics? Bioinformatics is the application of information technology to

the field of molecular biology. The term bioinformatics was coined by Paulien Hogeweg in

1979 for the study of informatics' processes in biotic systems. Its primary use since at least the late 1980s has been in genomics and genetics, particularly in those areas of genomics involving large-scale DNA sequencing. Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.

Wikipedia

Page 3: NCBI Bioinformatics Workshop

What is NCBI?

• Create automated systems for knowledge about molecular biology, biochemistry, and genetics.

• Perform research into advanced methods of analyzing and interpreting molecular biology data.

• Enable biotechnology researchers and medical care personnel to use the systems and methods developed.

On November 4, 1988 that President Ronald Reagan signed the Health Omnibus Extension Act to create The National Center for Biotechnology Information as part of National Library of Medicine at NIH.

Page 4: NCBI Bioinformatics Workshop

History of molecular biology

1860 Genetics Gregor Mendel discovered that genes determine characteristics of the organism genes are passed to children from both parents

1943 Molecular biologyJames Watson discovered that DNA

molecule might store the genes1962 Noble Prize James Watson, Francis Crick, Wilkins (Rosaline Franklin)

1970 Central Dogma (first announced in 1952) and reinstated by Francis Crick in Nature.

Page 5: NCBI Bioinformatics Workshop

Central Dogma of molecular biologyThe central dogma of molecular biology was first enunciated by Francis Crick in 1958[1] and re-

stated in a Nature paper published in 1970 The general transfers describe the normal flow of biological information: DNA can be copied to DNA

(DNA replication), DNA information can be copied into mRNA, (transcription), and proteins can be synthesized using the information in mRNA as a template (translation).

Does the central dogma still stand?Koonin EV. Biol Direct. 2012 Aug 23;7(1):27. [Epub ahead of print]

Page 6: NCBI Bioinformatics Workshop

History of biotechnology1590 the microscope is discovered by Janssen1675 Leeuwehoek discovered protozoa and bacteria1855 Escherichia coli bacterium is discovered (major research and production tool for biotechnology1879 Flemming discovered chromatin, rod-like structures in cell nucleus, later called ‘chromosomes’1942 The electron microscope is used to identify and characterize a bacteriophage- a virus that infects bacteria.1953 Watson and Crick reveal the three-dimensional structure of DNA.1973 Cohen and Boyer perform the first successful recombinant DNA experiment, using bacterial genes.1983 The Polymerase Chain Reaction (PCR) technique1995 First bacterial genome is sequenced by whole genome shotgun technology2001 The sequence of the human genome is published in Science and Nature, making it possible for researchers all over the world to begin developing treatments.2005 Next Generation Sequencing: Illumna, MySeq, Ion Toron, PAcBio

Page 7: NCBI Bioinformatics Workshop

History of Bioinformatics

Sequence database 1960 - Margaret Dayhoff collected sequences in a database that later become PIR1974 –GenBank; 1980 –EMBL(ENA); 1984 – DDBJ; 1984 –SwissProtSequence comparison1970 – Needleman- Wuncsh global pairwise alignment1972 - Smith-Waterman local alignment1973 – multiple alignmentDatabase searches by sequence similarity1988 – FASTA by Pearson and Lipman1990 – BLAST by Altshul, Gish, LipmanText search and retrieval system1990 – Entrez designed by Lipman and BensonAlgorithmsGene predictionProtein structureHidden Markov ModelClusteringTrees

Page 8: NCBI Bioinformatics Workshop

Hypothesis

Data managment

MODELExperim

ent

DATA

Validation

Visualization

Analysis

Interpretation

Problem Solving

For every complex problem, there is an answer that is clear, simple, and wrong… - H. L. Mencken

Page 9: NCBI Bioinformatics Workshop

ROC curve analysisReceiver Operating Characteristic (ROC) curve analysis (Metz, 1978; Zweig & Campbell, 1993)

Page 10: NCBI Bioinformatics Workshop

Challenges in Computational Biology

Protein

Protein structure prediction

Homology searches

Multiple alignments and phylogenetic tree

Genome assembly and annotation

Page 11: NCBI Bioinformatics Workshop

Challenging issues in Bioinformatics

• Data management processing, storage accuracy (highthrouput low quality) search and retrieval presentation • Data analysis algorithms statistical techniques• Simulation modeling and prediction Parameter estimation prediction accuracy

Page 12: NCBI Bioinformatics Workshop

NCBI mission: discovery initiative

NCBI Analysis

Search

Visualization

Validation

Page 13: NCBI Bioinformatics Workshop

What is GenBank? NCBI’s Primary Sequence Database

• Nucleotide only sequence database • Archival in nature

– Historical– Reflective of submitter point of view (subjective)– Redundant

• GenBank Data– Direct submissions (traditional records)– Batch submissions (EST, GSS, STS)– ftp accounts (genome data)

• Three collaborating databases– GenBank– DNA Database of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL) Database

Page 14: NCBI Bioinformatics Workshop

Sequence Databases

GenBank

SequencingCenters

GA

GAGA

ATTAT

TC

CGAGA

ATTAT

TC

C

AT

GAGA

ATTC

C GAGA

ATTC

C

TTGACAATT

GACTA

ACGTGC

TTGACA

CGTGAATTGAC

TATATAGCCG

ACGTGC

ACGTGCACGTGCTTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTAATTGACTA AT

TGACTA

ATTGACTA

TATAGC

CG

TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG

CATT

GAGA

ATTC

C GAGA

ATTC

C Labs

Algorithms

UniGene

Curators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continuall

y by NCBI

Updated ONLY by submitters

Page 15: NCBI Bioinformatics Workshop

Next Generation Sequencing

Page 16: NCBI Bioinformatics Workshop

Next Generation Sequencing

Page 17: NCBI Bioinformatics Workshop

NGS produces a lot of data

Page 18: NCBI Bioinformatics Workshop

Information retrieval

NCBI Discovery initiative

Page 19: NCBI Bioinformatics Workshop

Entrez Search and retrieval system

"From a computer in the comfort of your own home or from one in your neighborhood library, you will be able to access timely and accurate information. Already 30,000 people a day are using MEDLINE. By making it more accessible -- free and private -- we can increase that number many times over."

Vice President Gore 1997

Page 20: NCBI Bioinformatics Workshop

Improve information retrieval

Add links filtersRelated information

Page 21: NCBI Bioinformatics Workshop

Rescuing Zero-Result PubMed Searches

Unassisted

Zero-result rescued by spelling

Unassisted

Zero-result rescued by spelling

Gene sensor

Citation sensor/Hydra

2008 2011

Auto-complete

16% of all PubMed searches

19%Improvement 37%

Improvement

Page 22: NCBI Bioinformatics Workshop

Sequence analysis

Page 23: NCBI Bioinformatics Workshop

Visualization

Page 24: NCBI Bioinformatics Workshop

NCBI Bioinformatics Workshop 2009

Page 25: NCBI Bioinformatics Workshop

NCBI Bioinformatics Workshop 2011