(c) m gerstein '06, gerstein.info/talks 1 cs/cbb 545 - data mining introduction to...

37
(c) M Gerstein '06, gerstein.info/talks CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,01.18 14:30-15:45)

Upload: rosalind-whitehead

Post on 11-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

1

CS/CBB 545 - Data MiningIntroduction to Bioinformatics

Mark Gerstein, Yale Universitygersteinlab.org/courses/545

(class 2007,01.18 14:30-15:45)

Page 2: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

2

Data Mining• Importance of knowing the Data• Best approaches often require detailed domain

knowledge – non anonomymized aol data, netflix challenge

Page 3: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

3

Bioinformatics represents one of the biggest "open" areas for mining

• Genomics & Astronomy• Finance, marketing, credit-card fraud• Security and Intelligence

• Relation to experimental sciences

Page 4: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

4

General Intro. & background on bioinformatics

Page 5: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

5

Bioinformatics

BiologicalData

ComputerCalculations

+

Page 6: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

6

What is Bioinformatics?

• (Molecular) Bio - informatics• One idea for a definition?

Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

• Bioinformatics is a practical discipline with many applications.

Core

Page 7: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

7

What is the Information?Molecular Biology as an Information Science

• Central Dogmaof Molecular Biology DNA -> RNA -> Protein -> Phenotype -> DNA

• Molecules Sequence, Structure, Function

• Processes Mechanism, Specificity, Regulation

• Central Paradigmfor Bioinformatics

Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype

• Large Amounts of Information Standardized Statistical

•Genetic material•Information transfer (mRNA)•Protein synthesis (tRNA/mRNA)•Some catalytic activity

•Most cellular functions are performed or facilitated by proteins.

•Primary biocatalyst

•Cofactor transport/storage

•Mechanical motion/support

•Immune protection

•Control of growth/differentiation

(idea from D Brutlag, Stanford, graphics from S Strobel)

Page 8: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

8

Molecular Biology Information - DNA

• Raw DNA Sequence Coding or Not? Parse into genes?

4 bases: AGCT ~1 K in a gene, ~2 M in genome

~3 Gb Human

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .

. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

Page 9: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

9

Central Dogma

Page 10: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

10

Molecular Biology Information: Protein Sequence

• 20 letter alphabet ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

• Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain

• ~200 K known protein sequencesd1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF

d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKAd3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKAd3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV

Page 11: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

11

Molecular Biology Information:Macromolecular Structure

• DNA/RNA/Protein Almost all protein

(RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)

Page 12: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

12

Molecular Biology Information: Protein Structure Details

• Statistics on Number of XYZ triplets 200 residues/domain -> 200 CA atoms, separated by 3.8 A Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A

o => ~1500 xyz triplets (=8x200) per protein domain 10 K known domain, ~300 folds

ATOM 1 C ACE 0 9.401 30.166 60.595 1.00 49.88 1GKY 67ATOM 2 O ACE 0 10.432 30.832 60.722 1.00 50.35 1GKY 68ATOM 3 CH3 ACE 0 8.876 29.767 59.226 1.00 50.04 1GKY 69ATOM 4 N SER 1 8.753 29.755 61.685 1.00 49.13 1GKY 70ATOM 5 CA SER 1 9.242 30.200 62.974 1.00 46.62 1GKY 71ATOM 6 C SER 1 10.453 29.500 63.579 1.00 41.99 1GKY 72ATOM 7 O SER 1 10.593 29.607 64.814 1.00 43.24 1GKY 73ATOM 8 CB SER 1 8.052 30.189 63.974 1.00 53.00 1GKY 74ATOM 9 OG SER 1 7.294 31.409 63.930 1.00 57.79 1GKY 75ATOM 10 N ARG 2 11.360 28.819 62.827 1.00 36.48 1GKY 76ATOM 11 CA ARG 2 12.548 28.316 63.532 1.00 30.20 1GKY 77ATOM 12 C ARG 2 13.502 29.501 63.500 1.00 25.54 1GKY 78

...ATOM 1444 CB LYS 186 13.836 22.263 57.567 1.00 55.06 1GKY1510ATOM 1445 CG LYS 186 12.422 22.452 58.180 1.00 53.45 1GKY1511ATOM 1446 CD LYS 186 11.531 21.198 58.185 1.00 49.88 1GKY1512ATOM 1447 CE LYS 186 11.452 20.402 56.860 1.00 48.15 1GKY1513ATOM 1448 NZ LYS 186 10.735 21.104 55.811 1.00 48.41 1GKY1514ATOM 1449 OXT LYS 186 16.887 23.841 56.647 1.00 62.94 1GKY1515TER 1450 LYS 186 1GKY1516

Page 13: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

13

Molecular Biology Information:

Whole Genomes• The Revolution Driving Everything

Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F.,

Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A.,

Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-

genome random sequencing and assembly of Haemophilus influenzae rd."

Science 269: 496-512.

(Picture adapted from TIGR website, http://www.tigr.org)

• Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes done

1997, yeast: 13 Mb & ~6000 genes for yeast

1998, worm: ~100Mb with 19 K genes

1999: >30 completed genomes!

2003, human: 3 Gb & 100 K genes...

Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Pekso, Nature 401: 115-116 (1999)

Page 14: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

14

Genomes highlight

the Finiteness

of the “Parts” in Biology

Bacteria, 1.6 Mb,

~1600 genes [Science 269: 496]

Eukaryote, 13 Mb,

~6K genes [Nature 387: 1]

1995

1997

1998

Animal, ~100 Mb,

~20K genes [Science 282:

1945]

Human, ~3 Gb,

~25K genes

2000?

real thing, Apr ‘00

‘98 spoof

Page 15: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

15

Gene Expression Datasets: the Transcriptome

Also: SAGE; Samson and Church, Chips; Aebersold, Protein Expression

Young/Lander, Chips,Abs. Exp.

Brown, array, Rel. Exp. over Timecourse

Snyder, Transposons, Protein Exp.

Page 16: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

16

2nd gen.,Proteome

Chips (Snyder)

The recent advent and subsequent onslaught of microarray data

1st generation,Expression

Arrays (Brown)

Page 17: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

17

Other Whole-Genome

Experiments

Systematic Knockouts

Winzeler, E. A., Shoemaker, D. D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito, R., Boeke, J. D., Bussey, H., Chu, A. M., Connelly, C., Davis, K., Dietrich, F., Dow, S. W., El Bakkoury, M., Foury, F., Friend, S. H., Gentalen, E., Giaever, G., Hegemann, J. H., Jones, T., Laub, M., Liao, H., Davis, R. W. & et al. (1999). Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901-6

2 hybrids, linkage maps

Hua, S. B., Luo, Y., Qiu, M., Chan, E., Zhou, H. & Zhu, L. (1998). Construction of a modular yeast two-hybrid cDNA library from human EST clones for the human genome protein linkage map. Gene 215, 143-52

For yeast: 6000 x 6000 / 2 ~ 18M interactions

Page 18: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

18

Large-scale characterization of yeast gene phenotype using

molecular barcodes

Page 19: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

19

Molecular Biology Information:Other Integrative Data

• Information to understand genomes Metabolic Pathways

(glycolysis), traditional biochemistry

Regulatory Networks Whole Organisms

Phylogeny, traditional zoology

Environments, Habitats, ecology

The Literature (MEDLINE)

• The Future....

(Pathway drawing from P Karp’s EcoCyc, Phylogeny from S J Gould, Dinosaur in a Haystack)

Page 20: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

20

What is Bioinformatics?

• (Molecular) Bio - informatics• One idea for a definition?

Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

• Bioinformatics is a practical discipline with many applications.

Page 21: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

21

Large-scale Information:GenBank Growth

Year Base Pairs Sequences1982 680338 6061983 2274029 24271984 3368765 41751985 5204420 57001986 9615371 99781987 15514776 145841988 23800000 205791989 34762585 287911990 49179285 395331991 71947426 556271992 101008486 786081993 157152442 1434921994 217102462 2152731995 384939485 5556941996 651972984 10212111997 1160300687 17658471998 2008761784 28378971999 3841163011 48645702000 8604221980 7077491

GenBank Data

Page 22: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

22

Large-scale Information:Explonential Growth of Data Matched by Development of Computer Technology

• CPU vs Disk & Net As important as the

increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial

• Driving Force in Bioinformatics

(Internet picture adaptedfrom D Brutlag, Stanford)

Str

uct

ure

s in

PD

B

0500

10001500200025003000350040004500

1980 1985 1990 19950

20

40

60

80

100

120

1401979 1981 1983 1985 1987 1989 1991 1993 1995

CP

U In

stru

ctio

nT

ime

(ns)Num.

Protein DomainStructures

Internet Hosts

Page 23: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

23

PubMed publications with title “microarray”

0

500

1000

1500

2000

2500

3000

1998 2000 2002 2004

Per YearCumulative

Nu

mb

er o

f P

aper

s

Page 24: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

24

Features per Slide

F

eatu

res

per

chip

oligo features

transistors

Page 25: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

25

The Dropping Cost of Sequencing

• Adapted from Technology Review (Sept./Oct. 2006)

Page 26: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

26

Bioinformatics is born!

(courtesy of Finn Drablos)

Page 27: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

27

What is Bioinformatics?

• (Molecular) Bio - informatics• One idea for a definition?

Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

• Bioinformatics is a practical discipline with many applications.

Page 28: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

28

Organizing Molecular Biology

Information:Redundancy and

Multiplicity• Different Sequences Have the Same

Structure• Organism has many similar genes• Single Gene May Have Multiple Functions• Genes are grouped into Pathways• Genomic Sequence Redundancy due to the

Genetic Code• How do we find the similarities?.....

(idea from D Brutlag, Stanford)

Integrative Genomics - genes structures functions pathways expression levels regulatory systems ….

Core

Page 29: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

29

Where does "mining" fit in Science?

Page 30: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

30

Bioinformatics as New Paradigm forScientific Computing

• Physics

Prediction based on physical principles

EX: Exact Determination of Rocket Trajectory

Emphasizes: Supercomputer, CPU

• Biology Classifying information and

discovering unexpected relationships

EX: Gene Expression Network

Emphasizes: networks, “federated” database

Core

Page 31: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

31

Statistical Analysis

vs. Classical Physics

Bioinformatics, Genomic Surveys

Vs.

Chemical Understanding, Mechanism, Molecular Biology

How Does Prediction Fit into the Definition?

Page 32: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

32

Differences between Mining in Science vs Other Contexts

• Biology & Chemistry are Experimental Sciences Goal is construct an experiment that illuminates fundamental

mechanism Correlation is a means not a goal In contrast, in social sciences one can't readily uncover mechanism

• Nevertheless genome scale data changed things So much data that it contained high-dimensional non-obvious

"patterns" that could be teased out by mining

• Data mining is best as "Target Selection" to suggest experiments but if one does good experiments one won't need careful statistics

Page 33: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

33

Practical Mining

Page 34: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

34

Outline for the bioinformatics part of the class

• Illustrate a concrete problem in bioinformatics Predicting gene function (and phenotype)

o Representing genes in terms of networkso Network analysis and prediction

• Why networks are useful for representing information Mining as representing complex information relative to simple generative

models Scale-free models for biological networks

• Bayesian Approaches to Network and Function Prediction Predicting protein interactions

• Spectral Approaches to Phenotype Prediction Using PCA to analyze gene expression data and classify cancers

Page 35: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

35

~ Outline

• Sequences Alignment

o non-exact string matching, gaps Multiple Alignment and Consensus Patterns

o How to align more than one sequence and then fuse the result in a consensus representation

o Transitive Comparisons

o HMMs, Profiles, Motifs Scoring schemes and Matching statistics

o How to tell if a given alignment or match is statistically significant Evolutionary Issues

o Rates of mutation and change

Page 36: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

36

~ Outline

• Sequence / Structure Secondary Structure “Prediction” Tertiary Structure Prediction

o Fold Recognition

o Threading

o Ab initio

• Structures Basic Protein Geometry and Least-Squares Fitting Docking and Drug Design as Surface Matching

o Calculation of Volume and Surface Structural Alignment

o Aligning sequences on the basis of 3D structure.

Page 37: (c) M Gerstein '06, gerstein.info/talks 1 CS/CBB 545 - Data Mining Introduction to Bioinformatics Mark Gerstein, Yale University gersteinlab.org/courses/545

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

37

Basis for the topic selection

• Sequence and Structure are classic bioinformatics They are the most prevalent problems

• More difficult to represent for mining Require specialized techniques not covered (e.g. HMMs) Networks and function prediction most similar to classic mining

problems

• Covered Sequence and Structure in MBB/CS/CBB 452/752 Didn't cover networks for function prediction