practice (molecular biology) theory: central dogma methods: separation, visualization

Bioinformatics: Theory and Practice – Striking a Balance

(a plea for teaching, as well as doing, Bioinformatics)

Practice(Molecular Biology)

Theory: Central DogmaMethods: separation,

visualizationExperiment as “Art”

Theory(population/statistical genetics)

Theory: 80+ years of Mathematical Biology

Methods: Ag,RFLPs,SNPs…

BioinformaticsTheory: 40 years of algorithms,

information theory20+ years of statistics

Current practice My ideal

The spectrum of experimental BiologyPractice – Theory

Teaching and Bioinformatics

What is the goal?• Learning Biology / learning Computer Science• Becoming "computer literate"

scripting/programming• Exploring uncertainty

– experimental shortcomings– computational biases

• Utility – getting something done

Bioinformatics is challenging because biology is complicated and idiosyncratic

Biology: A “clean” experiment –Internal positive and negative controls

Southern blot of human class-mu Glutathione transferase genes from individuals with low (-) or high (+) GT-tSBO activity.

Bands found with high GT-tSBO (GSTM1)

RFLP independent of GT-tSBO

• When GSTM1 is present, it is detected

• When it is not detected, it is absent

Bioinformatics –ambiguity or computational error?

D3BUQ5 is “clearly” homologous to GSTA6_RAT, aligning from beginning to end• Does it have a GST_C domain?• Does it have glutathione transferase activity?• Could it be a steroid isomerase? Prostaglandin synthetase?

Why is Bioinformatics “hard”?

Bioinformatics is at the intersection of Biology, Computer

science, and Statistics• What is interesting to Computer Scientists, – algorithms, optimality –

is less relevant to Biologists (text book bias)

• “irrelevant” parameters for Computer Scientists – DNA vs protein –

are important in practice

• Statistics are central, and the statistical perspective is not well

integrated into either Biology or CS curricula

• The biological assumptions behind a “null hypothesis” are rarely

explicit and often idealistic

• Biologists do experiments (CS folks like theory). If it works, use it.Bioinformatics uses "hard/true/reproducible" techniques

to solve "soft/ambiguous/varying" biological questions.

A teaching "opportunity"

6

Alberts is wrong about sequence similarity(three times in three claims)

“With such a large number of proteins in the database, the search programs find many nonsignificant matches, resulting in a background noise level that makes it very difficult to pick out all but the closest relatives. Generally speaking, one requires a 30% identity in sequence to consider that two proteins match. However, we know the function of many short signature sequences ("fingerprints"), and these are widely used to find more distant relationships.”

– Alberts, Molecular Biology of the Cell (5th ed, 2007) p. 139

• Sequences producing statistically significant alignments ALWAYS share a common structure

• Many significant alignments share < 30% identity (<25% identity is routine, and <20% identity can be significant)

• In the absence of significant similarity, “fingerprints” should never be trusted.

How can we teach better?

• Discuss the strengths and weaknesses of data

resources

• Examine how published protocols go out of date

(or are optimized for different problems).

Examine potential weaknesses – what do the

protocols assume?

• Review high-profile papers with mistaken

conclusions to understand what went wrong.

Biology 4XXX – Bioinformatics and Functional Genomics

3hr lecture, 1hr lab

• Introduction to Unix / perl (python) scripting / web resources

– programming by imitation

• similarity searching / domain identification

– homology, scoring matrices– errors in domain annotation (why)

• multiple sequence alignment– sequences vs domains

• evolutionary tree-building– finding the best tree– evaluating alternative trees– where is the uncertainty (why)

• Introduction to 'R' statistical language

– programming by imitation

• Expression analysis– read mapping, read counting

• Motif extraction, mapping– motif independence?

• Pathway analysis – gene enrichment

• Gene models and alternative splicing

– which gene/splicing models supported?

Computational and Comparative GenomicsOct 29 – Nov 4, 2014

(application deadline July 15, 2014)

practice (molecular biology) theory: central dogma methods: separation, visualization

Documents

idiosyncratic biology

intersection of biology

bioinformatics hard

years of algorithms

high gttsbo activity

teaching opportunityalberts

central dogmamethods

cs folks