practice (molecular biology) theory: central dogma methods: separation, visualization
DESCRIPTION
Bioinformatics: Theory and Practice – Striking a Balance (a plea for teaching, as well as doing, Bioinformatics). The spectrum of experimental Biology Practice – Theory. Theory (population/statistical genetics) Theory: 80+ years of Mathematical Biology Methods: Ag,RFLPs,SNPs …. Practice - PowerPoint PPT PresentationTRANSCRIPT
Bioinformatics: Theory and Practice – Striking a Balance
(a plea for teaching, as well as doing, Bioinformatics)
Practice(Molecular Biology)
Theory: Central DogmaMethods: separation,
visualizationExperiment as “Art”
Theory(population/statistical genetics)
Theory: 80+ years of Mathematical Biology
Methods: Ag,RFLPs,SNPs…
BioinformaticsTheory: 40 years of algorithms,
information theory20+ years of statistics
Current practice My ideal
The spectrum of experimental BiologyPractice – Theory
Teaching and Bioinformatics
What is the goal?• Learning Biology / learning Computer Science• Becoming "computer literate"
scripting/programming• Exploring uncertainty
– experimental shortcomings– computational biases
• Utility – getting something done
Bioinformatics is challenging because biology is complicated and idiosyncratic
Biology: A “clean” experiment –Internal positive and negative controls
Southern blot of human class-mu Glutathione transferase genes from individuals with low (-) or high (+) GT-tSBO activity.
Bands found with high GT-tSBO (GSTM1)
RFLP independent of GT-tSBO
• When GSTM1 is present, it is detected
• When it is not detected, it is absent
Bioinformatics –ambiguity or computational error?
D3BUQ5 is “clearly” homologous to GSTA6_RAT, aligning from beginning to end• Does it have a GST_C domain?• Does it have glutathione transferase activity?• Could it be a steroid isomerase? Prostaglandin synthetase?
Why is Bioinformatics “hard”?
Bioinformatics is at the intersection of Biology, Computer
science, and Statistics• What is interesting to Computer Scientists, – algorithms, optimality –
is less relevant to Biologists (text book bias)
• “irrelevant” parameters for Computer Scientists – DNA vs protein –
are important in practice
• Statistics are central, and the statistical perspective is not well
integrated into either Biology or CS curricula
• The biological assumptions behind a “null hypothesis” are rarely
explicit and often idealistic
• Biologists do experiments (CS folks like theory). If it works, use it.Bioinformatics uses "hard/true/reproducible" techniques
to solve "soft/ambiguous/varying" biological questions.
A teaching "opportunity"
6
Alberts is wrong about sequence similarity(three times in three claims)
“With such a large number of proteins in the database, the search programs find many nonsignificant matches, resulting in a background noise level that makes it very difficult to pick out all but the closest relatives. Generally speaking, one requires a 30% identity in sequence to consider that two proteins match. However, we know the function of many short signature sequences ("fingerprints"), and these are widely used to find more distant relationships.”
– Alberts, Molecular Biology of the Cell (5th ed, 2007) p. 139
• Sequences producing statistically significant alignments ALWAYS share a common structure
• Many significant alignments share < 30% identity (<25% identity is routine, and <20% identity can be significant)
• In the absence of significant similarity, “fingerprints” should never be trusted.
How can we teach better?
• Discuss the strengths and weaknesses of data
resources
• Examine how published protocols go out of date
(or are optimized for different problems).
Examine potential weaknesses – what do the
protocols assume?
• Review high-profile papers with mistaken
conclusions to understand what went wrong.
Biology 4XXX – Bioinformatics and Functional Genomics
3hr lecture, 1hr lab
• Introduction to Unix / perl (python) scripting / web resources
– programming by imitation
• similarity searching / domain identification
– homology, scoring matrices– errors in domain annotation (why)
• multiple sequence alignment– sequences vs domains
• evolutionary tree-building– finding the best tree– evaluating alternative trees– where is the uncertainty (why)
• Introduction to 'R' statistical language
– programming by imitation
• Expression analysis– read mapping, read counting
• Motif extraction, mapping– motif independence?
• Pathway analysis – gene enrichment
• Gene models and alternative splicing
– which gene/splicing models supported?
Computational and Comparative GenomicsOct 29 – Nov 4, 2014
(application deadline July 15, 2014)