biology 162: computational genetics fall 2004 todd vision assistant professor department of biology,...

39
Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Post on 21-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Biology 162: Computational Genetics

Fall 2004

Todd VisionAssistant Professor

Department of Biology, UNC Chapel Hill

Page 2: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Bioinformatics vs computational genetics

• Bioinformatics: The application of computing technology to molecular biology

• Computational genetics: The interdisciplinary intersection of genetics, computer science and statistics

Page 3: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Course emphasis

• Data analysis in molecular genetics

• We will not cover– Developments in IT hardware– Analysis of protein structure– Modeling of metabolic pathways,

cells, tissues, organs, etc. (i.e. systems biology)

Page 4: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Prerequisites

• Bio 50: Molecular Biology and Genetics– Gene/protein structure and expression– Principles of inheritance

• Comp Sci 14: Introduction to Programming– Algorithms and their design– Fundamental programming skills

• Stat 31: Introduction to Statistics– Probability and Distributions– Hypothesis testing and parameter estimations

Page 5: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Related courses at UNC

• Biology 170/Math 107, Mathematical and Computational Models in Biology (Tim Elston and Maria Servedio)

• Summer courses in– Computer Science

• Graduate courses in– Bioinformatics and Computational Biology– Biostatistics– School of Pharmacy

Page 6: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Readings

• Gibson and Muse, A Primer of Genome Science, Sinauer Associates.– Available in Student Bookstore– Primarily covers genomic technologies– Brief on computational/statistical aspects

• Supplemental papers– Handed out in class or posted on Blackboard – Includes

• More detail on computational/statistical aspects• Papers which you will review for class assignments

Page 7: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

https://blackboard.unc.edu

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 8: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Computer labs / Problem sets

• Thursdays 3:30-4:30 in Wilson 132• Assignments are due following Tuesday• Purpose:

– Familiarity with genomic databases and tools• Functional and evolutionary sequence analysis• Gene expression analysis• Mapping of genomes and complex traits

– Comfort with command-line tools and computing– Exercise of scientific reasoning and biological

judgement– No programming required (but learn Perl

anyway!)

Page 9: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Research paper

• Critical review of the computational challenges involved in assembly of the human genome

• Based on opposing articles from the main players in the drama

• Paper will be judged on– Understanding of content– Critical and synthetic reasoning– Clarity of scientific writing

Page 10: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Late policy

• Assignments are due at beginning of class on the due date

• Late assignments receive half-credit

• Exceptions can be made but require more than 24 hours notice

Page 11: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Group work

• You are encouraged to work together on most assignments (some exceptions)

• What you turn in should be your own– Show your work– Be able to defend your answers

• Know and love the UNC Honor Code– http://honor.unc.edu

Page 12: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Exams

• Two midterms• Final exam will be cumulative• May include material from labs/problem

sets, readings and lectures• Most questions will be similar to those

on lab/problem sets• You will receive a study guide in

advance

Page 13: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Grading

• 10 Labs/problem sets - 50% (5% each)• Review paper - 10%• Midterms - 20% (10% each)• Final exam - 20%• Final grades

– No curve, point divisions at discretion of instructor

– Different divisions for undergraduate/graduate students

Page 14: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Computer lab server: Biolinux

• All necessary analysis software is installed

• Dell PowerEdge server– Linux Redhat operating system– 2 Xeon processors– 2 GB RAM– 60 GB disk space

• Requires an ONYEN for login• Uses AFS file space

Page 15: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Connecting to Biolinux

• biolinux.bio.unc.edu (IP 152.2.66.25)• Windows

– Zip archive contains necessary connection software

• MacOSX– X11 for graphical sessions– Fugu for secure ftp

• Linux/Solaris/etc.– Should work as is

Page 16: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

https://onyen.unc.edu

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 17: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

http://cilantro.bio.unc.edu/biolinux

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 18: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Cretaceous Park?

• In 1994, researchers reported a remarkably well-preserved Cretaceous dinosaur fossil.

• DNA was extracted– Care was taken to prevent contamination

• Specific regions were amplified– 20 different PCR primer pairs used, including 6

pairs from mitochondrial cytB– How would you design primers for dinosaur DNA?– All yielded products in mammals, birds and reptiles– Only one cytB pair yielded a product from the fossil– Negative controls did not reveal contamination

Page 19: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Cretaceous Park?• One cytB fragment amplified• 9 sequences obtained from two bone samples

– Variability was present within and between the two samples, none were identical

• Consensus sequences used to search for homologs– Genbank (215,000 sequences)– BLAST

• Measured percent identity• Closest matches were ~70% identical

– Equidistant to mammals, birds, and reptiles

Page 20: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Cretaceous Park?

• One would expect dinosaur DNA to be most similar to that of birds, and then crocodilians

• Other authors reanalyzed the data– Multiple alignment– Protein sequence scoring matrix– Phylogenetic analysis

• All concluded that the DNA was clearly mammalian, possibly human

• One group showed that similar sequences could be amplified from human nuclear DNA

Page 21: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Cretaceous Park?

• Three possibilities– Preparation of human nuclear DNA could have

been contaminated by dinosaur DNA– Dinosaurs and humans might have hybridized

during the Cretaceous– Dinosaur extracts were contaminated by human

DNA

• Study revealed an interesting aspect of human molecular evolution, but not much about dinosaurs

• Lesson learned: naïve computational analysis can lead to very misguided conclusions!

Page 22: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Discussion question

• You are given the sequence of a new gene and asked to determine its function.

• How would you begin?– What ‘wet lab’ approaches are possible?– What ‘in silico’ approaches are possible?– What approaches might require both

wet lab and in silico components?

Page 23: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Biological topics

• Sequence alignment and assembly• Sequence homology searching• Sequence evolution and phylogenetics• Finding genes and other features• Patterns of gene expression• Genetic mapping• Dissecting genetic diseases and

quantitative traits

Page 24: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Computational topics

• Dynamic programming• Regular expressions and suffix trees• Markov chains• Hidden Markov models and machine

learning• Techniques for clustering and classification• Maximum likelihood and Bayesian statistics• Graph traversal

Page 25: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Some informatics tools

• Genbank, Uniprot, and major sequence repositories

• InterPro and protein signature dBs• Gene Ontology• Model organism genome databases

(SGD, FlyBase, Ensembl)• A sampling of software programs

– Chosen primarily for pedagogical utility

Page 26: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Genomics

• Genetics on lots of genes?• Hypothesis-free science?• Some technologies• Enabled by

– Robotics– Computers

Page 27: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Genome database examples

• Primary databases– Genbank/EMBL/DDBJ

• Secondary databases– Pfam (protein domains)

• Organism-specific– SGD (yeast genomics)

• Specialized dBs– OMIM (human genetic disorders)

• Annual database issue of Nucleic Acids research: http://www3.oup.co.uk/nar/database/c/

Page 28: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Growth of Genbank

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 29: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

http://www.expasy.org/cgi-bin/show_thumbnails.pl?2

Page 30: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

First bacterial genome: 1995

• Haemophilus influenzae (TIGR)– 1.8 x 106 bp shotgun assembly– Required 9 months of computer time

• Now there are hundreds– 160 Bacterial– 19 Archaeal– 32 Eukaryotic

• Over a thousand projects ongoing• And a bacterial genome takes only days

to sequence and assemble

Page 31: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Tree of life

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 32: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

More protein families await

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 33: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Other types of genomic data

• Spatiotemporal gene expression• Alternative transcription• Genetic knockout/overexpression phenotypes• Genetic variability

– Molecular polymorphism

• Phenotypic variation / disease• Comparative data / molecular evolution• Protein

– Structure, including modifications– Interactions with other molecules

• Metabolic profiling, etc., etc.

Page 34: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Algorithmic/statistical innovations

• The most fundamental and heavily used application in the field is pairwise alignment – Smith-Waterman algorithm (1981)

• Still too slow for general database search

– BLAST (1987)• Made database search of 107-108 sequences feasible• Statistical ranking of each alignment

• Statistical methods in molecular evolution <25 yrs old

• Modern genetic mapping methods ~15 yrs old

Page 35: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 36: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 37: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 38: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Things to review

• Chemical differences among amino acids

• Prokaryotic and eukaryotic gene structure

• The central dogma• Anatomy of a typical protein

Page 39: Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

Reading for Thursday

• Gibson and Muse, Ch.1 Genome Projects, pgs. 1-58.