cs691k bioinformatics kulp lecture notes #0 molecular ... · cs691k bioinformatics kulp lecture...

28
CS691K Bioinformatics Kulp Lecture Notes #0 Molecular & Cell Biology Fall 2005 [email protected]

Upload: lamdien

Post on 28-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

CS691K BioinformaticsKulp Lecture Notes #0

Molecular & Cell Biology

Fall 2005

[email protected]

Logistics

• Syllabus distributed– Class taught in 3 stages by faculty in CS, math/stats, and microbio

– Grades will be based on up to six homework assignments

– Office hours on syllabus. All faculty are readily available by email.We are happy to discuss the class with you personally.

– Not all notes will be available online - you should attend all lecturesand take good notes

• Diverse group of students

• Emphasis will be on understanding methods and practicaluse of existing bioinformatics tools

• Why are you here? What is your background? What areyou hoping to get out of this class? Please sign the emailsheet!

• Homework will involve the use of the unix ED-LABcomputers. There will be a special meeting onWEDNESDAY, SEPTEMBER 14 for novice unix users.

What is Bioinformatics

• Computational Biology: The use of algorithmic,mathematical, and statistical methods to analyzegenome sequences (i.e. DNA, RNA, protein) andderived data (e.g. expression, NMR, etc.)

• Informatics: The software and data managementmethodologies for storing, retrieving, andintrigrating such data

• Data Mining / In-silico Biology: Hypothesisgeneration and testing from genome data sets

Topics

• Detecting similar sequences (homology)– Pairwise and multiple sequence alignment

– Protein function/structure prediction

• Sequence pattern modeling and recognition– Motif discovery

– Gene finding

• Analyzing high-dimension data– Function prediction, target discovery, etc. from gene

expression

• Constructing trees– Phylogenetics

• Informatics and integration– Genome biology

The Cell

• Prokaryotes are unicellular with minimal compartments -bacteria, archaea

• Eukaryotes are multicellular with differentiation and manyorganelles including the nucleus that typically canreproduce sexually - all higher organisms includingmammals, birds, fish, invertebrates, mushrooms, plants,and yeast. ~300,000,000,000,000 cells in a human.

The Cell

• The cell is composed of and makes thousands of proteins, e.g.

– the cell wall is made of a layer of proteins and lipids.

– There are special proteins embedded in the wall as channels andpumps

– And the cell makes (synthesizes) proteins• “DNA makes RNA, RNA makes proteins, and proteins make us!” F.

Crick

• The cell is a chemical catalytic machine

• Networks:

– one type of network are metabolic networks describing catalyticreactions for the consumption or synthesis of products necessaryfor life. Many of these are fairly well understood. (e.g.photosynthesis)

– Another type of network are signaling networks where informationis conveyed about the environment. These are partially understood.(e.g. protein kinases are involved in cell differentiation and celldeath)

• From KEGG(http://www.genome.ad.jp/kegg/pathway.html)

The Cell - Genetic Information

• There is a third major type of network: geneticinformation processing. We will focus on thesenetworks.

• To understand this:– we describe the nature of DNA

– Tangentially mention homology and conservation

– Then discuss the process of translation

DNA Structure - Eukaryotic Chromosome

• DNA - a string of nucleic acids (Adenine, Guanine, Cytosine, and Thymine)

• Regular, long, stable, oriented, double-stranded, helical structure

• Humans: 23 pairs of chromosomes. Total ~3B “bases” (x2)

• DNA resides in nucleus in eukaryotes

DNA StructureDNA

• Always: chemical pairing of A-T andC-G. Thus, strands arecomplementary.

• Two chains run in opposite directions:5’ to 3’

5’

3’

5’

3’

Prokaryotic Chromosomes

• Prokaryotes (andmitochondria)have one circularchromosome

• This shows the E.coli genome withorange andyellow barsindicating thepositions of thegenes on the twostrands.

RNA

RNA is a similar molecule composed of 4 nucleic acids (A, C,G, and U)

• Single-stranded.

• Can base-pair with DNA (synthesis)

• Can self-base-pair and fold

DNA Replication

• We won’t be discussing the details of DNA replication.There are 2 processes:– Mitosis for normal cell duplication

– Meiosis for gametes for sexual reproduction - single,recombined chromosomes

• In both processes, DNA is copied by breaking double-strand (dsDNA) into single-strands (ssDNA) at originsof replication and synthesizing a complementary copyfrom the template.– 50 bp/sec * 15K origins = ~1 hr to replicate human genome

• Problem:– How does DNA polymerase find the origins? Are there

sequence patterns?

The Tree of Life

Single common ancestral genome!

DNA Conservation and Variation

• Mutations occur in DNA due to environmental effects (e.g. radiation)and random mistakes during synthesis. Usually just singlenucleotides are changes, sometimes large rearrangements.

• Those changes occurring in somatic (non-sex) cells cause localdamage, usually cell death, but can cause cancer. (Search for thecommon mutations that cause different types of cancers.)

• Those changes occurring in gametes can be inherited and if favorablecan become “fixed”

• Variation in non-functional (junk) DNA tends to “drift”, whereasfunctional DNA (e.g. containing genes) tends to remain “conserved”.

• Problems:– Given a set of sequences from different organisms:

• Identify and align sequences from a common ancestor (homologous)

• What are the important (conserved) parts?

• What was the evolutionary history? (Reconstruct the “tree”)

– Given a model organism (e.g. mouse, yeast, fruitfly, etc.), find theorthologous locus in human

Examples of Sequence Conservation

• A segment from the RNA needed for protein synthesis - a fundamentalprocess in all life forms. It is conserved across all 3 major branches ofthe tree of life.

• A multiple alignment of homologous protein sequences. Colorsindicate different classes of amino acids. Dots are inserts/deletes.

DNA contains “GENES”• Genes are heriditary units of DNA

– We now know that, for the most part, genes are regions that “code”for proteins

• Proteins are derived from DNA according to the “centraldogma”: DNA => RNA => Protein– Like DNA replication, DNA is opened into two single strands.

– Using a ssDNA as a template, a complementary copy of RNA issynthesized for a small region of the genome (1000-100000nt)

– The RNA is processed and transported (more about that in laterlectures)

– Each triple of RNA (codon) is translated to one of 20 amino acidscreating a polypeptide chain, which folds into a protein

• Problems:– How does the cell know where to find a gene? (Sequence

patterns?)

– How does RNA transcription know when to stop? (Patterns?)

– How is RNA edited?

“Central Dogma” - DNA - RNA - Protein

©1998 by Alberts, Bray,Johnson, Lewis, Raff,Roberts, Walter

Codon Translation

• Each triplet translates to a unique amino acid. Forexample, CUU is Leucine.

• There are 4*4*4=64 possible codons that translate into 20amino acids

• This translation table is fixed for almost all life

Cell Differentiation

• Eukaryotes have many different cell types (skin,muscle, neurons, etc.) that each play a differentrole.

• To accomplish the cell’s role, different genes mustbe activated

• Problems:– How are genes activated? What regulatory patterns are

in the DNA?

– What genes control other genes? What networkassociations among genes can be found?

– What genes are “differentially expressed”?

Cell Differentiation

Differential Expression

• Interleukin 1 alpha expressed in different celltypes

Protein Sequence, Structure, Function

• Lastly, given a protein sequence, what is the 3-Dstructure and function?

• The most common approach is to exploitconservation (see earlier)

• Problem:– Find similar proteins to my query protein. Maybe I can

assign structure or function to my new query protein, ifstructure or function is already known for a homologousprotein. (Sequence similarity searching, protein familymodeling)

Protein Structure

Further Reading

• Many online intros to genome biology– E.g. http://www.ncbi.nlm.nih.gov/About/primer/

• Any molecular biology text– E.g. Molecular Biology of the Cell by Alberts, et al or

Genomes by Brown.