topics in bioinformatics cs832b bin ma. lecture 1: basic
TRANSCRIPT
Topics in Bioinformatics
CS832b
Bin Ma
Lecture 1: Basic
Three molecules we will study
• DNA• A string over alphabet {A,C,G,T}
• RNA• Primary structure – a string over alphabet {A,C,G,U}
• Secondary and tertiary structures
• Protein• Primary structure – a string over alphabet
{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
• Secondary and tertiary structures
5’
5’ 3’
3’
DNA
5’…AGTAGCCTATGCGA…3’ …::::::::::::::…3’…TCATCGGATACGCT…5’
5’…AGTAGCCTATGCGA…3’
>CHRXGATCACCTGACATCAGGAGTTCAAGACCAGCCTGCCAACGTGGTGAAACCCCATCTCTACTAAAAATAGGAAATTCACCTGGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGAAGAATCGCTTGAACCCAGGAGGTGGAGATTGCACTGAGCTGAGATCACGCCACTGCGCTCCAGCCTGGGTGACAGAGCAAGACTCCATAAAAAAAAAAATTATAACCTAATGATTAAATACTGTAGGGAAGAGCTTACCACAATTGCTGGCCCATGGCCAATGCTGGGTATAAGACAGCTACTGCAAACAACCATGATGATGATACATCTCTTGTGTAGGGTTAGGTTGTTTGAGACACATTCTATGCTCCTTGATTTGATTGGAAGGTACCTTGGTTCCTTGGGGACTTGGAGGTGACGAAAGCCTCCCTGGGGACAAAACTCACCTTCACTTCTCTAATATCAAGCTTCAGCAACCTGCTCCAGCTACAGCACAGGGTTGGACAGGCCCAACAACAGAGGAAATCCACAAAGTGTGTCTTGACACATACATCCACGGGGTCTAACGAGGTGAGGCCAATGACTGCTTCCACACACCCCAGCCAGACTCTGACTTCACTCCCGGCAGGTTTCAGTAGACTTGGCAGCAGTTGGAGCGAGCTGGCTTCTTGCGGTAGGCAGCCATGTTGGAAGAGCTCCCAATAGTCCTCGTTTCCTGGTAATCTCATGCTTGGATCATCTTCTTCTCTTGAGTGAAGAGAAGAACTGCAGAGAGAGACAGAGACAGAGAGACAGATCACAGGGGCAGTTTCCCCCATACTGTTCTCAAGATAAATGAGTCAACTCTTACACCTCTTTTCTCTGGTGTAAAACAAGGCTGGTGAACAGGCAGAGAGAACTGGGGTGTTGGAGTAGCATTGACCTTCCTTCTTCATCCCTCTATAATCTCTCCTAGTGCAGGAGTAGGAAAACTAAAAATCACACGTCTGATCATCTGTGATCTCAGAGTCTTGGACAAGCCTTGCTTGCCAATCAGCAGGGATGGGAGTTGGAGCCATCTCCAAGTGTCCCCCCACAAATCTATGTCCACCTGGAAGTTTCAAATGCAACTTTATTTGGGAAAGGCAATTTTGCAAATGTTATTAAGTGAAGGATCTAGGGATGAGATCATCCTGGAGTAGGGTGGGTCCTAGGTCAAATGACAGGAAATCTGCCCACCTCGGCCTCCCAAAGTGCTGGGATTACAGGCATGAGCCACCAAACCTGGCCTATCATTGATTTAATGATTAATACGGTTAGGCTCTGTGTCCCCACCCAAATCTCATCTCAAATTGTAATTCCCATGTGTCCAGGGAGGGAGCTTGTGGAAGGTGATTGGATCACAGGGGCAGTTTTTGTCATGCTGTTCTCATGATAAATGAGTCAATTCTCAGAAGAGATGATGGTTTTAAAGTGTGGCACTTCTTTGCTCTCTTGCTCTCTCTCTCTCCTGAGTAGACTGGCTCATTCTTTCTACTGGTTACAAGCAATAGAAGTGATAACAAAATTGATGGTTTCTCATTTCCTAAATGGTACCAGTGGATTCCTGGTTTCCTCTCTCTCTCTTCTCTCTCTCTATCAACTTTTCCCTCAATCTCTCTATCAACCTCCCTCTCTCTCAATCTCAATCTCTCTCAGTCTCATTCTCAATCTCTTTTGCTCAATCTCTTTCTCAGCTTCTCTCCCTCAATTTCTCTTTTGCAACTTCTCTCTCTCAGTCTGTGTCTCTCAATCTCCCTCTCTCAATCTCTCTTGTAGTCTCCCTGTCTCTCATACTCTCTCTGTTTCTGTCTGTCTCTGCCCTTGCTCTAGGGAAAGCAAGTTCTTATGCTGTAAGTTCTCCTGTAAAAAGGTCCACATGATACGGAACTGGCCATCTTTGGCCAACATGAGTGAGTTTAGAAGTGTGCCTTTCACCAGTTGAGCCTTCAAATGAGATCCCAGCCCTGGATGACACAGTGACAGTAACCTGCTAGGAACTGTGAACCAGAGGCACCCAGCCAAGCTGCTCCCAGACTCCCAACCCAGTGAAACCATAAGATAATAAATGCATGTTGTTTTAAGCTGCTAAGTTTGGGGGTCACTTGTTACACAGCAACAGCTGACTCATACATTTTCTTTGAAATTGATTTCCACTTCTGTCACCAGCATCATTCCATAAATTTGCTCTATGTGCATTGCTGACCTGCAGTAGAAGTTTTGGAGAAGTGAACCACATCCCCTTATCTGCCATTTGACAGCAAGCAGCCTCAAACATTCATAATTTCTTTCCTGACTCTCCACTCCACACTGTTGCCTGCCTTCCTGGTTCCAGATCTTTGGATCTGGACTGACACCTGGGCACTGTCATAGGCATCCGTGTGAAGAGACCACCAACAGGCTCTGTGTGAGCAATAAAGCTTTTTAATCACCTGGGTGCAGGTGGGCTGATTCTGAAAAGAGAGTCAGCAAAGAGTGGTGGGATTATCATTAGTTCTTATAGGTTCGGGATAGGTGGTGGAGTTAGGAGCAATTTTTTGTGGGCAGGGAGTGGATCTTACAAAGGACATTCTCAAGGGTGGGGATGATTTTACAAAGTACCTTCTTAAGGGCGGGGGAGGATATTACAAAGTACCTTCTCAAGGGTGGGGATGATTTTACAAAGTACCTTCTTAAGGGCGGGGGAGGATATTACAAAGTACCTTCTCAAGGGTGGGGGTGGATATTACAAAGTACCTTCTTAAGGGCAGGGGAGGATATTACAAAGTACCTTCTCAAGGGGGGGGATGATTTTACAAAGTACCTTCTTAAGGGCGGGGGAGGATATTACAAAGTACCTTCTCAAGGGTGGGGGTGGATATTAGAAAGTACCTTCT
• Chromosome X is one of the 23 chromosomes in human genome.• Chromosome X has 162 million base pairs.
Genome Sizes
Species Size in bps
Amoeba dubia 670,000,000,000
Homo sapiens 3,400,000,000
Drosophila melanogaster 180,000,000
Mycoplasma genitalium 580,000
Human immunodeficiency virus type 1
9,750
Protein and Amino Acids
Protein
Protein
GOT Ecoli
A protein sequence
>gi|7228451|dbj|BAA92411.1| EST AU055734(S20025) corresponds to a region …
MCSYIRYDTPKLFTHVTKTPPKNQVSNSINDVGSRRATDRSVASCSSEKSVGTMSVKNASSISFEDIEKSISNWKIPKVN
IKEIYHVDTDIHKVLTLNLQTSGYELELGSENISVTYRVYYKAMTTLAPCAKHYTPKGLTTLLQTNPNNRCTTPKTLKWD
EITLPEKWVLSQAVEPKSMDQSEVESLIETPDGDVEITFASKQKAFLQSRPSVSLDSRPRTKPQNVVYATYEDNSDEPSI
SDFDINVIELDVGFVIAIEEDEFEIDKDLLKKELRLQKNRPKMKRYFERVDEPFRLKIRELWHKEMREQRKNIFFFDWYE
SSQVRHFEEFFKGKNMMKKEQKSEAEDLTVIKKVSTEWETTSGNKSSSSQSVSPMFVPTIDPNIKLGKQKAFGPAISEEL
VSELALKLNNLKVNKNINEISDNEKYDMVNKIFKPSTLTSTTRNYYPRPTYADLQFEEMPQIQNMTYYNGKEIVEWNLDG
FTEYQIFTLCHQMIMYANACIANGNKEREAANMIVIGFSGQLKGWWNNYLNETQRQEILCAVKRDDQGRPLPDRDGNGNP
TELKEGFHMEEKDEPIQEDDQVVGTIQKYTKQKWYAEVMYRFIDGSYFQHITLIDSGADVNCIREDEILDQLVQTKREQV
VNSIYLHDNSFPKSMDLPDQKITEKRAKLQDIPHHEERLLDYREKKSRDGQDKLPMEVEQSMATNKNTKILLRAWLLST
A protein sequence may have a few hundreds to several thousands amino acids.
RNA
Animal cell
Nucleus
Chromatin
Mitochondrion
Nucleolus (rRNA synthesized)
Plasma membraneCell coat
Cytoplasm
Protein synthesis
Protein synthesis
Genetic code ..ATTCACAGTGGA..
I
H
S
G
Notes on translation
• Reading frame• Start and end codon
• Third base not important
• 5’ -> 3’
DNA replication
The Central Dogma of Molecular Biology
DNA RNA Proteintranscript translation
replication
genotype phenotype
Exception – retroviruses
DNA RNA Proteintranscript translation
replication
genotype phenotype
ProteinPhenotype
DNA(Genotype)
Biology
Genes• One gene encodes one protein (or sometimes
RNA).• Like a program, it starts with start codon (e.g.
ATG), then each three code one amino acid. Then a stop codon (e.g. TGA) signifies end of the gene.
• Genes are dense in prokaryotes and sparse in eukaryotes.
• In the middle of a eukaryotic gene, there are introns that are spliced out (as junk) after transcription. Good parts are called exons. This is the task of gene finding.
Introns and Exons
Jumping genes
• Genes can jump over other genes.
Gene related diseases
• Hemophilia: on X chromosome.• Sickle-Cell Anemia: single nucleotide mutation in the first
exon of beta-globin gene (removes a cutting site). 1 in 12 African Americans are carriers. (sick for homozygotes)
• BRCA1 gene (chr. 17q) – responsible for ½ inherited breast cancer (10% of breast cancer)
• Fragile X syndrome (mentally retard) – 1 in 1250 males, 2500 females (dominate, but females have partially expressed good gene). FMR-1 gene: tri-nucleotide repeats >200 causes disease.
• P53 gene: chr. 17p, responsible for ½ of all cancers