genes and genomes. genome on line database (gold) 243 published complete genomes 536 prokaryotic...

Download Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December

Post on 18-Jan-2016




0 download

Embed Size (px)


  • Genes and Genomes

  • Genome On Line Database (GOLD)243 Published complete genomes536 Prokaryotic ongoing genomes434 Eukaryotic ongoing genomesDecember 2004 : 1245 genome projects

  • Common Genome Browsers


    Eukaryote Only UCSC: Ensembl:

    Prokaryote Only MGV: TIGR:


  • What can we learn from genomesGenesSplice variantsVariation analysisPromotersComparative GenomicsEvolution

  • Alternative splice variants

  • Looking for genes in genomesExisting mRNA and EST dataGene prediction programComparative genomics

  • ESTs (Expressed Sequence Tags)cDNA provide a best tool to identify genes in a genome.For unsequenced genomes it was the primary source for identifying genesBasic strategy - select cDNA clones at random and perform a single automated read from one/both ends of the transcript. Many clones will be redundant.Very cost effective.ESTs are short (400-600b), relatively inaccurate (2% error).ESTs are correlated to known genes using a relatively small region of sequence alignment.Used to discover genes, alternative splicing variants, etc.

  • Problems with ESTs-Incomplete CoverageBias for high copy number genes-Experimental mistakes- not always reliable-Enrichment of 3 ends of genes-High representation of cancer cells

    Usage of EST-Predicting of coding regions-Detecting of alternative splicing-Clustering to form genes

  • RefSeq database (NCBI)The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

    RefSeq standards serve as the basis for medical, functional, and diversity studies; they provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.

  • Gene Finding ApproachesLearn characteristics of known genesSearch for new genes using characteristicsDifferent types of genes have different characteristics.Prediction StatusThe problem of gene prediction is very much open even in well studied genomes:The number of genes in human keeps changing.

  • Gene FindingInputChromosomal genetic sequenceOutputRegion which encodes for geneStrand and reading frameStart and end of coding sequenceExon-intron boundaries

  • Prokaryotes Vs. EukaryotesRequire different gene finding strategies. Prokaryotes: the genome is compact (Shorter intergenic regions, no introns).several genes may reside on the same mRNA in different reading frames.Promoter regions are more conserved.Eukaryoteslarge genomes; intron/exon structure; alternative splicing; pseudogenes, very long intergenic regions The human genome: average gene ~ 27,800b. 8 exon ~ 100b. intron 100-30,000 b.

  • ORF FindingOpen Reading Frames sequences that presumably code for proteins.How can ORFs be detected?All reading frames are checked.Search for initiation and termination codons within a sequence.Are these codons totally conserved?

  • Protein-coding Gene CharacteristicGC ContentUneven codon usageAmino acid biasSpecies preferred codonsPromoter and splicing signalsThese characteristics may aid in Prediction.Validation.

  • Codon UsageDNA is not a random choice of possible codons for each amino acid.It is an ordered list of codons that reflects evolutionary origin and constraints related to gene expression. Each species has its own coding preferences codon usage.

  • The genetic code - Each amino acid is coded by 3 nucleotides, named codon. Code redundancy - Most amino acids are coded by several codons. - 64 triplets code for 20 amino acids & 3 stop codons.

    The Genetic Code

  • site provides: Codon usage tables per organism Computation of codon usage for query coding sequences.

    Codon Usage Database

  • Codon Usage PreferencesDifferent codon usage for highly vs. weakly expressed E. Coli genes were divided into 3 groups based on their codon usage - regular genes (70%) - highly expressed genes (15%) - horizontally transferred genes (15%) There is strong preferences in ORFs for specific codon pairs and for specific codons near terminators.

    The base in the third position in each codon tends to repeat itself in the same ORF.

  • Sequence SignalsProkaryotes:Promoter (-35, -10 from TSS)Ribosome Binding Site (Shine-Dalgarno) is conserved. Located ~ -15 upstream AUG.

    EukaryotesTranscription signals TATA (~-30 TSS), cap signal, poly-adenylation site. Any signal may be missing.Translation signals Kozak signal (immediately upstream ATG).Splicing signals recognized by the Spliceosome. Introns usually start with GT and end with AG.

  • Computational Approaches to PredictionVarious computational methods including decision trees, neural nets, Markov models and Hidden Markov models (HMM). A model is studied based on known genes, and then applied to genomic sequences. Each genome defines its own model.

  • Markov ModelsProbabilistic approach.Modeled by states and the probability of transition from one state to the next.The probability of being at state X in step i depends only on the state we reached at step i-1.

    It has been found that ORFs have a reading-frame specific hexamer (6mer) composition. => the probability of the 6th base can be computed using the previous 5. => The probability that a sequence is an ORF in a specific reading frame can be computed from its 6th-mer composition.

  • Grail II for finding Exons(Neural Network)Score of 6mersScore of 6mers in flanking regionMarkov model scoreGC compositionGC composition in flankingScore for splicing acceptorScore for splicing donorInput layerHidden layeroutputExonscore

  • GenScan (HMM)One of most accurate programsBest for human/vertebrate sequencesMarkov parameters for different regionsIntrons beginning at 3 phasesExons: first, intermediate, lastPromoter region3 and 5 untranslated regionsIntragenic regions

  • HMM for a GC reach intronic region

  • The General SchemeObtain new genomic DNA sequence. i. Translate in all 6 reading frames and compare to protein databases. ii. Perform database similarity search of expressed sequence tags (EST) database of same organism, or cDNA sequences if available. 3. Use gene prediction program to locate genes. 4. Analyze regulatory sequences and signals in the gene. Can help characterize putative genes.

  • Other gene Finding ToolsGeneMark (prokaryote, eukaryote) (bacteria, archaea) (human, mouse, arabidopsis) (vertebrate, C. elegans)

  • Prediction EvaluationPrediction tools are compared using two criteria:

    Sensitivity - % true predicted genes out of the true genes in the genome.TP /(TP+FN)Specificity - % true predicted non genes out of the total number of non genes. TN /(TN+FP)

    Both need to be high, results vary from genome to genome

    Accuracy comparisons tested on vertebrates


  • Functional RNA GenesRNA genes are transcribed but are not translated no codon preference exists. How can rRNA, tRNA and small RNA genes be predicted?Promoter regions can be characterized, but remain a big challenge. RNA secondary structure is important. Can be predicted using RNA structure prediction tools (MFOLD tool).

  • Comparative genomicsFinding OrthologsLooking for genes in one species not found in anotherSearching for conserved regulatory elementsGene ClustersConserved regulatory networks

  • Conservation of the IGFALS (Insulin-like growth factor)Between human and mouse.