decoding encode

Download Decoding ENCODE

Post on 18-Feb-2016




0 download

Embed Size (px)


Decoding ENCODE. Jim Kent University of California Santa Cruz. ENCODE Timeline. ENCyclopedia of Dna Elements. Attempt to catalog as many functional elements in human genome as possible using current technologies. Pilot project - finished 2007, covered 1% of genome. - PowerPoint PPT Presentation


  • Decoding ENCODEJim KentUniversity of California Santa Cruz

  • ENCODE TimelineENCyclopedia of Dna Elements. Attempt to catalog as many functional elements in human genome as possible using current technologies.Pilot project - finished 2007, covered 1% of genome.Production project - ramping up now. Genome-wide. Should have major amounts of data in 6 months.

  • ENCODE ExperimentsChromatin state:DNA Hypersensitivity assaysChromatin Immunoprecipitation (ChIP)Histones in various methylation statesSequence-specific transcription factorsDNA methylationChromatin conformation capture (5C)Functional RNA discoveryNuclear & cytoplasmic, short & longRNA ImmunoprecipitationComparative GenomicsHuman curated gene annotation

  • Role of UCSC

    Display data in context of what else is known on the UCSC Genome Browser and in other tools.Facilitate analysis of the data with both Web-based and command line tools.

  • A Peek at the Pilot Project

  • ENCODE pilot data at

  • Correlation at gene starts in enr221

  • Transcription at enm221

  • ENCODE Chromatin Immunoprecipitation

  • Scientific Highlights of PilotTranscription:Lots of transcription outside of known genes.Outside of known genes transcribed areas not very well conserved across species.Lots of rare splice variants, also poorly conserved.DNA/Protein InteractionsGood correlation between histone markers, gene starts, and _active_ transcription.Lots of occupied transcription factor binding sites not conserved, near promoters etc.Biological noise?Main controversy was whether to explain much of the data as biological noise that was tolerated but not necessary for function.

  • From Pilot to Production Phase

  • ENCODE Production PhaseMoving from microarray based assays to assays based on next-generation sequencing. (ChIP-chip to ChIP-seq)Genome-wide rather than regional.Broader set of cell lines used more consistently between labs.Broader set of antibodies.Some new technology development continues.

  • ENCODE Cell LinesTier 1 - used in ALL experimentsGM12878 (lymphoblastoid cell line)K562 (chronic myeloid leukemia)Tier 2 - used in most experimentsHepG2 (hepatocellular carcinoma)Hela-S3 (cervical carcinoma)HUVEC (umbilical vein endothelial cells)Keratinocyte (normal epidermal cells)Likely will do an embryonic stem cell too.Tier 3 - used in one or two experimentsMany of these for assays such as DNAse hypersensitivity, RNA measurements where dont have to do separate experiment for each antibody.

  • Simple Model of Eukaryotic Transcription RegulationInitially chromatin opened to allow transcription factors to access DNAMultiple transcription factors bind to DNA in combination.Most factors have such small DNA binding sites that by themselves they are not specific or the binding even stableThe right combination of factors in open chromatin leads to active transcription starting at the initiation complex.With the ENCODE experiments we can directly test most aspects of this model.

  • Chromatin ExperimentsIn general applied across a large number of cell lines.DNAseI hypersensitivityFormaldehyde Assisted Isolation of Regulatory ElementsMethylation of CpG IslandsChIP-seq of relevant factorsH3K4me1,2,3 H3K9me3 H4K20me3, H3K27me3, H3K36me3, RPol-II, etc.

  • Transcription Factor ChIPMany antibodies in modest number of cell lines.Limited by good antibodies, hope for 100 or more.Current good antibodies includeE2F1, E2F4, E2F6, KAP1, L3MBTL2, STAT1, CtBP1, CtBP2, SETDB1, ZNF180, ZNF239, ZNF263, ZNF266, ZNF317, ZNF342Part of project pipeline for raising and testing antibodies.

  • RNA measurementRNA-seq of poly-A selected RNA to measure mRNA levels in many cell lines.Sequencing of G-cap selected tags (CAGE)Sequencing 5 and 3 ends (paired end tags)Measurement of RNAs of several types in several cell compartments of a few cell lines.Long/short, polyA/nonPolyA, associated with proteins/not associated with proteinsNucleus, cytosol, polysomes, chromatin, nucleolus

  • New Pilot Projects Starting to Sprout

  • New Pilot ProjectsImmunoprecipitation of RNA binding proteins/RNA sequencing. Mapping silencers and enhancers with transient transfection assaysComputational identification of active promotersDeep comparative sequencing in targeted regions and conservation analysis.Chromatin Conformation Capture Carbon Copy (5C) to capture long range regulatory elements and their targets.

  • ENCODE TimelineGrants funded for 4 years starting Sept 2007.First production data just now starting to roll into UCSC, not quite ready for public display.Data should accumulate quickly over next few years.

  • Data Release PolicyOnce have reproducible data (where at least 2 of 3 replicates agree) should be released to public within a month.Data is still considered pre-publication! Ok to publish a paper using data on a few genes.Please wait for consortium papers before papers doing full genome analysis.Anyone can join ENCODE consortium analysis group to help us write the papers.We just have ~1 year after data release to write papers, after that fair game to publish full genome analysis. If in doubt please contact consortium via UCSC.

  • Web Works for Mice and Men

  • Mouse ES Cell Chromatin IPBrad Bernstein lab ChIP-seq based experiment on methylated histones now on UCSC Genome Browser.Shows some of the user interfaces that will be used for the ENCODE data

  • List of mouse chromatin subtracks.

  • Signal densities of entire mouse chromatin data set.

  • The unending quest for genes

  • Gencode ProjectProject to define structure (exons and introns) for all common splice varients of all genes.Human curators merge many lines of evidence includingComputational gene predictionsRNA/DNA alignmentsPaired end tagsCross-species alignmentsPossibly chromatin state dataPI is Tim HubbardMuch of the work done by Havana group

  • Data Mining with Table Browser

  • Table BrowserComplete access to UCSC Database with results in tab-delimited formatMethod for creating custom tracks by combining and filtering existing tracks.

    Sample query - getting a table of Ensembl gene coordinates and associated Superfamily annotations.

  • Selected fields from related tables results: Ensemble Gene (ensGene) and Superfamily Description (sfDescription).

  • Table Browser Filters

    Getting list of Ensembl genes that have SH3 domains.

  • Table Browser Intersection

    Getting list of Ensembl genes that dont intersect UCSC Known Genes

  • Custom Track OutputUseful for visualizing results of queries in genome browserThe way to produce more complex queries.

    Here we look at how well genes that are Ensembl but not UCSC are conserved across species.

  • 681/3329 (20%) of Ensemble not known also not conserved1728/33,666 (5%) of Ensembl in general not conserved

  • UCSC Gene Sorter

    Swiss army knife for dealing with gene sets.Hilights relationships and connections between genes.Powerful data mining tool.

  • Cytochrome P450 - a gene family important in drug metabolism.The family is related in many ways. Sorted by protein homology

  • Various sorting methods let you focus on different typesof relationships between genes.

  • Sorting by gene distance is a quick way to browse candidategenes in a region.

  • Clicking on row # or gene name selects that gene.

  • Configuration page controls column order and display options.

  • Also you can upload your own columns here.

  • Controlling expression display

  • GNF Atlas 2 column in median of replicates mode. ActualColumn includes 79 tissues, slide only fits first half.

  • Sorting based on expression similarity to selected gene.

  • The filters page turns the Family Browser into a powerfuldata mining tool.

  • GO-annotated membrane proteins that are expressed at least 8X in pancreatic islets cells and no more than 4X elsewhere outside of pancreas and central nervous system. These might be good candidates for targets of the autoimmune response that can cause Type I diabetes.Candidate Pancreatic Islet Membrane Genes

  • Direct Data Access

  • FTP or HTTP DownloadSequenceMultiple genome alignmentsWiggle track data.Database as tab-separated filesFollow downloads link from http://genome.ucsc.eduVia

  • Public MySQL AccessQuery mirror of our database directlyHost: genome-mysql.cse.ucsc.eduUser: genomeNo password neededBest to use table browser to find relevant tables in many cases.Some tables are split by chromosomeschr1_est, chr2_est, etc.Some data (genome sequence, multiple alignments, wiggles) are in files just referenced by SQL tables.For some purposes easier to use via UCSC C library code than via SQL.

  • The Sordid Details of the UCSC Genome Informatics Code BaseDownload via modules require MySQL to be installed.

  • Lagging Edge SoftwareC language - compilers still available!CGI Scripts - portable if not pretty.SQL database - at least MySQL is free.

  • Problems with CMissing booleans and strings.No real objects.Must free things

  • Coping with Missing Data Types in C#define boolean intFixing lack of real string type much harderlineFile/common modules and autoSql code generator make parsing files relatively painlessdyString module not a horrible string class

  • Object Oriented Programming in CBuild objects around structures.Make families of functions with names that start with the structure name, and th