2014 whitney-research

Click here to load reader

Upload: ctitusbrown

Post on 10-May-2015

503 views

Category:

Technology


0 download

TRANSCRIPT

  • 1.Like the dog that caught the bus: now what? Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor MMG, CSE, BEACON Michigan State University Feb 2014 [email protected]

2. The challenges of non-model transcriptomics Missing or low quality genome reference. Evolutionarily distant. Most extant computational tools focus on modelorganisms Assume low polymorphism (internal variation) Assume reference genome Assume somewhat reliable functional annotation More significant compute infrastructureand cannot easily or directly be used on critters of interest. 3. Isoform analysis some easy 4. Isoform analysis some hardCounting methods mostly rely on presence of unique sequence to which to map. 5. Types of Alternative Splicing 40%25% Enables analyses that are otherwise completely impossible. 26. Solution 2: Partitioning transcripts into transcript familiesTranscript familyPell et al., 2012, PNAS 27. Transcriptome results - lamprey Started with 5.1 billion reads from 50 differenttissues. (4 years of computational research, and about 1 month of compute time, GO HERE)Ended with: 28. Lamprey transcriptome basic stats 616,000 transcripts (!) 263,000 transcript families (!)(This seems like a lot.) 29. Lamprey transcriptome basic stats 616,000 transcripts 263,000 transcript families Only 20436 transcript families have transcripts >1kb (compare with mouse: 17331 of 29769 genes are > 1kb) So, estimation by thumb ~ not that off, for long transcripts. 30. Validation -Assume computers lie. How do we judge precision & recall? 1) Homology!Do we see sequence similarity to e.g. mouse sequences? 1) Orthogonal data sets and analysesFor example, look at sperm genome, or independently cloned CDS. 31. Evolution: mouse 58,000 lamprey transcript families have somematches to mouse. 10,000 putative orthologs (reciprocal best hits) So thats a pretty good sign. (expecting about ~30k total genes) Conclusion: These numbers feel good to me; hard to know what to expect after ~350-500 mya. 32. Orthogonal data set: pm2 (liver genome) 64% of our new transcript families have a match inpm2. 71% of conserved transcript families have a match in pm2. 83% of long transcripts have a match in pm2. Good we dont expect 100%, because we know pm2 is probably missing stuff. So that means:Conclusion: At least 64% of transcript families are really lamprey (and > 83% of the long transcripts!) 33. Orthogonal data set: sperm genome 94.2% of ref-based transcripts have a match insperm genome. 98.2% of full-length cDNAs have a match in sperm genome.So sperm genome is pretty good for cross validation. But only 71% of our new transcript families have a match in sperm genome. ?? 34. Orthogonal data set: sperm genome 94.2% of ref-based transcripts have a match in spermgenome. 98.2% of full-length cDNAs have a match in sperm genome. New transcriptome: 71% of transcript families have a match in sperm genome. 92% (!!) of long transcript families have a match in sperm genome. (Since the sperm genome is low coverage, this length dependence makes sense the longer the 35. Orthogonal data set: sperm genome 94.2% of ref-based transcripts have a match in spermgenome. 98.2% of full-length cDNA have a match in sperm genome. New transcriptome: 71% of new transcriptome families have a match in sperm genome. 92% (!!) of long transcript families have a match in sperm genome. Conclusion: Our is poorer than but comparable 36. Orthogonal data set: full-length cDNAs We can look at both precision and recall byasking Are known sequences represented completely by asingle transcript? (best match) Are known sequences covered by one or more transcripts? (total matches) 70% 90% 37. Best matches not great. 38. Total matches better! 39. Ref-based (lamp0) best are better than new assembly (lamp3) 40. lamp3 total is better than lamp0 41. Conclusions from full-length cDNA Ref-based data set has longer best matches(better precision; less fragmented) De novo assembly is more sensitive overall (better recall; contains more real sequences) 42. Mapping percentages (with orthogonal data) Ona Bloom generated more data; how much maps?Ref-based New/all New/longBR SC 29.20% 42.94% 100.00%100.00% 45.99% 46.89%Conclusion: Ref-based is considerably less complete than new, de novo transcriptome assembly. 43. Lamprey transcriptome conclusions A substantial portion of the new transcriptome seemsgood: 58k transcript families with mouse homology, 10k orthologs; 20k transcript families with transcripts > 1kb. Good matches to liver genome & sperm genome. Reasonable numbers ~mouse. Much (!) better than ref-based for mapping. (2x as good)But! Poor recall of known full-length cDNA !? 240k partitions with only small sequences !? => microbial contamination? 44. Separate question: how much of the pm2 genome is missing?? 64% of lamp3 transcript families match to pm2. 82.5% of long transcript families match to pm2. 71% of lamp3 transcript families conserved withmouse match in pm2. Conclusion I: Probably about 30% of genic sequence is missing. 45. Separate question: how much of the pm2 genome is missing?? 64% of lamp3 transcript families match to pm2. 82.5% of long transcript families match to pm2. 71% of lamp3 transcript families conserved withmouse match in pm2. 22.5% of sperm genome contigs have no hits inpm2.Conclusion II (firmer): About 30% of single-copy sequence is missing. 46. CEGMA based completeness estimates (Core eukaryotic genes) Number seqsCompleteness / 100% matchesCompleteness / partial matcheslamp3 entire 620k lamp3 all ORFs > 80aa 269k lamp3 longest ORF in tr 80k70.696.446.48941.177.8lamp044.762.511kCamille Scott 47. Looking at the MolgulaPutnam et Modified al., 2008, Nature. from Swalla 2001 48. What do these animals look like? Molgula oculataMolgula oculataMolgula occultaCiona intestinalis 49. Tail loss and notochord genesa) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996 50. Diginorm applied to Molgula embryonic mRNAseq No.$ reads Reads$ of$ kept M.# occulta$ F+3 M.# occulta$ F+3 M.# occulta$ F+4 M.# occulta$ F+5 M.# occulta$ F+6 M.# occulta!Total M.# oculata$ F+3 M.# oculata$ F+4 M.# oculata$ F+6 M.# oculata!Total42,174,510 50,018,302 44,948,983 53,692,296 45,782,981 236,617,072 47,045,433 52,890,938 50,156,895 150,093,26615,642,268 6,012,894 3,499,935 2,993,715 2,774,342 30,923,154 10,754,899 3,949,489 2,874,196 17,578,584Percentage$ kept ? ? ? ? ? 13% ? ? ? 11.70% 51. Question: does normalization lose transcript information? M. occulta Diginorm Raw37C. intestinalis13623M. oculata Diginorm Raw17missing 244664C. intestinalis1364615missing 2398Reciprocal best hit vs. Ciona Blast e-value cutoff: 1e-6 Elijah Lowe 52. Transcriptome assembly thoughts We can (now) assemble really big data sets, andget pretty good results. We have lots of evidence (some presented here :)that some assemblies are not strongly affected by digital normalization. 53. Practical implications of diginorm Data is (essentially) free; For some problems, analysis is now cheaperthan data gathering (i.e. essentially free); plus, we can run most of our approaches inthe cloud. 54. 1. khmer-protocols Read cleaning Effort to provide standard cheapassembly protocols for the cloud. Diginorm Entirely copy/paste; ~2-6 days fromraw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set. Open, versioned, forkable, citable.AssemblyAnnotationRSEM differential expression 55. CC0; BSD; on github; in reStructuredText. 56. 2. Data availability is important for annotating distant sequences no similarityAnything elseMolluscCephalopod 57. Can we incentivize data sharing? ~$100-$150/transcriptome in the cloud Offer to analyze peoples existing data forfree, IFF they open it up within a year. See: CephSeq white paper. Dead Sea Scrolls & Open Marine Transcriptome Project blog post; 58. First results: Loligo genomic/transcriptome resources Putting other peoples sequences where my mouth is: 59. Tools to routinely update metazoan orthology/homology relationships > 100 mRNAseq data sets already; Build interconnections between them via homology; Build tools to update interconnections as new datasets arrive. Provide raw data, processed data, underlyingtools, simple Web interface, all CC0/in da cloud/open/reproducible. (Question: what biology problems could we tackle?) 60. Research singularity The data a researchers generates in their lab constitutes an increasingly small component of the data used to reach a conclusion. Corollary: The true value of the data an individual investigator generates should be considered in the context of aggregate data. Even if we overcome the social barriers and incentivize sharing, we are, needless to say, not remotely prepared for sharing all the data. 61. We practice open science! Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/research.html Preprints: on arXiv, q-bio: diginorm arxiv 62. Acknowledgements Lab members involved Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh ShenemanCollaborators Josh Rosenthal(UPR) Weiming Li, MSU Ona Bloom (Feinstein), Jen Morgan (MBL), Joe Funding Buxbaum (MSSM) USDA NIFA; NSF IOS; NIH; BEACON.