phang lab talk

91
PHANG LAB TALK Tzu L Phang Ph.D. Assistant Professor Department of Medicine Division of Pulmonary Sciences & Critical Care Medicine

Upload: wiley

Post on 26-Feb-2016

58 views

Category:

Documents


0 download

DESCRIPTION

PHANG LAB TALK. Tzu L Phang Ph.D. Assistant Professor Department of Medicine Division of Pulmonary Sciences & Critical Care Medicine. What I do:. Perform high-throughput data analysis for the scientific community; microarray and Next Generation Sequencing datasets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PHANG LAB TALK

PHANG LAB TALK

Tzu L Phang Ph.D.Assistant Professor

Department of MedicineDivision of Pulmonary Sciences & Critical Care Medicine

Page 2: PHANG LAB TALK

What I do:• Perform high-throughput data analysis for the scientific community;

microarray and Next Generation Sequencing datasets

• Provide analysis solution for experts and novice users alike

• Develop multi-media approaches to disseminate translational science education

• Studying the role of long non-coding RNA; second talk

• Establishing the Bioinformatics Consultation and Analysis Core to help researchers and scientists design, analyze and interpret their experiments.

Page 3: PHANG LAB TALK

Today’s Talk Layout

• The center of my universe:– R and Bioconductor

• Collaboration with Biologists• 5x5; simple way to teach and contribute• Next Generation Sequencing (NGS)

Page 4: PHANG LAB TALK

Today’s Talk Layout

• The center of my universe:– R and Bioconductor

• Collaboration with Biologists• 5x5; simple way to teach and contribute• Next Generation Sequencing (NGS)

Page 5: PHANG LAB TALK

R

r-project.org

Page 6: PHANG LAB TALK

R is hot

http://blog.revolutionanalytics.com/r-is-hot/

Page 7: PHANG LAB TALK

R in the media

Page 8: PHANG LAB TALK

Bioconductor• www.bioconductor.org• Statistical tools in R for high-throughput data analysis• 6 month update cycle. Release 2.10 with 554 software

package (45 new)• Analysis workflow

– Oligonucleotide Arrays– Sequence Analysis– Variants– Accessing Annotation Data– High-throughput Assays

Page 9: PHANG LAB TALK

The Websitewww.bioconductor.org

Page 10: PHANG LAB TALK

Categories

Page 11: PHANG LAB TALK

Categoriescont …

Page 12: PHANG LAB TALK

• Typical Analysis Routine

Page 13: PHANG LAB TALK

R is easy

Page 14: PHANG LAB TALK

Result output

Page 15: PHANG LAB TALK

Other Resourceshttp://www.rseek.org/ http://www.statmethods.net/

http://crantastic.org/ http://stackoverflow.com/

Page 16: PHANG LAB TALK

Today’s Talk Layout

• The center of my universe:– R and Bioconductor

• Collaboration with Biologists• 5x5; simple way to teach and contribute• Next Generation Sequencing (NGS

Page 17: PHANG LAB TALK

Collaboration

• >1000 microarray chips / year• Affymetrix & Illumina platforms• Next Generation Sequencing 25 free Pilot

Projects.• Serve the rocky mountain region scientific

community

Page 18: PHANG LAB TALK

Collaboration - tips

• Don’t be a data analyst – be a co-investigator• Suggest analysis approaches that are not

obvious• Focus on the result, not method• Always looks for grant writing opportunity• Understand the technical & biological system

as thoroughly as possible – you will be surprise what biologists missed informatically

Page 19: PHANG LAB TALK

Exmaple 1: Classification of Pituitary Tumors

• Pituitary tumors are the most common type of brain tumor in 20% at autopsy and 1/10,000 persons clinically. Based upon 2010 figures of a veteran population of 22.7 million, this translates into >225,000 veterans with pituitary tumors.

• Currently no medical therapies exist for these tumors and surgical resection is the treatment of choice. Recurrence rates approach 40%.

• Understanding of the pathways to tumorigenesis and markers of aggressiveness and risk of recurrence would alter the intensity and cost of clinical care and may provide novel candidates and pathways to explore for new treatment options for these patients

Page 20: PHANG LAB TALK
Page 21: PHANG LAB TALK

Principle Component Analysis

Page 22: PHANG LAB TALK

Potential markers

Page 23: PHANG LAB TALK

Outputs

Page 24: PHANG LAB TALK

Example 2: Explore the artistic side!

Page 25: PHANG LAB TALK
Page 26: PHANG LAB TALK

Example 3: Unconventional Usage

Page 27: PHANG LAB TALK
Page 28: PHANG LAB TALK

Introduction• Crohn’s Disease (CD) is an Inflammatory Bowel Disease

(IBD) that affecting up to one million Americans (15 to 30 ages).

• Discordance between monozygotic twins affected by CD provide evidence for epigenetic role in etiology of disease.

• We combined 2 microarray technologies to study these roles– CHARM array (Comprehensive High-throughout Array for

Relative Methylation)– Gene Expression (Affymetix Gene 1.0 ST)

Page 29: PHANG LAB TALK
Page 30: PHANG LAB TALK
Page 31: PHANG LAB TALK

Research Informatics Integrated Core (RIIC)

Michael G. Kahn MD, PhDCCTSI Co-Director & RIIC Core Director

[email protected]

Page 32: PHANG LAB TALK

RIIC Organizational ModelMichael Kahn

ThomasYaeger

Web site

Portal application

s

Virtual server farm

Research LIS

implementation

Desktop support

Jessica Bondy

(Cancer Center Informatics

Core Director)REDCap, REDCap Survey

Data Manageme

nt Best Practices

MichaelKahnThird

Thursday @ Three

Thirty Three

InformaticsSeminar

Series

Secondary database

and analysis service

TzuPhang

5x5s

Video Tutorials

Bioinformatics Tools Tutorials

SteveRoss

Community Engagemen

tInformatics

Liaison

Page 33: PHANG LAB TALK

http://cctsi.ucdenver.edu/RIIC/Pages/ConsultationDataAnalysis.aspx

Page 34: PHANG LAB TALK
Page 35: PHANG LAB TALK
Page 36: PHANG LAB TALK

5X5http://cctsi.ucdenver.edu/5x5

Page 37: PHANG LAB TALK

Demonstration

http://gcrc.ucdenver.edu/Videos/Informatics/5x5/SocialNetworking5x5.wmv

Page 38: PHANG LAB TALK

Tools

Page 39: PHANG LAB TALK

Podcast

Page 40: PHANG LAB TALK
Page 41: PHANG LAB TALK

TIES – Translational Informatics Education Support (TIES)

• Bridging the gap in translational research through education

• Training biologist informatics• Enhance collaboration through education and

knowledge exchange• Bring awareness in latest technical advances• Disseminate knowledge through innovation

Page 42: PHANG LAB TALK

Next Generation Sequencing

The future is here ….

Page 44: PHANG LAB TALK
Page 45: PHANG LAB TALK

Paradigm Shift

• Standard “Sanger” sequencing– 96 sample/day– Read length ~650 bp– Total = 450,000 bases of sequence data

• 454 – the game changer!– ~400,000 different templates (reads)/day– Read length ~ 250 (at that time)– Total = 100,000,000 bases of sequence data

Page 46: PHANG LAB TALK

The second generationRoche (454) http://454.com/

– First on the market– Emulsion PCR and pyrosequencing

Illumina (Solexa) http://www.illumina.com/– Second on the market– Bridge PCR and polymerase based SBS

Abi (Solid) http://solid.appliedbiosystems.com/– Third on the market– Emulsion PCR and ligase based sequencing

Page 47: PHANG LAB TALK

Single molecule sequencingHelicos Biosciences

http://helicosbio.comtrue Single Molecule Sequencing technology

Pacific Biosciences http://www.pacificbiosciences.comSingle Molecule Real Time sequencing

Page 48: PHANG LAB TALK

Portable Sequencer

• Ion Torrent

Page 49: PHANG LAB TALK

OthersPolonator http://www.polonator.org

Emulsion PCR and ligase based sequencingUsed in the Personal Genome ProjectOpen platform, open sourceCheap/affordable

Complete Genomics http://www.completegenomics.comSpecializing in human genome sequencing

Page 50: PHANG LAB TALK

Type of read data

• Base Space or Color Space• Paired end or single end• Stranded or Unstranded

Page 51: PHANG LAB TALK

Short Reads

• Short reads from NGS are challenging (Solexa ~36 bp, now HiSeq 100 bp single pass)– Very hard to assemble whole genome– Especially on repeat regions

• Requires many fold coverage• New and faster algorithm for many traditional

bioinformatics operations• Reads are getting longer – another moving

target. (2x250)

Page 52: PHANG LAB TALK

Applications

• An explosion of scientific innovation!!• New usages not directly foreseen by the

original developers of the technology• Some envision the beginning of next

revolution – such as PCR – NGS machine in every lab!!

• Cheap high-volume sequencing – revisiting data collection and management system

Page 53: PHANG LAB TALK

RNA Sequencing• “Digital Gene Expression” or “RNA-Seq”• Truly accurate gene expression measurements– Can replace gene expression microarrays • 25% more sensitive• Does not rely on hybridization (no %GC bias, no cross-

hybridization between related genes)

• Discover novel genes (and other kinds of RNA

molecules) – one experiment found that 34% of human transcripts were

not from known genes• Sultan et al, Science. 2008 Aug 15;321(5891):956-60.

Page 54: PHANG LAB TALK

Why RNAseq better then microarray?

• Not predefine gene annotation — make discovering novel transcripts possible

• Low, if any, background• Large dynamic range of expression levels, no

upper limit for quantification• Reveal sequence variation, such as SNP, in the

transcript region• In Helico — single molecule sequencing — no

PCR step, remove amplification bias

Page 55: PHANG LAB TALK

More information from RNA

• Can capture true alternative splicing information– Sequence of splice-junctions• One study found 4,096 previously unknown splice

junctions in 3,106 human genes– Different transcription start and end points for

RNA molecules• Allelic variation (SNPs) • Small RNAs

Page 56: PHANG LAB TALK
Page 57: PHANG LAB TALK

Bottleneck: Data Analysis

Page 58: PHANG LAB TALK

Informatics is the Bottleneck

• Scientists are currently able to generate sequence data much faster/more easily than they are able to analyze it

• Customized analysis / Bioinformatics consulting is needed for every project

Page 59: PHANG LAB TALK

Bioinformatics Challenges• Need for large amount of CPU power– Informatics groups must manage compute clusters– Challenges in parallelizing existing software or redesign of

algorithms to work in a parallel environment– Another level of software complexity and challenges to

interoperability• VERY large text files (million lines long)– Can’t do ‘business as usual’ with familiar tools such as

Microsoft Excel.– Impossible memory usage and execution time

• Sequence Quality filtering

Page 60: PHANG LAB TALK

Auer P. Statistical design and analysis of RNA sequencing data. Genetics. 2010.

Page 61: PHANG LAB TALK

Data formats

• Images• “raw” basecalls with quality scores• Sequence reads aligned to reference genomes• Assemblies (contigs)• Variants (SNPs, indels, copy number variants)

Page 62: PHANG LAB TALK

Hexadecimal mode

Decimal mode

Page 63: PHANG LAB TALK

FASTQ

Raw

Page 64: PHANG LAB TALK

SAM format

Example

Pileup format

Page 65: PHANG LAB TALK

QNAME

FLAG

RNAME

POS

MAPQ

CIGAR

MRNM

MPOS

ISIZE

SEQ

QUAL

Page 66: PHANG LAB TALK

CIGAR

• M : match/mismatch• I : Insertion compared with reference• D : Deletion compared with reference• N : Skipped bases on reference• S : soft clipping (unaligned)• H : hard clipping• P : padding

Page 67: PHANG LAB TALK
Page 68: PHANG LAB TALK
Page 69: PHANG LAB TALK
Page 70: PHANG LAB TALK
Page 71: PHANG LAB TALK
Page 72: PHANG LAB TALK
Page 73: PHANG LAB TALK

File Size

• s_1_ILS4_sequence.txt [5.2 GB]• s_1_ILS4_sequence.fastq [3.3 GB]• s_1_ILS4_sequence.sam [4.5 GB]• s_1_ILS4_sequence.bam [995 MB]• s_1_ILS4_sequence.sorted.bam [696 MB]

Page 74: PHANG LAB TALK

The Bible

Page 75: PHANG LAB TALK

Utility Tools

• SamTools• Picard• Useq• Etc …

Page 76: PHANG LAB TALK

Bioconductor Solution

Page 77: PHANG LAB TALK
Page 78: PHANG LAB TALK

A demonstration

Page 79: PHANG LAB TALK
Page 80: PHANG LAB TALK

Secondary Tools

• Laboratory Management• Data mining and visualization• Project management for genome assembly• Pathway mapping (functional analysis of

groups of genes)• Motif finding (for Chip-Seq)

Page 81: PHANG LAB TALK

Integration

• Integrate information from different technologies on a single genome map– Genetic variation– Gene expression (mRNA levels)– Alternative splicing– Transcription factor binding– Methylation/histone status– Small RNA levels (gene regulatory molecules)– Non-coding RNA levels!

Page 82: PHANG LAB TALK

Speed/Efficiency

• New emphasis on efficient data structures and algorithms

• Use of “old style” tools such as grep/sed/awk• Machine language programming• Currently a huge burst of programming creativity in

an “anything goes” environment• A desperate scramble for tools that work• Huge duplication of effort in programming, but also

in evaluating new software

Page 83: PHANG LAB TALK
Page 84: PHANG LAB TALK
Page 85: PHANG LAB TALK

Amazon Web Serviceshttp://aws.amazon.com/education/

Page 86: PHANG LAB TALK

Future Directions

• Sequencing will continue to get much faster and cheaper, by 4-10x per year for several more years.

• Affordable complete human genome sequencing will be available as a clinical diagnostic tool within 2-3 years.

• Data storage and analysis bottleneck• Data security/privacy issues

Page 88: PHANG LAB TALK

Field Trip

Page 89: PHANG LAB TALK
Page 90: PHANG LAB TALK
Page 91: PHANG LAB TALK