9/20/2011
1
GE, miRNA, Exon (non-NGS data) with Partek Genomics Suite 6.6
Jean Jasinski, Ph.D.
Field Application Scientist
Agenda
University of Minnesota Tuesday, September 20 , 2011
10:00 a.m.-noonGene Expression, miRNA, Exon, CNV, ChIP-ChipWhat’s new in PGS 6.6, Overview Array Assays & Analysis
12:00 – 1:00 Lunch Break (Pizza)
1:00 – 3:00 Partek Flow, PGS, and NGS analysis
9/20/2011
2
Who is Partek?
• Founded in 1993
• Build tools for statistics & visualization
• Focused on genomics
• Thousands of customers worldwide
• Worldwide, world-class customer support
• We are growing – Job openings
What is Partek Genomics Suite™?
• Desktop software - no server required
• Graphic user interface - no scripting needed
• Designed for biologists & bioinformaticians
• Intuitive workflows
• Windows, Mac, Linux
• Competitively priced
9/20/2011
3
ONE Software for…any assay
Gene Expression RNA-seqsRNA-seqRNA-seqsRNA-seqRNA-seqsRNA-seq
ExonExonExon
DNA-seq
miRNA
ChIP-chip
CN/LOH /ASCNMethylation
ChIP‐Seq
Partek® Genomics Suite™
ONE software …any technology
RT-PCR
Partek® Genomics Suite™
9/20/2011
4
Partek GS™ Statistical Toolbox
• Powerful Statistics
• Parametric
• Non‐parametric
• ANOVA, Welch’s ANOVA, Repeated Measures ANOVA
• Fisher’s Exact test, chi‐square test
• Power Analysis
• Survival Analysis
• Model Prediction
• Correlation tests
• Multiple Test Corrections
• ..More
Overview of Layout
9/20/2011
5
Tool Layout
Spreadsheet (unlimited size)
Summary
Help, Tutorials, and Software Updates
• Recorded presentations
• Tutorials with datasets
• White Papers
9/20/2011
6
Webinar Title Date
Statistical Concepts laying the foundation for Microarray & NGS Data AnalysisWe will cover the statistical concepts at the core of microarray and NGS data analysis in Partek GS, from a biological perspective.
October 26, 2010
Clinical Applications & Microarray Data AnalysisLearn about Classification algorithms, Survival Analysis, Biomarker identification and more.
November 18, 2010
Statistical Tools for Next-Generation Sequencing Data AnalysisWe will describe the statistical tools in Partek GS that help biologists extract relevant information from NGS data.
December 9, 2010
iDEA WebinarAnalysis of a Multi-Assay Next Generation Sequencing Study
August 24th, 2011
Partek Webinars
New in 6.6
• BAM file direct import (no .pdata)
• Hierarchical clustering
• Methylation Workflow (NGS)
• Profile Trellis
• Exporting Zipped Project
• Self Organizing Maps
Don’t be fooled by the “beta” version; upgrade today (esp. NGS users)
9/20/2011
7
BAM file import (no more .pdata files)
Convert SAM to BAM
Sort & Index (PGS maywrite a new indexedBAM file—need writepermission in folder)
New Hierachical Clustering
14Copyright © 2011 Partek Incorporated. All rights reserved.
9/20/2011
8
Gene Patterns by Category – Profile Trellis
SOM Fingerprinting– Look for Clustering Genes
• Expression levels are standardized
• Relative difference in gene expression is provided by colors – red high; blue low; greens indicates no change
• For example the left lower corner is high in the normal cell lines MCF10A but same area is low in BT474
9/20/2011
9
Exporting Zipped Projects
• Preserves parent/child relationship
• Includes annotation files
• Single file can be moved from one system to another (must unzip manually)
• Archive completed projects
Methylation Hilbert Curve
DenselyMethylated
Centromere – no reads
End of data –padded with zeros
A way to view one-dimensional data in two-dimensional space.
locality-preserving behavior: meaning points close to each other in 1-D space are on average close to each other in 2D space.
9/20/2011
10
Gene ExpressionmicroRNAExonCopy NumberChIP-ChIPMethylation
Microarray Data Assays
9/20/2011
11
Data File Types
File > Import:
• Text Files (.txt, .csv)
• Vendor formats • Affymetrix, Illumina, Agilent,
Nimblegen, Nanostring
• GEO (Gene Expression Omnibus)
• ODBC Compliant Databases
• Partek Express Study File
• Excel – Save as .csv or .txt
Import – Affymetrix CEL filesNormalization Default - RMA• Background correction• Quantile Normalization• Log2 Transformation• Median Polish Probeset Summarization
Options (Customize…)• Partek Defaults (mean summarization)• Adjust for GC content or probe sequence• Import Control probes• Exclude/Include Probes
9/20/2011
12
Import – Agilent Data
Import ‐ Illumina Expression Data
• Export Partek Project from BeadStudio or Genome Studio (.ppj file)
•Partek Plug-in available for download through website
• Tutorial available on website to install plug-in
9/20/2011
13
Import – Nanostring Data
Files (rcc) imported
Normalization method
Background Subtraction
Import - Nimblegen Data
1) Import individual files• .pair, .calls, .ngd, .txt
2) Import Nimblegen Project Directory
•Subdirectories
• Specify annotation file
• Choose Species
9/20/2011
14
Import – Taqman RQ
• Import Ct values
• Set undetermined values
Import – SOLiD SAGE
• Counts are already mapped to known locations (transcripts)
• Transformation?
9/20/2011
15
Workflow
• Import & Normalization
• QA/QC – Exploratory Analysis
• Analysis
• Visualization
• Biological Interpretation
• Genomic Integration• Gene Expression with Copy Number • MicroRNA Integration
Assign Sample Attributes
• There are many ways to assign sample information (treatments, phenotype, other clinical information)
1) From a “sampleInfo” file (Affymetrix Import)
2) By creating treat/phenotype groups and dragging the samples into the appropriate group
3) By splitting apart the filename
4) By manually adding columns and filling them in (similar to Excel)
9/20/2011
16
1) Assigning sampleInfo file during Import
.fmt format with associated .txt or binary (i.e. SampleInfo.txt.fmt)
There must be a “key” column which identifies the filename (exactly – cAsE sEnsITive)
1) Creating a sampleInfo File
Multiple Chip Types – 2 x 250K
ChipType1 ChipType2Attribute1SubjX_A1
Attribute2SubjX_A2
NSP STY Type Subject
CRL-2325D_NSP.CEL CRL-2325D_STY.CEL Normal 1
CRL-2324D_NSP.CEL CRL-2324D_STY.CEL Tumor 1
CRL-5957D_NSP.CEL CRL-5957D_STY.CEL Normal 2
CRL-5868D_NSP.CEL CRL-5868D_STY.CEL Tumor 2
CCL-256.1D_NSP.CEL CCL-256.1D_STY.CEL Normal 3
……. ……. …… ……
9/20/2011
17
2) Add attributes from existing column
2) Getting Sample Attributes From Filename
Specify what character(s) separate the factors
Specify factor names and type (e.g. categorical or numeric)
Example Filename: “< TisMap_Brain_01_v1_WTGene1.CEL>”
9/20/2011
18
3) Assigning Sample Attributes after Import
3) “Drag & Drop” Specification of Groups
Name the new attribute and all categories of that attribute.
Group samples by dragging and dropping
(e.g. attribute name is “TissueType”, and categories are “Tumor” and “Normal”)
9/20/2011
19
Edit Sample Information
Factors vs. Response Variables
• There are 2 fundamental types of measurements per sample:
• Treatment/Phenotype information (factor variable)
• Any variable that describes the samples is a factor variable
• The organism’s “response” to treatment (response variable)
FACTORS RESPONSE
9/20/2011
20
Fixed vs. Random Variables (mixed model ANOVA)
Imagine if you repeated the experiment 20 years from now, would the same levels of each factor be used again?
•Tissue – Yes, the same tissues would be used – Fixed effect•Gender – Yes, the same genders would be used again – Fixed effect•Scan Date / Subject – No, samples would be taken from other subjects – Random Effect (red)
Exploratory Analysis
9/20/2011
21
Exploratory Analysis, PCA
• Points close together are similar across genome
• Outliers?
• Batch effect?
• Help set up ANOVA model
• Edit Plot Properties
Linear transformation to Convert n original variables into 3 dimensions
Exploratory Analysis
View>Box & Whiskers>Rows(Response)
Outliers? Normal Distribution?
9/20/2011
22
Finding differentially Expressed GenesANOVA Model
• ANOVA – measure the effects of multiple experimental factors (phenotypes) on expression levels
• Assumptions
• Normal Distribution• Approximately Equal Variance between groups• Sample Independence
• Balanced & Unbalanced | random & fixed effects (mixed- model ANOVA) |nested factors | any number of categorical effects or numeric values |Interactions
• P-values & F ratio are displayed by default• Contrasts & Fold Change• Batch effect
Differential Expression
• ANOVA will partition variability due to the factors in the model
• Test on every gene/probeset on chip
• Cross Tabs – balance of samples
• Advanced Tab for mixed-model ANOVA (model statistics, REML)
9/20/2011
23
Contrasts / Pair-wise Comparisons
• Select Factor / Interaction drop down
• Log2 transformed?
• Keep control consistency
• Report other statistics
• Contrasts added
• Fold change, ratio
Results of ANOVA
9/20/2011
24
Plot Sources of Variation
Examine each factor’s contribution to variability in the response variables
Error is the amount of variability NOT explained by the model
Columns taller than Error column may be significant
Mean or Median
Create Gene List
• Each factor/contrast of model is listed
• cutoff for FDR p-value & Fold change
•# of genes that pass
• Configure default settings
• Save list or temporary
• Advanced tab
Tools > List Manager
9/20/2011
25
Multiple Test Corrections
• FDR (False Discovery Rate) is the proportion of false positives among all positives.
• Partek implements the “step up” FDR method by default (Benjamini & Hochberg, 1995).
• Additional methods – “step down” FDR, q‐Value, Dunn‐Sidak, Bonferroni
(lenient) (restrictive)
FDRStep Up/Down & q-Value
Uncorrectedp-value
Bonferroni/Dunn-Sidak
FWER
Hierarchical Clustering (6.5 style)
• Parent/child spreadsheet relationship
• Change plot properties
• Color, labels, text legend
*Change default color schemeEdit> Preferences > Colors tab
9/20/2011
26
GO Enrichment
microRNA
9/20/2011
27
miRNA
• Short ~22 nucleotide sequences that bind to complementary sequences in the 3’ UTR of multiple target mRNA’s
• important post-transcriptional regulators of gene expression – usually silencing
• Even diagnostic of some conditions
• microRNA’s can target hundreds of mRNA targets each
• miRNA analysis allows a more complete view of the biological system
microRNA Workflow
• Import - Vendor neutral support
• QA/QC & Exploratory Analysis
• Analysis – Diff. Expressed microRNA’s
• Easy integration of GX & microRNA
1. Combine miRNA analysis with GX
2. Find miRNAs targeting genes of interest
3. Find miRNAs which correlate with targets
*Also available option in the GX workflow
9/20/2011
28
miRNA Analysis
Import
9/20/2011
29
microRNAProperties
• miRNA analysis is similar to gene expression analysis
• After import the data, make sure the species is defined• File>Properties
• If marker ID is miRNAname, annotation file is optional
Normalization
• As the “best” normalization for miRNAdata is still under discussion, here are a few normalizations available in Partek
• Quantile normalization or full RMA
• Normalization to control probes/genes
• Normalization to 3rd quartile
• Loess
• Mathematical transformations such as logarithm
9/20/2011
30
Exploratory analysis (QA/QC)
• PCA analysis, histogram analysis, clustering and much more
• Visualize natural sample grouping
• Distribution of data
• Preliminary analysis for ANOVA
Statistical analysis
• ANOVA to find differentially expressed microRNAs
• Mixed Model, Unlimited number/categorical factors
• Assumptions of ANOVA met
• Pair-wise comparisons
9/20/2011
31
microRNA Target Databases
Other:Not listed (i.e., Pictar)
Custom Format: Tab-Delimited file of microRNA-mRNA target pairs
mir mrna
mir1 GENE1
…. …
Integrative Genomics
Data is associated using target prediction databases:• By default: TargetScan or miRBase
Partek supports three integration experiment designs:
1. Separate GX and miRNA analysis (no statistics)Combine the result of the two analysis
2. Differential gene expression analysis onlyFind miRNAs which target changed genes
3. Paired GX and miRNA analysis:Correlate the expression of genes and the miRNAs which target them
9/20/2011
32
Separate GX and miRNA analysis
1. Separate gene expression and miRNA analysis
What do I need?Two datasets: Gene expression and miRNAThese do not have to be done on the same samples
What will I get?A combination of the statistical results from both analysis
What am I testing?Are my significantly changed genes the target of significantly changed miRNAs?
Combine miRNAswith their mRNA Targets
• Choose differentially expressed microRNAspreadsheet
• Select column with microRNAs
• Choose mRNA result spreadsheet (ANOVA or differentially expressed genes)
• Select Gene Symbol
9/20/2011
33
1. Separate GX and miRNA analysis
Use case:I have run a gene expression analysis experiment and a miRNA experiment. I want to see if my miRNAs of interest correspond with genes of interest
If you do not yet have gene expression data, you can get a list of gene targets of miRNAs
Putative Targets
Get all the Targets of the microRNA’s
9/20/2011
34
Differential gene expression analysis
2. Differential gene expression analysis only
What do I need?Any gene expression experiment can be used
What will I get?microRNAs that target a disproportionately high number of significant mRNA’s
What am I testing?Are my significantly changed genes the target of specific miRNAs?
2. Over-represented microRNA targetsTo test if significantly changed genes are the target of specific miRNAs
Fishers Exact test –P‐value indicates the overrepresentation of targets within gene list
9/20/2011
35
Differential Expression Results
Smaller Enrichment p-values indicate the more over-represented the significant genes are
miR-124 has been found to be the most abundant microRNAexpressed in neuronal cells
Paired GX and miRNA analysis
3. Paired GX and miRNA analysis:
What do I need?A gene expression and miRNA analysis using the same samples in both assaysSample IDs must match
What will I get?The correlation of miRNAs with their targets.
What am I testing?Does miRNA abundance effect mRNA abundance?
9/20/2011
36
Paired GX and miRNA analysis
• Results:• Negative correlation = high level of microRNA associated with low expression
of targeted gene
• Positive correlation = high level of microRNA associated with high expression of targeted gene
Each row is a miRNA with its targeted mRNA pair
Correlation Scatter Plot
<Right+Click> row header
9/20/2011
37
Gene ExpressionmicroRNAExonCopy NumberChIP-ChIPMethylation
Workflow
Gene Level Analysis
(summarize to genes)
Alternative Splicing Analysis
(Exon level analysis)
*Import through workflow will be ExonLevel analysis
Import
9/20/2011
38
Gene Level analysis – Summarize Exons
Differential Expression – Gene Level
• Summarize exons to genes
• ANOVA
• Detect differential gene expression detection
9/20/2011
39
Alternative splicing
DifferentialExpression
AlternativeSplicing
Exon A Exon B
D N D N
Exon A Exon B
DN
D
N
Exon A Exon B
D N DN
No Alt Splice Alternative Splicing
Alternative splicing analysis – ANOVA Specify ANOVA model
Filter out non‐expressing probes
9/20/2011
40
Three Views into Expression
Exon Level
Gene Level
Exon Level
Gene Level
Differentially ExpressedExons
Differentially ExpressedGenes
Alt SpliceCandidates
Gene View
Access from Alt-Splice Result Spreadsheet
Filtered - translucent
9/20/2011
41
Gene ExpressionmicroRNAExonCopy NumberChIP-ChIPMethylation
What is copy number variation?
• Copy Number Variation (CNV) is a segment ofDNA in which copy number differences havebeen found by comparison of two or moregenomes.
• CNV caused by genomic rearrangements such as deletions, duplications, inversions and translocations of particular genetic regions
• CNV can be detected by a collection of closely spaced genomic markers to measure the abundance of DNA across multiple samples and compared against a reference.
• Range 1kb to several Mb
Amp Del
Normal
9/20/2011
42
Standard Copy Number Processing Workflow
Copy Number/LogRatio
Detect regions on each sample
Analysis on regions across samples
Import from Affymetrix, Agilent, Illumina, NimbleGen, etc…
Find genes overlap with regions
Import Allele Intensity (Affy .cel files)
Genomic integration
Biological interpretation
Visualize data at any of the steps
Paired/Unpaired Copy Number Creation
• PAIREDTwo samples (case/control) taken from each subject
• The normal sample is baseline for the case sample for each subject
• Output copy number values only for case sample in each subject
UNPAIRED
Affy : SNP6, SNP5, 100K,500K Illumina: 1M, Omni1‐Quad
9/20/2011
43
Baseline Choices
Better ability to detect true copy number
More robustness to sources of noise
UnpairedUniversalReferenceLarge Hapmapbaseline run inthird party lab
Paired
DNA andreference fromsame patient
UnpairedExperimentalReferenceReference fromsimilar samplesrun in same lab
UnpairedLab ReferenceReference fromlarger unrelatedgroup of samplesrun in same lab
Import Copy Number Assays
Agilent• Feature Extraction• Choose LogRatio to
import• Change the log base to 2
Nimblegen• .pair in raw data folder or(LOESS – one color to the other)• *normalized.txt in processed data folder(output – corrected logratio)
Illumina• Partek plug‐in for GenomeStudio• LogRRatio( Allele intensities)• B allele frequency• Genotype calls
Affymetrix
• .CEL (allele intensities)• .CHP (genotype calls)• Create Allele Ratio
9/20/2011
44
Import Affymetrix MIP ChipFile > Import > Affymetrix > MIP Copy Number text file
• Choose input file, annotation file
• Three files to choose from:— ASCN— Total Copy Number— Allele Ratio
• Values pre‐calculated by Affymetrix
• Must adjust Copy Number values below zero to use analysis options
• Set values below zero to small number
Choose Sample ID
• Necessary for integration with other integrative analysis (e.g., LOH, gene expression)
• Not recommended to use file name because of the different extension (i.e. CEL & CHP)
9/20/2011
45
How to find regions of CNV?(Amplifications & Deletions)
• Monitoring trends across multiple adjacent markers
• Define chromosomal breakpoints where these trends in chromosomal abundance changes
• Methods in Partek:• Hidden Markov Model
• Genomic segmentation
2 “normal”
Partek Genomic Segmentation• Find a breakpoint that produces different
neighboring regions
Segmentation Parameters
• Specify minimum number of genomic markers
• Two sided t‐test to comparing two neighboring regions
• Based on significance and amount of changes to decide whether to insert breakpoint
Region Report
• 2 One sided t‐test to compare the mean of the region with expected range to determine aberration status
• Expected range: the range around each expected copy number. In a diploid region , the expected range would be 2+/‐ 0.3 which is from 1.7‐2.3.
2.62.2
Signal to Noise
(2.6-2.2)=0.4 > 0.3
9/20/2011
46
HMM vs Segmentation
• HMM - Good on homogenous samples with anticipated states (copy number)
• Segmentation ‐ Good for heterogeneous sample when you don’t know the copy number state
Segmentation Result Spreadsheet
• One row per segment per sample
• First 3 columns are the genomic location: chromosome, start, end
• Copy Number status is based on the report parameters
9/20/2011
47
GC Wave Correction on Copy Number
Adjust copy number/logratio based on local gc content• Need reference genome in .2bit format
Diskin, et.al; Adjustment of genomic waves in signal intensities from whole genome SNP genotyping platforms, Nucleic Acid Res., 2008, 36: 19
Analyze Detected Segments
Region of sample 1
Region of sample 2
Region of sample 3
Region of sample 4
Region of sample 5
Result ResultResult
Result
Analyze regions across
multiple samples Detect changes by Phenotype
Unbalance between samples will indicate increased significance between levelsCan have minimum # probes less
than Segmentation default
9/20/2011
48
Segment‐analysis Spreadsheet
Copy number status information among all the samples
Detect changes by category (i.e. phenotype) ‐ Chi‐square results
Detailed information on each sample for each region (i.e. average copy number)
Plot Detected Regions
Karyoview (Histogram View)— Sample frequency on aberration regions
Classification View— View each region in each sample separately
9/20/2011
49
Create Region List• Specify criteria to filter down to interesting regions based on
• p‐value• length• number of marker• chromosome• number of aberration samples
Find Overlapping Genes• Annotations can be attached onto
any region spreadsheet
• The annotation source can be specified or mapped to custom annotations
• Region can be extended on both ends
• Output a new gene list spreadsheet or on a new column in the selected region spreadsheet
9/20/2011
50
Gene Overlap ResultOutput a new gene list spreadsheet or on a new column in the selected region spreadsheet
• Region Overlap: The intersection in base pairs divided by the size of the region
•Gene Overlap:The region of the intersection divided by the size of the gene
Right click on a row header to browse to location
Finding Genes on Breakpoints
• Add refseq genes to shared segments in a new sheet
• Any gene with less than 1 “gene overlap” is a potential break point gene
• These could be fusion genes depending on where the breakpoint overlaps with the gene and possibly be “driving” a phenotype
Amplification
9/20/2011
51
Test for Known Abnormalities• Input file:
• Filtered segmentation/HMM result spreadsheet• Abnormality database
• If overlap = positive• Output is each row is a feature testing in each sample
Cluster Genome
Copy number spreadsheet is used to verify how the samples are clustered on the whole genome or selected chromosomes
• default is showing cluster on chromosome 1
• click Show All button to cluster on the whole genome
• combine left and right click on chromosome number to select chromosomes
9/20/2011
52
Chromosome View
Copy Number & LOH
• Regions of Copy Number Amplification and Deletion
• Regions of LOH
• Combined regions of CN and LOH
• Overlapping genes
9/20/2011
53
LOH & CNLOH
Amplification
Deletion
Amplification Delw/
LOH
Copy neutral LOH
Ampw/
LOH
Deletion
Gene ExpressionmicroRNAExonCopy NumberChIP-ChIPMethylation
9/20/2011
54
ChIP-ChIP
• Import
• Goal: Detect regions enriched by transcription factor chromatin immunoprecipitation( ChIP)
• Motif Discovery
• Find Genes that overlap enriched regions
• Adjust for probe sequence • RMA background correction• Quantile normalization• Log base 2 transformation
Import, normalization
Nimblegen
Agilent
• Feature Extraction Files• LogRatio• Transform Log(10) to Log(2)
Affymetrix
• .pair, .ngd, .pos• define species• choose annotation
More dense assays Less dense assays
9/20/2011
55
ANOVA or t-test
ANOVA• More complex model• Batch effect• Remember to calculate T‐statistic
T‐test• 1 factor model• Faster• Typical experiments only have 1 comparison• Result is t‐statistic
The MAT algorithm (Johnson et. al, PNAS, 2006) is used to find regions of binding in Tiling experiment:
1. Estimating probe level t‐statistics
2. Using the trimmed mean of probe‐level t‐statistics in a window of fixed genomic length to generate MAT scores
3. An empirical distribution is used to determine MAT score significance by sampling windows from the original data
4. After identifying regions of a specified target length as significant, combine with other close regions
Detect regions of Significance(MAT)
MAT
9/20/2011
56
MAT Algorithm Results
P‐value(region): the p‐value of the most significant window in the region
Fraction of negatively enriched: proportion on false positive probes in a region. = (# of probes not significant) / (# of probes in reported region)Regions of high value less confident or caused by large number of false positives
MAT‐score: maximum MAT score for this regionpos(+) = trimmed mean of t‐statistics from specified contrast was positiveneg(‐) = trimmed mean of t‐statistics from specified contrast was negative
Create Region List
• Positive MAT score means one group is enriched over another• Filter by MAT score > 0• Filter by p‐value• Intersection between to region lists
9/20/2011
57
Motif Discovery
De NovoGibbs Motif Sampler
KnownJASPAR Database
Find Overlapping Genes
*PAZAR coming soon
9/20/2011
58
Distance of Methylation to TSS
Thank you for your attentionHungry???