microarray data analysis in genespring gx 11 · · 2016-08-30microarray data analysis in...
TRANSCRIPT
GeneSpring GX 11 Genome Browser
Navigation:
• Intelligent navigation by
jumping to next entity of
interest (Copy number
region or SNP in list etc)
• Zoom in by dragging box in
Chromosome Navigator or
within the tracks
• Pan around region of
interest
• Search for Entity of Interest
by annotation (gene symbol
and coordinates, etc)
GeneSpring GX 11 Genome Browser
• Multiple samples or conditions can be displayed as individual tracks or “merged” in the same track
• Data from different experiment types can be displayed in same browser and “merged”
• Plot raw and normalized intensity values, copy number, LOD, and other list associated values
• Select multiple annotation tracks to be displayed (i.e. miRNA, CpG islands, CNVs from DGV etc)
Merge tracks
1 and 2
Genome Browser Terminology
• Organism: Species (Scientific and Common Name, Taxonomy ID)
• Build: Chromosomal number, size, and optional banding patterns (cytoband file)
• Features: Annotation Tracks
Hierarchy is
Organism
Build (one or more per organism)
Feature (one or more per build)
Adding Information to Genome Browser
• Create the organism if not present.
Human, mouse, rat, C. elegans available from Agilent server.
• Create a [new] build using a cytoband file.
• Add annotation tracks (.BED files or advanced import). Drag
and drop (text files) or menu-driven.
Where to get the data? UCSC is a good source.
Useful Manipulations
• Overlay or merge them (select the tracks to be overlaid
and use:
• Resize tracks to fit in window (right click in track window
and Change Track Size).
Why would I use this?
The Genome Browser is a visualization tool for annotations
as well as data.
For expression studies, you might view your expression data
in the context of chromosomal location (up- or down-
regulated genes that are physically adjacent might be due to
chromosomal aberrations).
Does your down-regulated gene have a known CpG island?
Visual thinkers?
GeneSpring GX 11:
Results Interpretations
GO Analysis (Fx, Process, Location)
GSEA (Gene Set Enrichment Analysis)
GSA (Gene Set Analysis)
Pathway Analysis
Find Significant Pathways (BioPax.org)
Link to IPA (Ingenuity Systems)
Extract relation via NLP (mine literature)
MeSH Pathway Builder
Gene Ontology
Gene Ontology is a controlled vocabulary that describes gene products in
terms of their associated cellular components, molecular functions, and
biological process in a species-independent manner:
• Cellular Component: Where does the gene product act? Ex: cytoplasm,
extracellular matrix, etc
• Molecular Function: What is the activity of the gene product? Ex: insulin
receptor binding, drug transport activity, etc
• Biological Process: A series of events accomplished by one or more
ordered assemblies of molecular functions. Ex: cell division, cell motility,
etc.
GeneSpring GX is packaged with GO terms and their relationship as
provided in files provided from GO Ontology Consortium
• Ontology files will be periodically updated and provided as data updates in
tool (Annotations > Update Technology Annotations> From Agilent Server)
Is there a significant enrichment of my genes of
interest in a particular GO term?
Differentially expressed genes
identified from statistical analysis
GO Analysis Enrichment Scoring
The p-value calculated for each GO term indicates the likelihood that your genes of interest fell into that category, just by chance
If 400 out of 40,000 entities on the array were found in the cell division category, what is the likelihood that you would find 40 out of 400 entities in the input Entity List in the cell division category just by chance?
GeneSpring GX uses standard hypergeometric distribution and the following information to compute p-value:
• Number of entities in the Entity List with the particular GO term and its children
• Number of entities with the GO term on entire array
• The total number of entities in the Entity List
• The total number of entities on array
For unbiased calculation, multiple probes corresponding to same Entrez ID (gene) are collapsed before enrichment scoring performed
GO Ontology
Gene X
• Multiple null hypotheses are tested (one for each term
tested), so should apply a form of correction
• Hypotheses are not independent, as GO terms are related
within a hierarchy (each GO term can have multiple parents)
• GeneSpring GX applies the Benjamini Yekutelli correction
which takes into account the dependency among the GO
categories
GSEA (Gene Set Enrichment Analysis)
GSEA interrogates genome-wide expression profiles from
samples belonging to two different classes (e.g., normal and
tumor) and determines whether genes in a pre-defined gene set
correlate with class distinction.
Useful for hypothesis generation (before analysis) as well as
post-statistical analysis.
Reference: Subramanian et al. Gene set enrichment analysis:
A knowledge-based approach for interpreting genome-wide
expression profiles. PNAS. September 30, 2005, 10.1073
GSEA
GSEA can use either BROAD lists or any lists that you create.
Download gene sets at: http://www.broad.mit.edu/gsea/
Broad Institute has defined five categories of gene sets:
• C1- Grouped based on cytogenetic location.
• C2- Functional lists. ~1000 gene lists corresponding to pathways or functional process (if they are both involved in inflammatory response, they can also be in the same list)
• C3- Regulation lists. Grouped according by promoter analysis. Genes are regulated by the same motif (may or may not know transcription factor). Cases where they simply share same binding motif and therefore assumed to be co-regulated.
• C4- Proximity to known oncogene and tumor suppresors. For example, all the neighbors of BRCA.
• C5 – GO gene sets
Gene Symbol is required annotation field.
GSEA Method
1. Rank genes based on the correlation between their
expression intensities and class distinction
• Genes that differ most in their expression between the two classes will
appear at the top and bottom of the list
• Assumption is that genes related to the phenotypic distinction of the
classes will tend to be found at the top and bottom of the list
2. Calculate enrichment score (ES) to reflect the degree of
overrepresentation of genes in a particular gene set at the
top and bottom of the entire ranked list
3. Derive p-value for the ES to estimate its significance level
4. Adjust p-value for multiple testing
GSEA Input Parameters
All Entities list can be used as input list
Multiple pairs of conditions can be analyzed at the same time
GSEA Input Parameters
Minimum number of matching Genes:
minimum number of genes that must
match between the Gene Set and input
Entity List for Gene Set to be tested
Maximum number of permutations: max
number of permutation to be performed for
p-value computation
Search Options: Entity Lists can be used
as Gene Sets. Click Advanced Search
and Next>> to search for Entity List to use
BROAD Gene Sets: Choose Gene Sets
to use for analysis
Identifiers Necessary for GSEA
Technology must contain Gene Symbol
Columns that must be marked in custom technology to perform
GSEA:
• Annotation file must contain a column (Column X) containing Gene Symbol
– Column X must be marked “Gene Symbol”
– Select “Gene Symbol” mark from the drop-down menu while creating
Custom technology.
Gene Set Analysis (GSA)
GSEA and GSA share the same idea that it is more powerful to take a
genome-wide approach by ranking genes based on their correlation with the
phenotypes being tested and seeing if there is enrichment at the top and
bottom of this correlation matrix of genes in any gene sets.
The approach that they use to determine whether or not there is an
enrichment differs in a few key ways:
1. GSA differs from GSEA in the method of calculating the test statistic
2. GSA uses a different approach for estimation of false discovery rates
3. GSA can handle multiple classes
Algorithm and computation of associated metric is detailed in the paper
http://www-stat.stanford.edu/~tibs/ftp/GSA.pdf
Pathway Analysis
Two types of pathway analysis in GeneSpring GX:
1. Pathway Enrichment Analysis: (Statistical)
„Find Significant Pathways‟ Tool
(via BioPax format pathways)
2. Network Analysis: (Visualization)
„Pathway Analysis‟ Tool
Survey relationships from published literature OR
View networks based on experimental data
Importing and Visualizing BioPAX Pathways
BioPAX (Biological Pathway Exchange) is a standard pathway
data exchange format.
Pathways in the biopax format will have the extension .owl
Allows GeneSpring users to import pathway data from KEGG,
Reactome and many other standard pathway sites.
Database for any organism of interest can be created using the
Biopax files- Rice, zebra fish, chimpanzee, dog
Find Significant Pathways Tool
Is there a significant enrichment of my genes of interest in a particular
pathway?
Analysis will be performed on every pathway that has been imported into
GeneSpring GX and every pathway created in GeneSpring GX
Find Significant Pathways Tool
• Click Finish to add all significant pathways into the currently active experiment
• Double Click on a pathway to open it in the viewer
Adding Pathways to Experiment
Any pathways imported into GeneSpring GX can be searched for and subsequently added to active experiment
• Search > Pathways
• Select Pathways to add
• Click on Add selected pathways to active experiment icon
Add selected
pathways to
active
experiment
icon Selected
pathways
Using Pathways to View PubMed Abstract
Relationships
Create and view pathways based on MeSH terms.
Start with empty pathway, add terms, and expand network.
Use NLP tool to extract information from particular articles
(full-length) and then display relationships.
Viewing Pathways from Experimental Data
Use Network Building (Pathway Analysis) to answer
questions like:
How do the differentially regulated genes relate to each
other? Are they directly related? Or are they related via
intermediate proteins?
What are the common regulators of this set of genes?
Which small molecules might interact with a gene or set
of genes?
Relation from NLP inference
UBE2L3 up-regulates the
expression of MT1G. VEGFA
modulates this up-regulation of
expression
UBE2L3 up-regulates the
expression of MT1G. VEGFA
modulates this up-regulation of
expression
Example –advanced network building
Find all “protein targets” in the given list that are modified by “drugs”
Solution:
Algorithm: Expand the given list of proteins
Filter:
• Entity type = Small molecules
• Relation type = Binding, Regulation, Expression
Small molecules (drugs)
Proteins
Essential constituents of a network
• Nodes (molecular and biological entities)
• Relations (biological relations between entities)
• Edges (regulatory effects of a node on another)
Entities
Relations
Edges
Pathway Viewer
Proteins with blue halo are in currently selected list
Hover over proteins or connections for information
Pathway Viewer
Expression information can be overlaid on the pathway
• Right click and select properties
• Choose interpretation overlap and the appropriate interpretation
Fold Change Data Overlay
• Fold Change data may also be overlaid on pathways.
• Right click in open space of pathway.
• Choose Overlay Properties and select an entity list that
contains FC.
• Overlay Column: change to FC Absolute.
Pathway Relation Database
The general computer requirements are the same as GeneSpring GX, but there are additional space requirements.
Database Size:
Infrastructure database 150 MB
Interaction databases: full install ~4 GB
Relations and entities are organized into separate databases for each organism
• Human- >1.4 million relations
• Mouse- 674,725 relations
• Rat- 767,296 relations
• Drosophila- 82,090 relations
• C. elegans- 43,122 relations
• Yeast- 94,992 relations
• E. coli- 10,876 relations
• Arabidopsis- 23,918 relations
More than 16 million abstracts were parsed.
NLP generated relations
NLP extraction pipeline: Majority of the GeneSpring pathway database relations are
derived from published Pubmed abstracts using text-mining.
DictionaryProteins, Enzymes etc..
PubMed
Molecular and Process/ Functions
InteractionsTEXT MINING
Input Sentence Entity Recognition Tagged
Sentence
Syntax Semantics
Inferencing
Apply grammar rules to
derive interactions
Methods of Pathway Analysis in GeneSpring
GX 11
Simple Analysis
• This option allows the user to explore the most common functionalities of a pathway
analysis.
Advanced Analysis
• This option aims to allow the user to explore all the functionalities of a pathway analysis
in detail and change the settings at every step of the analysis, as required.
Building Pathways
Algorithms:
Direct – connects nodes in the given list
Expand – Step wise expansion of an initial set of nodes to include the first degree neighbors
Shortest connect – the minimum number of steps to trace a continuous path among all nodes in a list.
Filters:
Quality – confidence score assigned to the relations
Connectivity – Number of neighbors of each node (One can use this filter to rank the
neighboring entities and limit the number of entities in a pathway view.
• Total number of neighbors (Global)
• Number of neighbors within a given network (Local)
Type – Class of entities or relations
• Entity types (proteins, small molecules, enzymes, etc.)
• Relation types (binding, protein modifications, expression, transport, etc.)
Pathways are built using a combination of Algorithms and Filter parameters
Simple Pathway Analysis
Predefined Algorithms determine the type of analysis performed :
•Direct Interactions: Finds relations that connect the entities in the selected entity list.
•Network Targets: Finds downstream entity targets that connect two or more entities from the
original list.
•Network Regulators: Finds upstream entity regulators that connect to two or more entities from
the original list.
•Shortest Connect: Finds the smallest set of relations that will connect all entities in a given list
into a single network.
Simple Pathway Analysis
Preview pane for pathway shows if any relationships were found and how many new entities were added.
Click Next to save Pathway in Experiment
Modifying Pathway
Save Pathway View
Selection Mode
Zoom Mode
Pan Mode
Zoom to fit view to
visible area
Zoom Selected
Region
Zoom in
Zoom Out
Select all
Remove unlinked
entities from the view
Copy
selection Paste
Selection
Undo
Redo
Change Layout
Viewing Relationships and References
Double click on any node to discover more about the
relationship and source of information
Advanced Pathway Analysis
In Advanced Analysis, there are 3 algorithms to choose from:
• Direct: Finds relations that connect the entities in the selected entity list.
• Expand: Expands the existing network to include the first-degree neighbors
of the selected entities.
• Shortest Connect: Finds the smallest set of relations that would connect a
set of entities into a single network. Note that some intermediate entities may
be introduced in this process
Advanced Pathway Analysis
Direct Interactions – Choose the types of relationships that are
of interest
Advanced Pathway Analysis
Expand Interactions – Choose the types of relationships and
entities that are of interest
Advanced Pathway Analysis
Shortest Connect – Choose the types of relationships and
entities that are of interest
Deleting Pathways
Deleting a pathway will remove the pathway from the database
within GeneSpring
Choose „Remove pathway‟ if you don‟t want to delete the
pathway permanently
Identifiers Necessary for Viewing Pathways and
Find Significant Pathways Tool
Technology must contain Entrez Gene ID and/or SwissProt
Columns that must be marked in custom technology to view
pathways and perform Find Similar Pathways tool:
• Annotation file must contain a column X containing Entrez Gene ID and/or
a column Y containing SwissProt IDs
– Column X must be marked “Entrez Gene ID”
– Column Y must be marked “SwissProt”
Easier way of selecting multiple Entity Lists for
Venn Diagram
Entity List Selection window for Venn Diagram automatically opens to display all Entity Lists for all open experiments
Multiple Entity Lists can be selected from window at once (Ctrl click) to display in Venn Diagram
Entity Lists can also be dragged and dropped into Venn Diagram
Select Entity Lists
from window
Drag-and-Drop
Entity Lists
OR