geworkbench hands-on training

1

geWorkbenchHands-On Training

Session Date:

Session Length:

Target Audience:

Trainer:

Developer Subject Matter Expert:

2

geWorkbench

geWorkbench is being developed at the Joint Centers for Systems Biology, Columbia University

This work is supported by the NCI caBIG and the NIH NCBC programs.

3

► This training is designed for a user who is new to

geWorkbench.

► Target Audience: Researchers and students interested in microarray gene expression experiment analysis.

► The attendee is expected to have basic computer and biological knowledge.

► Note – this is not a complete introduction to all geWorkbench components. The primary goal is to describe those features developed for caBIG during Year 1 of the project, and the context in which they are used.

Session Details:

4

These slides are suitable for use in:

► Classroom Training

► Centra – Online Classroom

► Web-based Delivery

Session Details:

Overview of the Training Environment

5

► geWorkbench requires the Sun Java JRE 1.5 environment to be installed on your local machine.

► geWorkbench requires significant memory. At least 1 GB is recommended, especially if larger datasets are being read in or hierarchical clustering will be done.

► Windows, Linux and Mac/PowerPC version of geWorkbench are available.

► See www.geworkbench.org for full details.

Session Details:

Hardware and Software

6

By the end of the training session participants should :

► Have a basic understanding of the purpose and aims of geWorkbench.

► Be able to set program preferences and load microarray data from local and remote sources.

► Understand how data files are organized into Projects, and how subsets of data can be formed and used.

► Use filtering and normalization components to prepare data.

► Analyze and view data using a number of new components.

Session Details:

Session Goals

7

► Introduction► Tutorial Data► Part 1 – Data management

♦ Lesson 1: Basics of the graphical interface♦ Lesson 2: Setting Preferences♦ Lesson 3: Projects and Data Files♦ Lesson 4: Working with Data Subsets♦ Lesson 5: Working with Remote Sources

► Part 2 – Data manipulation♦ Lesson 6: Normalization♦ Lesson 7: Filtering♦ Lesson 8: Experiment Annotations

► Part 3 – Analysis and display♦ Lesson 9: The Scatter Plot component♦ Lesson 10: Expression Value Distribution♦ Lesson 11: Reverse Engineering♦ Lesson 12: Gene Annotation and Pathway Viewing♦ Lesson 13 : Hierarchical Clustering Analysis♦ Lesson 14 : ANOVA

► Part 4 – Workflow execution♦ Lesson 15: caSCRIPT Editor

Session Details:

Outline of lessons

8

Introduction

Introduction

9

► This section will describe in general the capabilities of geWorkbench in the following areas:

♦ Microarray analysis.♦ Sequence analysis.♦ Access to remote data and services

► A complete description of geWorkbench and online tutorials are available at www.geworkbench.org.

Introduction:

Overview

10

geWorkbench – a platform for tool and data integration

► geWorkbench is an open-source bioinformatics platform that provides an extensive collection of tools for the management, analysis, visualization and annotation of biomedical data.

► geWorkbench has been designed with a plug-in framework. As new techniques are developed and implemented, they can be added to geWorkbench.

► geWorkbench aims to allow different tools to easily work together, such as using microarray analysis to obtain a list of interesting genes, and then retrieving their coding or upstream sequences and using these in BLAST, pattern discovery, or transcription factor binding motif searches.

Introduction:

Overview

11

► Obtaining data from local or remote data sources

► Filtering and normalization

► Basic statistical analysis

► Clustering (Hierarchical, SOM)

► Gene Ontology analysis

► Reverse Engineering

► Visualization using many common tools♦ Scatter Plot♦ Volcano Plot♦ Expression Profiles♦ Expression Value Distribution♦ Color Mosaic♦ Dendrogram

geWorkbench supports many kinds of operations on microarray data:

Introduction:

Microarray data

12

► BLAST

► Pattern Discovery

► Transcription Factor Mapping

► Syntenic Region Analysis

geWorkbench also provides capabilities for working with sequence data:

Introduction:

Sequence data

13

► There are many biomedical data sources and computational services available through the internet. geWorkbench strives to make remote data and services directly available on the desktop, integrated with its own local tools.

► External sources provide expression data, sequences and annotation:

♦ Microarray gene expression repositories (caArray)

♦ Gene annotation web pages (viaCGAP)

♦ DNA Sequence retrieval (UC Santa Cruz)

♦ Pathway diagrams (BioCarta via caBIO database at NCI)

Introduction:

External data services

14

geWorkbench also provides a gateway to several computational services, including some hosted on Columbia servers and clusters.

► BLAST – search for sequences similar to a query sequence.

♦ Access is provided both to a Columbia server and the NCBI BLAST service.

► Pattern Discovery – find repeated patterns in a group of sequences.

► Synteny – compare regions of one chromosome against another.

► Through the caGRID project, additional remote services are being added:

♦ Hierarchical clustering – tree-like grouping by expression similarity.

♦ SOM (Self-Organizing Maps) – divide expression profiles into a limited number of bins.

♦ ARACNE – regulatory network reverse engineering.

Introduction:

External computational services

15

Tutorial Data

Tutorial Data

16

► In this section we describe the downloadable tutorial data files. This is primarily a reference section. Other files are included in the data directory of the program itself.

► The data can be downloaded from http://wiki.c2b2.columbia.edu/workbench/index.php/Download

► There are several file types

♦ Microarray

♦ Affymetrix MAS5/GCOS format files – a single file per array, as produced by Affymetrix software.

♦ The geWorkbench data matrix format, which merges all expression data from a set of experiments into a single file. By default it uses the ending “.exp”.

♦ Genepix two-color array experiments (in base download).

♦ Sequence

♦ DNA and protein sequence files in FASTA format.

Tutorial Data:

Overview

17

All data sets used in the tutorials are available from the download area of the geWorkbench website

(http://wiki.c2b2.columbia.edu/workbench/index.php/Download).

The file "tutorial_data.zip" contains the following files:

cardiogenomics.med.harvard.edu/ Contains 10 individual MAS5/GCOS format data files.

webmatrix_quantile_log2_dev1.2_mv0.exp A geWorkbench "exp" format matrix file containing filtered, normalized data. This data originally derives from the file "webmatrix2.exp". NM_024426-Wilms.fasta A Genbank nucleotide seqeuence file. NP_077744-Wilms.fasta A Genbank protein seqeuence file. H1H5_HistoneDB_NHGRI.fasta Contains H1 and H5 histone sequences from the NHGRI.

cluster_tree_total_pearsons_84_markers.csv Contains a list of genes derived from hierarchical clustering.

64of84ClusterPearsonsSeqs.fasta Contains upstream DNA sequences derived from a subset of the above genes.

Tutorial Data:

Data files

18

The example MAS5 format data files were obtained from the following site at Harvard University: http://cardiogenomics.med.harvard.edu/project-detail?project_id=229

A number of MAS5 format data files are available there. The specific project is the "Belgium Dataset of Aortic Stenosis, Congestive Cardiomyopathy and Normal LV Function", and the data is downloadable from: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/download_Hs-belgium.html

An abstract describing the study is also available, at:http://cardiogenomics.med.harvard.edu/groups/proj2/pages/Hs-belgium_home.html

Tutorial Data:

About the Cardiogenomics Microarray Dataset

19

Generation of the "webmatrix2_quantile_log2_dev1.2_mv0.exp" dataset.

The file "webmatrix2.exp", available in the Download area, contains results from 100 Affymetrix HG-U95Av2 chips containing B-cell samples from numerous different disease states. 12,600 probes are represented.

For use in these tutorials we normalized and filtered the data. The steps on the next page are just an example of how filtering and normalization can be used, and each dataset should be handled according to the type of analysis being undertaken and its goals.

Tutorial Data:

Generation of example microarray dataset

20

The dataset was created through the following steps:

1. Normalization: Quantile normalization.

2. Normalization: Log2 transformation.

3. Filtering: Deviation filter with Deviation bound of 1.2.

4. Filtering: Missing values filter with maximum number of missing arrays of 0.

The result of performing these steps is available as the file "webmatrix2_quantile_log2_dev1.2_mv0.exp", found in the tutorial data file "tutorial_data.zip”.

Tutorial Data:

Generation of example microarray dataset

21

Part 1: Data Management

Data Management

22


Objectives

The objective of Part 1 is to learn the basic operation of geWorkbench. This includes understanding the layout of the graphical interface in four main functional regions, and setting user preferences. The loading of local and remote data files will be demonstrated. Perhaps of most importance is understanding how geWorkbench allows data to be divided into subsets, both for setting up analyses and utilizing their results.

After completing Part 1, you should be able to:

1. Load microarray data into geWorkbench from local and remote sources, and set display preferences.

2. Understand how the data can be organized into projects and manipulated using sets.

23

Lesson 1: Basics of the graphical interfaceLesson 2: Setting PreferencesLesson 3: Projects and Data FilesLesson 4: Working with Data SubsetsLesson 5: Working with Remote Sources


Lesson outline

24

Basics of the graphical interface.

Lesson 1: Basics of the graphical interface:

25

The graphical user interface for geWorkbench is divided into four major sections

1. Data management Workspace and Projects (upper left).

2. Marker and Array/Phenotype set selection and management (lower left).

3. Visualization tools (upper right).

4. Analytical tools (lower right).

Areas 2, 3 and 4 are defined for convenience. The actual placement of agiven component into any of these three areas is controlled by a configuration file and can be customized as desired.

Lesson 1: Basics of the graphical interface

The four areas of the GUI

26

Menu bar

► The GUI provides a menu bar at top with a standard choice of commands.

► Many commands that are available in the menu bar are also available by right-clicking on data objects.

Data management area (area 1)

► Working with geWorkbench involves creating a project within the top-level Workspace.

► Opened data files and the results of analysis are stored within a Project.

► Multiple projects can be used within a workspace to organize data.

► A workspace and all the projects and data within it can be saved and later reloaded.


Menu bar and data management area

27

Set selection and management (area 2)

► geWorkbench allows sets of markers (gene probes) and of arrays/phenotypes to be defined and used. This allows the application to:

♦ analyze only a desired subset of the data

♦ Return lists of genes from one module which can then be used in another module, e.g. a list of genes returned by a t-test of differential expression can then be further investigated through sequence retrieval and analysis.


Set selection area

28

Visualization and Analysis tools (areas 3 and 4)

► To simplify the display area, only the visualization and analysis components relevant to the type of dataset currently selected in the Project Folders area (area 1) are displayed.

► Thus choosing a microarray dataset will result in a different set of tabs being displayed as compared with those seen when a nucleotide sequence file is selected.

► When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Project area (area 1), but an appropriate viewer in the Visualization area (area 3) is automatically selected.

► A selection of visualization and analysis tools will be demonstrated in the following sections.


Visualization and analysis areas

29

Setting Preferences

Lesson 2: Setting Preferences

30

Preferences

► The Preferences selection in the Tools menu allows users to specify how certain aspects of the system will behave.

► Once the preferences are set, they are persistent between application sessions.

► From the main menu, click on Tools >Preferences.

Modifying Settings


Modifying settings

31

Modifying Settings

► Text Editor: The editor selected will be used to open and inspect data sets loaded in a project. Notepad is the default setting.

► Visualization: The color scheme to be applied to color mosaic images.

♦ Absolute: (default) Values are scaled against the largest absolute value found in the dataset, with positive values red and negative green.

♦ Relative: Each marker is mean-variance normalized across all arrays. A red-blue color scheme is used, with red showing positive and blue negative values.

► Genepix Value Computation: Specifies how to compute the value displayed for a Genepix array. The default setting is Option 1 (Mean F635 - Mean B635) / (Mean F532 - Mean B532).


Modifying settings

32

► The relative display performs its own transformation on the data just for purposes of visualization. The underlying data is not changed.

► The relative selection for the Microarray Viewer preference will give odd-looking results if only a small number of arrays are loaded (e.g. 2). This is because with only two values, each point will be at a color extreme – either blue or red.

► Changing the Microarray Viewer relative/absolute preference will not take effect until the next time a data set is loaded.


Notes

33

Projects and Data Files

Lesson 3: Projects and Data Files

34

geWorkbench supports a number of data file formats, including:

For Microarrays:

► Affymetrix MAS5/GCOS text files.

► Affymetrix File Matrix - this is the native file type created by geWorkbench, and contains a data matrix from any number of experiments merged together.

► RMA Express File - RMA Express is a sophisticated tool for combining data

from multiple Affymetrix chips. It is not a part of geWorkbench.

► Genepix Files – created by a popular analysis program for two color arrays.

For Sequence:

► FASTA Files. DNA or protein sequence files in FASTA format.

► Pattern Files – created by the Pattern Discovery component.


File types

35

In this example, we will load 10 individual Affymetrix MAS5 format files, merging them into a single dataset.

1. Create a Project. All data must belong to a project. Right-click on the Workspace entry in the Project Folders window at upper left to create a new project.

2. Next, right-click on the new Project entry and select Open Files.


Opening a file

36

3. Select file type Affymetrix MAS5/GCOS as shown.

4. Make sure to check the Merge files checkbox.

5. Select 10 MAS5 format text files from the tutorial data directory.

6. Click Open. 3

6

5

4

The chip type HG_U95Av2 is recognized...


Loading and merging data

37

The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the Microarray Viewer. Note we have increased the intensity slider to maximum here.


Viewing data

38

► The merged dataset can be given a shorter name.

♦ Right click on the merged dataset and select Rename.

♦ Enter a new dataset name, e.g. merged_cardio.

► The dataset can also be saved to disk for later reuse.

♦ Right-click on the merged dataset and select Save.

♦ Enter a filename.


Renaming and saving a merged dataset

39

Working with subsets of data

Lesson 4: Working with Subsets of Data

40

► geWorkbench makes extensive use of sets of markers (genes) or arrays.

► Sets can be defined by the user, or may be created as a result of an analysis.

► Sets of arrays can be used to distinguish between different experimental states, for example as part of a statistical analysis.

♦ The t-test requires two states be defined for comparison.

► Sets of markers are returned from various analysis routines. For example the t-test returns a list of markers showing signficant differential expression, and after hierarchical clustering, the markers in a subtree of the resulting dendrogram can be saved.

► geWorkbench supports groupings of sets. Each such group can contain different sets of markers or arrays.


Background

41

► How to create a set of arrays.

► How to mark a set of arrays as "Active“.

► How to classify a set of arrays, e.g. as "case" vs. "control".

► How arrays can be grouped in different ways with descriptive tags.

In this tutorial you will learn


Overview

42

The first example here will use the same data files read in and merged in the previous lesson (Projects and Data Files).

The second example will use the tutorial file webmatrix2_quantile_log2_dev1.2_mv0.exp


Preparation

43

First, we will select and label arrays which contain samples from the congestive cardiomyopathy disease state:

We will leave the arrays in the default group, however you can create a new group by pushing the New button on Array/Phenotype Sets located at the lower left in the application (arrow labeled New).

1. In the Arrays/Phenotypes component, select the six arrays beginning with JB-ccmp, which represent the samples from the congestive cardiomyopathy disease state.

2. Right click, select Add to Set.

1

2


Assigning arrays to sets

New

44

3. Enter "CCMP" in the input box and click OK.

3

4. Next, similarly label the arrays beginning with JB-n as "Normal“.

The Array/Phenotype Sets component will now show the two sets added:

4


Assigning arrays to sets

45

The boxes next to the set name can be checked to indicate that a setof arrays is "Active". Various analysis and visualization components can be set to only use/display activated arrays or markers.


Activating sets

Note – if no Marker sets are explicitly activated, then all Markers are implicitly active. The same applies to Arrays.

46

For statistical tests such as the t-test, Case and Control groups can be specified.

1. Left-click on the thumb-tack icon in front of the phenotype name.

2. Select Case to specify the disease arrays as the "Case". The remaining "Normal" arrays are by default considered Control.

1

2


Classifying a set

47

3. A red thumbtack indicates an array set has been marked as "Case".

3


Classifying a set

48

► Different groups of sets can be made, both for Markers and for Arrays. They may differ in membership or in how members are named (e.g. amount of detail).

► Here we show how several different groupings are defined in the example data file "webmatrix2_quantile_log2_dev1_mv0.exp“.

► After loading this file into geWorkbench as type "Affymetrix File Matrix", four groups can be seen in the Arrays/Phenotypes group pulldown menu at right.


Using multiple array groups

49

If we choose the group called "Class", the sets of arrays at right are displayed:



50

If instead we choose the group "Cell Line", a different grouping of the same arrays is seen:



51

Working with Remote Data Sources

Lesson 5: Working with Remote Data Sources

52

geWorkbench can retrieve microarray data from certain remote data sources, primarily from instances of the NCI's caArray database.

The Open File dialog allows remote sources to be added to the list of those available either manually or through discovery using grid services.

Right-clicking on Project will bring up the Open File dialog.

Click the Remote radio button. The Open File dialog window will be expanded to include remote sources.

Entries (locations, parameters) for non-grid services can be edited.


The remote Open File dialog

53

After clicking Remote, four additional buttons appear:

1. Remote source selector – chose from available Remote Resources.

2. Go button - Accesses the Remote Source that you selected.

3. Query button – specifies search criteria for retrieving only a subset of available experiments.

4. Add A New Resource button - Opens the Data Source Definition Page used to add Remote Data.

4. Edit button - Edits Remote Source Parameters.

1 2 4 5


The remote Open File dialog

3

54

Click on the Go button next to the caArray data source at the bottom of the dialog. All available caArray experiments will be displayed.


Loading data from a remote instance of caArray

55

1. Here we depict the experiment ending in *36540. The number of derived bioassays, 4, is displayed, along with the experiment information.

2. To start retrieving the bioassays themselves, right-click on the experiment and press Get bioassays. This will download the list of available bioassays into geWorkbench

Select an experiment that has bioassays


Selecting an experiment

2

1

56

To Retrieve Bioassay Data

Select the desired arrays and push the Open button.

(You might want to first select just one, as each can take

several minutes to download).


Retrieving bioassay data

57

To Retrieve Bioassay Data Based on Search Criteria


Searching for specific types of bioassay data

1. Click on the Query button.

2. Select “Experiments” from the available search categories.

3. Select one or more fields (like “Tissue Type”, “Chip Platform”, etc) and enter a desired search value; some fields (like “Tissue Type”) assume values from a pick list while others accept free text.

58

To Retrieve Bioassay Data Based on Search Criteria


Searching for specific types of bioassay data (cont.)

1. Click on the Search button.

2. The list of available experiments is updated to include only those that meet all the search criteria specified by the user.

59

To Remove the Search Filter and Retrieve All Bioassay Data Again


Searching for specific types of bioassay data (cont.)

2

1. Bring up again the Query screen.

2. Click on the Clear All button to clear all search fields.

3. Click on the Search button.

4. The full list of experiments is displayed again.

3

60

To add a remote source

1. Click on the Add A New Resource button.

2. Fill in the Data Source definition page. URL and Short Name are required fields.

3. Click on the OK button. The configuration is set up to automatically reflect your additional Data Source.

2

2

3

To modify a remote source

1. Click on the Edit button.

2. Make the changes that you need.

3. Click on the OK button

2

3

1 1


Adding or modifying a remote source

61


Review

Part 1 covered the basics of the layout of geWorkbench, loading data, setting preferences, and the use of sets of arrays and markers to organize data.


1. Locate the different working areas of the application GUI.

2. Load microarray data from local and remote sources, and create a merged dataset for further analysis.

3. Use search criteria to filter the set of experiments retrieved from a remote data source.

4. Set the display preferences.

5. Use sets of arrays and markers to organize data for analysis and to convey results from one tool to another.

62

Part 2: Data Manipulation

Data Manipulation

63


Objectives

The objective of Part 2 is to learn the a few of the basic techniques available in geWorkbench for microarray data normalization and filtering. This section will also cover the manual and automatic annotation of datasets.


1. Normalize a microarray dataset.

2. Filter out unwanted data points, such as low quality or missing data.

3. Use and create new dataset annotations.

64

Lesson 6: NormalizationLesson 7: FilteringLesson 8: Experiment Annotations

Part 2: Data Manipulation:

Lesson outline

65

Normalization

Lesson 6: Normalization

66

Normalization is used to reduce the effects of systematic variations between arrays, such as variations in hybridization, scanning, sample concentration etc. The aim is to make the data from different chips more comparable.

geWorkbench supports a number of basic types of normalization. In this section, two will be described: Housekeeping Gene normalization, and Quantile normalization.


Overview

67

► Housekeeping genes are those thought to express at a relatively constant level.

► They can be used to provide a reference point against which to normalize.

► Using multiple housekeeping genes can lower the effect if one or more of them is actually varying with the experimental conditions. geWorkbench uses the average expression of all selected housekeeping genes as the normalization factor.

► To perform a housekeeping gene normalization:

♦ Load or select a dataset, such as the merged_cardio set created earlier.

♦ For this example, first perform a log2 normalization on the dataset. This will reduce the dominance of the more highly expressed genes, and the result will be similar to performing a geometric mean normalization.

♦ In the Housekeeping Gene normalization component, the Load button allows a predefined list of genes to be loaded. The supplied file “housekeeping_marker_list.csv” is a list of 26 such genes applicable to the Affymetrix HG_U95Av2 chip type.


Housekeeping gene normalization

68

► Performing the normalization

♦ Loaded genes can be moved to and from the active list using the arrow buttons. Here all 26 have been chosen, but you would likely select just a few, perhaps based on experiments using other techniques (1).

♦ Press the Normalize button. The current dataset will be normalized.


Housekeeping gene normalization

69

► Quantile normalization is used to make the expression profile of each array the same. It is the relative position of each gene in a list ordered by expression value that now varies on each array.

► The assumption is that the real expression profile on each array is quite similar.

► Quantile normalization at the Affymetrix probe level is a feature of the advanced analysis technique called RMA. Quantile normalization in geWorkbench is applied at the gene (probeset) level.

► To perform a Quantile normalization:

♦ Load or select a dataset, such as the merged_cardio set created earlier.

♦ Go to the Normalization component in the Analysis area.


Quantile normalization

70

► To perform a Quantile normalization (cont.):

♦ Choose an Averaging method for handling missing values.

♦ Mean profile marker – average for marker across all arrays.

♦ Mean microarray value – average for array across all markers.

♦ Push the Normalize button. The current dataset will be normalized.


Quantile normalization

71

► (1) Accurate normalization of real-time quantitative RT_PCR data by geometric averaging of multiple internal control genes. Vandesompele et al. Genome Biology 2002, 3(7)


References

72

Filtering

Lesson 7: Filtering

73

Filtering is used to remove data from datasets. The data may be removed due to being of low quality, of low interest (unvarying), or may have been flagged by another program as being absent or unreliable.

Lesson 7: Filtering

Overview

74

► GenePix is a software platform used for analyzing spotted two-color arrays. It produces its own file format with the suffix .gpr.

► The file can include flags on individual data points, indicating e.g. bad or missing data.

► geWorkbench can filter out these flagged data points.

► To perform GenePix flags filtering:

♦ Load a GenePix format file, such as 21161 neu10.gpr. This is included in the geWorkbench data directory.

♦ In the Analysis area, go to the Filtering component and select GenePix Flags Filter.

♦ The list of available flags is presented. Choose a flag such as “bad” to filter on by checking its box.

♦ Push Filter.

Lesson 7: Filtering

GenePix Flags filtering

75

► Filtered-out values are colored yellow in the Tabular Microarray Viewer, indicating they are now classified as Missing Values in geWorkbench.

► Such values can be removed entirely from the dataset through use of the Missing Values Filter (not shown).

Lesson 7: Filtering

GenePix Flags filtering

76

Experiment Annotations

Lesson 8: Experiment Annotations

77

► Three components provide for automatic and manual annotation of datasets.

♦ Dataset Annotation – allows the user to type in comments on a dataset.

♦ Dataset History - automatically records data transformation steps.

♦ Experiment Info – information about the makeup of the dataset, e.g. the files that were merged to create it.

► Shown on the next slide are annotations for the dataset used in the Housekeeping Gene normalization example.

► A text file can also be read in to the Dataset Annotation component using the Load Custom Data Annotations button.


Three annotation components

78

The three annotation components

♦ Dataset Annotation (text entered by hand)

♦ Dataset History

♦ Experiment Info


Three annotation components

79


Review

In Part 2 we covered microarray data normalization and filtering. We also saw how geWorkbench keeps a record of each data transformation, and how annotations can be added to an experimental dataset by hand or from a file.


1. Normalize a microarray dataset using tools such as Housekeeping Genes Normalization and Quantile normalization.

2. Filter unwanted data points out, for example flagged points from a GenePix datafile.

3. View dataset annotations created automatically by geWorkbench when a dataset is transformed, and

4. create new dataset annotations by hand.

80

Part 3: Analysis and Display

Objectives

The objective of Part 3 is to introduce some of the major tools for microarray data analysis and display found in geWorkbench. The Scatter Plot and Expression Value Distribution (EVD) components are used to inspect microarray data, for example to evaluate data quality and the effectiveness of normalization and filtering. The Reverse Engineering component can be used to examine relationships between the expression pattern of a chosen gene and others in the dataset. Lists of genes which result from analysis steps can be evaluated through annotations and Pathway diagrams retrieved using the Marker Annotations component.


1. Use the Scatter Plot and Expression Value Distribution components to examine microarray datasets.

2. Run Reverse Engineering on a microarray dataset to find interactions with a chosen hub gene, and

3. Retrieve gene annotations and pathway diagrams using the Marker Annotations component and view them.

81


Analysis and Display

82

Lesson 9: The Scatter Plot componentLesson 10: Expression Value DistributionLesson 11: Reverse EngineeringLesson 12: Gene Annotation and Pathway ViewingLesson 13: Hierarchical ClusteringLesson 14: Analysis of Variance


Lesson outline

83

The Scatter Plot component

Lesson 9: The Scatter Plot component

84

► The Scatter Plot examines the relationship between two datasets. Two types of comparisons can be made: one gene probe against a second on every chip (Marker option), or every gene probe against itself on two chips (Array option). Up to 6 graphs can be shown.

Two marker plots are shown here. The marker AFFX-BioC-5_at is on the x-axis while the markers AFFX-BioB-5_at and AFFX-BioC-3_at are on the y-axes.


Overview

85

1. You can use the dataset loaded in the previous example, or open the tutorial data file webmatrix_quantile_log2_dev1.2_mv0.exp.

2. In the scatter plot component, select the Marker or Array tab to choose the type of comparison. The above picture used Marker.

3. Highlight a reference marker or array. The second and any following items selected will result in a graph being drawn, up to a limit of six.


Using the Scatter Plot

86

1. This tab switches between Marker/Marker and Array/Array plots.

2. Markers or arrays available for selection.

3. The first marker or array selected is placed on the x-axis and his highlighted in black. A different marker or array can be placed on the x-axis by right-clicking the marker/array name and choosing Put on X-Axis.

Basic Usage: The steps of basic usage are indicated with numbers in the screenshot

4. Subsequent selections of markers or arrays after a marker/array is on the x-axis results in the creation of a chart. Plotted markers/arrays are highlighted in grey. Clicking again on one of these markers/arrays results in the plot being removed.


Basic usage

87

5. Clicking the Rank Statistics Plot checkbox transforms the data for analysis. The x and y values are sorted and plotted according to their rank.

6. By default, a black reference line with slope 1 is displayed in each chart. This may be turned off with the Reference Line checkbox. Also, the slope of the line may be adjusted in the Slope textbox.

Basic Usage: continued

7. The Clear Charts button removes all charts and removes the x-axis selection. The Print button prints the charts after allowing the user to adjust the page setup and choose a printer. The Image Snapshot button captures the charts as an image and places it in to the project underneath the current data set.


Basic usage

88

Each chart can be manipulated by

right-clicking anywhere in the plot

area. This brings up a menu that

allows the chart to be individually

saved as an image or printed,

zoomed and visual properties

adjusted.

Set Selections

Markers or microarrays that are members of active sets will be plotted with unique visual properties. These selections are managed for arrays and markers in the Phenotype and the Marker components, respectively. Consider an example where the two sets are activated in the Phenotype component:


Options and sets

89

Here we compare the expression of two genes across all arrays. The two selected sets of markers are colored blue and green. Because the “All Arrays” box is also checked, the remaining arrays are also displayed, in red:


Example plot, all arrays

90

The visual properties of a set of markers or arrays may be adjusted. From within the Array or Marker component, right click a set and choose Change Visual Properties.

A dialog opens that allows the shape and color to be changed for that set. These properties are honored in the Scatter Plot as well as other caWorkbench components


Set options

91

Expression Value Distribution

Lesson 10: Expression Value Distribution

92

► The expression value distribution component plots a histogram of binned expression values for selected or all the genes on one or more arrays.

► A slider (at bottom) can be used to step between each array in the current dataset.

► A subset of markers within a given expression range can be selected using movable sliders (Select values from and Select values to) and added to a Marker Set using the Add to Set button

► A T-Test can be used to detect markers with significantly different expression. A Case set of arrays must be activated in the Arrays component (remaining arrays are by default Control).

► Image Snapshot saves an image of the graph to the Project Folders component.

► Mouse-over annotations can be activated by pressing the lightbulb

► An array from the Housekeeping Genes Normalization example is displayed in the following picture:


Features

93

Normalized, log2 transformed data (Housekeeping Gene Normalizer example)


Example graph

94

Display options for the EVD diagram.

► Right-click on the EVD diagram to obtain the following list of display and manipulation options.


Display options

95

► The Arrays/Phenotypes component allows the dataset to be divided into sets of arrays, which can be named and classified (e.g. as Case/Control)

►Select a group (e.g. CCMP arrays) and right-click, select Add to Set


Working with activated datasets

96

► The set CCMP is active. The “One color per array” checkbox is checked, so each array is shown in a different color.

► The base array, shown in red, is selected using the array slider.


Displaying an activated set

97

Results of a t-test on CCMP vs Normal arrays

► Now both the CCMP and Normal array sets are active. CCMP has been marked Case.

► The t-test button is active, showing the t-statistic distribution.


t-test

98

Reverse Engineering

Lesson 11: Reverse Engineering

99

► The primary use of the Reverse Engineering component is to infer regulatory interactions between genes and gene products.

► The Reverse Engineering component uses the information theory concept of mutual information to find these interactions.

♦ Mutual information here means the information that the expression pattern of one gene carries about the expression of another gene - it is a pairwise calculation.

♦ Mutual Information is in principle more sensitive and flexible than a simple correlation calculation.

♦ It is also invariant under certain data transformations, such as log transformations.


Overview

100

Larger datasets, containing more arrays per marker, will yield greater sensitivity and better statistical support.

Full scale runs of reverse engineering algorithms, comparing all markers against each other, and typically done on datasets containing several hundred microarrays, are typically performed on large cluster computers and are not feasible on a desktop machine.


Overview

101

► As typically used in geWorkbench, the Reverse Engineering component calculates the Mutual Information score between a single hub gene and all other N markers in the dataset.

► In a second step, a subset containing the best M markers is chosen (with a current limit of 100), and a complete pair-wise MxM/2 mutual information calculation is performed between them.

► The network resulting from this calculation can be displayed as a branched tree of interactions within the Cytoscape component.


Reverse Engineering in the context of geWorkbench

102

► A dataset containing multiple arrays (the more the better) should be loaded into geWorkbench. If data is loaded from separate files, it should be merged into a single micro array dataset. See the section Projects and Data Files.

► For this example we will load the tutorial dataset "webmatrix2_quantile_log2_dev1.2_mv0.exp".

♦ This contains a set of 100 experiments on Affymetrix HG_U95Av2 chips. This filtered dataset has been reduced to 2226 markers.


Prerequisites

103

1. In the upper right section of geWorkbench find the Reverse Engineering component. It should by default be displaying the Profiler tab

2. In the Markers component search box, on the left side of the geWorkbench interface, enter 1973 and hit enter. This will find the marker 1973 _s_at,which is the c-Myc gene, a well-known transcription factor with many interactions.

3. Click on this marker in the list. This will enter the marker into the Hub Gene Label field of the Profiler.


Profiler - selecting a hub gene

104

4. The default setting in the Profiler is Mutual Information (fast). With this

selected, hit Analyze(2D). This will return a list of all markers having a MI

score of greater than the cutoff value (the default is 0.2).


Profiler – Analyze 2D

105

Options

► Pearson - Uses a Pearson correlation function to calculate the interaction scores.


Profiler - Options

106

5. After the Mutual Information algorithm has been run, an adjacency matrix will be placed in the Projects Folder:


Profiler – data output

107

► If a smaller network is desired, a set of markers can be highlighted in the list originally returned. Only this selected subset, up to 100 markers, will then be used if "Create Network" is pressed.

► By right-clicking and selecting "Add to Set", the selected group can also be added to the Markers component as a new set which can be used in other components (sequence retrieval, annotation retriever etc.).


Profiler – adding returned markers to a set

108

6. Hit the Create Network button. a) A network will be displayed based on the top 100 markers interacting with c-

Myc. As described above, the MI algorithm is run again on these M=100 markers, in order to measure interactions between each pair.

b) Each marker is then connected via an edge with the marker it most strongly interacts with, with the chosen hub-gene at the center.


Profiler – Create Network

109

7. The resulting network is displayed in the Cytoscape viewer.


Cytoscape viewer

110

8. The visualization in Cytoscape can be improved by going to the Layout menu, and choosing yFiles->organic:


Cytoscape viewer layout

111

9. Within the network created in Cytoscape, one can select the central gene, and then on the Cytoscape menu chose Select->Nodes->First Neighbors of selected nodes


Cytoscape viewer layout

112

10. The first neighbors will be highlighted in the graph.

11. and also added as a new set in the Markers component.


Cytoscape viewer – choosing genes

113

12. Return to the main Reverse Engineering component by clicking on the original dataset in the Project Folders component.

13. Select the first (highest MI score) marker on the list and the graph shown below is drawn in the Motif Location Histogram display. This shows a plot of the expression values on each array for the selected hub marker vs any other marker selected in the list.


Motif Location Histogram

114

Gene Annotation and Pathway Viewing

Lesson 12: Gene Annotation and Pathway Viewing

115

► The Marker Annotations component retrieves information for selected markers (genes) using caBIO.

♦ Links to CGAP annotation pages are listed under Gene.

♦ Links to BioCarta pathway diagrams are listed under Pathway.

♦ Clicking on the pathway links will display the SVG pathway diagrams in the caBIO Pathways viewer.


The Marker Annotation component

116

1. Load a set of markers from the tutorial data into the Markers component:

a) Press Load set

b) Locate the file “cluster_tree_total_pearsons_84_markers.csv” and load it.

c) Activate the set by checking the box in front of its entry. Here it has been renamed to “Cluster tree”.

2. In the Marker Annotations component, press Retrieve annotations.

3. Click on an Gene or Pathway link to view the annotations.


Marker Annotations - retrieving

117

► The list of markers can be sorted by Gene or by Pathway name by clicking on the column headings.


Marker Annotations - display

118

A CGAP annotation page displayed in a web browser window.


CGAP annotations

119

A BioCarta pathway displayed in the caBIO Pathways component.


BioCarta Pathway display

120

Hierarchical Clustering

Lesson 13: Hierarchical Clustering

121

► Hierarchical clustering can be used to identify trends in the data by grouping together genes or/and microarrays that share common expression patterns.

► Like many of the analyses available in geWorkbench, hierarchical clustering is carried out in 2 steps:

► Analysis setup and execution: the Analysis Panel is used to specify the parameters settings to be used when invoking the hierarchical clustering algorithm.

► Visualization of analysis results: the Dendrogram module is used to visualize the clusters generated by the analysis.

► The hierarchical clustering algorithm can be executed in 2 modes:

► Local: the computation takes place on the user’s machine (the same computer on which geWorkbench is running).

► Remote: using the caGrid infrastructure, the computation can be outsourced to any computer running a grid-enabled version of the hierarchical clustering code.


Overview

122


Set up the analysis parameters

• The Analysis Panel is located in the lower bottom portion of the application’s user interface; locate and select on the tab titled “Analysis”.

• Within the parameters portion of the interface you can specify the values for the 3 parameters applicable to the hierarchical clustering analysis: “Clustering Method”, “Clustering Dimension” and “Clustering Metric”. The values for these parameters will determine, respectively, (1) how clusters get agglomerated, (2) if the analysis should cluster markers, arrays or both, and (3) what distance metric to use for assessing similarity between clusters.

• Set the values of the parameters as shown above.

• From the list of available analyses, select the one titled Hierarchical Clustering.

123


Select Local or Remote execution

• To select a compute server (the piece of software which will carry out the actual computations) select the Services tab

• To specify local or remote execution click (respectively) on the Local or Grid radio button.

• If the Local option is selected, a locally running version of the hierarchical clustering code will be executed. To select among a list of available grid-enable hierarchical clustering servers, select the radio button next to the Grid option (as shown above).

• geWorkbench will query a caGrid Index Service in order to find out which grid-enabled servers are available. The application comes pre-configured with a default Index Service address. To change this default, click on the “Change Index Service” link and enter the host URL and port for the new Index service to use.

124


Select Local or Remote execution (cont.)

• To retrieve the available hierarchical clustering services, click on the button titled “Grid Services”.

• Select theradio buttonnext to the service thatyou would like to use. Details about the selected service (including it’s URL, the hostinstitution, etc) appear at the bottom portion of the interface. Your selection will be remembered next time you use geWorkbench.

• The list of discovered services is displayed here.

• Return to the Parameters tab and click the Analyze button to initiate the clustering (using the compute server you designated).

125


Dendrogram

• Upon completion of the analysis a tree node representing the analysis results is created in the Project Folders pane. It appears as a child node under the microarray set that was clustered.

• By clicking on the results node, the resulting hierarchical cluster can be visualized as a dendrogram in the upper right part of the user interface. In this view, horizontal mosaic blocks correspond to markers and vertical blocks correspond to arrays.

126


Dendrogram (cont.)

• To select a marker cluster first check the “Enable Selection” box.

• The view can be adjusted to focus on a particular marker or array cluster.

• Mouse-over to highlight a marker cluster of interest and right-click.

• The dendrogram view is updated to include only the selected markers. Further, by right clicking on the view and selecting “Add to set” the selected markers can be grouped into a marker set.

127

ANOVA(Analysis of Variance)

Lesson 14: ANOVA

128

Lesson 14: ANOVA

Overview

► The ANOVA ( Analysis of VAriance) algorithm is used to determine whether any significant difference exists in the means of independent groups of data. ANOVA is an extension of the t-test to more than two experimental conditions. In geWorkbench each group comprises gene expression microarray measurements from various samples, and one is interested in identifying genes whose mean expression is significantly different across the various groups.

► Currently, only one-way ANOVA is implemented. The user is initially required to enter the number of groups, following which a sample grouping panel similar to the t-test panel, with the appropriate number of groups, is created. Samples can be assigned to any group or excluded from the analysis. F-statistics are calculated for each gene, and gene is considered significant if p-value associated with its F-statistic is smaller than the user-specified alpha or critical p-value.

► The compute code for the ANOVA analysis used in geWorkbench has been adapted from the ANOVA component in the MeV software from TIGR (http://www.tm4.org/mev.html).

129

Lesson 14: ANOVA

Analysis Setup

• Load a microarray set in the Project Folders component.

• Use the Arrays/Phenotypes component in order to define 3 or more groups of arrays upon which ANOVA will operate (the groups need to be activated)

• In the Analysis component, select “Anova Analysis” among the available analyses.

• Enter the desired run parameters in the parameters panel.

• Click on the Analyze button to initiate the analysis.

130

Lesson 14: ANOVA

ANOVA Results Display

Tabular Viewer: This Visual Area Component displays a read-only spreadsheet view of the significant genes sorted by p-value in ascending order (from most significant to least significant). In this view, the columns displayed can be altered in the preference window (click on the “Display Preferences” button), and the display can be sorted by the values in any column.

The table is exportable in .cvs format.

They following columns can be displayed:Marker Name: The name of the marker that is deemed significant according to the analysis.F- Statistic: The value of the statistic calculated by the ANOVA test.P- Value: The probability of observing the F-statistic value under the null hypothesis.For each group:

• Mean is the mean expression value of the marker in that group. • Std is the standard deviation of the marker expression measurements in that group.

131

Lesson 14: ANOVA

ANOVA Results Display (cont.)

Color Mosaic: In this view, a color spectrum is used to indicate the relative magnitudes of the measurements. The arrays (columns) are grouped by input group membership, i.e. set 1,set 2 etc. Each row corresponds to a marker, and marker display is ordered by p-value in ascending order (from most significant to least significant).

Marker namesP-values

132


Review

Part 3 described several tools for the analysis and display of microarray data.

Having completed Part 3, you should be able to:

1. Use the Scatter Plot and Expression Value Distribution components to examine microarray datasets.

2. Run Reverse Engineering on a microarray dataset to find interactions with a chosen hub gene.

3. Retrieve gene annotations and pathway diagrams.

4. Run Hierarchical Clustering analysis (either remotely or locally) to discover trends in the data and visualize and interact with the resulting dendrogram.

5. Run Analysis of Variance analysis to discover genes differentially expressed in a collection of 3 or more exprimental conditions.

133

Part 4: Workflow Execution

Objectives

• The objective of Part 4 is to demonstrate how to define and execute (in batch mode) analysis workflows involving the sequential invocation of multiple geWorkbench modules and caGrid analytical services. geWorkbench uses a specially developed script language, caScript, for coding workflows. The language itself will not be described in this presentation; if you are interested in caScript, the syntax and semantics of the language are explained in the Software Requirements and Specification document: http://cabigcvs.nci.nih.gov/viewcvs/viewcvs.cgi/caworkbenchcabig/Requirements/geWorkbench_cagrid_SRS_final.pdf


1. Load, edit and save caScript workflow files.

2. Execute caScript workflow files.

134


Workflow Execution

135

Lesson 15: caScript Editor


Lesson outline

136

The caScript Editor

Lesson 15: The caScript Editor

137

► In many settings it is desirable to be able to codify a sequence of data processing steps so that they can be re-executed at a future time.

► geWorkbench uses a scripting language, caScript, to allow users to express such analysis workflows. caScript provides direct access to geWorkbench module functionality that has been explicitly exposed for scripting. It also allows programmatic invocation of caGrid-enabled analytical services.

► The caScript Editor facilitates the authoring of workflows by:

► Providing an editor environment where to compose scripts.

► Supporting loading and editing of script files.

► Listing all geWorkbench modules and their corresponding methods that have been exposed to caScript.

► Listing all available caGrid analytical services and their methods that can be invoked by caScript.


Overview

138

• The caScript Editor component is located in the bottom right portion of the interface. It can be accessed by selecting the tab titled caSCRIPT

The Editor interface is divided into three main areas:1. The editor window where scripts are edited. A script can be authored de novo or can be

loaded from a file in the disc, by clicking on the Open File button ( ). The contents of the editor can be save to disc as a script file by clicking on the Save to File button ( ).

2. The list of components that are available for invocation by caSCRIPT. This list contains all loaded geWorkbench modules (they appear under the tab “Local”) as well as available caGrid services (under the tab “Grid”).

3. The list of methods that each component/service exposes to caScript. For each method its name, input and output parameter types are displayed, to facilitate script authoring.


The Editor Environment

1

2

3

139

• The grid services accessible to caScript are discovered by querying a caGrid index service. The Grid tab of the component provides a space for entering the URL of this service.


The Editor Environment – Grid Services

• Clicking on the Discover button will retrieve all services registered with the specified Index service (in the screenshot below only one such service is found, a Hierarchical Clustering service).

140

• After editing is complete, the script can be executed by clicking on the “Execute” button.

• At any point during execution, a script can be stopped by clicking on the “Stop Execution” button.


Script Execution

• It is possible that a script involves the execution of methods which engage parts of the geWorkbench graphical user interface (e.g., when opening an Affymetrix gene expression file, you will be prompted to specify the location of the associated annotations file). In such cases script execution does not take place in a purely batch mode but rather requires interaction with the user.

• At present the caScript editor component is at a prototype stage; although it is fully functional in its ability to execute scripts, it misses some features that would make it more usable, such as:

• There still are many geWorkbench modules that have not exposed methods to caScript.

• There is no visual feedback to the user indicating the progress of an executing script.

141


Review

• Part 4 described how geWorkbench supports the authoring of workflows to facilitate the reproducible execution of data processing pipelines. caScript is the scripting language used for expressing such workflows.

Having completed Part 4, you should be able to:

1. Locate appropriate documentation that contains further details about the syntax and semantics of the caScript language.

2. Create scripts and save (load) them into (from) disc files.

3. Identify geWorkbench modules and caGrid services that are accessible to caScript, along with the precise methods that are available for invocation.

4. Start and cancel script execution.

142

For further information….

For further information about geWorkbench, including complete online tutorials, please see:

www.geworkbench.org

geworkbench hands-on training

Documents

data integration geworkbench

remote data

geworkbench components

data managementlesson

data manipulationlesson

data fileslesson

data subsetslesson

view data