1 geworkbench hands-on training session date: session length: target audience: trainer: developer...

Post on 27-Dec-2015

221 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

geWorkbenchHands-On Training

Session Date:

Session Length:

Target Audience:

Trainer:

Developer Subject Matter Expert:

2

geWorkbench

geWorkbench is being developed at the Joint Centers for Systems Biology, Columbia University

This work is supported by the NCI caBIG and the NIH NCBC programs.

3

► This training is designed for a user who is new to

geWorkbench.

► Target Audience: Researchers and students interested in microarray gene expression experiment analysis.

► The attendee is expected to have basic computer and biological knowledge.

► Note – this is not a complete introduction to all geWorkbench components. Its primary goal is to describe those features specified in the geWorkbench Year 1 SOW, and the context in which they can be used.

Session Details:

4

These slides are suitable for use in:

► Classroom Training

► Centra – Online Classroom

► Web-based Delivery

Session Details:

Overview of the Training Environment

5

► geWorkbench requires the Sun Java JRE 1.5 environment to be installed on your local machine.

► geWorkbench requires significant memory. At least 1 GB is recommended, especially if larger datasets are being read in or hierarchical clustering will be done.

► Windows, Linux and Mac/PowerPC version of geWorkbench are available.

► See www.geworkbench.org for full details.

Session Details:

Hardware and Software

6

By the end of the training session participants should :

► Have a basic understanding of the purpose and aims of geWorkbench.

► Be able to set program preferences and load microarray data from local and remote sources.

► Understand how data files are organized into Projects, and how subsets of data can be formed and used.

► Use filtering and normalization components to prepare data.

► Analyze and view data using a number of new components.

Session Details:

Session Goals

7

► Introduction

► Tutorial Data

► Part 1 – Data management♦ Lesson 1: Basics of the graphical interface♦ Lesson 2: Setting Preferences♦ Lesson 3: Projects and Data Files♦ Lesson 4: Working with Data Subsets♦ Lesson 5: Working with Remote Sources

► Part 2 – Data manipulation♦ Lesson 6: Normalization♦ Lesson 7: Filtering♦ Lesson 8: Experiment Annotations

► Part 3 – Analysis and display♦ Lesson 9: The Scatter Plot component♦ Lesson 10: Expression Value Distribution♦ Lesson 11: Reverse Engineering♦ Lesson 12: Gene Annotation and Pathway Viewing

Session Details:

Outline of lessons

8

Introduction

Introduction

9

► This section will describe in general the capabilities of geWorkbench in the following areas:

♦ Microarray analysis.♦ Sequence analysis.♦ Access to remote data and services

► A complete description of geWorkbench and online tutorials are available at www.geworkbench.org.

Introduction:

Overview

10

geWorkbench – a platform for tool and data integration

► geWorkbench is an open-source bioinformatics platform that provides an extensive collection of tools for the management, analysis, visualization and annotation of biomedical data.

► geWorkbench has been designed with a plug-in framework. As new techniques are developed and implemented, they can be added to geWorkbench.

► geWorkbench aims to allow different tools to easily work together, such as using microarray analysis to obtain a list of interesting genes, and then retrieving their coding or upstream sequences and using these in BLAST, pattern discovery, or transcription factor binding motif searches.

Introduction:

Overview

11

► Obtaining data from local or remote data sources

► Filtering and normalization

► Basic statistical analysis

► Clustering (Hierarchical, SOM)

► Gene Ontology analysis

► Reverse Engineering

► Visualization using many common tools♦ Scatter Plot♦ Volcano Plot♦ Expression Profiles♦ Expression Value Distribution♦ Color Mosaic♦ Dendrogram

geWorkbench supports many kinds of operations on microarray data:

Introduction:

Microarray data

12

► BLAST

► Pattern Discovery

► Transcription Factor Mapping

► Syntenic Region Analysis

geWorkbench also provides capabilities for working with sequence data:

Introduction:

Sequence data

13

► There are many biomedical data sources and computational services available through the internet. geWorkbench strives to make remote data and services directly available on the desktop, integrated with its own local tools.

► External sources provide expression data, sequences and annotation:

♦ Microarray gene expression repositories (caArray)

♦ Gene annotation web pages (viaCGAP)

♦ DNA Sequence retrieval (UC Santa Cruz)

♦ Pathway diagrams (BioCarta via caBIO database at NCI)

Introduction:

External data services

14

geWorkbench also provides a gateway to several computational services, including some hosted on Columbia servers and clusters.

► BLAST – search for sequences similar to a query sequence.

♦ Access is provided both to a Columbia server and the NCBI BLAST service.

► Pattern Discovery – find repeated patterns in a group of sequences.

► Synteny – compare regions of one chromosome against another.

► Through the caGRID project, additional remote services are being added:

♦ Hierarchical clustering – tree-like grouping by expression similarity.

♦ SOM (Self-Organizing Maps) – divide expression profiles into a limited number of bins.

♦ ARACNE – regulatory network reverse engineering.

Introduction:

External computational services

15

Tutorial Data

Tutorial Data

16

► In this section we describe the downloadable tutorial data files. This is primarily a reference section. Other files are included in the data directory of the program itself.

► The data can be downloaded from http://wiki.c2b2.columbia.edu/workbench/index.php/Download

► There are several file types

♦ Microarray

♦ Affymetrix MAS5/GCOS format files – a single file per array, as produced by Affymetrix software.

♦ The geWorkbench data matrix format, which merges all expression data from a set of experiments into a single file. By default it uses the ending “.exp”.

♦ Genepix two-color array experiments (in base download).

♦ Sequence

♦ DNA and protein sequence files in FASTA format.

Tutorial Data:

Overview

17

All data sets used in the tutorials are available from the download area of the geWorkbench website

(http://wiki.c2b2.columbia.edu/workbench/index.php/Download).

The file "tutorial_data.zip" contains the following files:

cardiogenomics.med.harvard.edu/ Contains 10 individual MAS5/GCOS format data files.

webmatrix_quantile_log2_dev1.2_mv0.exp A geWorkbench "exp" format matrix file containing filtered, normalized data. This data originally derives from the file "webmatrix2.exp". NM_024426-Wilms.fasta A Genbank nucleotide seqeuence file. NP_077744-Wilms.fasta A Genbank protein seqeuence file. H1H5_HistoneDB_NHGRI.fasta Contains H1 and H5 histone sequences from the NHGRI.

cluster_tree_total_pearsons_84_markers.csv Contains a list of genes derived from hierarchical clustering.

64of84ClusterPearsonsSeqs.fasta Contains upstream DNA sequences derived from a subset of the above genes.

Tutorial Data:

Data files

18

The example MAS5 format data files were obtained from the following site at Harvard University: http://cardiogenomics.med.harvard.edu/project-detail?project_id=229

A number of MAS5 format data files are available there. The specific project is the "Belgium Dataset of Aortic Stenosis, Congestive Cardiomyopathy and Normal LV Function", and the data is downloadable from: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/download_Hs-belgium.html

An abstract describing the study is also available, at:http://cardiogenomics.med.harvard.edu/groups/proj2/pages/Hs-belgium_home.html

Tutorial Data:

About the Cardiogenomics Microarray Dataset

19

Generation of the "webmatrix_quantile_log2_dev1.2_mv0.exp" dataset.

The file "webmatrix2.exp", available in the Download area, contains results from 100 Affymetrix HG-U95Av2 chips containing B-cell samples from numerous different disease states. 12,600 probes are represented.

For use in these tutorials we normalized and filtered the data. The steps on the next page are just an example of how filtering and normalization can be used, and each dataset should be handled according to the type of analysis being undertaken and its goals.

Tutorial Data:

Generation of example microarray dataset

20

The dataset was created through the following steps:

1. Normalization: Quantile normalization.

2. Normalization: Log2 transformation.

3. Filtering: Deviation filter with Deviation bound of 1.2.

4. Filtering: Missing values filter with maximum number of missing arrays of 0.

The result of performing these steps is available as the file "webmatrix_quantile_log2_dev1.2_mv0.exp", found in the tutorial data file "tutorial_data.zip”.

Tutorial Data:

Generation of example microarray dataset

21

Part 1: Data Management

Data Management

22

Part 1: Data Management

Objectives

The objective of Part 1 is to learn the basic operation of geWorkbench. This includes understanding the layout of the graphical interface in four main functional regions, and setting user preferences. The loading of local and remote data files will be demonstrated. Perhaps of most importance is understanding how geWorkbench allows data to be divided into subsets, both for setting up analyses and utilizing their results.

After completing Part 1, you should be able to:

1. Load microarray data into geWorkbench from local and remote sources, and set display preferences.

2. Understand how the data can be organized into projects and manipulated using sets.

23

Lesson 1: Basics of the graphical interfaceLesson 2: Setting PreferencesLesson 3: Projects and Data FilesLesson 4: Working with Data SubsetsLesson 5: Working with Remote Sources

Part 1: Data Management

Lesson outline

24

Basics of the graphical interface.

Lesson 1: Basics of the graphical interface:

25

The graphical user interface for geWorkbench is divided into four major sections

1. Data management Workspace and Projects (upper left).

2. Marker and Array/Phenotype set selection and management (lower left).

3. Visualization tools (upper right).

4. Analytical tools (lower right).

Areas 2, 3 and 4 are defined for convenience. The actual placement of agiven component into any of these three areas is controlled by a configuration file and can be customized as desired.

Lesson 1: Basics of the graphical interface

The four areas of the GUI

26

Menu bar

► The GUI provides a menu bar at top with a standard choice of commands.

► Many commands that are available in the menu bar are also available by right-clicking on data objects.

Data management area (area 1)

► Working with geWorkbench involves creating a project within the top-level Workspace.

► Opened data files and the results of analysis are stored within a Project.

► Multiple projects can be used within a workspace to organize data.

► A workspace and all the projects and data within it can be saved and later reloaded.

Lesson 1: Basics of the graphical interface

Menu bar and data management area

27

Set selection and management (area 2)

► geWorkbench allows sets of markers (gene probes) and of arrays/phenotypes to be defined and used. This allows the application to:

♦ analyze only a desired subset of the data

♦ Return lists of genes from one module which can then be used in another module, e.g. a list of genes returned by a t-test of differential expression can then be further investigated through sequence retrieval and analysis.

Lesson 1: Basics of the graphical interface

Set selection area

28

Visualization and Analysis tools (areas 3 and 4)

► To simply the display area, only the visualization and analysis components relevant to the type of dataset currently selected in the Project Folders area (area 1) are displayed.

► Thus choosing a microarray dataset will result in a different set of tabs being displayed as compared with those seen when a nucleotide sequence file is selected.

► When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Project area (area 1), but an appropriate viewer in the Visualization area (area 3) is automatically selected.

► A selection of visualization and analysis tools will be demonstrated in the following sections.

Lesson 1: Basics of the graphical interface

Visualization and analysis areas

29

Setting Preferences

Lesson 2: Setting Preferences

30

Preferences

► The Preferences selection in the Tools menu allows users to specify how certain aspects of the system will behave.

► Once the preferences are set, they are persistent between application sessions.

► From the main menu, click on Tools >Preferences.

Modifying Settings

Lesson 2: Setting Preferences

Modifying settings

31

Modifying Settings

► Text Editor: The editor selected will be used to open and inspect data sets loaded in a project. Notepad is the default setting.

► Visualization: The color scheme to be applied to color mosaic images.

♦ Absolute: (default) Values are scaled against the largest absolute value found in the dataset, with positive values red and negative green.

♦ Relative: Each marker is mean-variance normalized across all arrays. A red-blue color scheme is used, with red showing positive and blue negative values.

► Genepix Value Computation: Specifies how to compute the value displayed for a Genepix array. The default setting is Option 1 (Mean F635 - Mean B635) / (Mean F532 - Mean B532).

Lesson 2: Setting Preferences

Modifying settings

32

► The relative display performs its own transformation on the data just for purposes of visualization. The underlying data is not changed.

► The relative selection for the Microarray Viewer preference will give odd-looking results if only a small number of arrays are loaded (e.g. 2). This is because with only two values, each point will be at a color extreme – either blue or red.

► Changing the Microarray Viewer relative/absolute preference will not take effect until the next time a data set is loaded.

Lesson 2: Setting Preferences

Notes

33

Projects and Data Files

Lesson 3: Projects and Data Files

34

geWorkbench supports a number of data file formats, including:

For Microarrays:

► Affymetrix MAS5/GCOS text files.

► Affymetrix File Matrix - this is the native file type created by geWorkbench, and contains a data matrix from any number of experiments merged together.

► RMA Express File - RMA Express is a sophisticated tool for combining data

from multiple Affymetrix chips. It is not a part of geWorkbench.

► Genepix Files – created by a popular analysis program for two color arrays.

For Sequence:

► FASTA Files. DNA or protein sequence files in FASTA format.

► Pattern Files – created by the Pattern Discovery component.

Lesson 3: Projects and Data Files

File types

35

In this example, we will load 10 individual Affymetrix MAS5 format files, merging them into a single dataset.

• Create a Project. All data must belong to a project. Right-click on the Workspace entry in the Project Folders window at upper left to create a new project.

2. Next, right-click on the new Project entry and select Open Files.

Lesson 3: Projects and Data Files

Opening a file

36

3. Select file type Affymetrix MAS5/GCOS as shown.

4. Make sure to check the Merge files checkbox.

5. Select 10 MAS5 format text files from the tutorial data directory.

6. Click Open. 3

6

5

4

The chip type HG_U95Av2 is recognized...

Lesson 3: Projects and Data Files

Loading and merging data

37

The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the Microarray Viewer. Note we have increased the intensity slider to maximum here.

Lesson 3: Projects and Data Files

Viewing data

38

► The merged dataset can be given a shorter name.

♦ Right click on the merged dataset and select Rename.

♦ Enter a new dataset name, e.g. merged_cardio.

► The dataset can also be saved to disk for later reuse.

♦ Right-click on the merged dataset and select Save.

♦ Enter a filename.

Lesson 3: Projects and Data Files

Renaming and saving a merged dataset

39

Working with subsets of data

Lesson 4: Working with Subsets of Data

40

► geWorkbench makes extensive use of sets of markers (genes) or arrays.

► Sets can be defined by the user, or may be created as a result of an analysis.

► Sets of arrays can be used to distinguish between different experimental states, for example as part of a statistical analysis.

► The t-test requires two states be defined for comparison.

► Sets of markers are returned from various analysis routines. For example the t-test returns a list of markers showing signficant differential expression, and after hierarchical clustering, the markers in a subtrees of the resulting dendrogram can be saved.

► geWorkbench supports groupings of sets. Each such group can contain different sets of markers or arrays.

Lesson 4: Working with Subsets of Data

Background

41

► How to create a set of arrays.

► How to mark a set of arrays as "Active“.

► How to classify a set of arrays, e.g. as "case" vs. "control".

► How arrays can be grouped in different ways with descriptive tags.

In this tutorial you will learn

Lesson 4: Working with Subsets of Data

Overview

42

The first example here will use the same data files read in and merged in the previous lesson (Projects and Data Files).

The second example will use the tutorial file webmatrix_quantile_log2_dev1.2_mv0.exp

Lesson 4: Working with Subsets of Data

Preparation

43

First, we will select and label arrays which contain samples from the congestive cardiomyopathy disease state:

We will leave the arrays in the default group, however you can create a new group by pushing the New button on Array/Phenotype Sets at lower left.

1. In the Arrays/Phenotypes component, select the six arrays beginning with JB-ccmp, which represent the samples from the congestive cardiomyopathy disease state.

2. Right click, select Add to Set.

1

2

Lesson 4: Working with Subsets of Data

Assigning arrays to sets

44

3. Enter "CCMP" in the input box and click OK.

3

4. Next, similarly label the arrays beginning with JB-n as "Normal“.

The Array/Phenotype Sets component will now show the two sets added:

4

Lesson 4: Working with Subsets of Data

Assigning arrays to sets

45

The boxes next to the set name can be checked to indicate that a setof arrays is "Active". Various analysis and visualization components can be set to only use/display activated arrays or markers.

Lesson 4: Working with Subsets of Data

Activating sets

Note – if no Marker sets are explicitly activated, then all Markers are implicitly active. The same applies to Arrays.

46

For statistical tests such as the t-test, Case and Control groups can be specified.

1. Left-click on the thumb-tack icon in front of the phenotype name.

2. Select Case to specify the disease arrays as the "Case". The remaining "Normal" arrays are by default considered Control.

1

2

Lesson 4: Working with Subsets of Data

Classifying a set

47

3. A red thumbtack indicates an array set has been marked as "Case".

3

Lesson 4: Working with Subsets of Data

Classifying a set

48

► Different groups of sets can be made, both for Markers and for Arrays. They may differ in membership or in how members are named (e.g. amount of detail).

► Here we show how several different groupings are defined in the example data file "webmatrix_quantile_log2_dev1_mv0.exp“.

► After loading this file into geWorkbench as type "Affymetrix File Matrix", four groups can be seen in the Arrays/Phenotypes group pulldown menu at right.

Lesson 4: Working with Subsets of Data

Using multiple array groups

49

If we choose the group called "Class", the sets of arrays at right are displayed:

Lesson 4: Working with Subsets of Data

Using multiple array groups

50

If instead we choose the group "Cell Line", a different grouping of the same arrays is seen:

Lesson 4: Working with Subsets of Data

Using multiple array groups

51

Working with Remote Data Sources

Lesson 5: Working with Remote Data Sources

52

geWorkbench can retrieve microarray data from certain remote data sources, primarily from instances of the NCI's caArray database.

The Open File dialog allows remote sources to be added to the list of those available either manually or through discovery using grid services.

Right-clicking on Project will bring up the Open File dialog.

Click the Remote radio button. The Open File dialog window will be expanded to include remote sources.

Entries (locations, parameters) for non-grid services can be edited.

Lesson 5: Working with Remote Data Sources

The remote Open File dialog

53

After clicking Remote, four additional buttons appear:

1. Remote source selector – chose from available Remote Resources.

2. Go button - Accesses the Remote Source that you selected.

3. Add A New Resource button - Opens the Data Source Definition Page used to add Remote Data.

4. Edit button - Edits Remote Source Parameters.

1 2 3 4

Lesson 5: Working with Remote Data Sources

The remote Open File dialog

54

Click on the Go button next to the caArray data source at the bottom of the dialog. All available caArray experiments will be displayed.

Lesson 5: Working with Remote Data Sources

Loading data from a remote instance of caArray

55

1. Here we depict the experiment ending in *99049. The number of derived bioassays, 12, is displayed, along with the experiment information. (A new dataset, "Public Rembrandt" has 53 bioassays available).

2. To start retrieving the bioassays themselves, right-click on the experiment and press Get bioassays. This will download the list of available bioassays into geWorkbench

Select an experiment that has bioassays

1

2

Lesson 5: Working with Remote Data Sources

Selecting an experiment

56

To Retrieve Bioassay Data

Select the desired arrays and push the Open button.

(You might want to first select just one, as each can take

several minutes to download).

Lesson 5: Working with Remote Data Sources

Retrieving bioassay data

57

To add a remote source

1. Click on the Add A New Resource button.

2. Fill in the Data Source definition page. URL and Short Name are required fields.

3. Click on the OK button. The configuration is set up to automatically reflect your additional Data Source.

2

2

3

To modify a remote source

1. Click on the Edit button.

2. Make the changes that you need.

3. Click on the OK button

2

3

11

Lesson 5: Working with Remote Data Sources

Adding or modifying a remote source

58

Part 1: Data Management

Review

In Part 1, we covered

1. The basic layout of geWorkbench

2. Loading microarray data from local and remote sources, and creating a merged dataset.

3. Setting display preferences.

4. How geWorkbench uses sets of arrays and markers to organize data for analysis and to convey results from one tool to another.

59

Part 2: Data Manipulation

Data Manipulation

60

Part 2: Data Manipulation

Objectives

The objective of Part 2 is to learn the a few of the basic techniques available in geWorkbench for microarray data normalization and filtering. This section will also cover the manual and automatic annotation of datasets.

After completing Part 2, you should be able to:

1. Normalize a microarray dataset.

2. Filter out unwanted data points, such as low quality or missing data.

3. Use and create new dataset annotations.

61

Lesson 6: NormalizationLesson 7: FilteringLesson 8: Experiment Annotations

Part 2: Data Manipulation:

Lesson outline

62

Normalization

Lesson 6: Normalization

63

Normalization is used to reduce the effects of systematic variations between arrays, such as variations in hybridization, scanning, sample concentration etc. The aim is to make the data from different chips more comparable.

geWorkbench supports a number of basic types of normalization. In this section, two will be described: Housekeeping Gene normalization, and Quantile normalization.

Lesson 6: Normalization

Overview

64

► Housekeeping genes are those thought to express at a relatively constant rate.

► They can be used to provide a reference point against which to normalize.

► Using multiple housekeeping genes can lower the effect if one or more of them is actually varying with the experimental conditions. geWorkbench uses the average expression of all selected housekeeping genes as the normalization factor.

► To perform a housekeeping gene normalization:

♦ Load or select a dataset, such as the merged_cardio set created earlier.

♦ For this example, first perform a log2 normalization on the dataset. This will reduce the dominance of the more highly expressed genes.

♦ In the Housekeeping Gene normalization component, the Load button allows a predefined list of genes to be loaded. The supplied file “housekeeping_marker_list.csv” is a list of 26 such genes applicable to the Affymetrix HG_U95Av2 chip type.

Lesson 6: Normalization

Housekeeping Gene normalization

65

► Performing the normalization

♦ Loaded genes can be moved to and from the active list using the arrow buttons.

♦ Press the Normalize button. The current dataset will be normalized.

Lesson 6: Normalization

Housekeeping Gene normalization

66

► Quantile normalization is used to make the expression profile of each array the same. It is the relative position of each gene in a list ordered by expression value that now varies on each array.

► The assumption is that the real expression profile on each array is quite similar.

► Quantile normalization at the Affymetrix probe level is a feature of the advanced analysis technique called RMA. Quantile normalization in geWorkbench is applied at the gene (probeset) level.

► To perform a Quantile normalization:

♦ Load or select a dataset, such as the merged_cardio set created earlier.

♦ Go to the Normalization component in the Analysis area.

Lesson 6: Normalization

Quantile normalization

67

► To perform a Quantile normalization (cont.):

♦ Choose an Averaging method for handling missing values.

♦ Mean profile marker – average for marker across all arrays.

♦ Mean microarray value – average for array across all markers.

♦ Push the Normalize button. The current dataset will be normalized.

Lesson 6: Normalization

Quantile normalization

68

Filtering

Lesson 7: Filtering

69

Filtering is used to remove data from datasets. The data may be removed due to being of low quality, of low interest (unvarying), or may have been flagged by another program as being absent or unreliable.

Lesson 7: Filtering

Overview

70

► GenePix is a software platform used for analyzing spotted two-color arrays. It produces its own file format with the suffix .gpr.

► The file can include flags on individual data points, indicating e.g. bad or missing data.

► geWorkbench can filter out these flagged data points.

► To perform GenePix flags filtering:

♦ Load a GenePix format file, such as 21161 neu10.gpr. This is included in the geWorkbench data directory.

♦ In the Analysis area, go to the Filtering component and select GenePix Flags Filter.

♦ The list of available flags is presented. Choose a flag such as “bad” to filter on by checking its box.

♦ Push Filter.

Lesson 7: Filtering

GenePix Flags filtering

71

► Filtered-out values are colored yellow in the Tabular Microarray Viewer, indicating they are now classified as Missing Values in geWorkbench.

► Such values can be removed entirely from the dataset through use of the Missing Values Filter (not shown).

Lesson 7: Filtering

GenePix Flags filtering

72

Experiment Annotations

Lesson 8: Experiment Annotations

73

► Three components provide for automatic and manual annotation of datasets.

♦ Dataset Annotation – allows the user to type in comments on a dataset.

♦ Dataset History - automatically records data transformation steps.

♦ Experiment Info – information about the makeup of the dataset, e.g. the files that were merged to create it.

► Shown on the next slide are annotations for the dataset used in the Housekeeping Gene normalization example.

► A text file can also be read in to the Dataset Annotation component using the Load Custom Data Annotations button.

Lesson 8: Experiment Annotations

Three annotation components

74

The three annotation components

♦ Dataset Annotation (text entered by hand)

♦ Dataset History

♦ Experiment Info

Lesson 8: Experiment Annotations

Three annotation components

75

Part 2: Data Manipulation

Review

In Part 2 we covered microarray data normalization and filtering. We also saw how geWorkbench keeps a record of each data transformation, and how annotations can be added to an experimental dataset by hand or from a file.

After completing Part 2, you should be able to:

1. Normalize a microarray dataset using tools such as Housekeeping Genes Normalization and Quantile normalization.

2. Filter unwanted data points out, for example flagged points from a GenePix datafile.

3. View dataset annotations created automatically by geWorkbench when a dataset is transformed, and

4. create new dataset annotations by hand.

76

Part 3: Analysis and Display

Objectives

The objective of Part 3 is to introduce some of the major tools for microarray data analysis and display found in geWorkbench. The Scatter Plot and Expression Value Distribution (EVD) components are used to inspect microarray data, for example to evaluate data quality and the effectiveness of normalization and filtering. The Reverse Engineering component can be used to examine relationships between the expression pattern of a chosen gene and others in the dataset. Lists of genes which result from analysis steps can be evaluated through annotations and Pathway diagrams retrieved using the Marker Annotations component.

After completing Part 3, you should be able to:

1. Use the Scatter Plot and Expression Value Distribution components to examine microarray datasets.

2. Run Reverse Engineering on a microarray dataset to find interactions with a chosen hub gene, and

3. Retrieve gene annotations and pathway diagrams using the Marker Annotations component and view them.

77

Part 3: Analysis and Display

Analysis and Display

78

Lesson 9: The Scatter Plot componentLesson 10: Expression Value DistributionLesson 11: Reverse EngineeringLesson 12: Gene Annotation and Pathway Viewing

Part 3: Analysis and Display

Lesson outline

79

The Scatter Plot component

Lesson 9: The Scatter Plot component

80

► The Scatter Plot examines the relationship between two datasets. Two types of comparisons can be made: one gene probe against a second on every chip (Marker option), or every gene probe against itself on two chips (Array option). Up to 6 graphs can be shown.

Two marker plots are shown here. The marker AFFX-BioC-5_at is on the x-axis while the markers AFFX-BioB-5_at and AFFX-BioC-3_at are on the y-axes.

Lesson 9: The Scatter Plot component

Overview

81

1. You can use the dataset loaded in the previous example, or open the tutorial data file webmatrix_quantile_log2_dev1.2_mv0.exp.

2. In the scatter plot component, select the Marker or Array tab to choose the type of comparison. The above picture used Marker.

3. Highlight a reference marker or array. The second and any following items selected will result in a graph being drawn, up to a limit of six.

Lesson 9: The Scatter Plot component

Using the Scatter Plot

82

1. This tab switches between Marker/Marker and Array/Array plots.

2. Markers or arrays available for selection.

3. The first marker or array selected is placed on the x-axis and his highlighted in black. A different marker or array can be placed on the x-axis by right-clicking the marker/array name and choosing Put on X-Axis.

Basic Usage: The steps of basic usage are indicated with numbers in the screenshot

4. Subsequent selections of markers or arrays after a marker/array is on the x-axis results in the creation of a chart. Plotted markers/arrays are highlighted in grey. Clicking again on one of these markers/arrays results in the plot being removed.

Lesson 9: The Scatter Plot component

Basic usage

83

5. Clicking the Rank Statistics Plot checkbox transforms the data for analysis. The x and y values are sorted and plotted according to their rank.

6. By default, a black reference line with slope 1 is displayed in each chart. This may be turned off with the Reference Line checkbox. Also, the slope of the line may be adjusted in the Slope textbox.

Basic Usage: continued

7. The Clear Charts button removes all charts and removes the x-axis selection. The Print button prints the charts after allowing the user to adjust the page setup and choose a printer. The Image Snapshot button captures the charts as an image and places it in to the project underneath the current data set.

Lesson 9: The Scatter Plot component

Basic usage

84

Each chart can be manipulated by

right-clicking anywhere in the plot

area. This brings up a menu that

allows the chart to be individually

saved as an image or printed,

zoomed and visual properties

adjusted.

Set Selections

Markers or microarrays that are members of active sets will be plotted with unique visual properties. These selections are managed for arrays and markers in the Phenotype and the Marker components, respectively. Consider an example where the two sets are activated in the Phenotype component:

Lesson 9: The Scatter Plot component

Options and sets

85

Here we compare the expression of two genes across all arrays. The two selected sets of markers are colored blue and green. Because the “All Arrays” box is also checked, the remaining arrays are also displayed, in red:

Lesson 9: The Scatter Plot component

Example plot, all arrays

86

The visual properties of a set of markers or arrays may be adjusted. From within the Array or Marker component, right click a set and choose Change Visual Properties.

A dialog opens that allows the shape and color to be changed for that set. These properties are honored in the Scatter Plot as well as other caWorkbench components

Lesson 9: The Scatter Plot component

Set options

87

Expression Value Distribution

Lesson 10: Expression Value Distribution

88

► The expression value distribution component plots a histogram of binned expression values for selected or all the genes on one or more arrays.

► A slider (at bottom) can be used to step between each array in the current dataset.

► A subset of markers within a given expression range can be selected using movable sliders (Select values from and Select values to) and added to a Marker Set using the Add to Set button

► A T-Test can be used to detect markers with significantly different expression. A Case set of arrays must be activated in the Arrays component (remaining arrays are by default Control).

► Image Snapshot saves an image of the graph to the Project Folders component.

► Mouse-over annotations can be activated by pressing the lightbulb

► An array from the Housekeeping Genes Normalization example is displayed in the following picture:

Lesson 10: Expression Value Distribution

Features

89

Normalized, log2 transformed data (Housekeeping Gene Normalizer example)

Lesson 10: Expression Value Distribution

Example graph

90

Display options for the EVD diagram.

► Right-click on the EVD diagram to obtain the following list of display and manipulation options.

Lesson 10: Expression Value Distribution

Display options

91

► The Arrays/Phenotypes component allows the dataset to be divided into sets of arrays, which can be named and classified (e.g. as Case/Control)

►Select a group (e.g. CCMP arrays) and right-click, select Add to Set

Lesson 10: Expression Value Distribution

Working with activated datasets

92

► The set CCMP is active. The One color per array” checkbox is checked, so each array is shown in a different color.

► The base array, shown in red, is selected using the array slider.

Lesson 10: Expression Value Distribution

Displaying an activated set

93

Results of a t-test on CCMP vs Normal arrays

► Now both the CCMP and Normal array sets are active. CCMP has been marked Case.

► The t-test button is active, showing the t-statistic distribution.

Lesson 10: Expression Value Distribution

t-test

94

Reverse Engineering

Lesson 11: Reverse Engineering

95

► The primary use of the Reverse Engineering component is to infer regulatory interactions between genes and gene products.

► The Reverse Engineering component uses the information theory concept of mutual information to find these interactions.

♦ Mutual information here means the information that the expression pattern of one gene carries about the expression of another gene - it is a pairwise calculation.

♦ Mutual Information is in principle more sensitive and flexible than a simple correlation calculation.

♦ It is also invariant under certain data transformations, such as log transformations.

Lesson 11: Reverse Engineering

Overview

96

Larger datasets, containing more arrays per marker, will yield greater sensitivity and better statistical support.

Full scale runs of reverse engineering algorithms, comparing all markers against each other, and typically done on datasets containing several hundred microarrays, are typically performed on large cluster computers and are not feasible on a desktop machine.

.

Lesson 11: Reverse Engineering

Overview

97

► As typically used in geWorkbench, the Reverse Engineering component calculates the Mutual Information score between a single hub gene and all other N markers in the dataset.

► In a second step, a subset containing the best M markers is chosen (with a current limit of 100), and a complete pair-wise MxM/2 mutual information calculation is performed between them.

► The network resulting from this calculation can be displayed as a branched tree of interactions within the Cytoscape component.

Lesson 11: Reverse Engineering

Reverse Engineering in the context of geWorkbench

98

► A dataset containing multiple arrays (the more the better) should be loaded into geWorkbench. If data is loaded from separate files, it should be merged into a single micro array dataset. See the section Projects and Data Files.

► For this example we will load the tutorial dataset "webmatrix_quantile_log2_dev1.2_mv0.exp".

♦ This contains a set of 100 experiments on Affymetrix HG_U95Av2 chips. This filtered dataset has been reduced to 2226 markers.

Lesson 11: Reverse Engineering

Prerequisites

99

1. In the upper right section of geWorkbench find the Reverse Engineering component. It should by default be displaying the Profiler tab

2. In the Markers component search box, on the left side of the geWorkbench interface, enter 1973 and hit enter. This will find the marker 1973 _s_at,which is the c-Myc gene, a well-known transcription factor with many interactions.

3. Click on this marker in the list. This will enter the marker into the Hub Gene Label field of the Profiler.

Lesson 11: Reverse Engineering

Profiler - selecting a hub gene

100

4. The default setting in the Profiler is Mutual Information (fast). With this

selected, hit Analyze(2D). This will return a list of all markers having a MI

score of greater than the cutoff value (the default is 0.2).

Lesson 11: Reverse Engineering

Profiler – Analyze 2D

101

Options

► Pearson - Uses a Pearson correlation function to calculate the interaction scores.

Lesson 11: Reverse Engineering

Profiler - Options

102

5. After the Mutual Information algorithm has been run, an adjacency matrix will be placed in the Projects Folder:

Lesson 11: Reverse Engineering

Profiler – data output

103

► If a smaller network is desired, a set of markers can be highlighted in the list originally returned. Only this selected subset, up to 100 markers, will then be used if "Create Network" is pressed.

► By right-clicking and selecting "Add to Set", the selected group can also be added to the Markers component as a new set which can be used in other components (sequence retrieval, annotation retriever etc.).

Lesson 11: Reverse Engineering

Profiler – adding returned markers to a set

104

6. Hit the Create Network button. a) A network will be displayed based on the top 100 markers interacting with c-

Myc. As described above, the MI algorithm is run again on these M=100 markers, in order to measure interactions between each pair.

b) Each marker is then connected via an edge with the marker it most strongly interacts with, with the chosen hub-gene at the center.

Lesson 11: Reverse Engineering

Profiler – Create Network

105

7. The resulting network is displayed in the Cytoscape viewer.

Lesson 11: Reverse Engineering

Cytoscape viewer

106

8. The visualization in Cytoscape can be improved by going to the Layout menu, and choosing yFiles->organic:

Lesson 11: Reverse Engineering

Cytoscape viewer layout

107

9. Within the network created in Cytoscape, one can select the central gene, and then on the Cytoscape menu chose Select->Nodes->First Neighbors of selected nodes

Lesson 11: Reverse Engineering

Cytoscape viewer layout

108

10. The first neighbors will be highlighted in the graph.

11. and also added as a new set in the Markers component.

Lesson 11: Reverse Engineering

Cytoscape viewer – choosing genes

109

12. Return to the main Reverse Engineering component by clicking on the original dataset in the Project Folders component.

13. Select the first (highest MI score) marker on the list and the graph shown below is drawn in the Motif Location Histogram display. This shows a plot of the expression values on each array for the selected hub marker vs any other marker selected in the list.

Lesson 11: Reverse Engineering

Motif Location Histogram

110

Gene Annotation and Pathway Viewing

Lesson 12: Gene Annotation and Pathway Viewing

111

► The Marker Annotations component retrieves information for selected markers (genes) using caBIO.

♦ Links to CGAP annotation pages are listed under Gene.

♦ Links to BioCarta pathway diagrams are listed under Pathway.

♦ Clicking on the pathway links will display the SVG pathway diagrams in the caBIO Pathways viewer.

Lesson 12: Gene Annotation and Pathway Viewing

The Marker Annotation component

112

1. Load a set of markers from the tutorial data into the Markers component:

a) Press Load set

b) Locate the file “cluster_tree_total_pearsons_84_markers.csv” and load it.

c) Activate the set by checking the box in front of its entry. Here it has been renamed to “Cluster tree”.

2. In the Marker Annotations component, press Retrieve annotations.

3. Click on an Gene or Pathway link to view the annotations.

Lesson 12: Gene Annotation and Pathway Viewing

Marker Annotations - retrieving

113

► The list of markers can be sorted by Gene or by Pathway name by clicking on the column headings.

Lesson 12: Gene Annotation and Pathway Viewing

Marker Annotations - display

114

A CGAP annotation page displayed in a web browser window.

Lesson 12: Gene Annotation and Pathway Viewing

CGAP annotations

115

A BioCarta pathway displayed in the caBIO Pathways component.

Lesson 12: Gene Annotation and Pathway Viewing

BioCarta Pathway display

116

Part 3: Analysis and Display

Review

Part 3 described several tools for the analysis and display of microarray data.

Having completed Part 3, you should be able to:

1. Use the Scatter Plot and Expression Value Distribution components to examine microarray datasets.

2. Run Reverse Engineering on a microarray dataset to find interactions with a chosen hub gene, and

3. Retrieve gene annotations and pathway diagrams.

117

For further information….

For further information about geWorkbench, including complete online tutorials, please see:

www.geworkbench.org

top related