1 geworkbench hands-on training session date: session length: target audience: trainer: developer...

117
1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert :

Upload: brittney-owen

Post on 27-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

1

geWorkbenchHands-On Training

Session Date:

Session Length:

Target Audience:

Trainer:

Developer Subject Matter Expert:

Page 2: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

2

geWorkbench

geWorkbench is being developed at the Joint Centers for Systems Biology, Columbia University

This work is supported by the NCI caBIG and the NIH NCBC programs.

Page 3: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

3

► This training is designed for a user who is new to

geWorkbench.

► Target Audience: Researchers and students interested in microarray gene expression experiment analysis.

► The attendee is expected to have basic computer and biological knowledge.

► Note – this is not a complete introduction to all geWorkbench components. Its primary goal is to describe those features specified in the geWorkbench Year 1 SOW, and the context in which they can be used.

Session Details:

Page 4: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

4

These slides are suitable for use in:

► Classroom Training

► Centra – Online Classroom

► Web-based Delivery

Session Details:

Overview of the Training Environment

Page 5: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

5

► geWorkbench requires the Sun Java JRE 1.5 environment to be installed on your local machine.

► geWorkbench requires significant memory. At least 1 GB is recommended, especially if larger datasets are being read in or hierarchical clustering will be done.

► Windows, Linux and Mac/PowerPC version of geWorkbench are available.

► See www.geworkbench.org for full details.

Session Details:

Hardware and Software

Page 6: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

6

By the end of the training session participants should :

► Have a basic understanding of the purpose and aims of geWorkbench.

► Be able to set program preferences and load microarray data from local and remote sources.

► Understand how data files are organized into Projects, and how subsets of data can be formed and used.

► Use filtering and normalization components to prepare data.

► Analyze and view data using a number of new components.

Session Details:

Session Goals

Page 7: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

7

► Introduction

► Tutorial Data

► Part 1 – Data management♦ Lesson 1: Basics of the graphical interface♦ Lesson 2: Setting Preferences♦ Lesson 3: Projects and Data Files♦ Lesson 4: Working with Data Subsets♦ Lesson 5: Working with Remote Sources

► Part 2 – Data manipulation♦ Lesson 6: Normalization♦ Lesson 7: Filtering♦ Lesson 8: Experiment Annotations

► Part 3 – Analysis and display♦ Lesson 9: The Scatter Plot component♦ Lesson 10: Expression Value Distribution♦ Lesson 11: Reverse Engineering♦ Lesson 12: Gene Annotation and Pathway Viewing

Session Details:

Outline of lessons

Page 8: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

8

Introduction

Introduction

Page 9: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

9

► This section will describe in general the capabilities of geWorkbench in the following areas:

♦ Microarray analysis.♦ Sequence analysis.♦ Access to remote data and services

► A complete description of geWorkbench and online tutorials are available at www.geworkbench.org.

Introduction:

Overview

Page 10: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

10

geWorkbench – a platform for tool and data integration

► geWorkbench is an open-source bioinformatics platform that provides an extensive collection of tools for the management, analysis, visualization and annotation of biomedical data.

► geWorkbench has been designed with a plug-in framework. As new techniques are developed and implemented, they can be added to geWorkbench.

► geWorkbench aims to allow different tools to easily work together, such as using microarray analysis to obtain a list of interesting genes, and then retrieving their coding or upstream sequences and using these in BLAST, pattern discovery, or transcription factor binding motif searches.

Introduction:

Overview

Page 11: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

11

► Obtaining data from local or remote data sources

► Filtering and normalization

► Basic statistical analysis

► Clustering (Hierarchical, SOM)

► Gene Ontology analysis

► Reverse Engineering

► Visualization using many common tools♦ Scatter Plot♦ Volcano Plot♦ Expression Profiles♦ Expression Value Distribution♦ Color Mosaic♦ Dendrogram

geWorkbench supports many kinds of operations on microarray data:

Introduction:

Microarray data

Page 12: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

12

► BLAST

► Pattern Discovery

► Transcription Factor Mapping

► Syntenic Region Analysis

geWorkbench also provides capabilities for working with sequence data:

Introduction:

Sequence data

Page 13: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

13

► There are many biomedical data sources and computational services available through the internet. geWorkbench strives to make remote data and services directly available on the desktop, integrated with its own local tools.

► External sources provide expression data, sequences and annotation:

♦ Microarray gene expression repositories (caArray)

♦ Gene annotation web pages (viaCGAP)

♦ DNA Sequence retrieval (UC Santa Cruz)

♦ Pathway diagrams (BioCarta via caBIO database at NCI)

Introduction:

External data services

Page 14: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

14

geWorkbench also provides a gateway to several computational services, including some hosted on Columbia servers and clusters.

► BLAST – search for sequences similar to a query sequence.

♦ Access is provided both to a Columbia server and the NCBI BLAST service.

► Pattern Discovery – find repeated patterns in a group of sequences.

► Synteny – compare regions of one chromosome against another.

► Through the caGRID project, additional remote services are being added:

♦ Hierarchical clustering – tree-like grouping by expression similarity.

♦ SOM (Self-Organizing Maps) – divide expression profiles into a limited number of bins.

♦ ARACNE – regulatory network reverse engineering.

Introduction:

External computational services

Page 15: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

15

Tutorial Data

Tutorial Data

Page 16: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

16

► In this section we describe the downloadable tutorial data files. This is primarily a reference section. Other files are included in the data directory of the program itself.

► The data can be downloaded from http://wiki.c2b2.columbia.edu/workbench/index.php/Download

► There are several file types

♦ Microarray

♦ Affymetrix MAS5/GCOS format files – a single file per array, as produced by Affymetrix software.

♦ The geWorkbench data matrix format, which merges all expression data from a set of experiments into a single file. By default it uses the ending “.exp”.

♦ Genepix two-color array experiments (in base download).

♦ Sequence

♦ DNA and protein sequence files in FASTA format.

Tutorial Data:

Overview

Page 17: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

17

All data sets used in the tutorials are available from the download area of the geWorkbench website

(http://wiki.c2b2.columbia.edu/workbench/index.php/Download).

The file "tutorial_data.zip" contains the following files:

cardiogenomics.med.harvard.edu/ Contains 10 individual MAS5/GCOS format data files.

webmatrix_quantile_log2_dev1.2_mv0.exp A geWorkbench "exp" format matrix file containing filtered, normalized data. This data originally derives from the file "webmatrix2.exp". NM_024426-Wilms.fasta A Genbank nucleotide seqeuence file. NP_077744-Wilms.fasta A Genbank protein seqeuence file. H1H5_HistoneDB_NHGRI.fasta Contains H1 and H5 histone sequences from the NHGRI.

cluster_tree_total_pearsons_84_markers.csv Contains a list of genes derived from hierarchical clustering.

64of84ClusterPearsonsSeqs.fasta Contains upstream DNA sequences derived from a subset of the above genes.

Tutorial Data:

Data files

Page 18: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

18

The example MAS5 format data files were obtained from the following site at Harvard University: http://cardiogenomics.med.harvard.edu/project-detail?project_id=229

A number of MAS5 format data files are available there. The specific project is the "Belgium Dataset of Aortic Stenosis, Congestive Cardiomyopathy and Normal LV Function", and the data is downloadable from: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/download_Hs-belgium.html

An abstract describing the study is also available, at:http://cardiogenomics.med.harvard.edu/groups/proj2/pages/Hs-belgium_home.html

Tutorial Data:

About the Cardiogenomics Microarray Dataset

Page 19: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

19

Generation of the "webmatrix_quantile_log2_dev1.2_mv0.exp" dataset.

The file "webmatrix2.exp", available in the Download area, contains results from 100 Affymetrix HG-U95Av2 chips containing B-cell samples from numerous different disease states. 12,600 probes are represented.

For use in these tutorials we normalized and filtered the data. The steps on the next page are just an example of how filtering and normalization can be used, and each dataset should be handled according to the type of analysis being undertaken and its goals.

Tutorial Data:

Generation of example microarray dataset

Page 20: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

20

The dataset was created through the following steps:

1. Normalization: Quantile normalization.

2. Normalization: Log2 transformation.

3. Filtering: Deviation filter with Deviation bound of 1.2.

4. Filtering: Missing values filter with maximum number of missing arrays of 0.

The result of performing these steps is available as the file "webmatrix_quantile_log2_dev1.2_mv0.exp", found in the tutorial data file "tutorial_data.zip”.

Tutorial Data:

Generation of example microarray dataset

Page 21: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

21

Part 1: Data Management

Data Management

Page 22: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

22

Part 1: Data Management

Objectives

The objective of Part 1 is to learn the basic operation of geWorkbench. This includes understanding the layout of the graphical interface in four main functional regions, and setting user preferences. The loading of local and remote data files will be demonstrated. Perhaps of most importance is understanding how geWorkbench allows data to be divided into subsets, both for setting up analyses and utilizing their results.

After completing Part 1, you should be able to:

1. Load microarray data into geWorkbench from local and remote sources, and set display preferences.

2. Understand how the data can be organized into projects and manipulated using sets.

Page 23: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

23

Lesson 1: Basics of the graphical interfaceLesson 2: Setting PreferencesLesson 3: Projects and Data FilesLesson 4: Working with Data SubsetsLesson 5: Working with Remote Sources

Part 1: Data Management

Lesson outline

Page 24: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

24

Basics of the graphical interface.

Lesson 1: Basics of the graphical interface:

Page 25: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

25

The graphical user interface for geWorkbench is divided into four major sections

1. Data management Workspace and Projects (upper left).

2. Marker and Array/Phenotype set selection and management (lower left).

3. Visualization tools (upper right).

4. Analytical tools (lower right).

Areas 2, 3 and 4 are defined for convenience. The actual placement of agiven component into any of these three areas is controlled by a configuration file and can be customized as desired.

Lesson 1: Basics of the graphical interface

The four areas of the GUI

Page 26: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

26

Menu bar

► The GUI provides a menu bar at top with a standard choice of commands.

► Many commands that are available in the menu bar are also available by right-clicking on data objects.

Data management area (area 1)

► Working with geWorkbench involves creating a project within the top-level Workspace.

► Opened data files and the results of analysis are stored within a Project.

► Multiple projects can be used within a workspace to organize data.

► A workspace and all the projects and data within it can be saved and later reloaded.

Lesson 1: Basics of the graphical interface

Menu bar and data management area

Page 27: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

27

Set selection and management (area 2)

► geWorkbench allows sets of markers (gene probes) and of arrays/phenotypes to be defined and used. This allows the application to:

♦ analyze only a desired subset of the data

♦ Return lists of genes from one module which can then be used in another module, e.g. a list of genes returned by a t-test of differential expression can then be further investigated through sequence retrieval and analysis.

Lesson 1: Basics of the graphical interface

Set selection area

Page 28: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

28

Visualization and Analysis tools (areas 3 and 4)

► To simply the display area, only the visualization and analysis components relevant to the type of dataset currently selected in the Project Folders area (area 1) are displayed.

► Thus choosing a microarray dataset will result in a different set of tabs being displayed as compared with those seen when a nucleotide sequence file is selected.

► When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Project area (area 1), but an appropriate viewer in the Visualization area (area 3) is automatically selected.

► A selection of visualization and analysis tools will be demonstrated in the following sections.

Lesson 1: Basics of the graphical interface

Visualization and analysis areas

Page 29: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

29

Setting Preferences

Lesson 2: Setting Preferences

Page 30: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

30

Preferences

► The Preferences selection in the Tools menu allows users to specify how certain aspects of the system will behave.

► Once the preferences are set, they are persistent between application sessions.

► From the main menu, click on Tools >Preferences.

Modifying Settings

Lesson 2: Setting Preferences

Modifying settings

Page 31: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

31

Modifying Settings

► Text Editor: The editor selected will be used to open and inspect data sets loaded in a project. Notepad is the default setting.

► Visualization: The color scheme to be applied to color mosaic images.

♦ Absolute: (default) Values are scaled against the largest absolute value found in the dataset, with positive values red and negative green.

♦ Relative: Each marker is mean-variance normalized across all arrays. A red-blue color scheme is used, with red showing positive and blue negative values.

► Genepix Value Computation: Specifies how to compute the value displayed for a Genepix array. The default setting is Option 1 (Mean F635 - Mean B635) / (Mean F532 - Mean B532).

Lesson 2: Setting Preferences

Modifying settings

Page 32: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

32

► The relative display performs its own transformation on the data just for purposes of visualization. The underlying data is not changed.

► The relative selection for the Microarray Viewer preference will give odd-looking results if only a small number of arrays are loaded (e.g. 2). This is because with only two values, each point will be at a color extreme – either blue or red.

► Changing the Microarray Viewer relative/absolute preference will not take effect until the next time a data set is loaded.

Lesson 2: Setting Preferences

Notes

Page 33: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

33

Projects and Data Files

Lesson 3: Projects and Data Files

Page 34: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

34

geWorkbench supports a number of data file formats, including:

For Microarrays:

► Affymetrix MAS5/GCOS text files.

► Affymetrix File Matrix - this is the native file type created by geWorkbench, and contains a data matrix from any number of experiments merged together.

► RMA Express File - RMA Express is a sophisticated tool for combining data

from multiple Affymetrix chips. It is not a part of geWorkbench.

► Genepix Files – created by a popular analysis program for two color arrays.

For Sequence:

► FASTA Files. DNA or protein sequence files in FASTA format.

► Pattern Files – created by the Pattern Discovery component.

Lesson 3: Projects and Data Files

File types

Page 35: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

35

In this example, we will load 10 individual Affymetrix MAS5 format files, merging them into a single dataset.

• Create a Project. All data must belong to a project. Right-click on the Workspace entry in the Project Folders window at upper left to create a new project.

2. Next, right-click on the new Project entry and select Open Files.

Lesson 3: Projects and Data Files

Opening a file

Page 36: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

36

3. Select file type Affymetrix MAS5/GCOS as shown.

4. Make sure to check the Merge files checkbox.

5. Select 10 MAS5 format text files from the tutorial data directory.

6. Click Open. 3

6

5

4

The chip type HG_U95Av2 is recognized...

Lesson 3: Projects and Data Files

Loading and merging data

Page 37: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

37

The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the Microarray Viewer. Note we have increased the intensity slider to maximum here.

Lesson 3: Projects and Data Files

Viewing data

Page 38: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

38

► The merged dataset can be given a shorter name.

♦ Right click on the merged dataset and select Rename.

♦ Enter a new dataset name, e.g. merged_cardio.

► The dataset can also be saved to disk for later reuse.

♦ Right-click on the merged dataset and select Save.

♦ Enter a filename.

Lesson 3: Projects and Data Files

Renaming and saving a merged dataset

Page 39: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

39

Working with subsets of data

Lesson 4: Working with Subsets of Data

Page 40: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

40

► geWorkbench makes extensive use of sets of markers (genes) or arrays.

► Sets can be defined by the user, or may be created as a result of an analysis.

► Sets of arrays can be used to distinguish between different experimental states, for example as part of a statistical analysis.

► The t-test requires two states be defined for comparison.

► Sets of markers are returned from various analysis routines. For example the t-test returns a list of markers showing signficant differential expression, and after hierarchical clustering, the markers in a subtrees of the resulting dendrogram can be saved.

► geWorkbench supports groupings of sets. Each such group can contain different sets of markers or arrays.

Lesson 4: Working with Subsets of Data

Background

Page 41: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

41

► How to create a set of arrays.

► How to mark a set of arrays as "Active“.

► How to classify a set of arrays, e.g. as "case" vs. "control".

► How arrays can be grouped in different ways with descriptive tags.

In this tutorial you will learn

Lesson 4: Working with Subsets of Data

Overview

Page 42: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

42

The first example here will use the same data files read in and merged in the previous lesson (Projects and Data Files).

The second example will use the tutorial file webmatrix_quantile_log2_dev1.2_mv0.exp

Lesson 4: Working with Subsets of Data

Preparation

Page 43: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

43

First, we will select and label arrays which contain samples from the congestive cardiomyopathy disease state:

We will leave the arrays in the default group, however you can create a new group by pushing the New button on Array/Phenotype Sets at lower left.

1. In the Arrays/Phenotypes component, select the six arrays beginning with JB-ccmp, which represent the samples from the congestive cardiomyopathy disease state.

2. Right click, select Add to Set.

1

2

Lesson 4: Working with Subsets of Data

Assigning arrays to sets

Page 44: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

44

3. Enter "CCMP" in the input box and click OK.

3

4. Next, similarly label the arrays beginning with JB-n as "Normal“.

The Array/Phenotype Sets component will now show the two sets added:

4

Lesson 4: Working with Subsets of Data

Assigning arrays to sets

Page 45: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

45

The boxes next to the set name can be checked to indicate that a setof arrays is "Active". Various analysis and visualization components can be set to only use/display activated arrays or markers.

Lesson 4: Working with Subsets of Data

Activating sets

Note – if no Marker sets are explicitly activated, then all Markers are implicitly active. The same applies to Arrays.

Page 46: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

46

For statistical tests such as the t-test, Case and Control groups can be specified.

1. Left-click on the thumb-tack icon in front of the phenotype name.

2. Select Case to specify the disease arrays as the "Case". The remaining "Normal" arrays are by default considered Control.

1

2

Lesson 4: Working with Subsets of Data

Classifying a set

Page 47: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

47

3. A red thumbtack indicates an array set has been marked as "Case".

3

Lesson 4: Working with Subsets of Data

Classifying a set

Page 48: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

48

► Different groups of sets can be made, both for Markers and for Arrays. They may differ in membership or in how members are named (e.g. amount of detail).

► Here we show how several different groupings are defined in the example data file "webmatrix_quantile_log2_dev1_mv0.exp“.

► After loading this file into geWorkbench as type "Affymetrix File Matrix", four groups can be seen in the Arrays/Phenotypes group pulldown menu at right.

Lesson 4: Working with Subsets of Data

Using multiple array groups

Page 49: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

49

If we choose the group called "Class", the sets of arrays at right are displayed:

Lesson 4: Working with Subsets of Data

Using multiple array groups

Page 50: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

50

If instead we choose the group "Cell Line", a different grouping of the same arrays is seen:

Lesson 4: Working with Subsets of Data

Using multiple array groups

Page 51: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

51

Working with Remote Data Sources

Lesson 5: Working with Remote Data Sources

Page 52: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

52

geWorkbench can retrieve microarray data from certain remote data sources, primarily from instances of the NCI's caArray database.

The Open File dialog allows remote sources to be added to the list of those available either manually or through discovery using grid services.

Right-clicking on Project will bring up the Open File dialog.

Click the Remote radio button. The Open File dialog window will be expanded to include remote sources.

Entries (locations, parameters) for non-grid services can be edited.

Lesson 5: Working with Remote Data Sources

The remote Open File dialog

Page 53: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

53

After clicking Remote, four additional buttons appear:

1. Remote source selector – chose from available Remote Resources.

2. Go button - Accesses the Remote Source that you selected.

3. Add A New Resource button - Opens the Data Source Definition Page used to add Remote Data.

4. Edit button - Edits Remote Source Parameters.

1 2 3 4

Lesson 5: Working with Remote Data Sources

The remote Open File dialog

Page 54: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

54

Click on the Go button next to the caArray data source at the bottom of the dialog. All available caArray experiments will be displayed.

Lesson 5: Working with Remote Data Sources

Loading data from a remote instance of caArray

Page 55: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

55

1. Here we depict the experiment ending in *99049. The number of derived bioassays, 12, is displayed, along with the experiment information. (A new dataset, "Public Rembrandt" has 53 bioassays available).

2. To start retrieving the bioassays themselves, right-click on the experiment and press Get bioassays. This will download the list of available bioassays into geWorkbench

Select an experiment that has bioassays

1

2

Lesson 5: Working with Remote Data Sources

Selecting an experiment

Page 56: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

56

To Retrieve Bioassay Data

Select the desired arrays and push the Open button.

(You might want to first select just one, as each can take

several minutes to download).

Lesson 5: Working with Remote Data Sources

Retrieving bioassay data

Page 57: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

57

To add a remote source

1. Click on the Add A New Resource button.

2. Fill in the Data Source definition page. URL and Short Name are required fields.

3. Click on the OK button. The configuration is set up to automatically reflect your additional Data Source.

2

2

3

To modify a remote source

1. Click on the Edit button.

2. Make the changes that you need.

3. Click on the OK button

2

3

11

Lesson 5: Working with Remote Data Sources

Adding or modifying a remote source

Page 58: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

58

Part 1: Data Management

Review

In Part 1, we covered

1. The basic layout of geWorkbench

2. Loading microarray data from local and remote sources, and creating a merged dataset.

3. Setting display preferences.

4. How geWorkbench uses sets of arrays and markers to organize data for analysis and to convey results from one tool to another.

Page 59: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

59

Part 2: Data Manipulation

Data Manipulation

Page 60: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

60

Part 2: Data Manipulation

Objectives

The objective of Part 2 is to learn the a few of the basic techniques available in geWorkbench for microarray data normalization and filtering. This section will also cover the manual and automatic annotation of datasets.

After completing Part 2, you should be able to:

1. Normalize a microarray dataset.

2. Filter out unwanted data points, such as low quality or missing data.

3. Use and create new dataset annotations.

Page 61: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

61

Lesson 6: NormalizationLesson 7: FilteringLesson 8: Experiment Annotations

Part 2: Data Manipulation:

Lesson outline

Page 62: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

62

Normalization

Lesson 6: Normalization

Page 63: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

63

Normalization is used to reduce the effects of systematic variations between arrays, such as variations in hybridization, scanning, sample concentration etc. The aim is to make the data from different chips more comparable.

geWorkbench supports a number of basic types of normalization. In this section, two will be described: Housekeeping Gene normalization, and Quantile normalization.

Lesson 6: Normalization

Overview

Page 64: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

64

► Housekeeping genes are those thought to express at a relatively constant rate.

► They can be used to provide a reference point against which to normalize.

► Using multiple housekeeping genes can lower the effect if one or more of them is actually varying with the experimental conditions. geWorkbench uses the average expression of all selected housekeeping genes as the normalization factor.

► To perform a housekeeping gene normalization:

♦ Load or select a dataset, such as the merged_cardio set created earlier.

♦ For this example, first perform a log2 normalization on the dataset. This will reduce the dominance of the more highly expressed genes.

♦ In the Housekeeping Gene normalization component, the Load button allows a predefined list of genes to be loaded. The supplied file “housekeeping_marker_list.csv” is a list of 26 such genes applicable to the Affymetrix HG_U95Av2 chip type.

Lesson 6: Normalization

Housekeeping Gene normalization

Page 65: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

65

► Performing the normalization

♦ Loaded genes can be moved to and from the active list using the arrow buttons.

♦ Press the Normalize button. The current dataset will be normalized.

Lesson 6: Normalization

Housekeeping Gene normalization

Page 66: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

66

► Quantile normalization is used to make the expression profile of each array the same. It is the relative position of each gene in a list ordered by expression value that now varies on each array.

► The assumption is that the real expression profile on each array is quite similar.

► Quantile normalization at the Affymetrix probe level is a feature of the advanced analysis technique called RMA. Quantile normalization in geWorkbench is applied at the gene (probeset) level.

► To perform a Quantile normalization:

♦ Load or select a dataset, such as the merged_cardio set created earlier.

♦ Go to the Normalization component in the Analysis area.

Lesson 6: Normalization

Quantile normalization

Page 67: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

67

► To perform a Quantile normalization (cont.):

♦ Choose an Averaging method for handling missing values.

♦ Mean profile marker – average for marker across all arrays.

♦ Mean microarray value – average for array across all markers.

♦ Push the Normalize button. The current dataset will be normalized.

Lesson 6: Normalization

Quantile normalization

Page 68: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

68

Filtering

Lesson 7: Filtering

Page 69: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

69

Filtering is used to remove data from datasets. The data may be removed due to being of low quality, of low interest (unvarying), or may have been flagged by another program as being absent or unreliable.

Lesson 7: Filtering

Overview

Page 70: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

70

► GenePix is a software platform used for analyzing spotted two-color arrays. It produces its own file format with the suffix .gpr.

► The file can include flags on individual data points, indicating e.g. bad or missing data.

► geWorkbench can filter out these flagged data points.

► To perform GenePix flags filtering:

♦ Load a GenePix format file, such as 21161 neu10.gpr. This is included in the geWorkbench data directory.

♦ In the Analysis area, go to the Filtering component and select GenePix Flags Filter.

♦ The list of available flags is presented. Choose a flag such as “bad” to filter on by checking its box.

♦ Push Filter.

Lesson 7: Filtering

GenePix Flags filtering

Page 71: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

71

► Filtered-out values are colored yellow in the Tabular Microarray Viewer, indicating they are now classified as Missing Values in geWorkbench.

► Such values can be removed entirely from the dataset through use of the Missing Values Filter (not shown).

Lesson 7: Filtering

GenePix Flags filtering

Page 72: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

72

Experiment Annotations

Lesson 8: Experiment Annotations

Page 73: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

73

► Three components provide for automatic and manual annotation of datasets.

♦ Dataset Annotation – allows the user to type in comments on a dataset.

♦ Dataset History - automatically records data transformation steps.

♦ Experiment Info – information about the makeup of the dataset, e.g. the files that were merged to create it.

► Shown on the next slide are annotations for the dataset used in the Housekeeping Gene normalization example.

► A text file can also be read in to the Dataset Annotation component using the Load Custom Data Annotations button.

Lesson 8: Experiment Annotations

Three annotation components

Page 74: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

74

The three annotation components

♦ Dataset Annotation (text entered by hand)

♦ Dataset History

♦ Experiment Info

Lesson 8: Experiment Annotations

Three annotation components

Page 75: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

75

Part 2: Data Manipulation

Review

In Part 2 we covered microarray data normalization and filtering. We also saw how geWorkbench keeps a record of each data transformation, and how annotations can be added to an experimental dataset by hand or from a file.

After completing Part 2, you should be able to:

1. Normalize a microarray dataset using tools such as Housekeeping Genes Normalization and Quantile normalization.

2. Filter unwanted data points out, for example flagged points from a GenePix datafile.

3. View dataset annotations created automatically by geWorkbench when a dataset is transformed, and

4. create new dataset annotations by hand.

Page 76: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

76

Part 3: Analysis and Display

Objectives

The objective of Part 3 is to introduce some of the major tools for microarray data analysis and display found in geWorkbench. The Scatter Plot and Expression Value Distribution (EVD) components are used to inspect microarray data, for example to evaluate data quality and the effectiveness of normalization and filtering. The Reverse Engineering component can be used to examine relationships between the expression pattern of a chosen gene and others in the dataset. Lists of genes which result from analysis steps can be evaluated through annotations and Pathway diagrams retrieved using the Marker Annotations component.

After completing Part 3, you should be able to:

1. Use the Scatter Plot and Expression Value Distribution components to examine microarray datasets.

2. Run Reverse Engineering on a microarray dataset to find interactions with a chosen hub gene, and

3. Retrieve gene annotations and pathway diagrams using the Marker Annotations component and view them.

Page 77: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

77

Part 3: Analysis and Display

Analysis and Display

Page 78: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

78

Lesson 9: The Scatter Plot componentLesson 10: Expression Value DistributionLesson 11: Reverse EngineeringLesson 12: Gene Annotation and Pathway Viewing

Part 3: Analysis and Display

Lesson outline

Page 79: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

79

The Scatter Plot component

Lesson 9: The Scatter Plot component

Page 80: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

80

► The Scatter Plot examines the relationship between two datasets. Two types of comparisons can be made: one gene probe against a second on every chip (Marker option), or every gene probe against itself on two chips (Array option). Up to 6 graphs can be shown.

Two marker plots are shown here. The marker AFFX-BioC-5_at is on the x-axis while the markers AFFX-BioB-5_at and AFFX-BioC-3_at are on the y-axes.

Lesson 9: The Scatter Plot component

Overview

Page 81: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

81

1. You can use the dataset loaded in the previous example, or open the tutorial data file webmatrix_quantile_log2_dev1.2_mv0.exp.

2. In the scatter plot component, select the Marker or Array tab to choose the type of comparison. The above picture used Marker.

3. Highlight a reference marker or array. The second and any following items selected will result in a graph being drawn, up to a limit of six.

Lesson 9: The Scatter Plot component

Using the Scatter Plot

Page 82: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

82

1. This tab switches between Marker/Marker and Array/Array plots.

2. Markers or arrays available for selection.

3. The first marker or array selected is placed on the x-axis and his highlighted in black. A different marker or array can be placed on the x-axis by right-clicking the marker/array name and choosing Put on X-Axis.

Basic Usage: The steps of basic usage are indicated with numbers in the screenshot

4. Subsequent selections of markers or arrays after a marker/array is on the x-axis results in the creation of a chart. Plotted markers/arrays are highlighted in grey. Clicking again on one of these markers/arrays results in the plot being removed.

Lesson 9: The Scatter Plot component

Basic usage

Page 83: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

83

5. Clicking the Rank Statistics Plot checkbox transforms the data for analysis. The x and y values are sorted and plotted according to their rank.

6. By default, a black reference line with slope 1 is displayed in each chart. This may be turned off with the Reference Line checkbox. Also, the slope of the line may be adjusted in the Slope textbox.

Basic Usage: continued

7. The Clear Charts button removes all charts and removes the x-axis selection. The Print button prints the charts after allowing the user to adjust the page setup and choose a printer. The Image Snapshot button captures the charts as an image and places it in to the project underneath the current data set.

Lesson 9: The Scatter Plot component

Basic usage

Page 84: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

84

Each chart can be manipulated by

right-clicking anywhere in the plot

area. This brings up a menu that

allows the chart to be individually

saved as an image or printed,

zoomed and visual properties

adjusted.

Set Selections

Markers or microarrays that are members of active sets will be plotted with unique visual properties. These selections are managed for arrays and markers in the Phenotype and the Marker components, respectively. Consider an example where the two sets are activated in the Phenotype component:

Lesson 9: The Scatter Plot component

Options and sets

Page 85: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

85

Here we compare the expression of two genes across all arrays. The two selected sets of markers are colored blue and green. Because the “All Arrays” box is also checked, the remaining arrays are also displayed, in red:

Lesson 9: The Scatter Plot component

Example plot, all arrays

Page 86: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

86

The visual properties of a set of markers or arrays may be adjusted. From within the Array or Marker component, right click a set and choose Change Visual Properties.

A dialog opens that allows the shape and color to be changed for that set. These properties are honored in the Scatter Plot as well as other caWorkbench components

Lesson 9: The Scatter Plot component

Set options

Page 87: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

87

Expression Value Distribution

Lesson 10: Expression Value Distribution

Page 88: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

88

► The expression value distribution component plots a histogram of binned expression values for selected or all the genes on one or more arrays.

► A slider (at bottom) can be used to step between each array in the current dataset.

► A subset of markers within a given expression range can be selected using movable sliders (Select values from and Select values to) and added to a Marker Set using the Add to Set button

► A T-Test can be used to detect markers with significantly different expression. A Case set of arrays must be activated in the Arrays component (remaining arrays are by default Control).

► Image Snapshot saves an image of the graph to the Project Folders component.

► Mouse-over annotations can be activated by pressing the lightbulb

► An array from the Housekeeping Genes Normalization example is displayed in the following picture:

Lesson 10: Expression Value Distribution

Features

Page 89: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

89

Normalized, log2 transformed data (Housekeeping Gene Normalizer example)

Lesson 10: Expression Value Distribution

Example graph

Page 90: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

90

Display options for the EVD diagram.

► Right-click on the EVD diagram to obtain the following list of display and manipulation options.

Lesson 10: Expression Value Distribution

Display options

Page 91: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

91

► The Arrays/Phenotypes component allows the dataset to be divided into sets of arrays, which can be named and classified (e.g. as Case/Control)

►Select a group (e.g. CCMP arrays) and right-click, select Add to Set

Lesson 10: Expression Value Distribution

Working with activated datasets

Page 92: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

92

► The set CCMP is active. The One color per array” checkbox is checked, so each array is shown in a different color.

► The base array, shown in red, is selected using the array slider.

Lesson 10: Expression Value Distribution

Displaying an activated set

Page 93: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

93

Results of a t-test on CCMP vs Normal arrays

► Now both the CCMP and Normal array sets are active. CCMP has been marked Case.

► The t-test button is active, showing the t-statistic distribution.

Lesson 10: Expression Value Distribution

t-test

Page 94: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

94

Reverse Engineering

Lesson 11: Reverse Engineering

Page 95: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

95

► The primary use of the Reverse Engineering component is to infer regulatory interactions between genes and gene products.

► The Reverse Engineering component uses the information theory concept of mutual information to find these interactions.

♦ Mutual information here means the information that the expression pattern of one gene carries about the expression of another gene - it is a pairwise calculation.

♦ Mutual Information is in principle more sensitive and flexible than a simple correlation calculation.

♦ It is also invariant under certain data transformations, such as log transformations.

Lesson 11: Reverse Engineering

Overview

Page 96: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

96

Larger datasets, containing more arrays per marker, will yield greater sensitivity and better statistical support.

Full scale runs of reverse engineering algorithms, comparing all markers against each other, and typically done on datasets containing several hundred microarrays, are typically performed on large cluster computers and are not feasible on a desktop machine.

.

Lesson 11: Reverse Engineering

Overview

Page 97: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

97

► As typically used in geWorkbench, the Reverse Engineering component calculates the Mutual Information score between a single hub gene and all other N markers in the dataset.

► In a second step, a subset containing the best M markers is chosen (with a current limit of 100), and a complete pair-wise MxM/2 mutual information calculation is performed between them.

► The network resulting from this calculation can be displayed as a branched tree of interactions within the Cytoscape component.

Lesson 11: Reverse Engineering

Reverse Engineering in the context of geWorkbench

Page 98: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

98

► A dataset containing multiple arrays (the more the better) should be loaded into geWorkbench. If data is loaded from separate files, it should be merged into a single micro array dataset. See the section Projects and Data Files.

► For this example we will load the tutorial dataset "webmatrix_quantile_log2_dev1.2_mv0.exp".

♦ This contains a set of 100 experiments on Affymetrix HG_U95Av2 chips. This filtered dataset has been reduced to 2226 markers.

Lesson 11: Reverse Engineering

Prerequisites

Page 99: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

99

1. In the upper right section of geWorkbench find the Reverse Engineering component. It should by default be displaying the Profiler tab

2. In the Markers component search box, on the left side of the geWorkbench interface, enter 1973 and hit enter. This will find the marker 1973 _s_at,which is the c-Myc gene, a well-known transcription factor with many interactions.

3. Click on this marker in the list. This will enter the marker into the Hub Gene Label field of the Profiler.

Lesson 11: Reverse Engineering

Profiler - selecting a hub gene

Page 100: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

100

4. The default setting in the Profiler is Mutual Information (fast). With this

selected, hit Analyze(2D). This will return a list of all markers having a MI

score of greater than the cutoff value (the default is 0.2).

Lesson 11: Reverse Engineering

Profiler – Analyze 2D

Page 101: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

101

Options

► Pearson - Uses a Pearson correlation function to calculate the interaction scores.

Lesson 11: Reverse Engineering

Profiler - Options

Page 102: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

102

5. After the Mutual Information algorithm has been run, an adjacency matrix will be placed in the Projects Folder:

Lesson 11: Reverse Engineering

Profiler – data output

Page 103: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

103

► If a smaller network is desired, a set of markers can be highlighted in the list originally returned. Only this selected subset, up to 100 markers, will then be used if "Create Network" is pressed.

► By right-clicking and selecting "Add to Set", the selected group can also be added to the Markers component as a new set which can be used in other components (sequence retrieval, annotation retriever etc.).

Lesson 11: Reverse Engineering

Profiler – adding returned markers to a set

Page 104: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

104

6. Hit the Create Network button. a) A network will be displayed based on the top 100 markers interacting with c-

Myc. As described above, the MI algorithm is run again on these M=100 markers, in order to measure interactions between each pair.

b) Each marker is then connected via an edge with the marker it most strongly interacts with, with the chosen hub-gene at the center.

Lesson 11: Reverse Engineering

Profiler – Create Network

Page 105: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

105

7. The resulting network is displayed in the Cytoscape viewer.

Lesson 11: Reverse Engineering

Cytoscape viewer

Page 106: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

106

8. The visualization in Cytoscape can be improved by going to the Layout menu, and choosing yFiles->organic:

Lesson 11: Reverse Engineering

Cytoscape viewer layout

Page 107: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

107

9. Within the network created in Cytoscape, one can select the central gene, and then on the Cytoscape menu chose Select->Nodes->First Neighbors of selected nodes

Lesson 11: Reverse Engineering

Cytoscape viewer layout

Page 108: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

108

10. The first neighbors will be highlighted in the graph.

11. and also added as a new set in the Markers component.

Lesson 11: Reverse Engineering

Cytoscape viewer – choosing genes

Page 109: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

109

12. Return to the main Reverse Engineering component by clicking on the original dataset in the Project Folders component.

13. Select the first (highest MI score) marker on the list and the graph shown below is drawn in the Motif Location Histogram display. This shows a plot of the expression values on each array for the selected hub marker vs any other marker selected in the list.

Lesson 11: Reverse Engineering

Motif Location Histogram

Page 110: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

110

Gene Annotation and Pathway Viewing

Lesson 12: Gene Annotation and Pathway Viewing

Page 111: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

111

► The Marker Annotations component retrieves information for selected markers (genes) using caBIO.

♦ Links to CGAP annotation pages are listed under Gene.

♦ Links to BioCarta pathway diagrams are listed under Pathway.

♦ Clicking on the pathway links will display the SVG pathway diagrams in the caBIO Pathways viewer.

Lesson 12: Gene Annotation and Pathway Viewing

The Marker Annotation component

Page 112: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

112

1. Load a set of markers from the tutorial data into the Markers component:

a) Press Load set

b) Locate the file “cluster_tree_total_pearsons_84_markers.csv” and load it.

c) Activate the set by checking the box in front of its entry. Here it has been renamed to “Cluster tree”.

2. In the Marker Annotations component, press Retrieve annotations.

3. Click on an Gene or Pathway link to view the annotations.

Lesson 12: Gene Annotation and Pathway Viewing

Marker Annotations - retrieving

Page 113: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

113

► The list of markers can be sorted by Gene or by Pathway name by clicking on the column headings.

Lesson 12: Gene Annotation and Pathway Viewing

Marker Annotations - display

Page 114: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

114

A CGAP annotation page displayed in a web browser window.

Lesson 12: Gene Annotation and Pathway Viewing

CGAP annotations

Page 115: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

115

A BioCarta pathway displayed in the caBIO Pathways component.

Lesson 12: Gene Annotation and Pathway Viewing

BioCarta Pathway display

Page 116: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

116

Part 3: Analysis and Display

Review

Part 3 described several tools for the analysis and display of microarray data.

Having completed Part 3, you should be able to:

1. Use the Scatter Plot and Expression Value Distribution components to examine microarray datasets.

2. Run Reverse Engineering on a microarray dataset to find interactions with a chosen hub gene, and

3. Retrieve gene annotations and pathway diagrams.

Page 117: 1 geWorkbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert:

117

For further information….

For further information about geWorkbench, including complete online tutorials, please see:

www.geworkbench.org