microarray data analysis tool using java and...

MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R

By

Vasundhara Akkineni B.Tech, University of Madras, 2003

A Thesis Submitted to the Faculty of the

Graduate School of the University of Louisville In Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

Department of Computer Engineering and Computer Science University of Louisville

Louisville, Kentucky

May 2006

MICROARRAY DATA ANALYSIS USING JAVA AND R

By

Vasundhara Akkineni B.Tech, University of Madras, 2003

A Thesis Approved on

April 14, 2006

By the following Thesis Committee:

Dr. Eric C. Rouchka, Thesis Director

Dr. Dar-jen Chang

Dr. Thomas Knudsen

ii

DEDICATION

Dedicated to my parents

Mr. Sarat Kumar Akkineni

and

Mrs. Surya Rani Akkineni

Thanks for everything, papa and ma.

iii

ACKNOWLEDGEMENTS

First, I would like to thank my thesis director, Dr. Eric C. Rouchka for his direction,

assistance and guidance. I would also like to thank the members of my thesis committee,

Dr. Dar-jen Chang and Dr. Thomas Knudsen for their time. I thank Tim Hardin,

Elizabeth Cha, Yamini Rudraraju and Eric Stutzenberger for making the Bioinformatics

lab an enjoyable place to come to each day. Additional thanks to the other members of

the Bioinformatics Research Group. I thank all the friends I have made over the years at

the University of Louisville. I have learned a lot from each one of you. Finally, I thank

my parents with all due respect for their love, support and encouragement. Support for

this project was provided by NIH-NCRR grant # P20 RR16481 (Nigel Cooper, PI).

iv

ABSTRACT

MICROARRAY DATA ANALYSIS USING JAVA AND R

VASUNDHARA AKKINENI

APRIL 14, 2006 Microarray technology has become an essential tool in functional genomics for

monitoring the expression of many genes in parallel. Gene expression values obtained

from microarray experiments help biologists to understand the way in which a cell

responds to varying conditions (including, but not limited to development over time,

response to environmental stimuli, or disease states) by analyzing the increase or

decrease in the expression level of genes. We have developed web-based software that

provides biologists with several statistical solutions for analyzing gene expression data.

This platform independent java servlet first performs normalization of the gene

expression values in order to eliminate any systematic bias in the measured intensity

values arising from the microarray process. Several normalization methods like Total

Intensity Normalization, Median Normalization and Lowess Normalization have been

implemented. After normalization, visualization of the experimental data can be

performed using scatter plots, MA plots, RI plots and image maps of the intensity ratios.

For detection of genes which are differentially expressed the software provides fold-

change detection and t-test techniques. The tool also provides the users the ability to

create a workflow of the different analysis tools used to study the uploaded

v

data. All the statistical routines used in this software were developed in R called from

Java code. This software is a freely available tool to statistically analyze microarray

experiments.

vi

TABLE OF CONTENTS

DEDICATION……………………………………………………………………………iii ACKNOWLEDGEMENTS………………………………………………………………iv ABSTRACT……………………………………………………………………………….v LIST OF TABLES………………………………………………………………………..xi LIST OF FIGURES...…………………………………………………………….……...xii 1. INTRODUCTION ...................................................................................................... 1

1.1 Overview of molecular biology .......................................................................... 4

1.2 DNA.................................................................................................................... 4

1.3 RNA .................................................................................................................... 6

1.4 mRNA................................................................................................................. 6

1.5 Gene .................................................................................................................... 7

1.6 Central Dogma of Molecular Biology ................................................................ 7

1.7 MicroRNA (miRNA) .......................................................................................... 8

1.8 Microarrays ......................................................................................................... 9

2. MICROARRAY ANALYSIS TECHNIQUES......................................................... 13

2.1 Microarray data analysis ................................................................................... 13

2.2 Log ratios .......................................................................................................... 16

2.3 Normalization ................................................................................................... 17

2.3.1 Total intensity normalization .................................................................... 17

2.3.2 Median normalization ............................................................................... 18

2.3.3 Lowess normalization ............................................................................... 18

2.4 Scatter plot ........................................................................................................ 19

vii

2.5 MA plot............................................................................................................. 20

2.6 RI plot ............................................................................................................... 21

2.7 Difference between MA and RI plots ............................................................... 22

2.8 Identifying differentially expressed genes ........................................................ 23

2.8.1 Fold change............................................................................................... 23

2.9 Clustering.......................................................................................................... 24

2.10 Types of clustering............................................................................................ 24

2.10.1 Hierarchical clustering .............................................................................. 25

2.10.2 Dendrogram .............................................................................................. 26

2.10.3 Heat maps.................................................................................................. 27

3. LITERATURE REVIEW ......................................................................................... 29

3.1 Bioconductor..................................................................................................... 29

3.2 TM4 and MIDAS.............................................................................................. 31

3.3 BASE: BioArray Software Environment.......................................................... 32

3.4 WebArray: an online platform for microarray data analysis ............................ 32

3.5 SNOMAD (Standardization and Normalization of MicroArray Data)............. 33

4. IMPLEMENTATION SPECIFICS .......................................................................... 35

4.1 R........................................................................................................................ 35

4.1.1 Statistics and R.......................................................................................... 36

4.1.2 R and Windows™..................................................................................... 36

4.2 Rserve ............................................................................................................... 37

viii

4.2.1 Installation of Rserve ................................................................................ 38

4.3 Java ................................................................................................................... 39

4.3.1 Java language ............................................................................................ 39

4.3.2 Java platform............................................................................................. 39

4.4 Java servlets ...................................................................................................... 40

4.5 JSP (Java Server Pages) .................................................................................... 42

4.6 JDBC (Java Database Connectivity)................................................................. 42

4.7 MySQL ............................................................................................................. 43

4.8 Apache Tomcat ................................................................................................. 44

5. OBJECTIVES AND RESULTS............................................................................... 46

5.1 Uploading experiment data ............................................................................... 47

5.2 Normalization methods..................................................................................... 48

5.3 Data visualization.............................................................................................. 50

5.3.1 Scatter plot ................................................................................................ 50

5.3.2 MA plot..................................................................................................... 51

5.3.3 RI plot ....................................................................................................... 52

5.4 Creating a process pipeline ............................................................................... 53

5.5 Identifying genes of interest.............................................................................. 57

5.5.1 Fold change cut-off ................................................................................... 57

5.6 Clustering genes................................................................................................ 60

5.7 Top and bottom intensity ratios ........................................................................ 61

ix

5.8 Search genes...................................................................................................... 63

5.9 Output results as files........................................................................................ 65

6. CONCLUSIONS....................................................................................................... 67

6.1 Possible improvements ..................................................................................... 68

REFERENCES…………………………………………………………………………..70

APPENDICES…………………………………………………………………………...72

CURRICULUM VITAE…………………………………………………………………77

x

LIST OF TABLES

Table 2.1 - Gene expression matrix with raw gene expression data................................. 14

Table 2.2 - Gene expression matrix with intensity ratio values........................................ 15

Table 2.3 - Gene expression matrix with log 2 intensity ratio values ............................... 16

Table 3.1 - Bioconductor packages................................................................................... 30

Table 5.1 - Color coding scheme for differentially expressed genes using fold change .. 57

xi

LIST OF FIGURES

Figure 1.1 - An overview of the formation of proteins....................................................... 4

Figure 1.2 - DNA double helix structure ............................................................................ 5

Figure 1.3 - Formation of mRNA ....................................................................................... 6

Figure 1.4 - Central Dogma of Molecular Biology............................................................. 8

Figure 1.5 - Preparation of microarrays............................................................................ 10

Figure 2.1 - Process of obtaining a gene expression matrix ............................................. 13

Figure 2.2 - Effects of lowess normalization .................................................................... 19

Figure 2.3 - A scatter plot ................................................................................................. 20

Figure 2.4 - An MA plot ................................................................................................... 21

Figure 2.5 - An RI plot...................................................................................................... 22

Figure 2.6 - Construction of a two-dimensional dendrogram representing a hierarchical

cluster of related genes...................................................................................................... 26

Figure 2.7 - A heat map with a dendrogram and a color key............................................ 28

Figure 4.1 - R command line interface on startup ............................................................ 37

Figure 4.2 - Java servlet execution process ...................................................................... 41

Figure 4.3 - The three-tier architecture of a JDBC connection......................................... 43

Figure 5.1 - Sample data file for analysis ......................................................................... 47

Figure 5.2 - Process of uploading data files...................................................................... 48

Figure 5.3 - Normalized data file using total intensity normalization .............................. 49

Figure 5.4 - Normalized data file using median normalization ........................................ 49

xii

Figure 5.5 - Total intensity normalization scatter plots and text files .............................. 50

Figure 5.6 - matrix of scatter plots with a zoomed out portion for two specific

experiments ....................................................................................................................... 51

nn×

Figure 5.7 - matrix of MA plots with a zoomed out portion for two specific

experiments ....................................................................................................................... 52

nn×

Figure 5.8 - matrix of RI plots with a zoomed out portion for two specific

experiments ....................................................................................................................... 53

nn×

Figure 5.9 - Steps for pipelining analysis ......................................................................... 54

Figure 5.10 - Pipeline screen to define a sequence of routines to be performed on the data

........................................................................................................................................... 55

Figure 5.11 - Results screen after submitting the pipeline screen .................................... 56

Figure 5.12 - Color based image map of the gene’s intensity ratios between two

experiments ....................................................................................................................... 59

Figure 5.13 – Individual gene details................................................................................ 59

Figure 5.14 - Heat map with a dendrogram to represent expression clusters ................... 61

Figure 5.15 - User selected columns for calculating the intensity ratios and the number of

top/bottom genes needed................................................................................................... 62

Figure 5.16 - Results showing top 10 ratios and bottom 25 ratios between two

experiments ....................................................................................................................... 63

Figure 5.17 - Search screen with the list of genes from the uploaded data file ................ 64

Figure 5.18 - Output screen for a search done on gene information................................. 65

xiii

1. INTRODUCTION

Two complementary advances, one in knowledge and one in technology, are

greatly facilitating the study of gene expression and the discovery of the roles played by

specific genes in the development of disease. As a result of the Human Genome

Project[1], there has been an explosion in the amount of information available about the

DNA sequence of the human genome, including identification of a large number of genes

within these previously unknown sequences. The challenge currently faced by scientists

is to find a way to organize and catalog this vast amount of information into a usable

form. The full impact of the Human Genome Project will be realized only after the

functions of the new genes are discovered.

With this vast amount of information comes the need for tools to make sense of

the data. This led to the second advance which facilitated the identification and

classification of the DNA sequence information and the assignment of functions to these

new genes- the DNA microarray technology. With the invention of the DNA chip,

researchers have gone from looking at genes one at a time to tens of thousands at a

time[2]. In order to really understand a genome, scientists need to understand how genes

interact with each other and which genes are present under different conditions. This can

be done by measuring the amount of each mRNA present in the cell. Microarrays enable

us to measure this for thousands of genes simultaneously. With the aid of a computer, the

amount of mRNA bound to the spots on the microarray is precisely measured, generating

1

a profile of gene expression in the cell. Microarrays generate huge amounts of valuable

data and the handling and analysis of such data is becoming one of the major bottlenecks

in the utilization of the technology. The raw microarray data are images, which have to

be transformed into gene expression matrices—tables where rows represent genes,

columns represent various samples such as tissues or experimental conditions, and

numbers in each cell characterize the expression level of the particular gene in the

particular sample. These matrices have to be analyzed further, if any knowledge about the

underlying biological processes is to be extracted and this forms the basis for my thesis-

microarray data analysis.

The data analysis process constitutes the analysis of the gene expression matrix

using either supervised or unsupervised methods. Among the many statistical packages

available for data analysis, ‘R’ is a statistical package which is widely used for the

analysis of microarray data[3]. Several open source software are available which perform

data analysis using R functionality as their base. Most of these packages either require

some hands on programming experience and syntactical knowledge of the software in

order to perform the analysis of the microarray data or are platform dependent and not

universally available for all types of users.

The Bioinformatics Research Group (BRG) [http://kbrin.a-

bldg.louisville.edu/brg/], which is a joint collaboration between the Speed School of

Engineering and the School of Medicine at the University of Louisville, came up with the

initiative for developing user-friendly software that can be used by biologists who

generally lack programming knowledge. This thesis work is concentrated on developing a

web based java tool which allows users to upload their data files in the format of a gene

2

expression matrix and then performs normalization of the data, produces plots to

visualize the data, perform clustering of similar patterns of differentially expressed genes

and lets users to save their results to a text file.

It should be noted that a good understanding of these methods and the biology

behind the data is needed to choose the most appropriate for solving a particular problem.

The rest of chapter one is devoted to an overview of molecular biology, including a

discussion of DNA, RNA, genes and microarrays. Chapter two discusses microarray

analysis techniques, including an overview of log ratios, normalization, visualization

plots, differentially expressed genes using the fold change method, and clustering.

Chapter three is a literature review of existing microarray data analysis software, their

drawbacks and how the system being developed caters to the needs of the user who lacks

programming expertise. Chapter four gives a detailed description of the software used for

the development of the microarray analysis tool. Installation and implementation

specifics are also covered in a detailed manner. An overall discussion of the system being

developed in the form of its objectives and the results obtained is dealt with in chapter

five. Conclusions and further improvements to the microarray data analysis tool are

discussed in chapter six. A detailed glossary of terms is also available as part of this

thesis for the reader’s reference.

3

1.1 Overview of molecular biology

Every cell in an organism contains a full set of chromosomes and identical genes.

At a given point of time, only a subset of these genes is active. These genes define certain

unique properties of a cell type. The information contained in the DNA is transcribed into

messenger RNA (mRNA) molecules, which are then translated into proteins, which

perform most of the important functions of the cell. Figure 1.1 illustrates this process.

Cell Nucleus

Chromosome

Protein Gene (DNA)Gene (mRNA), single strand

Cell Nucleus

Chromosome

Protein Gene (DNA)Gene (mRNA), single strand

Figure 1.1 - An overview of the formation of proteins

1.2 DNA

Deoxyribonucleic Acid (DNA) is the basis for the building blocks encoding the

information of life. A single stranded DNA molecule, called a polynucleotide or

oligomer, is a chain of small molecules called nucleotides. There are four different

nucleotides, or bases: adenosine (A), cytosine (C), guanine (G) and thymine (T).

4

Stringing together a simple alphabet of four characters together we can get

enough information to create a complex organism. The ends of the polynucleotide are

marked either 3’ or 5’. The general convention is to label the coding strand from 5’ to 3’

(left to right). For instance, the following is a polynucleotide:

5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’

DNA can be either single-stranded or double stranded. When DNA is double-

stranded, the second strand is referred to as the reverse complement strand.

Complementary bases are determined by which pairs of nucleotides can form bonds

between them. In the case of DNA, A binds to T, and C binds to G. For the

polynucleotide given above, the double-stranded polynucleotide is as follows:

5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’

| | | | | | | | | | | | | | | |

3’ C←A←T←T←T←C←A←G←G←G←C←A←A←T←C←G 5’

Two complementary polynucleotide chains form a stable structure known as the

DNA double helix (Figure 1.2). Using the double stranded molecule as a template,

proteins will be produced for active genes with the help of RNA molecule.

Image source: www.genecrc.org/site/ lc/lc2b.htm

Figure 1.2 - DNA double helix structure

5

http://www.genecrc.org/site/lc/images/helix.gif

http://www.genecrc.org/site/lc/lc2b.htm

1.3 RNA

Ribonucleic Acid (RNA) is similar to DNA in the fact that it is constructed from

nucleotides. However, instead of thymine (T), an alternative base uracil (U) is found in

RNA. RNA can be found as double-stranded or single-stranded, and can also be part of a

hybrid helix where one strand is an RNA strand and the other is a DNA strand. RNA is

important in the cell and contributes in a variety of ways. One of the most important

roles of RNA is in protein synthesis. Two of the major RNA molecules involved in

protein synthesis are messenger RNA (mRNA) and transfer RNA (tRNA).

1.4 mRNA

Messenger RNA (mRNA) is a linear molecule encoding genetic information

copied from DNA molecules. DNA is copied into a single stranded mRNA molecule by

the transcription process. This occurs as follows. Genes consist of coding regions called

exons and non-coding regions called introns. mRNA processing removes introns and

splices the exons together. Processed mRNA can be translated into a protein sequence.

Source: http://www.ebi.ac.uk/microarray/biology_intro.html

Figure 1.3 - Formation of mRNA

6

http://www.ebi.ac.uk/microarray/biology_intro.html

Therefore, in order to determine which genes are active in a cell (i.e., those that are

producing a protein product) one can measure the amount of mRNA present. This gives

an approximation of the activity of individual genes in a cell.

1.5 Gene

A gene can be described as the physical and functional unit of heredity that carries

information from one generation to the next[4]. A gene can be thought of as the DNA

sequence necessary for the synthesis of a functional protein or RNA molecule. Proteins

are important components of the body that determine how the different kinds of

molecules in the body are organized and act. Thus, proteins play a key role in the way we

look and the also in making us a unique individual. Genes are expressed as proteins, a

complex process consisting of two main steps: Each gene (DNA) is converted

(transcribed) into messenger RNA (mRNA), RNA that serves as a template for protein

synthesis. The resulting mRNA then guides the synthesis of a protein through a process

called translation. Thus isolating the mRNA helps us to find expressed genes from the

human genome.

1.6 Central Dogma of Molecular Biology

The Central Dogma of Molecular Biology states that the region of a double

stranded DNA molecule that corresponds to a gene is copied, or transcribed, to a

complementary single stranded mRNA molecule[5]. The single stranded mRNA

molecule then gets translated to a protein (Figure 1.4). If mRNA molecules can be

identified, the expression level of the corresponding genes can be determined.

7

Source: http://www.accessexcellence.org/RC/VL/GG/images/central.gif

Figure 1.4 - Central Dogma of Molecular Biology

1.7 MicroRNA (miRNA)

A miRNA is a form of single-stranded RNA which is typically 20-25 nucleotides

long, and is thought to regulate the expression of other genes[6]. miRNAs are RNA genes

which are transcribed from DNA, but are not translated into protein. The DNA sequence

that codes for a miRNA gene is longer than the miRNA. This DNA sequence includes the

miRNA sequence and an approximate reverse complement. When this DNA sequence is

transcribed into a single stranded RNA molecule, the miRNA sequence and its reverse-

complement base pair to form a double stranded hairpin loop which is a primary miRNA

structure (pri-miRNA). Pri –miRNAs are processed in the nucleus into hairpin RNAs

called Pre-miRNAs. The pre-miRNA molecule is then actively transported out of the

nucleus by a carrier protein. Thus through a mechanism that is not fully characterized, the

8

http://www.accessexcellence.org/RC/VL/GG/images/central.gif

bound mRNA remains untranslated resulting in reduced expression of the corresponding

gene.

The function of miRNAs appears to be in gene regulation. miRNAs have been

reported to be critical in the development of organisms; they are differentially expressed

in tissues and are involved in viral infection processes. In the past two to three years, a

great deal of effort has gone in understanding how, when and where miRNAs are

produced and functions in cells, tissues and organisms. Several research groups have

provided evidence that miRNAs may act as key regulators of processes as diverse as

early development, cell proliferation and cell death, apoptosis and fat metabolism and cell

differentiation. There is speculation that the role of miRNAs in regulating gene

expression could be as important as that of transcription factors. The discovery of

miRNAs and their functions has added insight into how gene regulation is much more

complex than the Central Dogma of Molecular Biology previously led biologists to

believe.

1.8 Microarrays

Microarrays, developed in the lab of professor Patrick Brown at Stanford, in the

early 1990’s, took molecular biology by storm[7]. They are small slides spotted with

fixed samples of DNA, each for a different gene. When a researcher prepares a labeled

cell extract and incubates it with the slide, messengers in the sample anneal to the fixed

DNA, showing which genes in the sample are active. Microarray technology helps to

identify genes that are expressed under different conditions such as during the stages of a

cell cycle, under different environmental conditions, under diseased states at a particular

9

time, or under different tissue or cell types. A microarray is typically a glass slide, on to

which DNA, cDNA or Oligonucleotide molecules are attached at fixed locations (spots).

There may be tens of thousands of spots on an array, each containing a huge

number of identical DNA molecules of varying lengths. For gene expression studies, each

of these molecules ideally should uniquely identify a single gene in the genome.

Microarrays are used to compare gene expression levels in two different samples, for

example, a cell in a healthy state and a diseased state. A microarray employs the ability of

a given mRNA molecule to bind specifically to, or hybridize to, the complementary DNA

from which it is originated.

Source: http://www.bioteach.ubc.ca/MolecularBiology/microarray/index.htm

Figure 1.5 - Preparation of microarrays

10

http://www.bioteach.ubc.ca/MolecularBiology/microarray/index.htm

A microarray contains many DNA sequences, and the expression levels of

thousands of genes can be determined in a single experiment by measuring the amount of

mRNA bound onto each spot of the array. Arranged systematically, the particular

sequences can be identified by the location of the spots on the slide.

For two channel experiments, the relative abundance of each of the gene-specific

sequences in two RNA samples (test and reference) may be estimated by fluorescently

labeling the samples, mixing them and hybridizing them to the sequences on the glass

slide. The two samples of mRNA from the cells (target) are reverse transcribed into

cDNA, and labeled using two different dyes (red Cyanine 5 and green Cyanine 3).

Usually, the reference sample is labeled Cy3 and the test sample with Cy5. The mixture

reacts with the spotted cDNA sequences (probes). This results in cDNA sequences from

the targets and the probes base-pairing with one another. After this hybridization step is

complete, the microarray is placed in a scanner, consisting of lasers with different

wavelengths, a microscope and a camera. The slide is scanned twice, first using one

colored laser and then the second. Laser light excites the fluorescent dyes, Cy3 is excited

by green laser light and Cy5 is excited by red laser light[4]. Green spots indicate that the

test substance has lower activity than the reference substance, red spots indicate that the

test substance is more abundant than the reference substance; yellow spots mean that

there is no change in the activity level between the two populations of test and reference

substance. Black represents areas where neither test nor control substance has bound to

the target DNA. The process of creating and labeling a microarray can be observed in

Figure 1.5.

11

Having an introduction to the central dogma of molecular biology, genes,

microarrays, their preparation process and uses, the next chapter introduces microarray

data analysis and the techniques used to analyze microarray data for obtaining useful

information and knowledge about the underlying biological processes.

12

2. MICROARRAY ANALYSIS TECHNIQUES

2.1 Microarray data analysis

Analysis of microarray data is performed to identify which genes are involved in

the process being studied. It involves statistical analysis by various graphical and

numerical means to select differentially expressed (DE) genes or to find groups of genes

whose expression profiles can reliably classify the different RNA sources into

meaningful groups. The analysis of gene expression data is performed by constructing the

gene expression matrix that describes spot quantitations from different hybridizations.

The process of constructing a gene expression matrix from the raw microarray data is

summarized in Figure 2.1.

Figure 2.1 - Process of obtaining a gene expression matrix

13

A gene expression matrix is a matrix, in which the first column represents the

gene names, and the subsequent columns represent the different experimental conditions

and the cell values usually represent the gene expression value for the given experiment.

Given in Table 2.1 is a gene expression matrix with sample gene expression values.

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1 403 409.3 611.5 569.2 536.6 580.2

Gene 2 757.3 574.4 826.7 595.3 755.2 956

Gene 3 284.4 327.3 421.6 336.6 391.3 412.6

Gene 4 2314.2 1685.3 2264.7 2204.1 2233.1 2458.4

Gene 5 1574.5 1273 1484.6 1321.2 1474.7 1774.1

Gene 6 2333.7 1796.8 2464.5 2372.5 2095.9 2735.7

Table 2.1 - Gene expression matrix with raw gene expression data

The cell values can also be the intensity ratios of the particular experiment with a

preset experiment value. In the gene expression matrix in Table 2.2, the cell values are

the intensity ratios of all the other experiments with experiment 1, calculated from the

expression matrix of Table 2.1.

14

Exp 2/Exp 1 Exp 3/Exp 1 Exp 4/Exp 1 Exp 5/Exp 1 Exp 6/Exp 1

Gene 1 1.015633 1.51737 1.412407 1.331514 1.439702

Gene 2 0.758484 1.091641 0.786082 0.997227 1.26238

Gene 3 1.150844 1.482419 1.183544 1.375879 1.450774

Gene 4 0.728243 0.97861 0.952424 0.964955 1.062311

Gene 5 0.808511 0.942903 0.839124 0.936615 1.12677

Gene 6 0.769936 1.056048 1.016626 0.898102 1.172259

Table 2.2 - Gene expression matrix with intensity ratio values

Data analysis is based on the hypothesis that there are biologically relevant

patterns to be discovered in the data. The microarray data analysis process depends on the

analysis of the gene expression matrix using both supervised and unsupervised methods.

Most data analysis methods use raw expression values, intensity ratios or both for their

analysis routines. Data analysis methods take these huge sets of data as input and produce

both visual and numerical results for interpretation and further analysis. The most

commonly used microarray data analysis methods include log ratios of the gene intensity

data in order to spread the values across a given range, normalization to identify and

remove bias from the data, diagnostic plots of the microarray data for visualization

purposes, and methods to identify differentially expressed genes and clustering of genes

with similar behavior patterns. Each of these analysis techniques are discussed in detail in

the following sections.

15

2.2 Log ratios

A logarithmic transformation produces a continuous spectrum of values and treats

up and down regulated genes evenly across a range [8]. Equation 2.1 shows the formula

for calculating log2 ratios.

i

ii

GRX 2log= [Equation 2.1]

Where i=1, 2…..Ngenes and Ri , Gi are the measured intensity values for gene ‘i’ from two

different experimental conditions.

By using log2 values, X=0 represents equal expression, X=1 represents up

regulation by a factor 2, X=-1 down regulation by a factor 2, X=2 up regulation by factor

4, and so on. Additionally, calculating log 2 values spreads the values more evenly across

the intensity range and provides better visualization of the data and it tends to make the

variability of data more constant over the intensity range[9]. Given in Table 2.3 are the

log2 values of the intensity ratios calculated in Table 2.2.

Exp 2/Exp 1 Exp 3/Exp 1 Exp 4/Exp 1 Exp 5/Exp 1 Exp 6/Exp 1

Gene 1 0.02237883 0.60157266 0.49815582 0.413067216 0.52577046

Gene 2 -0.39880918 0.12649896 -0.34724803 -0.004006164 0.33614569

Gene 3 0.20269214 0.56795340 0.24311371 0.460353645 0.53682236

Gene 4 -0.45750812 -0.03119360 -0.07032387 -0.051465694 0.08720612

Gene 5 -0.30666134 -0.08481948 -0.25304488 -0.094472262 0.17219357

Gene 6 -0.37718928 0.07867587 0.02378897 -0.155049228 0.22929092

Table 2.3 - Gene expression matrix with log 2 intensity ratio values

16

2.3 Normalization

The goal of normalization is to identify and remove any systematic bias in the

measured fluorescence intensities, arising from variation in the microarray process rather

than from biological differences between the RNA samples or the printed probes[10;11].

Sources of bias include:

• labeling efficiencies of the dyes

• different amounts of Cy3 and Cy5 labeled mRNA

• scanning parameters

• spatial or plate effects, print tip effects, etc.

In the normalization process, a normalization factor (also referred to as scaling

factor) is calculated and is multiplied to all the values of an experiment. Either of the

experiments which are being compared can be multiplied with the normalization factor.

This process is the same as taking a constant value away from the log of the normal ratio.

2.3.1 Total intensity normalization

Total intensity normalization computes the normalization factor by summing the

measured intensities in both the experiments considered[10]. This is shown in Equation

2.2,

∑

∑

=

==array

kk

array

kk

totalN

G

NR

N

1

1 [Equation 2.2]

where Narray is the total number of genes, Gk and Rk are the measured intensity values of

the kth gene in both the experiments. The intensities are then rescaled such that Gk’ = Ntotal

17

Gk and Rk’= Rk and the normalized expression ratio for each feature are calculated

(Equation 2.3).

k

k

totalk

kk

GR

NGRT 1

''' == [Equation 2.3]

This is equivalent to

)(log)(log)'(log 222 totalkk NTT −=

2.3.2 Median normalization

In median normalization the normalization factor is found by calculating the

median of the array in question. Hence the equation becomes

akk medianTT −= )(log)'(log 22 where a is the experiment array.

The advantage of using the median normalization is that it is insensitive to outliers which

occur commonly in microarray data sets.

2.3.3 Lowess normalization

Lowess stands for Locally Weighted Linear Regression. It is also referred to as

Loess. Lowess uses a linear regression model whereas Loess uses a quadratic regression

model. The lowess normalization procedure subtracts a Lowess regression curve from the

data to normalize it[10;12].

18

Figure 2.2 - Effects of lowess normalization

A Lowess curve is first drawn on the RI Plot. The lowess curve is calculated by a

regression process which calculates the dependence of the ratio on the intensity and puts

it in a mathematical context.

The dependence, for each gene (i) is calculated by observing its distance

from the curve. On subtracting the dependence from the observed log

)( ixy

2 ratios, the

equation becomes:

))(^2(log)(log)()(log)'(log 2222 ikikk xyTxyTT −=>−= [Equation 2.4]

Figure 2.2 shows the effect of lowess regression on a set of data. The plot on the right

hand side is the RI plot itself and the plot on the left hand side is the RI plot fitted with

the lowess curve. Lowess detects the systematic deviations in the RI plot and corrects

them by carrying out a local weighted linear regression function given by Equation 2.4,

and uses this function, point by point, to correct the measured ratio values. The results of

applying such a lowess correction can be seen in the left hand side plot of Figure 2.2.

Lowess analysis is used as a normalization method that can remove intensity dependant

effects in the log2 ratio values.

2.4 Scatter plot

The scatter plot is an important graphical tool for studying the spread and linearity

of data[8]. In its simplest form, two variables are plotted along the axes, and marks are

19

drawn according to these coordinates. The intensity values of genes under different

experiments can be depicted as a scatter plot. A scatter plot is straightforward, but very

high correlation between the two experimental intensity values makes the features of the

plot difficult to discern. In an ideal scatter plot, all the spots are clustered around the

diagonal line representing y=x. Figure 2.3 shows a scatter plot with most of the data

points clustered around the diagonal line.

Figure 2.3 - A scatter plot

2.5 MA plot

An MA plot is a scatter plot with transformed axes[8]. The X-axis conforms to the

logged total intensity value of the two experiments; the Y-axis shows exactly the log-

ratio of the two experiments. MA plots are used to identify spot artifacts and detect

intensity-dependant patterns in the log ratios. Since the interest lies in deviations of the

points from the diagonal line, it is beneficial to rotate the axes and re-scale the axes as in

the MA plot. The MA plot serves to increase the room available to represent the range of

differential expression and makes it easier to see non-linear relationships between the log

intensities. The MA plot in Figure 2.4 shows the differentially expressed genes more

20

clearly than the scatter plot in Figure 2.3. If an MA plot clearly shows the dependence of

the log ratio M on overall spot intensity A, this suggests that intensity or ‘A’ dependent

normalization method may be preferable.

Figure 2.4 - An MA plot

2log1log 22 ExpExpM −= )2log1(log21

22 ExpExpA +=

2.6 RI plot

A ratio-intensity (RI) plot is also a scatter plot like the MA plot that shows the

intensity specific effects for all the genes by plotting the log ratio as a function of the

product of the intensities[9;12]. RI plots are used to determine if there is a rough

correlation between the total intensity of a spot and its ratio. The easiest way to visualize

intensity-dependent effects, and the starting point for the lowess analysis described in

section 2.3.3, is to plot the measured log2 (Exp 1/Exp 2) for each gene as a function of the

log10 (Exp 1*Exp 2) product intensities. This ‘R-I’ plot can reveal intensity-specific

artifacts in the log2 (ratio) measurements which can be eliminated using lowess

21

normalization method. Under the assumption that most genes are not differentially

expressed, most of the points in the RI plot should fall along the horizontal line. Figure

2.5 shows an RI plot where a large number of genes which are not differentially

expressed fall along the horizontal line, and a number of differentially expressed genes

are scattered away from the horizontal line.

Figure 2.5 - An RI plot

2.7 Difference between MA and RI plots

The MA plot and RI plot are used to check if the data exhibits an intensity dependent

structure. RI plots and MA plots are used in an alternative manner by scientists.

In an MA plot, plot M=log 2 (R/G) Vs A= (1/2) log 2 (R*G)

In a RI plot, plot R= log 2 (R/G) Vs I=log 10 (R*G)

where R and G are two different experiments.

The type of plots used for analysis is a source of confusion due to the fact that the RI plot

looks very similar to an MA plot. It is important to know that MA plots are similar to RI

plots but are not the same. RI plots are most commonly used to show the effect of lowess

22

normalization. MA plots are used instead of scatter plots because they serve to increase

the room available to represent the range of differential expression and makes it easier to

see non-linear relationships between the log intensities.

2.8 Identifying differentially expressed genes

One of the main goals of microarray experiments is to identify differentially

expressed (DE) genes[11]. It will be practical to identify a limited number of genes which

are the most likely candidates. This set of DE genes can be further analyzed using

clustering techniques, etc.

2.8.1 Fold change

The fold change detection is a simple approach where a fixed fold-change-cutoff

interval is used to find genes which are differentially expressed.[10; 13]

21

sampleinvalueExpressionsampleinvalueExpressionchangeFold = [Equation 2.5]

If a gene’s experimental log-ratio exceeds the upper cutoff interval boundary, then

it is marked as significant and over expressed. If a gene’s experimental log-ratio falls

below the lower cutoff interval boundary, then it is marked as significant and under

expressed. Genes with experimental log-ratios in the range of the interval are marked for

regular behavior. Important factors to be considered in fold change method are what

cutoff should be used and should the cutoff be the same for all the genes. Though it is a

very straightforward method for classifying genes, the fold change method has the

disadvantage of not considering variability. Hence, genes with large variances are more

likely to make the cutoff just because of noise. For poorly expressed genes, small changes

23

in intensity can lead to large calculated fold changes. And it is not a statistically based

method.

2.9 Clustering

Microarray experiments deal with a large amount of data, which has to be stored

and analyzed. Therefore a general idea is to reduce the dimensionality of the data. The

basic concepts in clustering are to try to identify and group together similarly expressed

genes and then try to correlate and interpret the observations at the biological end. The

basic principles in gene clustering are:

1. Organize the data into a small number of homogeneous groups.

2. Find similar expression patterns of genes. Both low and high expression level

genes can be placed in the same cluster if their expression profiles have similar

shape.

2.10 Types of clustering

Clustering can be hierarchical or flat, as well as agglomerative or divisive[10].

Agglomerative processes start out by considering each object as a separate cluster and

proceed to group the most similar objects in an iterative fashion until all the data are

included. Divisive methods start out with the complete set of data as one large group, or

cluster, and proceed by partitioning the objects starting with those that are most

dissimilar. Based on their background principle, the different types of clustering methods

available are Hierarchical agglomerative clustering[9;10], Hierarchical divisive

clustering[9;10], k-means clustering and self organizing maps (SOM’s)[9;10;13].

24

2.10.1 Hierarchical clustering

The clustering method used for analysis in this tool is hierarchical clustering. The

hierarchical clustering algorithm uses a bottom-up approach where it iteratively joins the

two closest clusters starting from a single cluster[9;10;13]. After each step, a new

distance matrix between the newly formed clusters and the other clusters is recalculated.

For a set of N genes to be clustered, and a NN × distance matrix, the hierarchical

clustering is performed as follows:

1. Assign each gene to a cluster of its own.

2. Find the closest pair of clusters and merge them into a single cluster.

3. Compute the distances between the new cluster and each of the old clusters.

4. Steps 2 and 3 are repeated until all the genes are clustered.

The distance matrix is calculated by considering the shortest distance from any

member of one cluster to any member of the other cluster. Hierarchical clustering has

become popular for the following reasons:

• Hierarchical clustering techniques are meaningful to cluster data at the experiment

level rather than at the level of individual genes. Such experiments are most often

used to identify similarities in overall gene expression patterns in the context of

different treatment regimens.

• The analysis reveals groups of similar genes that can be studied in greater depth.

• It is possible to visualize the data in a hierarchical way using interactive computer

programs.

While intuitively appealing as a method, hierarchical clustering is not an efficient

method for very large gene expression matrices as the full distance matrix of all pair-wise

25

distances has to be calculated in advance, which for n objects takes an order of n2 steps.

Hierarchical clustering is also less suitable for noisy data.

2.10.2 Dendrogram

Hierarchical clustering can be represented as a tree called a dendrogram[9;10].

Source 1: http://www.awprofessional.com/articles/article.asp?p=357695&seqNum=4&rl=

Figure 2.6 - Construction of a two-dimensional dendrogram representing a hierarchical cluster of related genes

26

By cutting the dendrogram at a particular height will give the different clusters and the

ze of the clusters. The dissimilarity of the clusters is proportional to the length of the

ertical lines projecting from each cluster. Figure 2.6 is an example of how a dendrogram

f clusters is obtained.

Each column represents a different experiment, each row a different spot on the

icroarray. The height of each link is inversely proportional to the strength of the

orrelation. Relative correlation strengths are represented by integers in the

ccompanying chart sequence. Genes 1 and 2 are most closely coregulated, followed by

enes 4 and 5. The regulation of gene 3 is more closely linked with the regulation of

enes 4 and 5 than any remaining link or combination of links. The strength of the

orrelation between the expression levels of genes 1 and 2 and the cluster containing

enes 3, 4, and 5 is the weakest (relative score of 10). (Adapted from: Jeffrey Augen,

ioinformatics and Data Mining in Support of Drug Discover," Handbook of Anticancer

rug Development. D. Budman, A. Calvert, E. Rowinsky, editors. Lippincott Williams

2.10.3 Heat maps

A heat map is a color image with a dendrogram attached to the left side and to the

top of the image[10;14]. The rows and columns plotted in the heatmap are re-ordered

based on the restrictions imposed by the dendrogram. Each row in the heatmap represents

a gene and the columns represent the different experiments to which the gene is

subjected. The colors in the heat map simply represent the values in the gene expression

matrix. One can observe from a heat map (Figure 2.7) that genes with similar gene

expression profiles (i.e. strings of similar colors) are grouped close together.

si

v

o

m

c

a

g

g

c

g

"B

D

and Wilkins. 2003)

27

Figure 2.7 - A heat map with a dendrogram and a color key

The next chapter is a literature review of existing microarray data analysis tools,

and their advantages and disadvantages in terms of ease of use, availability and

functionality. It also includes a discussion about the motivation for the developed

microarray data analysis tool.

28

3. LITERATURE REVIEW

There are several commercial and non-commercial solutions as well as a growing

body of freely available open source software for analyzing microarray data. A review of

some popular open source microarray data analysis tools is presented here including

Bioconductor, TM4, MIDAS, BASE, WebArray and SNOMAD.

3.1 Bioconductor

Bioconductor is an open source project for computational biology[15]. The main

focus is to d on analysis.

Biocon

s at least one vignette, a document that provides a textual,

sk oriented description of the package’s functionality and can be used interactively.

lthough initial efforts focused primarily on DNA microarray data analysis, many of the

ftware tools are general and can be used broadly for the analysis of genomic and

xpression data. Bioconductor has adopted object-oriented programming as its primary

rogramming paradigm.

he main features of the Bioconductor project are:

Use of R to provide a wide range of statistical and graphical methods for the

analysis of genomic data.

eliver high-quality infrastructure and end-user tools for expressi

ductor is built completely on R[3;14] and R packages. A list of the different types

of packages available is given in Table 3.1. In addition to providing genomic data

analysis tools, Bioconductor has excellent integrated, dynamic documentation. Each

Bioconductor package contain

ta

A

so

e

p

T

29

Help integrate biological literature data from PubMed and LocusLink with the

analysis of genomic data.

Allows the development of extensible, scalable and interoperable software.

Provide high qual le research.

nalysis tool with a simple user interface, which does not require

the user upload data for analysis and

dow o the

web-ba e focus of this thesis.

ity documentation and reproducib

Provide training in computational and statistical methods for the analysis of

genomic data.

Task Packages

General programming tools Biobase, graph, tkWidgets, reposTools,

rhdf5

Annotation AnnBuilder,

Table 3.1 - Bioconductor packages

Although Bioconductor has the advantage of building on the existing toolkit of

statistical applications, it is command line based which is imposing for many users. The

tkWidgets package provides some functionality for creating GUI’s, but even that requires

additional programming.

The need for an a

annotate

Graphics Geneplotter, hexbin

Preprocessing microarray data Affy, marrayClasses, marrayInput,

marrayNorm, marrayPlots, marrayTools

Differential gene expression Genefilter, multtest, ROC

any kind of programming skills, and instead lets

nl ad the results in a point-and-click fashion, is the main factor in developing

sed application which is th

30

3.2

alysis suite of tools was developed to provide the

mic of the

mic ajor applications,

Mic a

System Multiexperiment Viewer (MeV). Since the focus of this project is

in array data analysis, the discussion is confined to MIDAS.

M alysis System

MIDAS is a java application which pr users an intuitive interface to design

an sses combining one or more ring steps. MIDAS

reads “.tav” (TIGR ArrayViewer file type, w mn, tab-delimited text

fo purposes o

a single slide) files generated by TIGR Spo ia

M les include lo ormalization. It

also includes background- ate analysis and filtering,

and the

s the data in tav format. While TM4 overcomes some of the

limitati

accessed through web, instead of

TM4 and MIDAS

The TM4 microarray an

roarray community with a comprehensive set of tools to handle all aspects

roarray process[16]. The TM4 suite of tools consist of four m

ro rray Data Manager (MADAM), TIGR_Spotfinder, Microarray Data Analysis

(MIDAS), and

micro

IDAS: Microarray Data An

ovides

alysis proce normalization and filte

hich is an eight-colu

rmat developed at TIGR for the f storing the intensity values of the spots on

tfinder or retrieved from the database v

ADAM. Normalization modu wess and total intensity n

and quality- control trimming, replic

identification of differentially expressed genes using intensity dependent Z-scores

and user defined fixed fold-change cut-offs. MIDAS provides scatter plots that illustrate

the effects of each algorithm on the data. When the normalization and filtering steps are

complete, MIDAS output

ons of a command-line driven system, it has the disadvantage of requiring users to

maintain current copies of the software locally and to update the system as it evolves.

Thus the need for an analysis tool which can accept data as a simple text file,

instead of program specific formats and which can be

31

maintaining local copies of the program on a user’s computer, has been another

motiva

BASE was developed using a web-based approach which closely integrates a data

management system with a data analysis system[16]. Since expression analysis tools are

evolving rapidly, BASE has a plug-in architecture that allows new modules to be easily

added for data transformation, analysis, or visualization. BASE incorporates a data

analysis interface that allows users to define an analysis method that passes data through

multiple routines and to create transformed datasets and subsets. This allows the original

unmodified data to be analyzed in a number of ways to create multiple analyses. BASE

allows data to be visualized in a variety of ways. Unmodified and transformed datasets

can be plotted interactively as scatter plots, displayed in histograms, or viewed as tables.

Though BASE minimizes the software update problem through its web-based approach,

it has the disadvantage that it loses a good deal of the graphical functionality that local

applications can provide.

The motivation for creating a pipeline process in the application being developed

comes from the analysis method of BASE. Also, though not yet implemented, the

integration of the data analysis module with a data management system as done in BASE

is a good future improvement.

ting fact in developing this application.

3.3 BASE: BioArray Software Environment

3.4 WebArray: an online platform for microarray data analysis

WebArray offers a convenient platform for biologists to access several cutting-

edge microarray data analysis tools[17]. WebArray runs on a LAMP system (Linux +

32

Apache + MySQL + Python) system. Background computations are mostly done by R

scripts. The currently implemented functions of WebArray were based on limma (Linear

Models for Microarray Analysis) and affy package from Bioconductor, the spacings

LOESS histogram (SPLOSH) method, PCA-assisted normalization method and genome

mapping method. WebArray incorporates these packages and provides a user-friendly

interface for accessing a wide range of key functions of limma and others, such as spot

quality weight, background correction, graphical plotting, normalization, linear modeling,

empirical bayes statistical analysis, false discovery rate (FDR) estimation, and

chromosomal mapping for genome comparison. Microarray analysis using WebArray can

be executed in three steps: 1) uploading and managing files; 2) selecting datasets and

methods for analysis, 3) browsing results. A good help document is also available with

detailed annotation of all the functions of WebArray. Thus WebArray is an excellent free

open source software for microarray analysis that can be used by an average biologist

after some training.

ardization and Normalization of MicroArray Data)

on to the regular transformations and visualization tools,

SNOMAD includes two non-linear transformations which correct bias and variance

which are non-uniformly distributed across the range of microarray element signal

intensities: 1) local mean normalization; and 2) local variance correction (Z-score

generation using locally calculated standard deviation).

3.5 SNOMAD (Stand

SNOMAD is an interactive, user-friendly web-application which can be accessed

freely via the internet with any standard HTML browser[18]. SNOMAD is a collection of

algorithms for the normalization and standardization of gene expression datasets derived

from diverse sources. In additi

33

The SNOMAD tool is available at -

http://pevsnerlab.kennedykrieger.org/snomad.htm. No programming expertise or

software installation is required. Users can upload their gene expression data and specify

the transformations they wish to apply on their data. Results come in the form of both a

text file containing numeric values and image files of graphs of the data corresponding to

all the transformations.

WebArray and SNOMAD are two user-friendly tools available in the market for

microarray data analysis. But they have their own disadvantage of having limited

functionality, confined to a certain set of routines that the user can perform on the data.

They do not have the scope for adding new R programs to the already existing system. In

such a case, biologists tend to use multiple tools for obtaining the required results. This

lack of extensibility formed another motivation for the development of the application

under discussion. Thus all these above discussed factors led to the development of the

current application to provide a solution to the community driven need for an easy to use,

readily available and extensible microarray data analysis tool, which uses R routines for

analysis.

34

4. IMPLEMENTATION SPECIFICS

In this chapter, the software implementation specifics for the microarray analysis

tool are discussed. A brief introduction to the R package, which forms the base for the

statistic

4.1 R

R is a powerful software environment for data manipulation, calculation and

graphical display. It is a GNU General Public License project similar to the S language.

The name is partly based on the first names of the first two authors (Robert Gentleman

and Ross Ihaka), and partly a play on the name of the Bell Labs language ‘S’[3;14].

supports a wide range of statistical techniques including descriptive statistics,

linear and nonlinear modeling, classical statistical tests, probability distributions, analysis

of variance (ANOVA), time series analysis, classification, clustering, robust regression

and maximum likelihood.

al analysis of the microarray data is given. Description of Java which has been

used to develop the user interface, and information about Rserve, which is the plug-in

used to connect to R from Java are also given. Some background about MySQL database

and the JDBC connection needed to connect to a database from Java code is also

provided. A clear understanding of the software is needed to understand the

implementation techniques discussed in this thesis.

R

35

R is extensible via user defined functions written in its own language, or through

the use of dynamically loaded modules written in other languages. It can be used with

Linux, UNIX and Microsoft Windows™.

4.1.1 Statistics and R

Most of the statistical techniques have been built into the base R environment and

many more are supplied in the form of packages. There are about 10 packages called

standard packages which are supplied with R and many more can be downloaded from

the Comprehensive R Archive Network (CRAN) website (http://cran.r-project.org).

The major difference between R and other statistical systems is that in R, the

statistical analysis is performed as a sequence of steps with the results of every step

stored in objects. In systems like SAS and SPSS copious output is obtained from a

regression or analysis whereas R will give minimal output and store the results in a fit

ct f further processing by R functions.

4.1.2 R and Windows™

The latest version of R for Windows™ can be downloaded from the CRAN

website. The version used for development of this project is R 2.1.1. A full installation of

R on Windows™ takes up to 50 MB of disk space. To install, double click on the icon for

rw2011.exe and follow the instructions. R installed in this way can be started from the

start menu or by double clicking the R shortcut. To add packages to the existing R

system, download the packages from the CRAN website and unzip them into the

R/rw2011/library folder directly.

obje or

36

Figure 4.1 - R command line interface on startup

erted into native data types.

• Persistent connections until the connection are closed.

4.2 Rserve

Rserve [19]is a TCP/IP server which allows other programs to use R facilities

from various languages without the need to initialize R or link against R library. Rserve

supports remote connection, authentication and file transfer. Typical use of Rserve is to

integrate R backend for computing statistical models, plots, etc from other applications.

The features of Rserve include:

• R initialization is not necessary.

• Most R data types are conv

37

• Offers client independence since the client is not linked to R.

• Rserve provides some basic security in the form of encrypted user/password

authentication.

• Rserve allows transferring files between the client and the server.

Rserve itself is the server which responds to requests from the clients. It listens

for incoming connections and processes incoming requests. A client framework was also

developed – JRclient. JRclient is a client suite which allows a java application to access

Rserve. It was developed in java. It provides automatic type translation for most objects

such as int, double, arrays, string or vector and classes for special R objects such as

RList, RBool, etc. The idea behind the separation of client/server side allows handling

multi-threading better when linking to R library directly.

4.2.1 Installation of Rserve

R 1.5.0 or to be able to use

AIX and Windows™. The Windows™ version

of Rserve was used for development. Although Rserve works on Windows™, it is not the

recommended since Windows™ lacks important features that make the separation of

namespaces possible. Therefore Rserve for Windows™ allows only one connection at a

time and all subsequent connections share the same namespace.

Installation process for Windows™:

1. Make sure to download the proper binary based on the version of R.

2. Copy the binary Rserve.exe to the same directory where R.dll is located. By

default it is in the R\rw2011\bin folder.

3. Run rserve.exe to start the server and to make connections to R.

higher needs to be installed on your system in order

Rserve. Rserve works on Linux, Solaris,

38

Rserve was developed by Simon Urbanek, a researcher at AT&T Research labs.

Any e

(General Public License).

4.3

This microarray data analysis tool was mostly developed using Java in order to

provide a platform independent solution. The IDE (Integrated Development

Environment) used for code development is Eclipse SDK 3.1.1. Other IDE’s that can be

used are Borland’s JBuilder or Netbeans.

ple object-oriented, distributed,

interpreted, robust, secure, architecture neutral, portable, high-performance,

multithreaded, and dynamic language[20]. A program written in java is both compiled

and interpreted. A java compiler generates an architecture independent object file

executable on any system supporting the java runtime environment. The object code

consists of bytecode instructions designed to be both easy to interpret on any machine

and easily translated into native machine code at load time. So compilation takes place

only once, interpretation occurs each time the program is executed.

rogram runs.

Som

operati e. The java platform differs from these

on interested to contribute to the project can do so since it is released under GPL

Java

4.3.1 Java language

Java as described by Sun Microsystems is, a sim

4.3.2 Java platform

A platform is the hardware or software environment in which a p

e popular platforms like Windows™, Linux, Mac OS, etc. are a combination of the

ng system and the underlying hardwar

39

platform

parts:

1. The Java Virtual Machine (JVM)

2. The Java Application Programming Interface (API)

The JVM is the interpreter and the runtime system, which lets java programs run

on any hardware-based platform where it has been already ported to. The API is a large

collection of ready-made software components that provide several capabilities. It is a

grouped up collection of libraries of related classes and interfaces. These libraries are also

4.4

Servlets[21] are java programs that run on a web server and build web pages.

Servlets provide a component-based, platform-independent method for building web-

based applications. Servlets are server- and platform- independent which leaves us free to

select any server, platform and tools for running our application.

s based on the fact that it is a software-only platform that runs on top of other

hardware-based platforms.

The java platform has two

known as packages[20].

Java servlets

40

Source: http://cs.nmu.edu/~jeffhorn/Classes/CS122/Figures/javaTranslation.gif

Figure 4.2 - Java servlet execution process

form that specifies

method=POST. To be a servlet, a class should extend HttpServlet and contain the doGet

and the doPost methods to handle the GET and POST requests respectively. Both these

methods take two arguments: an HttpServletRequest and an HttpServletResponse. The

HttpServletRequest has methods to handle all incoming information such as form data,

HTTP request headers, and the client’s hostname. The HttpServletResponse lets you

specify outgoing information such as HTTP status codes, content-type, cookies and most

importantly lets you post document content back to the client. The two important

packages that have to be imported into the servlet file are:

Servlets are class files which handle GET and POST requests. GET requests are

the usual type of browser requests for web pages in HTTP, when a user types a URL on

the address line or follows a link from a web page. Servlets also handle POST requests,

which are generated when someone submits an HTML

41

1. javax.servlet (for HttpServlet) and

2. javax.servlet.http (for HttpServletRequest and HttpServletResponse).

4.5 JSP (Java Server Pages)

Java Server Pages[20;21] is a technology that lets you mix regular, static HTML with

dynamically-generated HTML. You simply write the regular HTML in the normal

manner, using whatever web-page building tools you normally use. You can then enclose

the code for the dynamic parts in special tags which start with “<%” and end with “%>”.

A JSP is saved with a .jsp extension and it can be invoked just like any other normal web

page. Though it appears to be a normal HTML file, a JSP acts like a servlet behind the

scene

.6 JDBC (Java D

s.

4 atabase Connectivity)

JDBC[20;21] defines how a java program can communicate with a database.

JDBC API provides two packages – java.sql and javax.sql. By using JDBC API, one can

connect to any database, send queries to the database and process the results.

JDBC architecture defines the different layers to work with any database and Java.

1. JDBC API interfaces and classes which are at top most layer (to work with java)

2. A driver which is at the middle layer (maps java to database specific language)

3. A database at the bottom (to store physical data)

42

Source: http://www.dbmsmag.com/9610i06.html#figure1

Figure 4.3 - The three-tier architecture of a JDBC connection The three main interfaces provided by the JDBC API to work with databases are:

connection functionality.

2.

hich comes

4.7

conjunction with web technology server applications[22].

The database has been designed for speed, which would be useful in large transactions.

MySQL is currently the most widely installed database, a well respected product that is

more than capable of commercial operation. In fact, the entire Google search engine is

built upon MySQL technology[23]. MySQL offers most of the functionality one will

1. Connection interface provides database

Statement interface provides SQL query representation and execution

functionality.

3. ResultSet interface provides functionality for retrieving the data w

from the execution of a SQL query using Statement.

MySQL

MySQL is a very popular open source database server which is commonly used in

to create dynamic and powerful

43

expect from an RDBMS. It ensures that transactions comply with the ACID model

(Atomicity, Consistency, Isolation, and Durability), allows the building of indexes,

supports standard data types and allows for database replication, among other features.

One area where MySQL falls short is its lack of certain features like sub-queries,

constraints, views, cursors and objects. MySQL is fast, easy to use, is open source and if

the application is a web application then MySQL meshes in perfectly with most of the

web development languages. When using MySQL with java, the MySQL Connector/J

driver needs to be downloaded from MySQL’s website

[http://www.mysq

.8 Apache Tomcat

developm

application container that was created to run Servlets and Java Server Pages (JSP) in web

applications. Java m

eb pages,

servlets and JSP into a single directory structure. It can be thought of as a container

h a deployment directory where you can place all your web application files

for them

l.com/products/connector/j/].

4

Apache Tomcat (also codenamed Catalina) is a standalone Web server used as a

ent server on your desktop. The Tomcat server [21] is a java based web

ust be installed for Tomcat to operate.

Tomcat organizes all the parts of a web application such as static w

whic cts as the

to execute without any hassles.

The root folder is the deployment folder where all the static html files and JSPs

can be placed. The Servlets are placed in the ROOT/WEB-INF/Classes folder. Pros of

Tomcat are that it is an open source project, stays on top of the Servlet API

developments, and works extremely well. The cons are that it is not the fastest

implementation and that you are on your own for support.

44

In the next chapter, a detailed discussion on the developed microarray data

analysis tool is presented. The discussion is in the form of objectives of the analysis tool

and the results that have been achieved along with screen shots of the system for easier

understanding.

45

5. OBJECTIVES AND RESULTS

The aim of this thesis was to develop a freely available, platform independent

plication for visualization, normalization and analysis of microarray experiments and

so a tool which will guide the users through the steps of normalization and data analysis

such as identifying differentially expressed genes and to cluster those differentially

expressed genes into clusters of genes exhibiting similar behavior.

R, the statistical package which is freely available can be used to perform all

types of analysis on microarray data, but it has the disadvantage of being a command line

based package which requires the biologists to know the syntax of various commands and

also requires the users to be familiar with programming techniques and concepts. The

users have to in short, be well versed in R to perform efficient data analysis. Thus the

motivation for the tool comes from the need for a easy to use, point and click kind of

interface which is easily accessible over the internet, to which users can easily connect to,

upload their data files retailored to a particular format and get both numeric and visual

results for interpreting the data without having to worry about the intricacies of

programming.

I will now discuss about the different objectives of the tool and the way they were

implemented and also discuss the results of normalizing and analyzing the microarray

data using the software tool, which was developed during the course of this work.

ap

al

46

5.1 Uploading experiment data

The user can upload experimental data as a text file, which is more importantly in

the format of a gene expression m

the genes and the expression values of the genes for different experiments (which may be

different biological conditions, different time-points, etc).

atrix. The data text files should contain a listing of all

Figure 5.1 - Sample data file for analysis

As shown in figure 5.1, the gene names form the first column, the other columns

are the different experiments and the numerical values represent the gene expression data

for each gene under the different experiments. The data file can be uploaded through a

user friendly web page as shown below and it will be saved in the working directory of

the application on the server and will be used for all further analysis (figure 5.2).

47

Figure 5.2 - Process of uploading data files

5.2 Normalization methods

A wide choice of common normalization methods is offered to the user to remove

the systematic bias within the data. It is also possible to add newly developed

normalization techniques at a later stage.

All the previou e applied to the data

from the uploaded file. The normalization can be applied to more than one experimental

column or to all the experimental columns. In the case of multiple experimental columns,

all the columns are normalized with respect to the first experimental column using the

normalization method selected. The normalized data can be downloaded as a text file for

reference or for input to another system.

sly discussed normalization methods can b

48

Figure 5.3 - rmalization Normalized data file using total intensity no

Figure 5.4 - Normalized data file using median normalization

49

50

Figure 5.3 shows an example of total intensity normalization, which can be compared to

figure 5.4 (median normalization). The slight variation in the normalized data from the

different methods can be observed.

5.3 Data visualization

To get an idea about the condition of the data sets or the effects of different

normalization methods, different means of graphical display such as scatter plots, MA

plots and RI plots have been implemented.

5.3.1 Scatter plot

Plotting the log 2 intensity values of the one experiment condition versus the log 2

intensity values of another experiment condition is a common way to display the

distribution of the data. The normalization step gives two forms of output. One is the data

file with the normalized data and the second form is a scatter plot of the normalized data.

Figure 5.5 - Total intensity normalization scatter plots and text files

The graphical result of the normalization step is a nn× matrix of scatter plots

where n is the number of experimental columns to be normalized (figure 5.5 and 5.6).

The nn× matrix of scatter plots consists of scatte ent versus itself

and all the other experiments selected. Thus it can be observe om the results that the

scatter plot of an experiment versus itself is a straight line passing through the origin. The

matrix of the scatter plots has an image area map defined on it which lets users to

zoom in on a particular scatter plot for better viewing of the plots.

r plots of each experim

d fr

nn×

51

Figure 5.6 - matrix of scatter plots with a zoomed out portion for two specific

xperiments

MA plot

An alternative to the scatter plot is the MA plot with transformed axes to provide

intensity information. This tool produces MA plots for both normalized and raw data.

The MA plots are also produced as a

nn×

e

5.3.2

nn× matrix of individual MA plots where n is the

51

number of experiments selected (figure 5.7). The image area map logic has also been

used for the MA plot which lets the user to zoom in the plot in a separate window.

Figure 5.7 - matrix of MA plots with a z

experiments

5.3.3 RI plot

The tool produces RI plots for both raw and normalized data. The matrix of

RI plots and the image area map logic has been used for this visualization technique as

well, which lets the users to view the RI plots for all the n experiments at the same time

and also to zoom in on a particular experiment’s RI plot (figure 5.8).

nn× oomed out portion for two specific

nn×

52

Figure 5.8 - matrix of RI plots wi

experiments

5.4 Creating a process pipeline

This software tool allows the user to create a process pipeline where the user can

a outines to be performed on the data uploaded. The pipeline window is

reached data

After selecting the

experimental columns for further analysis, the user interface window submits into a

process pipeline screen, where a variety of operations are listed out from which the user

can select the processes of is choice and form a sequence of steps to be performed on the

data. The sequence of steps to reach the pipeline step is discussed in detail below.

nn× th a zoomed out portion for two specific

select set of r

after uploading the data file (figure 5.9). On uploading the file, the

background code parses out the different experiments conducted and displays a window

to the user with all the experiments listed out from where the user can select those

experimental columns on which he /she wants the analysis done.

53

Figure 5.9 - Steps for pipelining analysis

The columns parsed out from the uploaded file, which represent the different experimental columns.

The columns transferred

double right arrow button, will be subjected to

here by clicking on the

further analysis routines.On clicking the Submit button, the window which lets the user select a pipeline of processes to be performed on the data uploaded, will open up.

Form

54

igu

rocesses listed as list boxes, which the user can perform on the uploaded data. The first

ategory is the normalization routines from where the user can select a particular

ormalization method and apply it to the data. The second category is the various

isualization plots available to view the distribution of data. The third list box forms the

ird category of processes called “Ratios and Clustering” which consists of finding the

p and bottom intensity ratios between experiments, a color coded representation of the

differentially expressed y the clustering of the

ost differentially expressed genes into groups with the same pattern of behavior. The

igure 5.10 - Pipeline screen to define a sequence of routines to be performed on the

data

In the pipeline window as shown in the figure 5.10, there are three categories of

rocesses listed as list boxes, which the user can perform on the uploaded data. The first

ategory is the normalization routines from where the user can select a particular

ormalization method and apply it to the data. The second category is the various

isualization plots available to view the distribution of data. The third list box forms the

ird category of processes called “Ratios and Clustering” which consists of finding the

p and bottom intensity ratios between experiments, a color coded representation of the

differentially expressed y the clustering of the

ost differentially expressed genes into groups with the same pattern of behavior. The

FF re 5.10 - Pipeline screen to define a sequence of routines to be performed on the

data

In the pipeline window as shown in the figure 5.10, there are three categories of

pp

cc

nn

vv

thth

toto

genes by the fold change method and finall genes by the fold change method and finall

mm

Thca

The list box with the sequence of processes to be performed on the data.

Buttons to change the order of the processes or to delete a selected process.

e three tegories of

routines available r analyzing the

data fo

Submit to get results.

Button for transferring routines from LHS to RHS.

55

buttons with the right arrows by the side of each list box of methods is used to select that

ethod into the pipeline list box on the right hand side. The “Up”, “Down” and “Delete”

uttons on the right hand side of the pipeline window are used to change the sequence of

e processes lined up in the “Selected Order of Execution” list box. In order to move a

articular process up to the beginning of the pipeline, the user has to select the process

nd click on the “Up” button. Similarly, for moving a particular process to a later stage of

e pipeline, the user has to select the process and click on the “Down” button. The

Delete” button is used for deleting a particular process from the pipeline. On clicking

the “Submit” button the pipeline is taken into the system and the sequence of steps are

performed on the data. On completion of the pipeline, the results screen is displayed as a

parate window as shown in figure 5.11.

m

b

th

p

a

th

“

se

Figure 5.11 - Results screen after submitting the pipeline screen

Click on this link to

data file. get the normalized

Click on the links to get visual results.

56

5.5 Identifying genes of interest

One of the most important goals of microarray technology is the search for new

target genes. Methods have been provided to detect differentially expressed (DE) genes.

The techniques implemented in these methods are fold change detection and statistical t-

test.

A certain threshold value is set with which the log

5.5.1 Fold change cut-off

ous colors are assigned to represent the genes

log intensity ratio values.

2 values of the intensity ratios

of a gene between different experiments, are compared. If a gene’s log 2 intensity ratio

value exceeds the threshold value it is marked as differentially expressed (DE) in that

given experiment. In this software tool, vari

based on the gene’s 2

Cutoff Range Color coding in RGB(red, green, blue) Color image< -2.0 (153, 0, 0)

5.12 −<−≥ and (255, 0, 0) 0.15.1 −<−≥ and (255, 51, 0) 7.00.1 −<−≥ and (255, 103, 0) 5.07.0 −<−≥ and (255, 204, 51) 3.05.0 −<−≥ and (255, 204, 102) 1.03.0 −<−≥ and (255, 255, 204)

0.01.0 <−≥ and (255, 255, 230) 1.00.0 <≥ and (235, 255, 230) 3.01.0 <≥ and (204, 255, 153) 5.03.0 <≥ and (153, 255, 0) 7.05.0 <≥ and (10 , 255, 0) 3 0.17.0 <≥ and (51, 255, 0) 5.10.1 <≥ and (51, 204, 0) 0.25.1 <≥ and (0, 153, 0)

0.2≥ (0, 51, 0)

Table 5.1 - Color coding scheme for differentially expressed genes using fold change

57

The ignment of a color was done foass r easier interpretation of data. The different ranges

of the

The process of assigning colors to the different genes for a given experiment is as

follows. The intensity ratios of all the genes between the two experiments the user has

log 2 values of the intensity ratios are then calculated. The

log 2 v

provided as a tool tip (figure 5.12). On

clicking a r b tensity

ratios of the gene in other experiments and f the gene’s expressio value in all

the experiments can be vie ed (figure 5.1

log 2 intensity ratio values and the corresponding color coding assigned is given

below.

selected are calculated. The

alues are then compared with the ranges in Table 5.1. to determine a range and

then a corresponding color is assigned to the gene, thus categorizing it as either regularly

expressed, under expressed or over expressed. All the genes are plotted as matrix of

color images, where when a user does a mouse over on a particular color image, the gene

corresponding to it and the intensity value is

particular colo ox, a window opens up where the gene name, the in

a line plot o n

w 3).

58

Figure 5.12 - Color based image map of the gene’s intensity ratios between two

experiments

Figure 5.13 – Individual gene details

59

60

5.6 Clustering genes

Clustering of genes allows biologists to identify genes which exhibit similar

behavior patterns over a set of experiments. Modules which perform clustering of genes

using hierarchical clustering technique have been implemented. The outputs obtained

from the clustering module are a few groups which contain genes which behave similarly.

This software tool performs clustering in the following way. The background

code calculates the intensity ratios of the gene expression values for all the experimental

columns selected by the user. The intensity ratios are calculated with respect to the very

first experimental column. Then the background code checks for those genes whose log 2

intensity ratios across all the experiments is greater than the upper limit of 2.0. It selects

all m

into groups of gene with the same pat r. The cluster of genes are displayed

sing a heat map which uses a dendrogram to show the gene clusters and also a color key

e a color-based illustration of the gene in

In the case of huge data file

ough the heat map can accommodate all the genes and their corresponding

lusters, it appears very clumsy and illegible. In order to solve this problem, the top fifty

ected and then clustered into similar pattern

er is not looking at a large number of genes clustered together,

ut can expect a more refined clustering of the most differently behaving genes.

mploying such a clustering method, the tool tends to overlook the other gene clusters

which might be carrying be understood that the

clustering information of all the genes is present and the tool is applying the clustering

those genes and then applies hierarchical clustering technique in order to cluster the

tern of behavio

u

to provid tensity ratio values.

s, the number of genes selected for clustering may be

large and th

c

most differentially expressed genes can be sel

groups. This way, the us

b

By e

some significant information. It should

routi to all the genes anne d not just the top fifty genes. But the number of genes used for

displaying the clustered information are the ones which are most differentially expressed

in order to make the heat map (Figure 5.14) easier to view and understand.

Figure 5.14 - Heat map with a dendrogram to represent expression clusters

5.7

2

Top and bottom intensity ratios

The tool has two methods implemented for finding the top and bottom 10, 25, 50,

75 and 100 genes based on their log intensity ratios between different experiments. The

user can select the experiments for which the ratio has to be calculated for and also select

the number of top/bottom ratios needed which can be either of 10, 25, 50, 75 and 100.

Both these methods return a ordered list of genes and their intensity ratio values. The

purpose of these two methods is to provide a quick way to determine those genes which

are the most over/under expressed in a set of selected experiments.

61

or calculating the intensity ratios and the F ure 5.15 - User selected columns fig

number of top/bottom genes needed

The result screens obtained for the selections done in figure 5.15 are shown in figure

5.16. The results are displayed in the form of a table with the gene name and its

log 2 intensity ratio value.

62

Figure 5.16 - Results showing top 10 ratios and bottom 25 ratios between two

experiments

5.8

A module has bee m thousands of rows of

different

he search also returns the results of a t-test on the gene

expression profile which consists of the confidence interval for a regular gene expression

value and also the up and down regulated gene expression values based on the confidence

interval.

Search genes

n implemented for searching genes fro

gene expression information (figure 5.17). The search is done by the gene name and

returns as output, a line plot showing the behavior of the gene expression for the

experimental conditions. T

63

Figure 5.17 - Search screen with the list of genes from the uploaded data file

n intelligent one which does not require the user to scroll through

thousan

The search is a

ds of rows of gene names. Keying in the first few letters of the gene name will

highlight the gene in the displayed list. On submitting the above servlet by clicking on the

“View Graph” button, the user can view in a new window, a line plot of the gene

expression values for all the experiments in the data file, and a color coded representation

of the results of a t-test on the gene’s expression profile. The resulting screen is shown in

figure 5.18.

64

Figure 5.18 - Output screen for a search done on gene information

5.9 Output results as files

The normalized data, the clustered groups of genes and the intensity ratios of all

the genes are available as text files, which can be saved onto the local system of the user

for further usage or reference. The scatter plots, MA plots, RI plots and the heat maps of

clustered genes can all be saved locally as JPEG image files.

65

The above discussion of the results and features of the microarray data analysis

tool, show that it is a tool which will be intuitive for all the users who have a basic

understanding of normalization and data analysis. It will be a very handy tool for all

kinds of users- biologists and software developers, mainly. It’s simple user interface and

easily accessible results, supported by a help manual will help the analysis of microarray

data a simpler and easier task.

In the next chapter, some of the potential improvements that can be added to the

tool are discussed as well as the conclusions reached about optimal usage of the tool.

66

6. CONCLUSIONS

In this thesis, a platform-independent and versatile software tool for normalizing

and analyzing microarray data was developed. The software meets the requirements that

were o

ion.

The web-based front end is accessible from any web browser and handles all user

interaction. The program with its current functionality can be used by biologists for data

analysis, but there are certain improvements still in progress.

he presented program handles a wide range of functions as listed below:

• Since normalization of data is an important concern, the software tool provides

different means of normalizing the data. The possibilities are total intensity

normalization, median normalization and lowess normalization.

• The effects of normalization can be observed by useful graphical plots like scatter

plots, MA plots and RI plots. All plots can be created before and after

normalization.

• The system provides the capability of creating a pipeline of processes to be

performed on the microarray data. The user can also subject the data to specific

individual analysis techniques incase he/she does not want to create a pipeline.

riginally set forth for a tool of this nature. The interface is user-friendly. The

system is available for access to multiple users simultaneously upon user authenticat

T

67

• To detect differentially expressed genes, this tool provides simple fold-change

detection and also statistical tests like t-test. The detected target genes can be

printed to a file, for further analysis.

• Clustering methods ar sis and are also useful for reducing

6.1 Possible improvements

As with most of the software tools, this tool has a lot of scope for improvement.

Since currently new methods for normalization and analysis are developed, it may be

useful to adapt the present system to theses needs.

Possible improvements would be:

• Further normalization methods

• Additional diagnostic plots (QQ-plot, volcano plots)

•

• ent

•

tool. As far as issues are concerned, most of them have been solved and some are still in

progress.

e widely used for analy

the amount of microarray data to a subset of genes, usually to those which are

most variable between different experimental samples. This has been achieved by

using hierarchical clustering method.

• The system is expandable for further development.

ANOVA

User managem

• More sophisticated methods for detection of differentially expressed genes

Compatibility with multiple types of input data files

Many suggestions for improvement have been provided by potential users of the

68

In conclusion, this microarray data analysis tool developed using Java and R is a

nity driven solution developed to help make the analysis of microarray data

and efficient. Additional functiona

commu

simpler lity will be added on with the continuing

dev p

elo ment by other members of the Bioinformatics Research Group (BRG).

69

REFERENCES

[1] International Human Genome Sequencing Consortium, "Finishing the uchromatic sequence of the human genome," Nature, vol. 431, pp. 931-945,

[2] Tom A.van de Goor, "A History of DNA Microarrays," Advanstar ommunications Inc., 2005.

[3] eter Dalgaard, Introductory Statistics with R Springer, 2002.

[4] ff Augen, "Bioinformatics and Transcription," in Bioinformatics in the Post-Genomic Era: Genome, Transcriptome, Proteome, and Information-Based

edicine Addison Wesley Professional, 2004.

[5] .Crick, "Central dogma of molecular biology," Nature, vol. 227, no. 5258

[6] . Ruvkun, "Molecular biology.Glimpses of a tiny RNA world," Science, vol. 94, no. 5543, pp. 797-799, Aug.2001.

[7] an Cray, "Gene Detective," 2001.

[8] i Pasanen, Janna Saarela, Ilana Saarikko, Teemu Toivanen, Martti Tolvanen, auno Vihinen, and Garry Wong, DNA Microarray Data Analysis Picaset Oy,

[9] Dov Stekel, Microarray Bioinformatics Cambridge University Press, 2003.

[10] Helen C.Causton, John Quackenbush, and Alvis Brazma, A beginner's guide. icroarray Gene expression data analysis Blackwell Publishing, 2003.

[11] ordon K.Amyth, Yee Hwa Yang, and Terry Speed, "Statistical issues in cDNA Microarray Data Analysis," Functional Genomics:Methods and Protocols, vol.

24, no. Methods in Molecular Biology, pp. 111-136, 2003.

[12] hn Quackenbush, "Microarray data normalization and transformation," nature genetics supplement, vol. 32, no. December 2002, pp. 496-501, 2002.

[13] Knudsen, Guide to Analysis of DNA Microarray Data, Second ed John iley & Sons, 2004.

eOct.2004.

C

P

Je

M

FAug.1970.

G2

D

TomM2003.

M

G

2

Jo

Steen W

70

[14] W.N.Venables, D.M.Smith, and R Development Core Team, An Introduction to R Network Theory Ltd, 2004.

[15] Robert C.Gentleman, Wolfgang Huber, Vincent Carey, Rafael Irizarry, and Sandrine Dudoit, Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health Series) Springer, 2005.

[ Source rch

[ an online 306

vsner, "SNOMAD .

[19] ide R Functionality to

[20] gan, Java in a Nutshell O'Reilly & Associates, Inc., 1997.

pache Sams Publishing, 2003.

[

16] Sandrine Dudoit, Robert C.Gentleman, and John Quackenbush, "Open Software for the Analysis of Microarray Data," Biotechniques, vol. 34, no. Ma2003, p. s45-s51, 2003.

17] Xiaoqin Xia, Michael McClelland, and Yipeng Wang, "WebArray: platform for microarray data analysis," BMC Bioinformatics, vol. 6:Dec.2005.

[18] Carlo Colantouni, George Henry, Scott Zeger, and Jonathan Pe(Standardization and NOrmalization of MicroArray Data): web-accessible geneexpression data analysis," Bioinformatics, vol. 18, no. 11, pp. 1540-1541, 2002

Simon Urbanek, "Rserve -- A Fast Way to ProvApplications," 2003.

David Flana

[21] Bruce W.Perry, Java Servlet & JSP Cookbook O'Reilly Media Inc, 2004.

[22] Julie C.Meloni, PHP, MySQL and A

23] Robert McMillan, "Loosen the reins, says Google CEO,", 11 ed 2005.

71

GLOSSARY ANOVA (Analysis Of Variance) is a collection of statistical models and their

associated procedures which compare means by splitting the overall

cDNA mplementary DNA) DNA synthesized from mRNA or DNA by

Chromosomes

the form of one or more large macromolecules called

several

exually

ome, one from

CRAN (Comprehensive R Archive Network). A network of ftp and web servers

around the world that store identical, up-to-date versions of code and

documentation for R.

DNA (Deoxyribonucleic Acid) is a nucleic acid, usually in the form of a

double helix that contains the genetic instructions specifying the

biological development of all cellular forms of life, and most viruses.

Exons The coding regions of DNA.

observed variance into different parts.

(Co

reverse transcriptase often synthesized from a cellular extract.

The DNA which carries genetic information in cells is normally

packaged in

chromosomes. Most multicellular organisms have

chromosomes, which together comprise the genome. S

reproducing organisms have two copies of each chromos

each parent.

72

Fold change The ratio of gene expression between two samples in a microarray

experiment.

Gene Units of heredity in living organisms. They are encoded in the

organism’s genetic material (usually DNA or RNA), and control the

physical development and behavior of the organism.

Genome The whole hereditary information of an organism encoded in the

DNA. It includes both genes and non-coding sequences.

Human Genome Project

A project initiated by the government of United States for DNA

sequencing of the human genome.

Hybridization The process of combining complementary, single stranded nucleic

acids into a single molecule. Nucleotides will bind to their

complement under normal conditions, so two perfectly

complementary strands will bind to each other readily.

IDE (Integrated Development Environment). Environment used for

developing code for any application.

A list of coordinates relating to a specific image, created in order to

hyperlink areas of the image to various destinations (as opposed to a

age links to a

single destination).

Introns Non-coding regions of DNA.

Java API (Java Application Programming Interface). Collection of ready-made

software components that provide several capabilities.

JDBC (Java Database Connectivity). Set of communication protocols

Image map

normal image link, in which the entire area of the im

73

between a java program and a database.

(Joint PhotogJPEG raphic Experts Group) is a commonly used standard

JSP gy which mixes

JVM programs to

JVM ter and runtime system, which lets

LAMP

LocusLink rmation about

Microarray spots attached to a physical

Microarray experiment

ing gene expression in a system under controlled

time, stimulus, developmental stage, or

MIDAS TM4 suite.

method of lossy compression for photographic images. The file

extensions for this format are .jpeg or .jpg.

(Java Server Pages). A java programming technolo

static HTML with dynamically-generated HTML.

(Java Virtual Machine). The interpreter which lets java

run on any platform where it has already been installed.

(Java Virtual Machine). An interpre

java programs run on any platform it has been ported to.

(Linux + Apache + MySQL + Python). A platform used by the

microarray data analysis tool, WebArray.

Single query interface to sequence and descriptive info

genetic loci.

A collection of microscopic DNA

substrate such as glass, plastic or silicon chip forming an array.

Microarrays are hybridized with labeled samples and then scanned

and analyzed to generate data.

An experiment study

conditions to factors such as

dosage on a sample.

(Microarray Data Analysis System). Forms a part of the

A java application which provides users an intuitive interface to

74

design analysis processes combining one or more normalization and

miRNA

pression of other genes.

A

nthesis to undergo

which a gene

MySQL

zation

riation from the microarray

Oligo tide) Short sequence of nucleotides (<80 base pairs)

Protein

gical functions of all living cells and

PubMed

from MEDLINE and other life science journals

RNA

filtering steps.

(micro RNA) is a form of single-stranded RNA which is 20-25

nucleotides long and which regulates the ex

mRN (messenger RNA) is RNA that encodes and carries information from

DNA during transcription to sites of protein sy

translation in order to yield a gene product. The amount of any

particular type of mRNA in a cell reflects the extent to

has been “expressed”.

A very popular open source database server.

Normali The process used to standardize microarray data by removing the

effect of all sources of non-biological va

data, making them comparable.

(Oligonucleo

always single stranded to be used as probes or spots.

A complex, high-molecular-weight, organic compound that consists

of amino acids joined by peptide bonds. Proteins perform a wide

variety of structural and biolo

viruses.

A service of the U.S. National Library of Medicine that includes over

16 million citations

for biomedical articles back to the 1950s.

(Ribonucleic acid) A class of nucleic acids that consist of nucleotides

75

containing the bases- adenine (A), guanine (G), cytosine (C), and

uracil (U). An RNA molecule is typically single-stranded and can pair

Servlet

SQL guage). A standard computer language for

Microarray Data Analysis System

Transcription from DNA into

Translation

ce a specific protein according to the rules specified by the

tRNA

tide chain at the ribosomal

with DNA, another RNA molecule, or form secondary structure by

hybridizing to itself.

Rserve Rserve is a TCP/IP server which allows other programs to use R

facilities without the need to initialize or link against R library.

Java program that runs on a web server and which is used to build

web pages.

(Structured Query Lan

accessing and manipulating databases.

TM4 A suite of analysis tools developed to handle all aspects of microarray

process. Includes four major applications, Microarray Data Manager

(MADAM), TIGR_Spotfinder,

(MIDAS), and Multiexperiment Viewer (MeV).

The process in which transfer of genetic information

RNA takes place. It is the beginning of the process that ultimately

leads to the translation of the genetic code into a protein.

The second process of protein synthesis, in which mRNA is decoded

to produ

genetic code. Translation is preceded by transcription.

A small RNA chain (approximately 75 bp in length) that transfers a

specific amino acid to a growing polypep

site of protein synthesis during translation.

76

CURRICULUM VITAE

Date of Birth 20, 1982

ce of Birth

Undergraduate India

Graduate Study

ter Science

Experience

uisville, KY

VASUNDHARA AKKINENI

May

Pla Madras, India

Study University of Madras, Madras,

B.Tech. Information Technology

(1999-2003)

University of Louisville, Louisville, Kentucky

M.S. Computer Engineering and Compu

(2003-2006)

IT Digitization Intern, GE Energy, Atlanta, GA

(Jan, 2005 - July, 2005)

QA Analyst Intern, Yum! Brands Inc, Louisville, KY

(June, 2004 – Dec, 2004)

Student Assistant, University of Louisville, Lo

(Aug, 2003 – May, 2004)

77

microarray data analysis tool using java and...

Documents