microarray database resource for designing...

73
MICROARRAY DATABASE RESOURCE FOR DESIGNING CUSTOM MICROARRAYS By Ravi Shrikanth Gundlapalli M.S. University of Louisville, 2005 A Thesis Submitted to the Faculty of the Graduate School of the University of Louisville in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE Department of Computer Engineering and Computer Science University of Louisville Louisville, Kentucky December 2005

Upload: others

Post on 18-Jun-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

MICROARRAY DATABASE RESOURCE FOR DESIGNING CUSTOM MICROARRAYS

By

Ravi Shrikanth Gundlapalli M.S. University of Louisville, 2005

A Thesis Submitted to the Faculty of the

Graduate School of the University of Louisville in Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

Department of Computer Engineering and Computer Science University of Louisville

Louisville, Kentucky

December 2005

DEDICATION

This thesis is dedicated to my parents

Mr. Subbarayudu Gundlapalli

and

Ms.Vijaya Kumari Gundlapalli

whose love and support made this possible for me.

iii

ACKNOWLEDGEMENTS

I owe this work to many people. I would like to thank my mother who has been

my inspiration all through my education. Many thanks are needed for Dr Eric C.

Rouchka, for the instruction and guidance provided during this long process. I would also

like to thank the members of my thesis committee, Dr. Nigel G. F. Cooper, Dr. Ahmed

Desoky. Additional thanks go out to Tim Hardin, Elizabeth Cha, Vasundhara Akeneni

and other members of the Bioinformatics Research Group. The work for this thesis was

performed in the University of Louisville’s Bioinformatics Labarotary. Support for this

work as well as equipment for the Bioinformatics lab is provided through Kentucky

Biomedical Research Infrastructure Network (NIH-NCRR grant # P20 RR16481; Nigel

Cooper, PI).

iv

ABSTRACT

MICROARRAY DATABASE RESOURCE FOR DESIGNING CUSTOM

MICROARRAYS

RAVI SHRIKANTH GUNDLAPALLI

NOVEMBER 29, 2005

Global gene expression monitoring with microarrays provides a vast

amount of biological information. Computers play a central role in many aspects of

microarray analysis, including the layout design of customized arrays that can be used by

printing robots to create customized microarrays for a myriad of different systems. We

have developed an integrated software called “MIDAR” that provides optimized design

solutions for customized microarrays based on user specified inputs. These include the

genes and gene groups, the number of replicates and the type of chip design (oligo or

cDNA). Users can additionally associate experimental data to stored chip designs. This

readily provides researchers with the gene expression level, and an interactive image map

that allows the user to view the included oligo/primer detail information and list specific

data details for each spot. MIDAR incorporates the Gene Ontology (GO) database to

provide for selection of genes based on a common characteristic such as biological

function, molecular process, and cellular component. MIDAR is a fully featured

community-driven solution to the challenge of microarray design, data management and

analysis. The main contribution of this thesis work deals with the design solutions for

custom microarrays as well as the incorporation of visual analysis approaches.

v

TABLE OF CONTENTS

DEDICATION …………………………………………………………………………iii ACKNOWLEDGEMENTS…………………………………………………………….iv ABSTRACT...…………………………………………………………………………..v LIST OF FIGURES……………………………………………………………………..vii NOMENCLATURE………………………………………………………………….…viii

1. INTRODUCTION….…………………………………………………………....1

2. LITERATURE REVIEW…...…………………………………………………...5

2.1 GENES…………………………………………………………………..5

2.2 GENETIC CODE………………………………………………………..10

2.3 CENTRAL DOGMA OF MOLECULAR BIOLOGY.…….…………...11

2.4 THE HUMAN GENOME PROJECT……………..…………………….13

2.5 GENE DISCOVERY THROUGH EXPRESSED SEQUENCE TAGS...15

2.6 dbEST : A DESCRIPTIVE CATALOG OF ESTs...…………………….18

2.7 GENE EXPRESSION PROFILING……………………………………..18

2.8 HYBRIDIZATION AND GENE EXPRESSION………..……………...20

3. MICROARRAYS..………………………………………………………………21

3.1 COMMERCIAL ARRAYS...……………………………………………25

3.2 EXPERIMENTAL DESIGN…………………………………………….29

3.3 APPLICATION OF MICROARRAYS…………………………………35

3.4 AVAILABLE MICROARRAY TOOLS...……………………………...35

4. MIDAR – DESIGN ASPECTS………………………………………………….38

5. MIDAR – A SCENARIO…..……………………………………………………51

6. FUTURE – MICROARRAYS AND MIDAR…..………………………………56

REFERENCES………………………………………………………………………58

CIRRICULUM VITAE..…………………………………………………………….60

vi

LIST OF FIGURES

Figure 1.1 - Overview of Microarray Analysis….……………………………………….2

Figure 2.1 - Structure of a Gene………………………………………………..………...6

Figure 2.2 - Exons and Introns in DNA Sequence……………………………………….7

Figure 2.3 - Spiral Structure of DNA…………………………………………………….8

Figure 2.4 - mRNA Processing…………………………………………………………..9

Figure 2.5 - Central Dogma of Molecular Biology………………………………………12

Figure 2.6 - Overview of Protein Synthesis……………………………...………………16

Figure 2.7 - Overview of how ESTs are Generated……………...………………………17

Figure 3.1 - Image of a Portion of a Microarray that has been Hybridized and Scanned..22

Figure 3.2 - Combinatorial Chemistry Approaches to Microarray Manufacturing…...…24

Figure 3.3 - Distribution of Types of Microarrays…………………………….………...27

Figure 3.4 - Experimental Approach for Microarray Analysis………….……………….30

Figure 3.5 - Five Steps of Microarray Analysis……………………………….………....32

Figure 3.6 - Hybridizing the Two Samples………………………….…………………...34

Figure 3.7 - Scanning the Microarray.…………………………………………………...34

Figure 4.1 - Database Design……………….……………………………………………42

Figure 4.2 – Data Header for Uploading Gene Information……………………………..43

Figure 4.3 – Explanation of Code………………………………………………………..45

Figure 5.1 - Screenshot Depicting User Input for Chip Design…….……………………51

Figure 5.2 - Screenshot Depicting the Custom Array Design and the Primer/Oligo

Information of a Specific Spot on the Array……….…………………………………….52

Figure 5.3 - Screenshot Depicting Marked Spots Associated with a Gene of Interest.….53

Figure 5.4 - Screenshot Depicting the Activated Custom Array.………………………..54

Figure 5.5 - Screenshot Depicting a Comparative Analysis of a Single Custom Array

from Two Different Experiments..……..………………………………………………...55

vii

NOMENCLATURE

Affymetrix: A company that has revolutionized manufacturing process for

microarrays. It uses semiconductors to produce high-density arrays called

GeneChip.

Agilent: Another popular manufacturer of microarrays with area of research in

MEMS, nanotechnology and Life Sciences.

Amino acids: The building blocks of proteins.

BASE: Abreviated for BioArray Software Environment, is a free web-based

solution for analyzing huge amounts of data obtained from microarrays.

cDNA: Abbreviated for complimentary DNA, is a stable compound of mRNA and

contains only the coding regions of DNA. The sequence is a reverse

compliment of mRNA.

Codon: A single unit of genetic code that is made up of three (triplet) nucleotide

bases in a DNA or RNA molecule specifying a single amino acid.

DNA: The molecule that encodes genetic information. DNA is a double stranded

molecule made of two twisting, paired strands held together by weak

bonds between base pairs of nucleotides

EST: Abbreviated for Expressed Sequence Tags, these are the sequences

obtained from either sides of a cDNA sequence.

viii

Exon: The coding regions of DNA.

Gene: The fundamental physical and functional unit of heredity. A gene is an

ordered sequence of nucleotides located in a particular position within the

genome that encodes a specific functional product (i.e., a protein or RNA

molecule).

Genetic Code: The sequence of nucleotides, coded in triplets (codons) along with the

mRNA that determines the sequence of amino acids in protein synthesis.

A gene’s DNA sequence can be used to predict the mRNA sequence, and

the genetic code can in turn be used to predict the amino acid sequence.

Genome: All the genetic material of a particular organism, its size is generally given

as its total number of base pairs or as its total number of genes.

Genomic Era: The new era in genetic research featuring rapid acquisition and integration

of increasingly advanced genetic information resulting from the progress

and completion of Human Genome Project.

Human Genome Project: It was initiated by the government of United States for DNA

sequencing of the human genome.

Intron: The non-coding regions of DNA.

LAD: Abbreviated for Longhorn Array Database, is an open source, MIAME-

complaint version of Stanford Microarray Database (SMD).

MADAM: Abbreviated for MicroArray Data Management is a software package

available from The Institute of Genomic Research. It helps in tracking

experimental parameters and results associated with microarrays.

Microarray: An ordered array of microscopic elements on a planar substrate that allows

ix

the specific binding of genes or gene products.

mRNA: A molecule that can move from the nucleus to the cytoplasm of cells that

serves as the crucial connecting message between information contained

in the gene and protein synthesis. The structure of RNA is similar to that

of DNA. The mRNA molecule serves as a template for the specific amino

acid sequence of a protein.

Nucleotide: The basic subunits of DNA or RNA. Thousands of nucleotides are linked

to form a DNA or RNA molecule. The four nucleotides in DNA contain

the bases adenine (A), guanine (G), cytosine (C), and thymine (T). In

nature, base pairs form only between A and T and between G and C; thus

the base sequence of each single strand can be deduced from that of its

partner.

Oligonucleotide: Small single stranded segments of DNA typically 20-30 nucleotide

bases in size which are synthesized in vitro.

PCR: Abbreviated for Polymerase Chain Reaction, is the process where RNA

and DNA polymerase enzymes link together into DNA and RNA chains.

Perl: A programming language released by Larry Wall, borrows heavily from

C, sed, awk, shell scripting and others, is gaining immense popularity

among open source developers.

PHP: Another widely open-source programming language primarily for server-

side programming and developing dynamic web pages.

Photolithography: It is a process used in semiconductor device fabrication to transfer a

pattern from a photo mask to the surface of a substrate.

x

Probe: Labeled molecule in solution that reacts with a complimentary target

molecule on the substrate.

Protein: A large molecule composed of one or more chains of amino acids in a

specific order; the order is determined by the base sequence of nucleotides

in the gene that codes for the protein. Proteins are required for the

structure, function and regulation of the body’s cells, tissues and organs,

and each protein has unique functions. Examples are hormones, enzymes

and antibodies.

Ribosome: A cytoplasmic organelle that serves as the molecular machine on which

polypeptide synthesis from mRNA occurs.

Sequencing: Determination of the order of nucleotides (base sequences) in a DNA or

RNA molecule.

SMD: Abbreviated for Stanford Microarray database is a database used to

publish microarray data and for public access of microarray data.

SNP: Abbreviated for single nucleotide polymorphism is a common sequence

variant containing a one-base-pair change relative to the normal gene.

TM4: A suite of software available with The Institute of Genomic Research

developed to address all data associated with microarray design and

analysis.

Transcription: The synthesis of an mRNA copy from a sequence of DNA (a gene), the

first step in gene expression.

Translation: The process in which the genetic code carried by mRNA directs the

synthesis of proteins from amino acids.

xi

tRNA: A class of RNA that recognizes the triplet nucleotide coding sequences of mRNA

and carries the appropriate amino acid to the ribosomes, where proteins are assembled

according to the genetic code carried by mRNA.

xii

1. INTRODUCTION

Imagine trying to solve a jigsaw puzzle without knowing if you have all of the

pieces. This is the dilemma faced by scientists in the field of molecular medicine when

attempting to understand how human genes and their protein products interact with one

another to lead to normal biological functions, how these functions can break down in

various disease states, and how normal functions can be restored through molecular

intervention. My dilemma could partly be attributed to my background being computer

engineering and not biology. The following material is a basic overview of genomics and

is meant to define the boundaries of a puzzle whose solution is going to be my thesis

work and is within my grasp, though not in my hands.

With the completion of Human Genome project [1], and cataloguing of thousand

of genes already (estimates ranging from approximately 64,000 to 80,000 genes have

been advanced [2]), the biggest challenge to scientists now is to understand how these

genes work – what regulates their activity and how they interact with each other and the

environment. In other words, there are available huge amounts of data relevant to the

puzzle, but solving the puzzle remains a bioinformatics challenge.

The challenge of determining the function of genes was well accepted and in

came a capability to analyze gene expression level of thousand of genes simultaneously

in a single experiment quickly and efficiently. As we have seen earlier that thousand of

genes and their products in a given living organism function in a complicated and

1

orchestrated way that creates the mystery of life. However, the traditional methods in

molecular biology generally work on one gene in one experiment basis [3], which means

that throughput is very limited and the whole picture of gene function is hard to obtain.

Microarray technology monitors the whole genome on a single chip so that researchers

can have a better picture of the interactions among thousand of genes simultaneously.

Microarray technology is being considered as the best method for examining

global aspects of genomic data and gene expression profiling. The microarray data

essentially added another dimension in bioinformatics with its infinite possibilities of

expression change comparison in various conditions.

Source: - bldg6.arsusda.gov/benlab/ microarrays002.jpg

Figure 1.1 – Overview of Microarray Analysis

2

Microarray analysis (Fig. 1.1) involves DNA sequences from two different

samples – one a normal sample and the other a test sample. They are then combined

together on a slide typically of a size of microscope slide. The samples then come

together and bind to complementary sequences and express themselves. Once the genes

have been expressed they are scanned and all the information is fed to a computer for

further analysis.

Computers play a central role in many aspects of microarray analysis, including

design informatics. Microarray printing robots can be directed through software to

manufacture microarrays with myriad physical characteristics, and this design flexibility

is critical for the study of different methods, and the development of new assays. Though

this software is most pertinent to researchers who make their own arrays, those who

purchase off-the-shelf microarrays are also served well by understanding design

informatics. The location of the microarray on the printing substrate, the number of

features and genes, the number of pins and samples, and content maps are among the

topics addressed in this software.

A critical aspect of microarray production is the design considering space

optimization to produce high – density arrays for a given set of gene samples and number

of replicates to be present. The software available with robotic spotters translates user

input parameters into a set of instructions in robotic language for building arrays. These

softwares do not offer design capabilities in which spotting parameters and grid

configurations can be chosen for a given set of samples and replicates. Presently various

solutions have to be derived manually in most academic laboratories.

User-friendly software that can be used by experts and novice alike would

3

simplify and aid rapid design of microarrays. We at the University of Louisville have a

joint collaboration between the Speed School of Engineering and School of Medicine to

address various bioinformatics issues through the Bioinformatics Research Group (BRG)

[kbrin.a-bldg.louisville.edu/brg/]. BRG initiated a multifaceted project - Microarray

database resource (MIDAR) - wherein researchers and students of the group will be

investigating efficient ways to design, analyze and store biological data resulting from the

experiments conducted on microarrays.

While the MIDAR system will include the ability to design, store and analyze

microarray experiments from various custom and commercial packages, this thesis work

will concentrate on developing optimized design techniques for microarray chips

followed by their simulation with the experimental values.

4

2. LITERATURE REVIEW

The genetic blueprint is carried in the genome, an improbable assembly of DNA

bases, genes, and chromosomes. Cells pass an exact copy of the genome to other cells

during cell division, and the blueprint is inherited during reproduction. Human genomes

are structurally complex, and minor changes in the sequence can produce disease. Each

cell in the human body contains the same genomic sequence as every other cell, but gene

expression varies greatly from cell to cell. Transcription and translation convert gene

information into proteins, and DNA replication synthesizes exact copies of the genomic

sequence. DNA sequencing technology [1;4] has recently afforded the sequence of the

entire human genome, as well as sequences of dozens of other organisms, including

viruses, bacteria, yeast, worms, insects, plants and rodents. This chapter provides a

survey of genes and genomes, and offers a glimpse into exciting and fast-moving field of

genomics.

2.1 GENES

All humans are basically the same, yet we are unique, with different traits that

allow us to stand out as individuals. Some are tall, some are short, some are fair, and

some are dark. These physical similarities and differences are due to similarities and

differences in our genetic instructions. Our own set of genetic instructions, our genes

5

determine our particular traits, inherited from our parents.

Genes are instructions inside you that tell your body what to look like, and how to

work. There are genes, which tell your hair to be curly or straight, genes, which tell your

body to grow tall (or not so tall!), genes, which tell your stomach how to digest food,

genes that account for every little detail of your body! Of course, your body is also

affected by the things you do and the things that go on around you. If you dye your hair,

you will look different. If you do not eat a healthy diet, you might not grow as tall. We

call the things outside your body that can affect it ‘environmental influences’[5].

Source: www.biotec.or.th/Genome/ whatGenome.html

Figure 2.1 – Structure of a Gene

Technically genes are continuous segments of genomic DNA constructed from

four nucleotide building blocks (Fig. 2.1). Each gene encodes a specific mRNA and

protein, the latter of which imparts biological function in cell. Genes in higher eukaryotes

6

such as humans, contain exons and introns [6]. An exon is gene segment that is copied

into mRNA and maintained after mRNA processing, and an intron is a gene segment that

is copied into mRNA, but removed from the mature mRNA before protein synthesis (Fig.

2.2). Genes in lower eukaryotes, such as yeast, are essentially devoid of introns, and

bacteria do not contain any introns at all. The presence of introns in complex organism,

but not in simple systems, suggests an evolutionary role for these noncoding gene

sequences, potentially in alternative splicing.

Figure 2.2 – Exons and Introns in DNA Sequence

Genes are composed of double-stranded DNA, and gene size is measured in a unit

known as base pair, corresponding to one nucleotide of double stranded DNA. The

genetic blueprint of virtually every organism in the biosphere is stored in the

biopolymeric molecule known as deoxyribonucleic acid (DNA). DNA is composed of a

long string of nucleotides, each of which contains one of four bases (A, G, C or T), a

deoxyribose sugar, and a phosphate group [7]. Nucleotides are joined together in a

covalent manner in the cell to build linear DNA sequences. A typical human gene and

chromosome contain approximately 20,000 and 100,000,000 nucleotides [2] respectively.

Different nucleotides can be strung together to form a polynucleotide. However, the ends

of the polynucleotide are different, meaning that each polynucleotide sequence will have

directionality. The ends of the polynucleotide are marked either 3’ or 5’. The general

convention is to label the coding strand from 5’ to 3’ (left to right).

DNA can be either single-stranded or double stranded. DNA chains that bond to

7

each other through A-T and G-C interactions are known as complimentary strands. The

chemical process by which complementary strands bind into a double-stranded molecule

is known as hybridization. Complimentary strands of DNA form a spiral molecule or

double helix (Fig. 2.3), whereby the two interwoven chains coil around a center axis like

a spiral staircase. Two complementary polynucleotide chains form a stable structure

known as the DNA double helix. For the polynucleotide given above, the double-

stranded polynucleotide is as follows:

5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’

| | | | | | | | | | | | | | | |

3’ C←A←T←T←T←C←A←G←G←G←C←A←A←T←C←G 5’

Source: www.genecrc.org/site/ lc/lc2b.htm

Figure 2.3 - Spiral Structure of DNA

The DNA encodes the genetic blueprint of an organism, and the blueprint is

converted into protein information using ribonucleic acid (RNA) as a molecular

intermediary. Certain classes of RNA also play a role in protein synthesis. RNA is similar

to DNA, but possesses some novel structural, physical and functional properties.

Ribonucleotides are identical to deoxyribonucleotides, except that the RNA building

blocks contain uracil (U) instead of thymine (T) and ribose instead of deoxyribose as the

sugar. RNA chains are generally much shorter than DNA chains; most RNA molecules

contain 70-10,000 ribonucleotides [8]. Unlike DNA, which forms a double stranded

8

double helix, nearly all RNA molecules are single stranded. RNA is important in the cell

and contributes in a variety of ways. Two of the major RNA molecules involved in

protein synthesis are messenger RNA (mRNA) and transfer RNA (tRNA).

mRNA encodes the genetic information as copied from the DNA molecules.

Transcription is the process in which DNA is copied into an RNA molecule. The

resulting linear molecule is an mRNA transcript (Fig. 2.4). In eukaryotic cells, before the

mRNA can be translated into a protein, it needs to be modified. The nature of most

eukaryotic genes is that the genes are created in pieces, where coding regions, called

exons, are interspersed with noncoding regions, called introns. One of the steps in

processing the mRNA is to remove the intronic regions and to splice together the coding,

or exonic regions. The processed mRNA can then be transported from the nucleus and

translated into a protein sequence.

Source: http://departments.oxy.edu/biology/Stillman/bi221/111300/processing_of_hnrnas.htm

Figure 2.4 - mRNA Processing

9

tRNA molecules develop a well-defined three-dimensional structure, which is

critical in the creation of proteins. Attached to each tRNA molecule is an amino acid

(which will be discussed momentarily). The amino acid to be attached is determined by a

three base sequence called an anticodon sequence, which is complementary to the

sequence in the mRNA. Translation is the process in which the nucleotide base sequence

of the processed mRNA is used to order and join the amino acids into a protein with the

help of ribosomes and tRNA.

2.2 GENETIC CODE

Genetic information is stored in a fundamental unit of genetic information known

as a codon. A codon, or triplet, contains three successive nucleotides that are read by the

cellular machinery in a 5’ or 3’ manner to specify an amino acid. There are 64 possible

combinations of the four nucleotides, and all 64 codons are used in the cell. The cellular

“conversion table” between codon and amino acid is known as the genetic code (Table

2.1). Of the possible combinations, 61 codons specify the 20 amino acids; and the

remaining 3 combinations are stop codons, the genetic signals in the code that signal

termination.

Thus the flow of genetic information is from the DNA directing the synthesis of

RNA, and RNA which in turns directs the synthesis of protein. This flow of genetic

information from nucleic acids to protein has been called the Central Dogma of

Molecular Biology [9] (Fig. 2.5).

10

Second position (Middle)

First Position (5’) A G C T Third Position (3’)

lys arg thr lle A

lys arg thr met G

asn ser thr lle C

A

asn ser thr lle T

glu gly ala val A

glu gly ala val G

asp gly ala val C

G

asp gly ala val T

gln arg pro leu A

gln arg pro leu G

his arg pro leu C

C

his arg pro leu T

stop stop ser leu A

stop trp ser leu G

tyr cys ser phe C

T

tyr cys ser phe T

Table 2.1 – The Genetic Code

2.3 CENTRAL DOGMA OF MOLECULAR BIOLOGY Going back to genes where we started our discussion, a gene can be thought of as the

DNA sequence necessary for the synthesis of a functional protein. A genome is a living

organism’s entire DNA. It is the complete set of genetic instructions for building, running

and maintaining that organism. Every species has its own genome. Simple organisms

such as bacteria have a simple genome, whereas a complex organism such as human

11

species has a relatively large genome with about 30,000 genes [2]. And interestingly, in

any two humans, 99.9% of their DNA is identical [2].

DNA

RNA

PROTEIN Source: http://www.people.virginia.edu/~rjh9u/dnaprot.html

Figure 2.5 - Central Dogma of Molecular Biology

The entire set of genetic instructions is so large that the 0.1% variation allows for

millions of these possible differences among individuals. These tiny fractions of DNA

where variation occur, leads to the enormous diversity that makes each of us unique. Yet,

the same variation that causes the differences in our appearance also leads to differences

in our likelihood of getting any particular disease. Knowledge about the effects of DNA

variation between individuals can lead to better understanding of disease and to advances

in medicine.

The cells of our bodies are made of different kinds of molecules, such as water,

minerals, proteins, sugars fats and DNA. Of these, proteins are particularly important

because they are the fundamental components of the body that determine how all of the

molecules are organized and how they act. Thus, proteins play a key role in the way we

12

look and in the way we grow. DNA acts as a molecular code for making these proteins.

The DNA in each gene provides the instructions for making one protein, or sometimes, a

few related proteins. However, only about 1/60th of the entire genome directly codes – or

provides the instructions – for making proteins [5]. The rest of DNA in our genomes

helps direct when and where in the body each gene should be used. Taken together, all of

the DNA of the genome can be thought of as a blueprint for a human being.

In the past, doctors and scientists did not have the benefit of a human genetic

blueprint to help them better understand sickness and develop appropriate treatments. It is

much similar to the following scenario – If a house needs a repair or maintenance,

mechanics and engineers can consult the blueprint when analyzing a problem and avoid

unnecessary work, or, more importantly, avoid worsening the problem. Similarly, the

blueprint of the human body provided through the Human Genome project will help

analyze problems when something goes wrong with a person. In the future, when a doctor

is treating someone who is sick, he or she will be able to consult the patient’s genetic

blueprint in order to determine what variation of genes that patient has and prescribe a

particular treatment that he or she knows is most likely to be effective for that individual.

This will also help doctors avoid prescribing a drug that could cause a serious side effect.

2.4 THE HUMAN GENOME PROJECT

Understanding the potential of a human genome, The United States started the

Human Genome project in 1990 with the ambitious goal of sequencing the human

genome within the next fifteen years [4]. The project’s new research strategies and

experimental technologies have generated a steady stream of ever-larger and more

13

complex genomic data sets that have poured into public databases and have transformed

the study of virtually all life processes. The genomic approach of technology

development and large-scale generation of community resource data sets has introduced

an important new dimension into biological and biomedical research. Interwoven

advances in genetics, comparative genomics, high throughput biochemistry and

bioinformatics are providing biologists with a markedly improved repertoire of research

tools that will allow the functioning of organisms in health and disease to be analyzed

and comprehended at an unprecedented level of molecular data.

The project’s huge success is based on the fact that only about 2% of total bases

make up the protein coding portions of our genes; the remaining 98% is of unknown

function and often referred to as junk DNA [10]. Thus, sequencing the genome may not

be the most efficient way to generate a catalog of human genes. As Brenner put it, “If

something like 98% of the genome is junk, then the best strategy would be to find the

important 2% and sequence it first” [10].

A number of investigators have advocated large scale sequencing of the

transcription products of genes in the form of complimentary DNA (cDNA) clones, as a

prelude to sequencing of entire human genome and then came the era of high –

throughput cDNA sequencing, initiated in 1991 by a landmark study from Venter and

colleagues [11]. The basic strategy involves selecting cDNA clones at random and

performing a single automated sequencing read from one or both ends of their inserts.

They introduced the term “Expressed Sequence Tag” (EST) to refer to this new class of

sequence, which is characterized by being short (typically around 400 bases) and

relatively inaccurate (around 2% error).

14

2.5 GENE DISCOVERY THROUGH EXPRESSED SEQUENCE TAGS

EST’s provide researchers with a quick and inexpensive route for discovering

new genes, for obtaining data on gene expression and regulation, and for constructing

genome maps. The idea is to sequence bits of DNA that represent genes expressed in

certain cells, tissues, or organs from different organisms and use these tags to fish a gene

out of a portion of chromosomal DNA by matching base pairs. The challenge associated

with identifying genes from genomic sequences varies among organisms and is

dependent upon genome size as well as the presence or absence of introns, the

intervening DNA sequences interrupting the protein coding sequence of a gene. And

most of the human genome is composed of introns interspersed with a relative few DNA

coding sequences, or genes, thus making gene identification difficult among humans.

Genes are expressed as proteins, a complex process composed of two main steps.

As seen in the Central Dogma of molecular Biology each gene (DNA) must be converted,

or transcribed, into messenger RNA (mRNA), RNA that serves as a template for protein

synthesis. The resulting mRNA then guides the synthesis of a protein through a process

called translation. Interestingly, mRNAs in a cell do not contain sequences from the

regions between genes, nor from the non-coding introns that are present within many

genes. Therefore, isolating mRNA is key to finding expressed genes in the vast expanse

of the human genome (Fig. 2.6).

15

Source: http://www.ncbi.nlm.nih.gov/About/primer/est.html

Figure 2.6 - Overview of Protein Synthesis

The problem, however, is that mRNA is very unstable outside of a cell; therefore,

scientists use special enzymes to convert it to complementary DNA (cDNA) [11]. cDNA

is a much more stable compound and, importantly, because it was generated from a

mRNA in which the introns have been removed, cDNA represents only expressed DNA

sequence. Once cDNA representing an expressed gene has been isolated, scientists can

then sequence a few hundred nucleotides from either end of the molecule to create two

different kinds of EST’s. Sequencing only the beginning portion of the cDNA produces

what is called a 5' EST (Fig 1.8). A 5' EST is obtained from the portion of a transcript

that usually codes for a protein. These regions tend to be conserved across species and do

not change much within a gene family. Sequencing the ending portion of the cDNA

molecule produces what is called a 3' EST (Fig 2.7). Because these ESTs are generated

from the 3' end of a transcript, they are likely to fall within non-coding or untranslated

regions (UTRs), and therefore tend to exhibit less cross-species conservation than do

coding sequences [11].

16

Source: http://www.ncbi.nlm.nih.gov/About/primer/est.html

Figure 2.7 - Overview of how ESTs are generated

Since ESTs represent a copy of subjective part of a genome, that is expressed,

they have proven themselves again and again as powerful tools in the hunt for genes

involved in hereditary diseases. ESTs also have a number of practical advantages in that

their sequences can be generated rapidly and inexpensively, only one sequencing

experiment is needed per each cDNA generated, and they do not have to be checked for

sequencing errors because mistakes do not prevent identification of the gene from which

the EST was derived.

ESTs employed for gene identification - To find a disease gene using this

approach, scientists first use observable biological clues to identify ESTs that may

correspond to disease gene candidates. Scientists then examine the DNA of disease

patients for mutations in one or more of these candidate genes to confirm gene

identity[12]. Using this method, scientists have already isolated genes involved in

Alzheimer's disease, colon cancer, and many other diseases. It is easy to see why ESTs

17

will pave the way to new horizons in genetic research.

Thanks to EST, there is explosion in information about DNA sequences of the

human genome. Scientists have identified large number of novel genes using ESTs.

Although important goals of any sequencing project may be to obtain a genomic

sequence and identify a complete set of genes, the ultimate goal is to gain an

understanding and when, where and how a gene is turned on, a process commonly

referred to as gene expression.

2.6 dbEST: A DESCRIPTIVE CATALOG OF ESTs

Scientists at NCBI created dbEST [13] to organize, store, and provide access to

the great mass of public EST data that has already accumulated and that continues to

grow daily. Using dbEST, a scientist can access not only data on human ESTs but

information on ESTs from over 300 other organisms as well. Whenever possible, NCBI

scientists annotate the EST record with any known information. For example, if an EST

matches a DNA sequence that codes for a known gene with a known function, that gene's

name and function are placed on the EST record. Annotating EST records allows public

scientists to use dbEST as an avenue for gene discovery. By using a database search tool,

such as NCBI’s BLAST, any interested party can conduct sequence similarity searches

against dbEST.

2.7 GENE EXPRESSION PROFILING

Chips containing hundreds, thousands, or even tens of thousands of genes

arranged in parallel are revealing the function of genes and relationships between genetic

18

and biochemical pathways that are virtually impossible to group by any other means.

Organisms express their genes at a relatively constant rate until the products encoded by

these genes are needed for a specific function. When needed, genes are activated or

repressed rapidly and in a dramatic fashion changing by 10-, 100- or even 1000-fold or

more depending on the particular gene and the strength of the regulatory cue [8]. The

expression of genes changes in response to a wide spectrum of signals, including

hormones, chemicals, nutrients, stress, changes in cell division and development, light

simulation and the like, provides a gene expression that is characteristic for a given

physiological state.

Because gene expression correlates specifically and tightly with function, it is

possible to infer the function of genes and the interaction of pathways by documenting

the expression of those genes are turned up or down in a given physiological state. Gene

regulation provides a selective evolutionary advantage by conserving cellular building

blocks (e.g.: - nucleotides, amino acids) and enzymatic machinery (e.g.: - transcription

factors, polymerase) until they are needed, and by allowing the organism to adapt to a

plethora of different environmental conditions to which it is exposed during its lifetime.

When cells are subjected to elevated temperature, for example, heat shock genes

are activated to protect cellular proteins from thermal damage [8]. Disease states, drug

treatment, different developmental stages, and many other processes can be examined by

cataloging gene expression profiles. Understanding how genes are expressed in normal

and diseased cells can help uncover novel potential targets for therapies. The logical

extension of this concept is to build comprehensive gene expression databases for each

organism that contain expression profiles for each gene across thousands of different

19

conditions, thereby allowing biological exploration to take place predominately by means

of a computer.

2.8 HYBRIDIZATION AND GENE EXPRESSION

A single stranded DNA molecule with a known sequence is labeled with a

radioactive isotope or fluorescent dye. This is used as a “probe” to detect a fragment of

DNA or mRNA, with the complimentary sequence. In order to determine if a gene is

expressed in a particular tissue, use the following “Northern blot” procedure could be

used:

1 Make a fluorescent dye probe by using a small piece of gene A

2 Isolate mRNA from all the tissues of interest

3 Bring the mRNA to a solid medium (like nylon filter)

4 Hybridize the probe the filter

5 If gene is expressed in the tissue, see a fluorescent signal on the filter

6 Thus one could detect the present of a particular DNA or RNA

But the process above and most other traditional methods in molecular biology

generally work on a “one gene in one experiment” basis, which means that the

throughput is very limited and the “whole picture” of gene function is hard to obtain.

There is a need for methods that can handle these huge sets of available data in a global

fashion, and that can analyze such large systems.

20

3. MICROARRAYS

With the completion of Human Genome project, and cataloguing of thousand of

genes already, the biggest challenge to scientists now is to understand how these genes

work – what regulates their activity and how they interact with each other and the

environment. A technology that is reshaping molecular biology, gives the scientists the

capability to analyze expression level of thousand of genes simultaneously in a single

experiment quickly and efficiently. As we have seen earlier that thousand of genes and

their products in a given living organism function in a complicated and orchestrated way

that creates the mystery of life. However, the traditional methods in molecular biology

generally work on “one gene in one experiment” basis, which means that the throughput

is very limited and the “whole picture” of gene function is hard to obtain. The microarray

technology monitors the whole genome on a single chip so that researchers can have a

better picture of the interactions among thousand of genes simultaneously.

The magic of microarray analysis is sweeping through the agricultural and

medical sciences, replacing traditional biological assays based on gels, filters, and

purification columns with small glass chips containing tens of thousands of DNA and

protein sequences. Microarrays function like biological microprocessors, enabling the

rapid and quantitative analysis of gene expression patterns, patient genotypes, drug

mechanisms, and disease onset and progression on a genomic scale.

21

A microarray is a small analytical device that allows genomic exploration with

speed and precision unprecedented in history of biology. Schena and co-workers first

developed them at Stanford University in early 1990’s [14]. Glass chips containing tens

of thousands of genes are used to examine fluorescent samples prepared by labeling

messenger RNA (mRNA) from cells, tissues and other biological sources. Molecules in

the fluorescent sample react with cognate sequences on the chip, causing each spot to

glow with intensity cognate sequences on the chip, causing each spot to glow with

intensity proportional to the activity of the expressed gene. The enormous capacity of

these miniature devices allows the analysis of entire human genome in a single

experiment. Since patterns of gene expression correlate strongly with function,

microarrays are providing unprecedented information on human disease, aging, drug and

hormone action, mental illness, diet, and many other clinical matters. Microarrays can

also be used to find alterations in gene sequences, paving the way for a new era of genetic

screening, testing and diagnostics.

Source: http://www.soe.ucsc.edu/~sugnet/microarray/microarray_FAQ.html

Figure 3.1 - Image of a Portion of a Microarray that has been Hybridized and

Scanned

A microarray is an ordered array of microscopic elements on a planar substrate

that allows specific binding of genes or gene products. Microarray is a new scientific

22

word derived from the Greek word mikro (small) and the French word arayer

(arranged)[8]. Microarrays also known as biochips, DNA chips, and gene chips contain

collection of small elements or spots arranged in rows and columns. To qualify as a

microarray, the analytical device must be (1) ordered, (2) microscopic, (3) planar, and (4)

specific. Devices that fulfill only a subset of these criteria do not afford the advantages of

microarrays, do not qualify as microarrays, and should not be considered as such.

A typical microarray would consist of a regular microscope glass slide that

contains thousands of microscopic quantities of PCR products of cDNA [14] or synthetic

oligonucleotides of genes. Each spot should represent one specific exon or one gene in

the genome. Thanks to recent development of high-speed robotic printing, once all the

nucleotide sequences are ready, mass production of microarray slides is now possible for

many different experiments. A computer then precisely measures the amount of sample

(mRNA) bund to each spot on the microarray, generating a profile of gene expression in

the cell. Microarray takes advantage of two basic technologies.

1) One is binding between single stranded DNA sequence with its complementary

sequence (Base pairing or hybridization, i.e. A-T and G-C for DNA; A-U and G-

C for RNA)

2) Another is using fluorescent probe to visualize difference in cDNA level which in

turn represents mRNA level.

Currently microarrays come in many different types but they are two main

fabrication methods.

1. Synthesize DNA probes separately, using PCR for cDNAs [15] or chemical

synthesis for oligonucleotides. Then a robot is used to spot these DNA probes

23

onto microarrays into very small grids. The substrate for microarrays can be glass,

plastic or even nylon membranes. Most labs use glass microscope slides since this

method is comparatively cheap and flexible. Some related technologies use ink-jet

like printers to spray oligonucleotide probes on the microarrays.

2. Synthesize DNA oligonucleotides [3] directly on the microarray using UV-masks

and photo-activated chemistry (Fig. 3.2). Currently the company Affymetrix is the

only commercial company using combinatorial chemistry approaches. The

technique used is as follows: deprotect sites that will have the next base (A,C,T,

or G) bound to them using UV light, then bind the next base to those sites and

repeat with a different base. To direct which sites will be deprotected, Affymetrix

uses a photolithographic mask which only lets the UV light activate certain sites.

Image source: - http://www.cse.ucsc.edu/~sugnet/microarray/microarray_FAQ.html

Figure 3.2 – Combinatorial Chemistry Approaches to Microarray

Manufacturing

Using this technology Affymetrix is able to build up very large arrays of

oligonucleotides in parallel. However due to synthesis efficiencies the longest

oligonucleotide probes that Affymetrix makes are 25 nucleotides long.

24

3.1 COMMERCIAL ARRAYS

There are various available companies manufacturing commercial microarrays.

Here are three of the popular commercial microarray platforms:

• Affymetrix (GeneChip®)

• Agilent Technologies (Agilent SurePrint)

• Amersham Biosciences (CodeLink™)

Over the past few years oligonucleotide GeneChip arrays, commercially produced

by Affymetrix, have become widely used by the scientific community to study genome

wide gene expression. More recently, Agilent and Amersham Biosciences

commercialized each a new microarray platform, also based on oligonucleotides. Both

technologies give high quality, good resolution and advanced gene information. Because

these technologies are, or will become, standards in life science research, it is important

to provide the scientific community with a privileged access to these technologies.

Affymetrix: Leveraging technologies adapted from the semiconductor industry,

the manufacture of GeneChip arrays use photolithography and solid-phase chemistry to

produce arrays containing hundreds of thousands of oligonucleotide probes packed at

extremely high densities. The probes are designed to maximize sensitivity, specificity,

and reproducibility, allowing consistent discrimination between specific and background

signals, and between closely related target sequences. Affymetrix also provides the

researcher access to array content information, including probe sequences and gene

annotations via the NetAffx™ Analysis Center. This center enables researchers to

correlate their GeneChip® array results with array design and annotation information

(www.affymetrix.com).

25

Agilent technologies print high-quality oligonucleotide microarrays using

SurePrint technology. These oligonucleotide microarrays are manufactured using

Agilent’s non-contact in situ synthesis process of printing 60-mere probes, base-by-base.

Up to 22,000 oligonucleotides per microarray currently can be synthesized. Each

microarray is uniquely bar-coded and data about the microarray and gene identification is

stored in an accompanying (GEML) microarray layout. GEML, or Gene Expression

Mark-up Language, is an XML-based open standard format that preserves expression

profile information consistently even when used under different database schemes,

allowing researchers to compare new data with existing data from other microarray

platforms (www.agilent.com).

CodeLink™ Bioarray Platform offers a high-precision microarray solution in a

high-density format. Presynthesized and functionally validated 30 mere-oligonucleotide

probes are piezo-electrically deposited onto a proprietary 3-D aqueous gel matrix.

Attachment is accomplished through covalent interaction between the amine-modified

group present on the 5' end of the oligonucleotide and the activated functional group

present in the gel matrix. The 3-D gel matrix provides an aqueous environment, allowing

for maximal interaction between probe and target. The 3-D platform achieves sensitivity

down to approximately one transcript per cell, and a minimum detectable fold change as

low as 1.3-fold with 95% confidence (www.amershambiosciences.com).

Different types of microarrays - Oligonucleotide microarrays are one of the

commonly used microarrays, finding wide use in a variety of applications, including gene

expression profiling and genotyping. Oligonucleotides are single stranded 15- to 70-

26

nucleotide molecules made by chemical synthesis, and these synthetic targets produce

high specificity and good signal strength in hybridization reactions. More than one

quarter of all microarray publications to date use oligonucleotide are the target molecules

[8]. Complimentary DNA and oligonucleotide microarrays both exploit the chemical

process of hybridization to generate microarray signals. Oligonucleotide and cDNA

microarrays fall into a broader category known as nucleic acid microarrays (Fig 3.3),

which encompasses microarrays containing any type of DNA or RNA as the target

material.

Figure 3.3 – Distribution of Types of Microarrays

Differences between different types of microarrays - The different types of

microarrays each have their own peculiarities and no one has published any sort of study

rigorously comparing the different technologies. However there are some inherent

strengths and weaknesses to each technology.

• The Affymetrix chemistry is great, but their technology is very expensive and

fairly inflexible. The basic Fluidics station and scanner are over $100,000 and

27

then each Gene Chip is around $5,000. The reason that it is inflexible is that if

you do not like their arrays and wants to make your own the cost for a new

photolithographic mask can be over a million dollars. That said, the technology is

very robust and Affymetrix's chemistry is reproducible and allows the detection of

SNPs and other small features in the DNA.

One thing to note is that Affymetrix is not currently doing co-hybridizations,

which make it very important, and challenging to normalize between the

experimental and control gene chips. That is to say that Affymetrix does not

produce ratios; each probe produces only an absolute intensity.

• Spotting DNA on glass microscope slides is relatively inexpensive and very

flexible. However the spotting process itself is inherently variable. Also most

microarrays produced in this manner use cDNAs as their probes. Using cDNAs

has a couple of technical problems.

1. You need a copy of that DNA to start with so you can use PCR to produce

your probes. This is a major point as we know the sequence of whole

genomes but we do not have unique cDNA libraries that span genomes

2. When using a cDNA as a probe you get a very long sequence to bind to

which makes it impossible to discern between genes that are more than

80% similar, and forget about detecting SNPs

It is possible to spot oligos on glass slides and save yourself a lot of PCR and

avoid the above limitations.

28

The technique that allows the spotting technology to sidestep the issue of

variability in spotting and other concerns is the use of co-hybridizations. This technique

is covered in greater detail later in this document but the main concept is to use relative

RNA expression levels instead of absolute expression levels. To accomplish this two

separate RNA samples are used: an "experimental" and a "reference". Each RNA is

labeled with a different fluorescent dye, and then the two samples are mixed and

hybridized at the same time to the microarray. When the microarray is scanned, number

of photons in the experimental dye's spectrum is compared to the number of photons in

the reference dye's spectrum. Many variations in spot size, probe concentration and other

issues are cancelled out in this manner.

3.2 EXPERIMENTAL DESIGN

Microarray analysis differs from traditional research in a number of striking ways,

one of which is the relationship between the amount of experimental time required and

the amount of data obtained. Traditional experimental approaches based on gels and

filters blots required a relatively large amount of experimental time (Fig 3.4) to obtain a

small volume of data, whereas microarray analysis affords vast quantities of data with

relatively little experiment time. Microarrays purchased commercially provide an

extreme example, allowing a single researcher to generate millions of datum points (Fig

3.4) in a few weeks. This paradigm shift and upside down relationship between

experimentation and data output places tremendous importance on sound experimental

design in microarray analysis. Properly designed experiments that include the right

experimental components and controls enable researchers to avoid the data avalanche that

can quickly bury the uninitiated.

29

Every microarray experiment should contain a positive control, a negative control,

and an experiment component. A positive control is a microarray element or substrate

that provides a readable signal, irrespective of the results obtained from the experimental

component of the assay. Readable results from the positive controls greatly improve the

capacity to evaluate the experimental data, particularly if negative results are obtained

from the experimental components

Microarray Analysis Traditional Research

DataData

Experimental time Experimental

time

Figure 3.4 – Experimental Approach for Microarray Analysis

Intense signals from the positive controls exclude trivial explanations for a failed

experiment, such as defect in hybridization, washing, scanning, or data analysis. No

formal conclusions can be drawn from a negative result in a microarray experiment

unless the positive controls produce readable signals.

A negative control is a microarray element or substrate that provides little or no

readable signal, irrespective of the results obtained from the experimental component of

the assay. Negative results add confidence to the experimental data by excluding or

reducing the possibility that a nonspecific biochemical event (e.g.: - cross-hybridization)

is producing the signals at the experimental locations. It is risky to draw conclusions

concerning the experimental components of a microarray assay if the assay does not

30

include one or more negative controls.

The experimental component of a microarray assay corresponds to the new

information that is sought in a given experiment. The experimental data contain

information regarding gene expression patterns, genotypes, and other biological

processes or pathways. Microarray assays that contain positive controls, negative

controls, and an experimental component yield reliable experimental data that can be

quantified, mined, and modeled using an increasing powerful collection of software tools.

Experimental design can be simplified by understanding the five basic steps in the

microarray analysis cycle: a biological question, sample preparation, a biochemical

reaction, detection and data analysis and modeling [8] (Fig

3.5).

31

1

Biological Question

Figure 3.5 - Five Steps of the Microarray Anal

Microarrays are used in the following manner - The basic pr

1. Isolate the RNA you are interested in and the RNA from

can come from any cells. It is important to realize tho

tissues or any heterogeneous cells may lead to results t

How do patterns of gene expression compare in root and leaf tissue?

2

Data Analysis & modeling

Quantitate data, calculate ratios, cluster

Detection

Select channels, laser settings, produce images

Biochemica

Hybridizationprocessing, bwashing

Microarray Analysis

“Lifecycle”

Sample Preparation

5

4

32

mRNA isolation, probe labeling, PCR,microarray manufacture

ysis Cycle

otocol [16] is as follows:

your control. The RNA

ugh that the RNA from

hat reflect changes in the

l Reaction

, substrate locking,

3

composition of the sample rather than in changes due to the experimental

hypothesis.

2. Label the RNA. Usually this means performing a reverse transcriptase reaction

and incorporating dye that has been linked to a DNA nucleotide. However some

protocols, i.e. Affymetrix's, call for an amplification of the RNA and labeling of

the RNA itself. For microarrays on nylon membranes usually the label is

radioactive.

Figure 3.6 – Hybridizing the Two Samples

3. Hybridize the labeled target to the microarray (Fig 3.6). This consists of placing a

solution containing the labeled target on the microarray and letting it sit for a

period of hours. This allows a given target to find its probe on the microarray and

bind to it. Usually this is carried out a specific temperature to minimize non-

specific binding of target to the probes on the microarray.

4. Remove the hybridization solution and wash the microarray. The washing can be

done at different salt and detergent concentrations to minimize non-specific

binding. In general solutions with lower salt concentrations weaken the DNA base

33

paring and are referred to as "more stringent" and vice versa for higher salt

concentrations.

5. Once the microarray has been washed it is time to scan the microarray. Scanning

is just quantitizing how much target bound to the DNA probe on the microarray.

Most microarrays use fluorescent dyes and are scanned in the following manner

(Fig 3.7):

Figure 3.7 – Scanning the Microarray

1. laser is used to excite the fluorescent dye; the photons coming from the

dye are captured using lenses to focus the light and a photo multiplier tube

(PMT) to quantitative how many photons are being captured.

2. The resulting number for that section of the microarray is translated into

one pixel of a 16-bit .tiff file. The more pixels per centimeter, the better

the resolution of the resulting .tiff image. It is important to note that .tiff

files are uncompressed and file formats like .jpeg and .gif which

compresses data should not be used for storage of results.

34

3. The resulting image is analyzed by finding the spots and comparing the

differences between chips (if the hybridization contained only one fluor)

or the ratio of the two fluors for co hybridization experiments. How these

differences are normalized, compared and interpreted is beyond the scope

of this document.

3.3 APPLICATION OF MICROARRAYS

Two trends in microarray research are the diversification of the assays and the

worldwide spread of the technology. Gene expression applications account for 81% of

the scientific publications to date [8], but microarrays are being used for many other

purposes, including genotyping, tissue analysis, and protein studies. Microarray assays

for genetic and infectious diseases may improve health care by providing rapid and

affordable genotyping data for treatable and curable illnesses.

3.4 AVAILABLE MICROARRAY TOOLS

Currently there are numerous tools available for microarray analysis, but there are

very few available for the design of microarrays. What follows is a list of few of the

available tools and their properties.

Microarray DAta Management (MADAM) - MADAM [17] is software available

from The Institute of Genomic Research (TIGR). It guides users through the microarray

process from RNA procurement to data analysis, offering intelligent forms to simplify the

tracking of experimental parameters and results that are essential for the interpretation of

expression results in downstream analyses. MADAM is platform independent and has

been tested on Microsoft Windows, Linux, Unix, and Mac OS X successfully. Additional

35

tools – TIGR Spotfinder, MIDAS, MeV are made available along with MADAM, thus

making a suite of software – TM4 (S. Dudoit, R.C. Gentleman and J. Quackenbush).

MIDAS provides for normalization of data and MeV provides for analysis of normalized

data whereas Spotfinder helps in image analysis.

TM4 comes close to your requirements but fails in that it only helps in tracking

the experimental parameters rather than help provide experimental parameters, thus

failing to design the chip. TM4 does not provide for designing chips, does not allow for

selection of genes that go on the chip and does not either allow for other design

parameters. However it is capable of tracking them.

BioArray Software Environment (BASE) - BASE is a comprehensive free web

based database solution for the massive amounts if data generated by microarray analysis.

It was developed at the Department of Theoretical Physics, Lund University [18]. It has

been designed entirely using free software – Linux OS, MYSQL database, Apache web

server, Java/C++/PHP languages. It manages bio material information, raw data and

images and provides integrated and “plug –in”-able normalization, data viewing and

analysis tools. BASE can be installed on a local server, which can be accessed via any

web browser using personal logins with administered access levels. It allows for data to

be visualized using various plots, histograms and tables.

Again, BASE deals entirely with the analysis aspect and does well but does not

provide any for design aspect.

Stanford Microarray Database - SMD [19] is a research tool for hundred of

Stanford researchers and their collaborators. It provides for unrestricted access to

microarray data published by SMD users. It has the ability to store, retrieve, display and

36

analyze the complete raw data produced by various microarray platforms and image

analysis software packages. Softwares have been implemented to increase the ease with

which data from SMD can be published adhering to accepted standards and as well

increase the accessibility of published microarray data to the general public.

Longhorn Array Database (LAD – Patrick J Killion, Gavin Sherlock, Vishwanth

R Iyer) - The Longhorn Array Database (LAD) [20]is a MIAME [21] compliant

microarray database that operates on PostgreSQL and Linux. It is a fully open source

version of the Stanford Microarray Database (SMD), one of the largest microarray

databases. LAD provides a simple, free, open, reliable and proven solution for storage

and analysis of two-color microarray data. LAD stores raw and normalized data from

microarray experiments, as well as their corresponding image files. In addition, LAD

provides interfaces for data retrieval, analysis, and visualization.

SMD and LAD both provide for features that are helpful for data analysis. None

of the above databases provide for both – the design and analysis of microarrays. Having

both the aspects in one place helps researchers with effective study of microarrays.

37

4.MIDAR – DESIGN ASPECTS

Personal computers have taken up residence on the desks of virtually every

scientist in the world, and larger workstations are common equipment in most research

departments. This chapter explores the myriad different faces of MIDAR as a tool for

electronic resource and biological databases, sequence and design informatics, data

quantization, mining and modeling.

Electronic Resources - The linking of microarray scientists worldwide via the

internet is speeding technological advance like never before. The internet, which supports

the World Wide Web (WWW) and electronic mail (e-mail), allows scientists to share

microarray protocols and data, obtain commercial products, and send electronic messages

quickly and economically. A host of electronic support services, including an electronic

library of microarray citations and Pub Med, are provided in MIDAR to assist scientists

in keeping pace with the rapidly expanding scientific literature.

Biological Databases - MIDAR allows microarray scientists to access biological

databases with extensive content on genes and genomes. These databases are useful in

selecting sequences for microarray manufacture and in interpreting the results of

microarray experiments. This section describes the backend of MIDAR, various available

genomic databases and how they are being implemented.

38

Microarray scientists commonly use three types of nucleotide sequence databases.

Many laboratories, companies and universities have small in-house databases of select

content, which is available on a limited basis. These sequence databases are useful for

focused projects, but tend to be somewhat limited in terms of the amount of content.

Several large biotechnology companies, including Celera Genomics (Rockville, MD) and

Incyte Genomics (Palo Alto, CA), possess large private databases of genomic and

expressed sequence tags (EST) information, providing data on a paid basis. The third

source of sequence information for microarray experimentation is found in GenBank

(www.ncbi.nlm.nih.gov), a public sequence database maintained by NCBI. For MIDAR,

we have implemented the first and last type of nucleotide sequence databases – small in-

house databases and public sequence databases. Small in-house database is incorporated

to provide for individual laboratory purposes. Scientists in laboratories can upload those

genes in their area of study as custom gene groups and use them for chip design. After all

MIDAR is for custom chips, chips with a subset of genes (the ones they are interested for

their research). Public databases cater to the general needs of the scientists. We have

incorporated GO (Gene Ontology) database, since GO database is a standard database

recognized by most laboratories.

Genes are well organized as groups based on biological process, molecular

function or cellular component, allowing user to choose genes based on their

characteristics. This feature is very helpful in microarray design because microarrays are

used to study genes that share some kind of common characteristics. The use of GO terms

by several collaborating databases facilitates uniform queries across them. The controlled

vocabularies are structured so that you can query them at different levels: for example,

39

you can use GO to find all the gene products in the mouse genome that are involved in

signal transduction, or you can zoom in on all the receptor tyrosine kinases. This structure

also allows annotators to assign properties to gene products at different levels, depending

on how much is known about a gene product.

One of the future directions for MIDAR is to involve KEGG [22] pathways for

gene selection. KEGG pathways are becoming increasingly popular and will provide

readily with pathways that most researchers would be interested in. KEGG pathways

database is a collection of graphical diagrams (KEGG pathway maps) representing

molecular interaction networks in various cellular processes. Each reference pathway is

manually drawn and updated regularly. Organism-specific pathways are computationally

generated based on the KEGG Orthology (KO) assignment in individual genomes.

If these databases provide for gene information, we still need to store the chip

information, primer and oligonucleotide sequences and other information concerning chip

design. So we have added few tables to one of the database to provide for custom chip

information.

Database design - For every chip that is being designed we will have to store a

unique id, description, designerId, genes involved, their concentration levels, number of

replicates and the number of columns and rows. We will also have to store the x, y

coordinates of individual gene on the chip so that we could replicate later when the chip

is re opened. All this cannot be stored in one table as it would lead to large amounts of

redundant data. After applying normalization techniques, we were able to successfully

place the attributes in two different tables with no redundant data. There was a similar

problem with storing the primer/oligo information of genes. Since there could be more

40

than one primer/oligo sequence for a particular gene, storing them along with other gene

attributes would lead to redundant data. Similarly applying the normalization techniques

we were able to rightly group the attributes.

As the work progressed, there were a few more needs and new tables had to come

over. These new requirements also added of new attributes to the existing tables. Though

this would take the redesigning of the entire database, we had to go for it. Finally after all

the changes, designing and redesigning, the database design follows in the next section

(Figure 4.1).

Designing the front-end - After designing the database, it was time to design a

front-end based on these databases. Basically front-end had to cater to two requirements –

populating the databases and providing for design of custom chips.

Populating the databases - The GO database is available as an SQL dump at

http://www.godatabase.org/dev/database/. It can be downloaded anytime and the database

can be constructed. That leaves us to the small in-house database. There are two things

that needed to be uploaded – the gene attributes and then the primer/oligo information. In

order to upload the gene information, we designed forms that would help upload data, the

only requirement being that the data be available in a tab-delimited format and follows

the specified data header. Different laboratories follow varied conventions in representing

gene data, so uploading data would be near impossible if there is no common standard.

For this purpose, a header was developed which defines gene attributes and the order they

need to be present. This approach helps uploading data from varied laboratories. The data

header is explained in figure 4.2.

Users could upload primer/oligo information in the same file or use Mprime [23]

41

Concentration ConcentrationId Value Units

Gene GeneId Description GBAcc GBID GeneCode Seq molType GeneGroup GeneSubgroup GeneSubunit CuratedCitations Diseases

Primers PrimerId GeneId LeftPrimerSeq RightPrimerSeq ProductSeq LeftPrimerLen RightPrimerLen RightPrimerBeg RightPrimerEnd LeftPrimerBeg LeftPrimerEnd LeftPrimerTM RightPrimerTM LeftPrimerGC RightPrimerGC Methods

Oligos OligoId GeneId OligoSeq OligoLen OligoBeg OligoEnd OligoTM OligoGC Methods

Person PersonId LastName FirstName Address1 Address2 City State

ChipGenes ChipId GeneId Primer_OligoId ConcentrationIdXCoord YCoord

Chip ChipId ChipDescriptionChipPIID ChipDesignerId ChipType NumSpots NumCols GeneGroup NumReplicates

Figure 4.1- Database Design

to generate primer/oligo information. MPrime is an external program, which is designed

by Dr Rouchka and his team to generate primers and oligos given a gene sequence.

Primer and oligo sequences are required for microarray design. They are the ones that are

actually placed on the chip and not the whole gene. So finding primer and oligo

sequences are very much needed in MIDAR. MPrime was readily available and provided

42

for primer and oligo sequence generation. And our team developed it so incorporating it

in my project was not a hassle. More over reusing code is the order of the day and is

advised in object-oriented concepts.

Data Header

GeneCode

GBAcc

Description

Oligo Sequence/s

Left Primer Sequence

Right Primer Sequence

Figure 4.2 – Data Header for Uploading Gene Information

MPrime was developed in C/C++ and MIDAR is being developed in VB.Net.

Both the programming languages are from two different programming paradigms – one

being the object oriented and the other being event driven programming. After getting

excited about the reusability and availability, this was little dampening. Moreover

VB.Net is not one of the best languages to implement searching and sequencing tens and

thousands long strings. However after some research on the internet, I found more than

one way to incorporate C++ code into my application.

Incorporation of C++ code into VB.NET - One way of doing this is making a

DLL (Dynamic Link Library) of the C++ code, which could later be used, by making an

instance of it in VB.Net code. There were few additions I had to make in the code. They

are as follows. Generally any C++ program would contain two files - .cpp file and .h file.

43

The .cpp file contains the implementation of the program while .h contains the

declarations of the member variables and functions. In order to incorporate C++ program

into VB.net, you have to define one other file - .Def file. This would act as an interface

and would expose all the member functions to objects defined in VB.Net code.

A sample .Def file would like the following

LIBRARY [DLL NAME]

EXPORTS

[FUNCTION NAME]

One could export more than one function. By using the DLL along with these

function names one could make a call to them and use them as and where required. The

.h files needs a little addition too. We have to add _stdcall keyword to the function

declaration after the return type and before the function name. This is needed to ensure

that function uses the same way as VB for passing variables about and working with the

stack. A sample header file would look like the following

int _stdcall function1 { return 1; } The above code is compiled as a DLL and now this program is available for external use,

it can be a VB.Net or VC++ or anything else for that matter.

The above DLL can be used in VB.Net by making an API call. A sample API call

would look like the following (Figure 4.3). Once the API call has been created, the

function can be called normally as you would call any other function.

Private Declare Sub Sleep Lib "Kernel32.dll" (By Val dwMilliseconds As Long)

44

“Private” We know that this method can only be used in this code block. It is

irrelevant as far as the declaration goes, this is just for VB’s sake.

“Declare” This means that this is only a method header and not the entire

method.

“Sub” Again we know this means it is a subroutine and does not return a

value.

“Sleep” This is the name of the method. It does not have to be the same as the

method name in the DLL however if it differs then an “Alias” clause

must also be added to the declaration, for simplicities sake though

it’s best to just use the same name.

“Lib” This means we are going to give VB the library name (DLL file.)

‘ “Kernel32.dll” ’ The name of the DLL file that contains the method.

“(“ We know this indicates that we are going to declare some parameters

(If any) that need to be passed to this method.

“ByVal” This means we are passing the value of the variable passed to the

method, not the address of its value in memory. This is important to

get right in the function declarations since it can cause big problems

later on.

“dwMilliseconds” This is the name of the parameter, this is just for VBs sake, it does

not have to be the same as it is declared in the DLL but it makes

sense to do so, no alias needs setting if it is not the same as in the

DLL.

“As Long” This means we are setting the type and thus how much memory to

send over to the DLL when we pass a value to it, again this is very

important to get these matched up with the DLL correctly.

“)” Finally we know this means we are done with parameters and as

there is no return value since we are dealing with a subroutine

declaration.

Figure 4.3 – Explanation of Code

45

Another way of doing the same is make an executable of the C++ program and

calling the executable from VB.Net code. This would not give as much flexibility as the

above method would give. The above method would give access to all member variables

and functions and the program would be treated as “in-process” process. However for

MIDAR there is no much need of flexibility. So I have adapted to the second owing to its

simplicity and “fitting the bill” feature without much work. The second method would

require an object of System.diaognostics.process to make a call to exe. The gene

sequences would be written to a file, which would serve as input. Similarly the output

will be written to file from where one could upload the primer/oligo sequence to the

database. This completes the processes in regards to uploading gene information and

primer/oligo sequences. Now lets look into designing the custom chip.

An input form is provided for the user to make his/her choice as to the set of

genes required, the concentration levels, the number of replicates and few other

parameters required for chip design. MIDAR is a huge gene database and to extract the

needed information would require some complex queries, which retrieve data efficiently.

The GO database contains vast amounts of gene information over several tables. An

example query that list all the genes exhibiting a common biological process

select gene_product.symbol from term as rchild,term as ancestor, graph_path,association,gene_product " _

& "where graph_path.term2_id = rchild.id and " _

& "graph_path.term1_id = ancestor.id and " _

& "association.term_id = rchild.id and " _

& "association.gene_product_id = gene_product.id and " _

& "ancestor.name = '" & strFunction & "' order by symbol"

Where strFunction is some chosen term definition such as …

46

The query involves multiple joins and a self-join of the term table. Initially a self-

join of term table is made and is then joined with graph_path table to obtain parent-child

relationship of the terms. The resultant join is then joined with association table to obtain

the associated genes pertaining to a term.

As the project progressed, the list of genes had to be refined more. The above

query will list all genes pertaining to a term, all genes belonging to all genomes. What

people are really interested in is the list of genes pertaining to a genome rather than all

genomes. This would result in few more joins to the above query thus slowing it down to

a great extent. To avoid this situation what we have done is split the process into two

queries. The first query will list only terms associated with a specific genome. The

second query would list genes pertaining to the particular term. The choices we had were

one big query involving large number of joins, or two descent queries with manageable

number of joins. We have made the choice of the second so that the response time

remains constant, predictable and better compared to one big query.

A search text box has been provided for users to search for terms of specific

interest. The search is dynamic and refreshes with every character you specify and gets

you closer to your desired term search.

Layout of selected genes on the custom array - Serial and Random - In serial

arrangement all the genes along with their replicates are placed sequentially one after the

other in the order they were selected. In a random layout, the genes and their replicates

are placed randomly. That is to say a gene along with its replicates can be present

anywhere on the chip. The choice of either of the layouts can be made based on user

requirements. If the user desires that the genes can be placed anywhere irrespective of

47

their location on the chip, then they would go for the serial arrangement. It has to be

understood that gene spots located on the corners may not be expressed, as they should be

owing to its corner location. So if a gene and its replicates are all placed together one

corner and they are not expressed, there could be two possibilities – one they are not

actually expressed and the other though they are expressed they are dimmed because of

their corner location. So the choice rests with the user as to a serial layout or a random

layout.

As discussed earlier a typical custom array would contain tens of thousands of

genes and potentially upwards of hundred thousand spots on a single microarray. MIDAR

is being developed in VB.NET and the type of control to be chosen to represent a spot on

the array was a tough decision to be made. The spots needed to have a click event so that

a mouse click on a spot would allow the user to view the included oligo/primer detail

information and list specific data details for each spot. So the initial decision was to have

a button, it had a click event and it had a background property using, which the intensity

value could be represented. It had a tag property using which a mouse over event would

quickly show brief information – gene name and concentration levels. The choice seemed

to be ideal, until we built chips with large number of spots, a typical number being five

thousands. We needed chips that should provide for at least ten thousand spots. Hence the

decision was reverted and we chose graphics to do the part.

Each spot on the array was a circle, which is being drawn using the graphic pen. It

could generate ten thousand spots in no time and performed the same for even larger

number of spots. So it seemed to be a good choice. But then it does not have a click

event, it is just a drawing on the form. There is no background or tags associated with it.

48

It did provide for scalability but lacked the necessary features. In order to add this

functionality, I implemented arrays to store all the information and associated the array

index with the position of the drawing on the chip. The position was in turn tied to the

pixels of the spot. Thus I could relate a mouse click event of the form to the array by

obtaining the pixels of the clicked area and transforming into location and thus index of

the array. Initially the transformations took some time to understand but once I realized

the logic, it was easy to incorporate in my code. In all it was a good decision to choose a

graphic circle as it provided for everything though it was tedious initially. A custom array

contains tens of thousands of genes. One would want to readily refer to a particular gene

or a gene group. One would want to know where the gene is distributed along with its

replicates on the chip, what are the intensity values, what are the corresponding

oligonucleotide sequence and so on. Well we have provided for this feature. One could

specify the gene or gene group of specific interest and a color to mark these genes and

could readily see them marked on the array. Users could find all the information they

need by simply clicking on the marked ones.

Experiment data obtained from a microarray contain intensity values

corresponding to each spot on the custom array. These values vary a great deal. The

range at times could be several thousands. To represent such a wide range of intensity

values on our software all we have is 255 colors. So data needs to be normalized, but then

again the range being so varied, the normalized values obtained are not distributed

uniformly. So a sorting approach was chosen. Using this approach all the intensities are

ranked according to their values. And then these values are normalized to 255 colors.

These represent a close resemblance to a real expressed custom array.

49

Microarray analysis generally involves conducting experiments on the same chip

at different time frames and different time conditions. One needs to study all this data to

understand as to what genes are expressed under what conditions. We have currently

provided for comparison of custom arrays under different time frames. Users could

associate one set of experiment values to the chip with one color (green). And then

associate another set of experiment values with a different color (red). We could then run

a logarithmic ratio of both the experiment values and associate with the chip. The spots

which are close to either green or red would be the ones interested ones as they were

either expressed or under expressed under different conditions. The ones close to yellow

were expressed under both the conditions and may not be of much interest.

50

5.MIDAR – A SCENARIO

Lets start with designing a custom microarray. The application opens up with an

input form (Fig. 5.1), wherein a user can specify his/her required inputs. After specifying

the chip description and ownership, the user could either choose genes from the standard

GO database or the custom gene groups. The genes in GO terms are grouped in terms of

biological process, cellular component and molecular function. The user could start with

specifying the genome type he/she is interested, followed by the GO term type. The user

could then scroll down the list or search for a GO term by using the search box. It is the

same with selecting custom gene groups. The user can then specify other parameters such

as number of replicates, positive and negative controls, and number of columns and rows.

Figure 5.1 – Screenshot Depicting User Input for Chip Design

51

After specifying the input parameters, the user needs to submit them for the chip

design by clicking on the “Submit Chip Data” button. The following figure (Fig. 5.2) is

generated showing the chip design along with the chip description and other information.

As the user mouse over the spots on the chip, he/she could readily see what gene is

present at that spot. By clicking on any spot of specific interest, further details such as the

primer/oligo sequences and their corresponding data is listed. The user could also connect

to the NCBI webpage pertaining to that specific spot.

Figure 5.2 – Screenshot Depicting the Custom Array Design and the Primer/Oligo

Information of a Specific Spot on the Array

A custom array contains large number of spots. One would want to know where a

particular gene/gene group is present on the spot. The following screenshot (Fig. 5.3)

52

provides for the same. Users can type in the input box their specified interest of genes,

choose a color to mark them and submit the information. The application would then

mark all the spots on the array associated with the gene/ gene group.

Figure 5.3 – Screenshot Depicting Marked Spots Associated with a Gene of Interest

The following screenshot (Fig. 5.4) is the result after activating the custom array.

Users associate the custom array chip design with the experiment values obtained by

conducting the experiment on a real custom array of same design. The color of the spots

shows the intensity values obtained by scanning the custom array after the experiment

was done. The intensity depicts the expression levels of a gene and one could readily see

the genes which are over expressed or which are under expressed.

53

Figure 5.4 – Screenshot Depicting the Activated Custom Array

MIDAR currently provides for comparison of intensity values of a custom array

obtained from two different conditions (Fig. 5.5). This feature would readily identify the

genes of interest for this custom array. The ones which are more of green or more of red

indicate the ones which are either expressed in one sample or under expressed in another.

The other spots (more of yellow) are expressed in both the samples, thus are not area of

interest.

54

Figure 5.5 – Screenshot Depicting a Comparative Analysis of a Single Custom Array

from Two Different Experiments

55

6. FUTURE – MICROARRAYS AND MIDAR

Human beings are gene machines, responsive to a myriad of genetic and

environmental signals, and these signals are integrated by each person to produce

resultant phenotypes that manifest as mental and physical wellness, normal and aberrant

behavior, disease predisposition, longevity, and so forth. Medical practice has succeeded

in identifying and curing disease using a host of clinical procedures, but medicine is

lacking in the sense that many illnesses, therapies, and drugs are poorly understood at the

genomic level.

The use of microarray technology in a clinical setting holds the promise of

providing detailed molecular information to the physician, enabling more informed

decision-making and better health care. Chips could be used to profile patients for

predisposition to drug and alcohol abuse, eating disorders, depression and schizophrenia.

The biochemical and physiological changes that accompany physical exercise, poor diet,

substance abuse, antidepressant and antipsychotic medications, and many other

environmental and chemical stimuli that alter body chemistry and probably detectable by

microarray. As microarray analysis is used extensively in clinical research and genetic

screening, testing and diagnostics, it makes sense to consider whether future visits to the

doctor’s office might include one or more microarray tests as components of the physical

examination. Microarrays are of great potential and a general awareness needs to created

among young students. MIDAR can help as a model for a virtual in “silico” experiment.

56

We would need to be able to apply a virtual probe set in which most of the genes on the

chip would have matching probes assigned random values. However, we "spike" values

for our favorite genes of interest, say 20 genes or so. So, we need to develop the two tools

that would allow this in silico modeling. These would include

• a way of generating the virtual probe set and assigning random values for

all genes in the probe set

• a way of generating a sub list from the virtual probe set for our genes of

interest, and a way of re-assigning or "spiking" their values.

We see this particular use of the tool as a way to market it to biology classes throughout

the US and beyond.

MIDAR also needs to expand on the analysis part of the microarrays. Scientists

seeking to harness the potential of microarrays are often challenged by the prodigious

quantities of data produced. There is a need for support to provide for quantizing and

normalizing these huge datasets. Data obtained from microarrays reflect various

conditions (time course, drug dosage, tissue specificity) and different time frames. If they

need to compared and studied they need to be normalized so that they all stand on the

same plain. Various data visualization techniques need to be developed to better illustrate

the analysis of experiment data.

In all MIDAR is a community driven solution developed to aid the design and

testing of custom chips and an aid to the evaluation and analysis of commercial chips.

57

REFERENCES

[1] J. C. Venter, M. D. Adams, E. W. Myers et. al., "The sequence of the human genome," Science, vol. 291, no. 5507, pp. 1304-1351, Feb.2001.

[2] C. Fields, M. D. Adams, O. White, and J. C. Venter, "How many genes in the human genome?," Nat. Genet., vol. 7, no. 3, pp. 345-346, July1994.

[3] J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen, Y. A. Su, and J. M. Trent, "Use of a cDNA microarray to analyse gene expression patterns in human cancer," Nat. Genet., vol. 14, no. 4, pp. 457-460, Dec.1996.

[4] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum et. al.,"Initial sequencing and analysis of the human genome," Nature, vol. 409, no. 6822, pp. 860-921, Feb.2001.

[5] "From Blueprint to you," NCBI,2005.

[6] W. Gilbert, "Why genes in pieces?," Nature, vol. 271, no. 5645, p. 501, Feb.1978.

[7] J. D. Watson and F. H. Crick, "A structure for deoxyribose nucleic acid. 1953," Nature, vol. 421, no. 6921, pp. 397-398, Jan.2003.

[8] F. Crick, "Central dogma of molecular biology," Nature, vol. 227, no. 5258, pp. 561-563, Aug.1970.

[9] S. Brenner, "The human genome: the nature of the enterprise," Ciba Found. Symp., vol. 149, pp. 6-12, 1990.

[10] M. D. Adams, J. M. Kelley, J. D. Gocayne, M. Dubnick, M. H. Polymeropoulos, H. Xiao, C. R. Merril, A. Wu, B. Olde, R. F. Moreno, and ., "Complementary DNA sequencing: expressed sequence tags and human genome project," Science, vol. 252, no. 5013, pp. 1651-1656, June1991.

[11] M. S. Boguski, T. M. Lowe, and C. M. Tolstoshev, "dbEST--database for "expressed sequence tags"," Nat. Genet., vol. 4, no. 4, pp. 332-333, Aug.1993.

[12] E. D. Green and P. Green, "Sequence-tagged site (STS) content mapping of human chromosomes: theoretical considerations and early experiences," PCR Methods Appl., vol. 1, no. 2, pp. 77-90, Nov.1991.

58

[13] M. Schena, Microarray Analysis John Wiley & Sons, 2005.

[14] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, "Quantitative monitoring of gene expression patterns with a complementary DNA microarray," Science, vol. 270, no. 5235, pp. 467-470, Oct.1995.

[15] M. S. Boguski, C. M. Tolstoshev, and D. E. Bassett, Jr., "Gene discovery in dbEST," Science, vol. 265, no. 5181, pp. 1993-1994, Sept.1994.

[16] D. J. Lockhart, H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo, M. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E. L. Brown, "Expression monitoring by hybridization to high-density oligonucleotide arrays," Nat. Biotechnol., vol. 14, no. 13, pp. 1675-1680, Dec.1996.

[17] L. Zhang, W. Zhou, V. E. Velculescu, S. E. Kern, R. H. Hruban, S. R. Hamilton, B. Vogelstein, and K. W. Kinzler, "Gene expression profiles in normal and cancer cells," Science, vol. 276, no. 5316, pp. 1268-1272, May1997.

[18] L. H. Saal, C. Troein, J. Vallon-Christersson, S. Gruvberger, A. Borg, and C. Peterson, "BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data," Genome Biol., vol. 3, no. 8, p. SOFTWARE0003, July2002.

[19] C. A. Ball, I. A. Awad, J. Demeter, J. Gollub, J. M. Hebert, T. Hernandez-Boussard, H. Jin, J. C. Matese, M. Nitzberg, F. Wymore, Z. K. Zachariah, P. O. Brown, and G. Sherlock, "The Stanford Microarray Database accommodates additional microarray platforms and data formats," Nucleic Acids Res., vol. 33, no. Database issue, p. D580-D582, Jan.2005.

[20] P. J. Killion, G. Sherlock, and V. R. Iyer, "The Longhorn Array Database (LAD): an open-source, MIAME compliant implementation of the Stanford Microarray Database (SMD)," BMC. Bioinformatics., vol. 4, p. 32, Aug.2003.

[21] T. B. Knudsen and G. P. Daston, "MIAME guidelines," Reprod. Toxicol., vol. 19, no. 3, p. 263, Jan.2005.

[22] M. Kanehisa and S. Goto, "KEGG: kyoto encyclopedia of genes and genomes," Nucleic Acids Res., vol. 28, no. 1, pp. 27-30, Jan.2000.

[23] E. C. Rouchka, A. Khalyfa, and N. G. Cooper, "MPrime: efficient large scale multiple primer and oligonucleotide design for customized gene microarrays," BMC. Bioinformatics., vol. 6, p. 175, July2005.

59

CIRRICULUM VITAE

RAVI SHRIKANTH GUNDLAPALLI

Date of Birth July 20, 1981

Place of Birth Hyderabad, India

Undergraduate Study Jawaharlal Nehru Technological University B.Tech. Computer Science and Information Technology 1999-2003 Graduate Study University of Louisville M.S. Computer Engineering and Computer Science 2003-2005 Experience Intern, GE Energy - Somersworth NH

(September 2004 – August 2005) Student Assistant – University of Louisville (June 2004 – August 2004)

60