exploitation de données de séquences et de puces à adn

135
N° d’ordre: 00000 Université d’Evry-Val d’Essonne THÈSE Présentée pour obtenir le grade de Docteur en Informatique Spécialité : Bio-Informatique par Ana Carolina Elisa FIERRO GUTIERREZ École Doctorale : Des Génomes aux Organismes Exploitation de données de séquences et de puces à ADN pour l’étude du transcriptome À soutenir le 20 novembre 2007, devant le jury composé de : Tijl De Bie Rapporteur Philippe Dessen Rapporteur Pascal Barbry Examinateur Kathleen Marchal Examinatrice Maurice Wegnez Examinateur Gilles Bernot Co-directeur de thèse Nicolas Pollet Co-directeur de thèse

Upload: others

Post on 29-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploitation de données de séquences et de puces à ADN

N° d’ordre: 00000

Université d’Evry-Val d’Essonne

THÈSE

Présentée pour obtenir le grade de Docteur en Informatique Spécialité : Bio-Informatique

par

Ana Carolina Elisa FIERRO GUTIERREZ

École Doctorale : Des Génomes aux Organismes

Exploitation de données de séquences et de puces à ADN pour l’étude du transcriptome

À soutenir le 20 novembre 2007, devant le jury composé de : Tijl De Bie Rapporteur Philippe Dessen Rapporteur Pascal Barbry Examinateur Kathleen Marchal Examinatrice Maurice Wegnez Examinateur Gilles Bernot Co-directeur de thèse Nicolas Pollet Co-directeur de thèse

Page 2: Exploitation de données de séquences et de puces à ADN

Remerciements

2

Page 3: Exploitation de données de séquences et de puces à ADN

Introduction Le génome d’une espèce est un « invariant fondamental » qui véhicule une information transmise de génération en génération, assurant la production des macromolécules biologiques nécessaires à la physiologie des organismes. D’un point de vue moléculaire, l’expression du génome passe par la transcription de l’ADN en ARNs, intermédiaires obligatoires permettant la synthèse des macromolécules biologiques. Par conséquent, mesurer le nombre et le type de molécules d’ARN révèle des indices sur la fonction des gènes correspondants. Au cours des deux dernières décades, différentes techniques ont été développées pour réaliser de telles mesures à l’échelle du génome, car l’approche transcriptomique est plus simple que l’étude directe de l’expression de gènes à travers la quantification de protéines. Cependant, le transcriptome possède un niveau de complexité qui était sous-évalué à l’initiation de ce type d’études. Des expériences de séquençage massif d’ADNc et d’études du transcriptome à haut débit ont mis en évidence ce niveau de complexité qui comprend l’épissage alternatif, les ARN non-codants, les microARNs, entre autres. Par ailleurs, les puces à ADN ont contribué substantiellement à la génomique fonctionnelle car elles ont permis d’acquérir de nombreuses mesures qui donnent des indices sur l’expression des gènes. Elles constituent de nos jours l’un des moyens préférés pour des études « high-throughput ». L’objectif de ce travail a été d’analyser le transcriptome dans le contexte biologique de la métamorphose de Xenopus tropicalis, afin de remplir les absences d’information existantes en utilisant de façon optimale les ressources disponibles. Ainsi cette thèse est une illustration bio-informatique appliquée et montre plusieurs aspects des méthodologies contemporaines. Xenopus tropicalis est devenu un organisme modèle pour la génomique chez les amphibiens, car il a un génome plus simple que le modèle classique Xenopus laevis. Au début de cette thèse les ressources étaient encore très limitées, mais pendant ces dernières années plus d’un million de séquences d’ESTs ont été mises à disponibilité et le génome de cet organisme est en cours de séquençage. Cependant, le système nerveux ainsi que la métamorphose restent encore à explorer, raison pour laquelle deux techniques servant à analyser ce transcriptome ont été utilisées : le séquençage partiel d’ADNc (ESTs) qui se base sur des séquences et les puces à ADN qui est une méthode basée sur des hybridations. Le premier chapitre introduit les techniques les plus utilisées pour l’étude du transcriptome à grande échelle. Le séquençage d’ADNc et les puces à ADN sont décrites plus en détail avec pour objectif de présenter le mode d’obtention des données expérimentales. Le but de ce chapitre est de décrire les concepts nécessaires pour comprendre les analyses décrites par la suite. Le deuxième chapitre présente l’utilité des séquences d’ADNc pour l’analyse du transcriptome de Xenopus tropicalis. Plus particulièrement, le système nerveux pendant l’embryogenèse et la métamorphose est exploré afin de produire une ressource pour la génomique fonctionnelle de cet organisme. Ce chapitre décrit l’analyse biologique (article publié dans BMC Genomics) ainsi que la ressource web développée et la description des étapes suivies pour traiter les données. Le troisième chapitre montre l’usage des puces à ADN pour étudier l’expression des gènes au cours de la métamorphose du X. tropicalis. Nous décrivons les défis bio-informatiques trouvés au cours de cette étude ainsi que la problématique biologique.

3

Page 4: Exploitation de données de séquences et de puces à ADN

Le quatrième chapitre porte sur l’utilisation des différentes stratégies expérimentales de puces à ADN et la reconstruction des profils d’expression. Les avantages potentiels d’utiliser des stratégies alternatives dépendent largement du succès de la reconstruction du profil. Il est donc nécessaire d’évaluer les méthodes d’analyse des puces à ADN, afin de déterminer laquelle des approches est la meilleure. L’étude menée pour comparer les méthodes de reconstruction des profils d’expression à partir de plans d’expériences complexes est présentée (article soumis dans BMC Bioinformatics). Le dernier chapitre offre une revue concernant les ESTs, les puces à ADN et l’état de l’art sur l’intégration des données provenant de ces techniques. Les bases de données publiques ouvrent la voie à l’intégration, afin d’exploiter de façon optimale les ressources transcriptionnelles qu’un laboratoire de recherche peut obtenir. Le chapitre décrit trois voies d’intégration: l’intégration de données produites par plusieurs laboratoires ou groupes de recherche, l’intégration des données entre multiples organismes, et l’intégration avec une multiplicité d’autres sources, comme des données ChIP-on-chip, des études d’interactions de protéines, des recherches de motifs, etc. Formée comme ingénieur civil spécialisé en informatique diplômée de l’Universidad de Chile (Chili), je suis venue étudier en France dans le cadre du DEA « Applications des Mathématiques et de l'Informatique à la Biologie » (Génopole). A l’issue de cette formation, j’ai choisi de continuer mes études et d’entreprendre une thèse de Doctorat. Ce travail a été conçu d’une façon pluridisciplinaire, avec un co-encadrement informatique et biologique pour aborder le sujet. Le travail a été effectué dans le laboratoire du Programme d’Epigénomique à Evry (G. Bernot) pour la partie informatique et le laboratoire Développement et Evolution à Orsay (N. Pollet) où j’ai acquis les connaissances biologiques et j’ai appris les besoins informatiques du côté biologiste. Ce laboratoire a aussi apporté les données analysées dans cette thèse. Finalement, un stage dans l’équipe bioinformatique de Kathleen Marchal, au Department of Microbial and Molecular Sciences, à la Katholieke Universiteit Leuven (Belgique), m’a permis d’approfondir mes connaissances sur les méthodes d’analyse de puces à ADN et mener l’étude comparative présentée dans cette thèse.

4

Page 5: Exploitation de données de séquences et de puces à ADN

Table des matières

CHAPTER 1: LARGE-SCALE TECHNIQUES TO EXPLORE THE TRANSCRIPTOME ..................................................................................7

1.1 Large-scale techniques .................................................................................................... 8

1.2 Partial cDNA sequencing : Expressed Sequence Tags (ESTs).. 9

1.3 Microarray technology..................................................................................................... 12 1.3.1 Two-channel arrays .............................................................................................. 12 1.3.2 Single-channel arrays ........................................................................................... 13 1.3.3 Experimental design............................................................................................. 14 1.3.4 Microarray analysis .............................................................................................. 16

CHAPTER 2: EXPLORING THE TRANSCRIPTOMES USING ESTS ..21

2.1 Exploring the nervous system transcriptomes during embryogenesis and metamorphosis in Xenopus tropicalis using EST analysis ............................................................................................................................................ 22

2.2 XTScope: Xenopus tropicalis EST, a web resource for the nervous system .................................................................................................................................... 36

Database content and data production.................................................................................. 36 Implementation and Architecture......................................................................................... 39 Web interface ....................................................................................................................... 40 Extensions ............................................................................................................................ 43 Conclusion............................................................................................................................ 43

CHAPTER 3: MICROARRAYS TO STUDY THE TRANSCRIPTOME ..45

3.1 Bioinformatic issues ......................................................................................................... 46 Experimental design............................................................................................................. 46 Analysis steps for the microarray experiment...................................................................... 47

3.2 Xenopus tropicalis metamorphosis transcriptomes analysis using microarrays .............................................................................................................................. 51

CHAPTER 4: EVALUATION OF TIME PROFILE RECONSTRUCTION FROM COMPLEX TWO-COLOR MICROARRAY DESIGNS................83

CHAPTER 5 : FROM GENE EXPRESSION PROFILES TOWARDS DATA INTEGRATION..........................................................................113

5.1 Background.............................................................................................................................. 114

5

Page 6: Exploitation de données de séquences et de puces à ADN

5.2 Gene expression profiles from transcriptomic experiments ... 115 Gene expression profiles from microarrays ....................................................................... 115 Gene expression profiles from ESTs.................................................................................. 116 ESTs versus Microarrays ................................................................................................... 117

5.3 Data Integration .................................................................................................................... 118 Data integration within the same organism........................................................................ 118 Data integration across species .......................................................................................... 120 Integration across different data sources ............................................................................ 121

CONCLUSION ET PERSPECTIVES ...................................................125

REFERENCES.....................................................................................129

6

Page 7: Exploitation de données de séquences et de puces à ADN

Chapter 1: Large-scale techniques to explore the transcriptome Le but de ce chapitre est de décrire les techniques les plus couramment utilisées pour étudier le transcriptome. Les techniques de sequençage partiel de ADNc ainsi que la technologie des puces à ADN sont décrites plus en détail, car les données expérimentales analysées dans cette thèse proviennent de ces deux techniques. Les principes de base et les aspects techniques sont abordés dans le chapitre.

7

Page 8: Exploitation de données de séquences et de puces à ADN

1.1 Large-scale techniques A whole range of techniques enable the measurements of transcript levels on a genome-wide scale. Whereas microarray analysis is hybridization-based, others are sequence- and/or fragment-based. ESTs, SAGE, and MPSS are examples of sequence-based techniques, whereas cDNA-AFLP is fragment (for expression level) and sequence (for identity) based. In the microarray technique(DeRisi et al., 1996; Shalon et al., 1996), labelled cDNA targets representing the mRNA population of interest are hybridized with a large number of probes that have been immobilized on a substrate, such as glass, plastic or silicon chip, forming an array for the purpose of expression profiling, monitoring expression levels for thousands of transcripts simultaneously. Measuring gene expression using microarrays is relevant to many areas of biology and medicine, such as studying treatments, disease, and developmental stages. For example, microarrays can be used to identify disease genes by comparing gene expression in sick and normal cells. Gene expression measurement by high-throughput partial sequencing of cDNAs, also called Expressed Sequence Tags or ESTs (Adams et al., 1991; Okubo and Matsubara, 1997), involves counting the ESTs that are sequenced per transcript. As it does not rely on previous sequence information, it has been mainly a valuable technique for “gene-discovery”. Serial Analysis of Gene Expression or SAGE (Velculescu et al., 1995) reduces the DNA sequencing efforts by sequencing concatenated tags derived from transcripts. SAGE is based on counting sequence tags of 14 bp from cDNA libraries. Contrary to EST, SAGE requires that the genome sequence of the organism or a substantial cDNA sequence database is available in order to identify the corresponding genes. To facilitate target identification, the LongSAGE method was developed by Saha et al. (2002). LongSAGE generates 21 bp tags, which allow unique assignment of tags to genomic sequences. However, as for the EST-based approach, quantification of lowly expressed genes requires sequencing of a large number of tags, which implies a high cost. Massively Parallel Signature Sequencing or MPSS (Brenner et al., 2000) improves SAGE as it is a parallel sequencing method that can generate 100-1000 short sequences signatures in one single analysis. It also generates longer (16-20 bp) signatures to make gene identification more accurate. However this method is technically demanding. The cDNA-AFLP (Bachem et al., 1996) technique applies the standard Amplified Fragment Length Polymorphism or AFLP (Vos et al., 1995) protocol, as described for genomic DNA on a cDNA template. This low cost procedure involves cleavage of the cDNA population by two restriction enzymes, followed by adaptor ligation to these fragments to allow for PCR amplification. The amplified fragments are then presented as a banding pattern on a sequencing gel. The differences in the intensity of the bands provide a good measure of the relative differences in the level of gene expression. CDNA-AFLP does not require prior sequence information. Furthermore, the sensitivity and specificity of the method allows the detection of poorly expressed genes and the determination of subtle differences in transcriptional activity. However, cDNA-AFLP needs a great amount of PCR reactions to generate a global overview of gene expression.

8

Page 9: Exploitation de données de séquences et de puces à ADN

From all the described techniques, microarrays is the most commonly used to study gene expression in living organisms. Microarrays have become an important tool for biological studies, which is demonstrated by the exponential growing of publication related to microarray. But the identification of ESTs has proceeded rapidly, with approximately 37 million ESTs now available in public databases (e.g. GenBank 7/2006), offering a great data source for gene expression profiling. In many cases, in particular when the genome sequence of an organism is unknown, EST sequencing is used before microarray technology, since the probe design requires to know in advance transcript sequences. 1.2 Partial cDNA sequencing : Expressed Sequence Tags (ESTs) Expressed Sequence Tags (ESTs) are DNA sequences obtained by sequencing the 5’ and/or 3’ ends of randomly isolated cloned cDNA representing gene transcripts , i.e. they correspond to a portion of an entire transcript. In this section we describe the process to obtain the mRNA samples, the sequencing step, until the reconstruction of the mRNA sequences. Using mRNA to generate cDNA libraries Isolating mRNA from specific tissues or the whole organism is key to find expressed genes in the vast expanse of the genome. The problem, however, is that mRNA is unstable outside of a cell. Moreover there is no way to sequence directly RNA molecules in a manner analogous to the sequencing of DNA molecules. Therefore, an enzyme called reverse transcriptase is needed to convert mRNA to complementary DNA (cDNA). cDNA is a much more stable compound and, importantly, because it was generated from a mRNA in which the introns have been removed, cDNA represents only expressed DNA sequence. But the reverse transcriptase will eventually fall off the template, and this will terminate the production of the cDNA generating fragments that represent portions of the original mRNA. A cDNA library is a collection of cDNA molecules generated from mRNAs contained within a cell or tissue. A cDNA library complexity can be estimated from the number of cDNA clones obtained. This complexity can range from tens of thousands to more than a million of cDNA clones. Usually, several hundred to several thousand clones are isolated at random and tag sequenced from a given cDNA library. The first step is the isolation and purification of mRNA from a tissue sample. Most often, those RNA molecules with a polyA tail are isolated. A polyT primer is used to start the reverse transcription (see Figure 1). Then, the cDNA molecules are stored in a vector, which is a circular piece of DNA used to stably propagate cDNA clones within a host (typically bacteria).

9

Page 10: Exploitation de données de séquences et de puces à ADN

Figure 1: Construction of a cDNA library from a tissue sample. mRNA is isolated and purified. Then a poly(T) primer is used to make cDNA copies. cDNA sequencing: Expressed Sequence Tags A single sequencing read is obtained from a given clone, from one or both ends of the cDNA insert, using universal primers which are complementary to the vector at the multiple cloning site. Sequence data in the form of trace files are produced by automatic sequencers. The traces are usually displayed in the form of chromatograms consisting of four curves of different colors, each curve representing the signal for one of the four bases (see Figure 2).

Figure 2: The raw data in a chromatogram file can be viewed as four sets of overlapping peaks, one each for the A,C,G and T sequencing reactions. Each curve represent a base, and for each position the corresponding nucleotide needs to be infered from the peaks. An idealized trace would consist of evenly spaced, nonoverlapping peaks, each corresponding to the labeled fragments that terminate at a particular base in the sequenced strand. Real traces, however, deviate from this ideal for a variety of reasons. The corresponding nucleotides can be determined by the base-calling step. Base calling and preprocessing The purpose of base calling is to determine the nucleotide sequence on the basis of multi-color peaks in the sequence trace (see Figure 2), i.e. the processed trace is translated into a sequence of bases. As mentioned before, traces (and regions within a trace) are of variable quality, and therefore the fidelity of "called" nucleotides is also variable. This accuracy for each called base is measured by what are called base quality values. Base calling programs, such as Phred (Ewing et al., 1998a; Ewing et al., 1998b), provide these base quality values to help realistically evaluate sequence accuracy.

10

Page 11: Exploitation de données de séquences et de puces à ADN

The “called” sequences are subject to errors caused by compressions and base calling problems resulting in frameshifts or wrong calls. Sequences have regions of high quality very close to regions of low quality. For these reasons, the sequence quality need to be assessed before any further step of analysis such as EST assembly. EST data are generally considered to be more sensitive to sequencing errors because the depth of sequencing is variable for a given transcript. However, compared to genome assembly, repeats are less of a problem for assembling an individual gene, since the coding sequence of a gene is unlikely to contain repeats. A commonly used program for repeat and vector masking is Repeatmasker (http://www.repeatmasker.org). Publicly available packages can be used to perform quality assessment, such as Lucy (Chou and Holmes, 2001), that allows to mask repeats and vector sequences, and remove poor-quality regions. Only 300-500 readable bases are produced from each sequence read, and yet a full gene transcript may be several thousands of bases long. ESTs thus provide a "tag level" association with an expressed gene sequence and an assembly step is needed in order to reconstruct the cDNA sequence. EST assembly and annotation Sequence assembly refers to aligning and merging many fragments of a much longer DNA sequence in order to reconstruct the original sequence. Since ESTs represent mRNA sequences, ideally the assembly step will reconstruct one sequence for each mRNA. However, the assembly step is complicated by features like read quality, repeated regions, alternative splicing, single-nucleotide polymorphism, and post-transcriptional modification. The assembler returns contigs and singletons, where contigs are consensus sequences resulting from the alignment of several ESTs, while singletons are ESTs that could not be aligned with other ESTs. Some examples of assemblers are Phrap (http://www.phrap.org) and Cap3 (Huang and Madan, 1999) which are used for EST assembly. Other assemblers like Arachne(Batzoglou et al., 2002) are designed for genomic assemblies. Once the ESTs are assembled in contigs and singletons the aim is to identify to which gene they correspond. In the annotation step, contigs are compared to known data to determine similarities and putative functions. Usually contigs are compared by sequence similarity to public databases such as the non-redundant database at NCBI, SWISS-PROT (Boeckmann et al., 2003), etc, in order to assign contigs to known genes. When the sequence similarity is low, protein domains are used to get more information from the contig sequence, using tools like InterproScan (Quevillon et al., 2005). Given the large amount of public ESTs for several organisms, some public gene indices are available, such as UniGene at NCBI and Gene Indices at DFCI (ex-TIGR)(Quackenbush et al., 2000). These gene indices provide contigs or clusters based on the current publicly available EST data. More details about EST preprocessing and assembly can be found in Chapter 2. Utility of ESTs The current understanding of the human set of genes, and other organisms, includes the existence of thousands of genes based solely on EST evidence. In this respect, ESTs become a

11

Page 12: Exploitation de données de séquences et de puces à ADN

tool to refine the predicted transcripts for those genes, which leads to prediction of their protein products, and eventually of their function. Since cDNA libraries can be prepared from different tissues or developmental stages of a single organism, this approach can be useful for the construction of catalogues of tissue-specific or stage-specific genes. The identification of ESTs has proceeded rapidly, with approximately 37 million ESTs now available in public databases (e.g. GenBank 7/2006). ESTs contain enough information to design precise probes for DNA microarrays that then can be used to determine the gene expression. In addition, the situation in which those ESTs are obtained (tissue, organ, disease state - e.g. cancer) gives information on the conditions in which the corresponding gene is acting. This information is also useful to assess gene expression. 1.3 Microarray technology Microarrays enable the quantification of transcript levels, and this is done on a global scale: the transcript abundance is measured for thousands of molecules simultaneously. Microarray technology enable the quantification of RNAs that may or may not be translated into active proteins, but measuring the gene expression at protein level genomewide is more difficult. Microarrays can be split mainly into two classes, the two-channel arrays and the single-channel arrays. For two-channel arrays, two labeled samples are hybridized onto one single slide, resulting in two separate intensities. These intensity values are often reported as log-ratios. For single-channel arrays, only one sample is hybridized and this result in absolute measurements. This major difference implies that the data coming from both data types of platforms has to be treated distinctly and requires specific normalization methods. A further categorization can be made based on the probes used on the slides. In the following sections the different microarray types are presented. We restrict ourselves to describe the analysis steps for the technique that was used in this work.

1.3.1 Two-channel arrays A first step to build a microarray is to select the probes that will be printed on the arrays. In two-channel microarrays the probes can be a DNA molecule (or an analog such as PNA or LNA) of known or unknown sequence : oligonucleotides, cDNA or small fragments of PCR products corresponding to mRNAs. Oligonucleotide arrays are created either by pre-synthesizing the oligonucleotides and printing on the substrate; or by synthesizing the oligonucleotides in situ directly on the substrate. The DNA molecules are spotted on the array by a set of pins. These pins dip into the DNA mixture to take a small amount of DNA and then drop it on the microarray surface by contact with the substrate. When the DNA molecules used are double stranded, the array is heated so that the DNA is denatured and can bind to complementary strands. In the microarray technology, the targets are always cDNA molecules (usually single-stranded) obtained from mRNA through reverse transcription and amplified by PCR to obtain sufficient material to print on a slide.

12

Page 13: Exploitation de données de séquences et de puces à ADN

Two-channel arrays are hybridized with cDNA targets from two samples to be compared (e.g. patient and control). These two samples are labeled with a red and a green fluorescent dye, called Cy5 and Cy3 respectively. If a cDNA target is present in one or both of the samples, then it will bind to its complementary probe. If it is present in both samples, the spot will emit a Cy3 and a Cy5 fluorescent signal. If it is present in only one of the samples, either a Cy5 or a Cy3 intensity will be measured, depending on in which sample it was found. If it is absent for both samples, it will not hybridize and no signal will be emitted. Before the intensities can be measured, the array is washed to remove the unbound targets. A scanner is then used to acquire the Cy3 and Cy5 signals. These signals then have to be preprocessed for quantification, what is described in section 3.4, and they are often reported as log-ratios.

Figure 3: Two channel microarray experiment. For two different RNA samples the cDNA is synthesized. The two samples are labeled in Cy3 (green) and Cy5 (red) and hybridized on the same array. The array is scaned and then the image is analysed. If a gene is expressed in the sample and present on the array, the gene will bind to the corresponding probe and a Cy3 or Cy5 signal will be emitted, depending in which samples the gene is expressed.

1.3.2 Single-channel arrays For single channel arrays, only one sample is hybridized and this results in more or less absolute measurements. In this case, the probes are designed to match parts of the sequence of known or predicted mRNAs. There are commercially available designs that cover complete genomes from companies such as GE Healthcare, Affymetrix, Ocimum Biosolutions, or Agilent. These microarrays give estimations of the absolute value of gene expression and therefore the comparison of two conditions requires the use of two separate microarrays.

13

Page 14: Exploitation de données de séquences et de puces à ADN

Affymetrix is probably the most used short oligonucleotide single-channel platform and with its specific probe design, it requires a completely different normalization strategy. Affymetrix uses short oligonucleotides that are synthesized on the slide by using a set of masks. By covering the slide with a mask, a selection of position on the chip is exposed and light will then be used to activate these unprotected sites. All these activated positions, nucleotides will bind, resulting in the synthesis of one nucleotide at the positions chosen with the mask. A new mask is then applied and the same process is repeated. After several rounds, the desired set of probes is obtained. In this way a lot of different masks are required and this makes the Affymetrix chip a rather expensive chip. On an Affymetrix chip, a gene is not represented anymore by one single DNA molecule, but by a probe set, consisting of 11 to 20 probe pairs. Each probe pair is composed of two short oligonucleotides of length 25 nt. One matches with a part of the sequence of the gene and is called the perfect match (PM). The second oligonucleotide has the same sequence as the perfect match, except for a single mismatch in the middle of the oligonucleotide (at the 13th position), and is therefore called the mismatch (MM) probe. These mismatch probes are assumed to measure the nonspecific binding and are therefore often used as a kind of background correction. Disadvantage of this setup is that these oligonucleotides are so short that they are sometimes not gene specific.

1.3.3 Experimental design Probe design and hybridizations Perhaps the most important concern related to microarrays is that all current technologies are based on the fundamental assumption that most microarray probes produce specific signals under a single, rather permissive hybridization condition. Specificity, in the context of DNA microarrays, refers to the ability of a probe to bind a unique target sequence. A specific probe should provide a signal that is proportional to the amount of the target sequence only. A non-specific probe will provide a signal that is influenced by the presence of other molecules. The specificity of a probe can be diminished by cross-hybridization, a phenomenon in which sequences that are not strictly complementary bind to each other. Cross-hybridization is also called non-specific hybridization. Other factors such as the formation of secondary structure and the melting temperature of probes may also cause hybridisation error, which reduces experimental accuracy. The advantage of cDNA microarrays is their low cost compared to oligonucleotide arrays, and therefore they were often used in the academic world. However, it is difficult to obtain full-genome coverage and highly similar transcripts may cause cross-hybridation. Long oligonucleotide arrays can overcome these problems selecting regions within genes in such a way they offer better specificity. Signals observed from the immobilized probes varies within a range. The sensitivity threshold of microarray measurements defines the concentration range in which accurate measurement can be made. Some attempts have been made to assess the dynamic range of microarrays. The detection limit of current microarray technology seems to be between one and ten copies of mRNA per cell (Allemeersch et al., 2005; Draghici et al., 2006).

14

Page 15: Exploitation de données de séquences et de puces à ADN

RNA spike-in experiments An RNA spike-in is an RNA transcript used to calibrate measurements in a DNA microarray experiment. Each spike-in is designed to hybridize with a specific control probe on the target array. Manufacturers of commercially available microarrays typically offer companion RNA spike-in "kits". Known amounts of RNA spike-ins are mixed with the experiment sample during preparation. Subsequently the measured degree of hybridization between the spike-ins and the control probes can be used to normalize the hybridization measurements of the sample RNA. In addition to normalization purposes, the spike-in experiments allows to quantify the accuracy of microarrays, i.e. to compare the observed measurement with the real concentration of RNA for the spikes probes. Experimental design for two-channel arrays In two-channel arrays, mRNA extracted from two conditions (samples) are hybridised simultaneously on a given array. Which conditions to pair on the same array is a non trivial issue and relates to the choice of the “microarray design”. When several conditions need to be studied, for instance 10 conditions, co-hybridize all possible pair combinations is not tractable. A good microarray design can estimates the parameter of interest with an efficient use of the available material (number of arrays, the amount of mRNA available, or other cost considerations) (Yang and Speed, 2002). The ‘reference design’ is the most common used design for two-channel microarrays, where each experimental sample is hybridised against a common reference sample (Figure 4a). The advantage is the easy interpretation of ratios and it extends easily to other experiments, if the common reference is preserved. However, half of the resources is used to measure this reference sample that usually has no biological interest, and indirect comparisons are needed to compare two samples of interest. For comparing a number of samples of equal interest and high quality, a design that utilizes a large number of direct sample-to-sample comparisons is most accurate for the cost, from a theoretical perspective. The simplest of these is a ‘loop’ design, where each sample is hybridized to each of two different samples in two different dye orientations (Figure 4b). The drawback is that if one chip fails, or is of poor quality, then the error variance for all estimates is doubled. In addition, loops are inefficient compared to the reference design for a large number of samples because some pairs of varieties are too far apart (Kerr and Churchill, 2001). There are other efficient designs which are also robust to failure, as the interwoven loop design, which is a combination of direct and indirect comparisons that can give more robust and precise estimates (Figure 4c).

15

Page 16: Exploitation de données de séquences et de puces à ADN

a)

b)

c)

T1

T6

T5

T4

T2

T3

Common Reference

T7 T2 T1 T1

T2T5

T3T4

Figure 4: Examples of microarray designs. Circles represent samples or time points, and arrows represent a direct hybridization between two samples. The arrows point from the sample labelled with Cy3 to the sample labelled with Cy5. (a) Common reference design. (b) Loop design. (c) Interwoven loop design. Independent of the chosen design, there are two fundamental principles of good design: balance and replication. In a balanced design, every sample is labeled equally often in red and green channel. For example, the loop design is by definition a balanced design, and swapped hybridizations provide balance for reference designs. Replication improves the precision of estimates. The number of replicates depends on the goals of the study, the resources, and the reliability of the technology.

1.3.4 Microarray analysis The difference between single-channel and two-channel microarrays, implies that the data coming from both types of platforms has to be treated distinctly and requires specific normalization methods. Here we focus on the microarray analysis of two-channel microarrays, since the data presented in this work correspond to this type of arrays. Image acquisition and analysis After performing the hybridization experiments, scanning the slide constitutes the first step of data analysis followed by the extraction of the raw intensities from the images. There are four basic steps in image acquisition and analysis: scanning, spot recognition or gridding, segmentation and intensity extraction. In the scanning step, there are two prerequisites for obtaining a high-quality image: all the previous steps have to be performed to the highest possible standards to ensure that all images would be least affected by contamination, and have consistent spots with high signal-to-noise ratios. Secondly, the choice of scaning parameters is also important. The photomultiplier tube (PMT) gain settings are adjusted to balance the overall intensities between the two channels. Spot recognition is not a difficult task for most image analysis software. It consists in laying the grid, i.e. finding where the printed spots ought to be in the image. Segmentation is a process used to differentiate the foreground pixels (i.e. the true signal) in a spot grid from the background pixels. A proper segmentation can be a problem because the spot morphology in a poor-quality image can vary substantially and the background can be high. There are several algorithms for segmentation and for background estimation, which are implemented in different image analysis software.

16

Page 17: Exploitation de données de séquences et de puces à ADN

After segmentation, the pixel intensities within the foreground and background masks are averaged separately to give the respective signals. Median or other intensity extraction methods can be used. Data pre-processing: Quality assessment The data extracted from the image acquisition step need to be pre-processed to exclude poor-quality spots and normalized to remove many systematic errors as possible before downstream analysis. A common criterion is to exclude any spot with intensity lower than the background plus two standard deviations. By this criterion, spots associated with an intensity similar to the background range intensity are eliminated. The intensities should also be log-transformed. When ratios are used, upregulated and downregulated values are in the same scale using this log-transformation. Usually a background subtraction is performed, but many researchers have found that background subtraction adds noise in measurements. On the other hand, it seems in principle wrong to ignore background, but at the moment there is no consensus. Data pre-processing: Normalization In a self-self experiment, i.e. two identical samples are hybridized on the same array, if we plot the log2 Cy5 intensity values versus log2 Cy3 values, one expects the values lie along the diagonal, but this is not the case, although biologically there is no difference between samples. This variation may be a consequence of different labeling efficiencies and scanning properties of Cy3 and Cy5 dyes; different scanning parameters, such as PMT settings; print tip, spatial, or PCR plate effects. The purpose of normalization is to minimize systematic variation in the measured gene expression levels of two co-hybridized samples, so that biological differences can be more easily distinguished. The utility of normalization can be visualized in a ratio-intensity plot, or MA-plot (Figure 5). In a MA-plot the log ratios (M values) are plotted versus the mean log-intensities (A values). This MA-plot also shows the dye effects, as one expects that the plot is centered around M = 0, i.e. the proportion of up- and down-regulated genes is similar, which is clearly not the case. The commonly used Loess normalization is a non-linear method that performs a local intensity dependent normalization (see Figure 5). The idea is to fit a curve for the M values based on A values, and then ratios are corrected based on this curve. In addition, normalization algorithms such as Loess, can be applied either globally (to the entire data set) or locally (to some physical subset of the data). For spotted arrays, local normalization is often applied to each group of array elements deposited by a single spotting pen (print tip group). To remove print-tip effects, one can split the data into groups printed by the same print tip. A separate Loess curves is fitted for each group and the intensities are corrected using the corresponding Loess line. This local normalization has the advantage that it can help correct for systematic spatial variation in the array, including inconsistencies among the spotting pens used to make the array, variability in the slide surface, and local differences in hybridization conditions across the array.

17

Page 18: Exploitation de données de séquences et de puces à ADN

Figure 5: Ratio-Intensity or MA-plot before (left) and after (right) loess normalization. An additional normalization, the between-array normalization, may be useful for comparison across arrays. Scale and quantile normalization are some examples. Scale normalization is a linear method that scale ratios (M values) from a series of arrays so that each array has the same median absolute deviation. Quantile normalization makes the distribution of probe intensities the same for all the arrays, and this is applied to log-intensities rather than log-ratios as scale normalization does. Estimating magnitude and significance of differential gene expression Inherent to the hybridization technique, measurements are noisy. To assess the significance of gene profiles, replicated experiments and statistical tools are needed. After data preprocessing and normalization, the next stage is to fit a statistical model to estimate magnitude and significance of relative gene expression across samples. Several methods can deal with complex microarray designs. These methods fit a model to estimate the relative gene expression and error terms, which can be used to identify significant differentially expressed genes. These methods can be classified according to the type of model they use. LIMMA (Linear model for microarray data analysis) is a gene-specific method, because it fits a linear model for each gene separately. LIMMA uses normalized log-ratios as input data. To test for differential expression, the gene specific variance estimates are improved in a Bayesian way by using the information from all genes (Smyth, 2004). Another statistical tool used in microarray analysis is the analysis of variance (ANOVA). Firstly, researchers used global ANOVA models (Kerr et al., 2000) where a single model is applied on the whole data set at a time. However, global models are computationally time consuming and were subsequently replaced by two-stage models. In a two-stage model, a first global model is applied on the whole dataset to estimate gene independent terms. The residuals of the global models become the data in the second stage which is applied to one gene at a time. Wolfinger et al. (2001) introduced the ANOVA mixed models for microarrays. In a mixed model some of the effects in the experimental design are treated as random samples from a population. ANOVA models use separate log-intensity values for each channel, as spot effects are explicitly incorporated. They return normalized absolute expression levels for each channel separately. The ANOVA F tests are designed to detect any pattern of differential expression among several conditions by comparing the variation among replicates samples within and between conditions. There are several variations of the F test, and software tools that can

18

Page 19: Exploitation de données de séquences et de puces à ADN

compute appropriate F statistics for mixed models, such as MAANOVA. More details about methods to obtain gene expression profiles can be found in Chapter 4. Clustering of gene expression profiles Clustering is a useful exploratory technique for suggesting resemblances among groups of genes. It is essentially a grouping technique that aims to find patterns in the data that are not predicted by the experimenter’s current knowledge or pre-conceptions. Different procedures emphasize different types of similarities, and give different resulting clusters. Most cluster programs offer several distance measures (Euclidean, Manhattan distances), some relational measures (correlation, and sometimes relative distance), and mutual information. Standard clustering techniques, such as hierarchical clustering, K-means, and self-organizing maps, are applied to group together the gene profiles with similar patterns across the conditions. Moreover, advanced algorithms have also been developed which are specifically fine-tuned for biological applications.

19

Page 20: Exploitation de données de séquences et de puces à ADN

20

Page 21: Exploitation de données de séquences et de puces à ADN

Chapter 2: Exploring the transcriptomes using ESTs Un projet de séquençage partiel d’ADNc à grande échelle a été initié afin de créer une ressource pour la génomique fonctionnelle de Xenopus tropicalis. Les ADNc provenants des gènes exprimés dans le système nerveux pendant l’embryogenèse et la métamorphose de X. tropicalis ont été etudiés. Ce chapitre 2 décrit l’analyse des ESTs et la création d’un indice de gènes, a partir d’une collection d’environ 50000 séquences de haute qualité. Ces ESTs sont estimés représenter 9693 transcrits, dérivés d’ environ 6000 gènes. L’analyse effectuée dans le cadre de cette thèse a consisté en prétraitement et assemblage des ESTs, l’annotation des séquences. L’annotation des contigs (ESTs assembles) inclut la prédiction de domaines protéiques, la classification fonctionelle basée sur Gene Ontology et l’identification des cas d’épiçage alternatif. En plus, les analyses effectuées par les biologistes ont permis d’identifier des ARN non-codants, obtenir des profils d’expression de gènes à partir du comptage d’ESTs et les utiliser pour définir les transcrits qui sont spécifiques pour les stades métamorphiques du développement. Toutes les données ont été mises à disposition du public avec l’application web « XTScope » (http://indigene.ibaic.u-psud.fr/EST) qui permet un accès rapide aux séquences des ESTs et contigs, des données d’ épiçage, d’expression, de statistiques et informations, grâce à l'interface de recherche. La conception et l’implementation de XTScope font aussi partie des travaux effectués dans cette thèse.

21

Page 22: Exploitation de données de séquences et de puces à ADN

Despite the fact that an EST represents only a small portion of a coding sequence, en masse this partial sequence data is of substantial utility. EST collections are a relatively quick and inexpensive route for discovering new genes, confirm coding regions in genomic sequence, provide the basis for development of microarrays and can be used to measure the transcriptome activity. These small fragments need to be assembled to reconstruct the transcribed gene sequence, and then identify the corresponding gene with its putative function. When the aim is to measure the transcriptome activity, gene profiles under different libraries are obtained from the analysis. We have studied the nervous system of Xenopus tropicalis through ESTs to provide a functional genomics resource on genes expressed in the nervous system during early embryogenesis and metamorphosis. The first section of this chapter corresponds to the article published in BMC Genomics, which presents the biological analysis of the gene index derived from ESTs. The second section presents the bioinformatic and technical aspects that I have developed for this thesis: the workflow used to construct a gene index from the raw EST reads and the web resource developed to query this gene index. 2.1 Exploring the nervous system transcriptomes during

embryogenesis and metamorphosis in Xenopus tropicalis using EST analysis

22

Page 23: Exploitation de données de séquences et de puces à ADN

BioMed Central

Page 1 of 13(page number not for citation purposes)

BMC Genomics

Open AccessResearch articleExploring nervous system transcriptomes during embryogenesis and metamorphosis in Xenopus tropicalis using EST analysisAna C Fierro*1,2,5, Raphaël Thuret1,2, Laurent Coen3, Muriel Perron1,2, Barbara A Demeneix3, Maurice Wegnez1,2, Gabor Gyapay4, Jean Weissenbach4, Patrick Wincker4, André Mazabraud1,2 and Nicolas Pollet*1,2,5

Address: 1CNRS UMR 8080, F-91405 Orsay, France, 2Univ Paris Sud, F-91405 Orsay, France, 3CNRS UMR 5166, Evolution des Régulations Endocriniennes, USM 501, Département Régulations, Développement et Diversité Moléculaire, Muséum National d'Histoire Naturelle, 7 rue Cuvier, 75231 Paris Cedex 5, France, 4Genoscope and CNRS UMR 8030, 2 rue Gaston Crémieux CP5706, 91057 Evry, France and 5Programme d'Épigénomique, Univ Evry, Tour Évry 2, 10è étage, 523 Terrasses de l'Agora, 91034 Evry cedex, France

Email: Ana C Fierro* - [email protected]; Raphaël Thuret - [email protected]; Laurent Coen - [email protected]; Muriel Perron - [email protected]; Barbara A Demeneix - [email protected]; Maurice Wegnez - [email protected]; Gabor Gyapay - [email protected]; Jean Weissenbach - [email protected]; Patrick Wincker - [email protected]; André Mazabraud - [email protected]; Nicolas Pollet* - [email protected]

* Corresponding authors

AbstractBackground: The western African clawed frog Xenopus tropicalis is an anuran amphibian species now usedas model in vertebrate comparative genomics. It provides the same advantages as Xenopus laevis but isdiploid and has a smaller genome of 1.7 Gbp. Therefore X. tropicalis is more amenable to systematictranscriptome surveys. We initiated a large-scale partial cDNA sequencing project to provide a functionalgenomics resource on genes expressed in the nervous system during early embryogenesis andmetamorphosis in X. tropicalis.

Results: A gene index was defined and analysed after the collection of over 48,785 high quality sequences.These partial cDNA sequences were obtained from an embryonic head and retina library (30,272sequences) and from a metamorphic brain and spinal cord library (27,602 sequences). These ESTs areestimated to represent 9,693 transcripts derived from an estimated 6,000 genes. Comparison of thesecDNA sequences with protein databases indicates that 46% contain their start codon. Further annotationincluded Gene Ontology functional classification, InterPro domain analysis, alternative splicing and non-coding RNA identification. Gene expression profiles were derived from EST counts and used to definetranscripts specific to metamorphic stages of development. Moreover, these ESTs allowed identificationof a set of 225 polymorphic microsatellites that can be used as genetic markers.

Conclusion: These cDNA sequences permit in silico cloning of numerous genes and will facilitate studiesaimed at deciphering the roles of cognate genes expressed in the nervous system during neuraldevelopment and metamorphosis. The genomic resources developed to study X. tropicalis biology willaccelerate exploration of amphibian physiology and genetics. In particular, the model will facilitate analysisof key questions related to anuran embryogenesis and metamorphosis and its associated regulatoryprocesses.

Published: 16 May 2007

BMC Genomics 2007, 8:118 doi:10.1186/1471-2164-8-118

Received: 17 November 2006Accepted: 16 May 2007

This article is available from: http://www.biomedcentral.com/1471-2164/8/118

© 2007 Fierro et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 24: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 2 of 13(page number not for citation purposes)

BackgroundXenopus tropicalis is now an anuran amphibian referencegenome for vertebrate comparative genomics. It presentsthe same advantages as Xenopus laevis but has a smallergenome of 1.7 Gbp and a shorter generation time [1].Moreover, while X. laevis is an allotetraploid derived froman allopolyploidization event, X. tropicalis is diploid [2,3].Even though phylogenetic studies indicate that 30 to 50MY evolution separate the two species [3,4], it has beenshown that most methods and resources developed for X.laevis can be readily applied to X. tropicalis [5]. Thus, thegenome of X. tropicalis was selected to explore amphibiangenome characteristics by whole-genome shotgunsequencing [6].

Working on X. laevis constitutes a challenge when dealingwith large-scale transcriptomics, such as microarraysexperiments or systematic cDNA sequencing. This isbecause some X. laevis genes are present as diploids, whileothers form pairs of paralogs (also called "pseudoalleles")that have been conserved with various degrees of diver-gence, generally less than 10% [7]. On a genomic scale,recent data has led to the estimation of 12% as the mini-mal fraction of paralogous gene pairs kept after allotetra-ploidization [8]. However, this estimate is based on theapplication of strict and conservative criteria: less than98% nucleotidic similarity and 93% mean similaritybetween paralogs. Therefore, it is likely that more than12% of paralogs are indeed active genes in X. laevis. More-over, such pairs of genes may have distinct expression pat-terns [7]. An estimated 14% of paralogs show distinctexpression profiles based on EST counts [8]. Given thesecomplications, it follows that the X. tropicalis genome ismore amenable to systematic transcriptome surveys thanthat of X. laevis

Transcriptome analysis relies heavily on cDNA analysis.Collections of cDNA sequences have multiple uses for themolecular geneticist. They can be used to establish tran-script catalogues [9-11] and to provide experimental evi-dence when building gene models from genomicsequence, particularly for 5' and 3' untranslated sequences[12]. Further, they can be used to provide global views ongenome expression in a given cell type by the estimationof the abundance of the different mRNA species (throughsignatures as in [13]) and therefore can help decipherphysiological roles played by a given gene product.Finally, partial cDNA sequences (ESTs) are used to iden-tify full-length clones containing the entire open-readingframe for each transcript [14].

We initiated an EST program so as to provide a functionalgenomics resource for X. tropicalis containing sequencesfrom the highest possible number of genes expressed inthe nervous system. We report the construction of such a

gene index and its assessment after the collection of48.785 partial cDNA sequences. These ESTs are estimatedto represent 6,000 genes that were annotated throughsequence similarity searches, protein domain searchesand Gene Ontology functional classification. Geneexpression profiles were derived from EST counts andused to evidence transcripts differentially expressed atmetamorphic stages of development. A set of polymor-phic intragenic microsatellite markers was deduced fromthe analysis of ESTs derived from distinct strains of X. trop-icalis. We expect that this resource will be valuable for fur-ther molecular genetics experiments.

Results and discussionConstruction of cDNA libraries and normalizationTwo X. tropicalis cDNA libraries were constructed for thisproject. The first, designated xthr, was derived from dis-sected retinas and heads of young tadpoles (Nieuwkoopand Faber st. 25–35). About 500 retinas were dissectedfrom stage 32 X. tropicalis embryos, a stage where differen-tiating retinal neurons are getting organized into layers.Because these retinas yielded only few polyA+ RNA, thelibrary was enriched by the addition of mRNA from headsof embryos of the same developmental stage. The secondlibrary, designated xtbs, was made from central nervoussystems of metamorphosing tadpoles. Brains and spinalcords were dissected from tadpoles between stage 58 and64, the period covering the whole of Xenopus metamor-phosis. To build the library, and with the aim of respect-ing the relative proportion of nervous tissue obtained atthe different stages, samples for six animals were pooledfor each stage between 58–61 and three animals for eachstage between 62–64. All these tissues were combined andthe mRNA extracted for preparation of the xtbs library.The SMART technology (Clontech) was used to enrich therepresentation of full-length cDNA clones (defined hereas a copy of the transcript sequences between the 5' capand a polyA tail).

To increase the information derived from EST projects, itis necessary to sample complex or normalised cDNAlibraries with few overrepresented cDNA clones (observedindividually with a frequency greater than 1%). To evalu-ate our libraries quality, samples of 1,989 cDNAs fromxthr and 1,694 cDNA from xtbs were partially sequenced(see Methods) to obtain 4,120 ESTs. Next, a normaliza-tion step was performed to increase the diversity ofsequence tags. We used a set of 53 oligonucleotides (35mers) corresponding to highly represented clones (≥ 1%,see Methods) in hybridizations on high-density colonyfilters (See additional file 1). A total of 22,561 clones werescored as positives (20% in both libraries) with an esti-mated false positive level of 0 and 3%, and an estimatedfalse negative level of 38 and 10% for xthr and xtbs librar-ies, respectively. The negatively scored clones were re-

Page 25: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 3 of 13(page number not for citation purposes)

arrayed to further the project using both 5' and 3' sequenc-ing. The further sequencing of cDNA clones provided48,785 high-quality sequences derived from 27,806clones after trimming 57,874 reads (including the 4,120ESTs of the pre-normalisation step, Table 1, see Methods).Both 5' and 3' end sequences were read for 75% of thecDNA clones, therefore reducing the difficulties associatedwith EST clustering. Moreover, this strategy helps to deter-mine the choice of a given cDNA clone for further experi-ments, whether it be full-length cDNA sequencing,overexpression studies or complementary RNA in vitrosynthesis.

To determine if the normalization process was successful,the number of sequences containing each oligonucleotideprobe was counted before and after normalization (Fig.1). Before normalization, the 53 clusters from which theprobes were derived accounted for 18% of the 4,120 ESTs.This fraction dropped to 1% after normalization, confirm-

ing the efficacy of the method. Of the 48 clusters corre-sponding to nuclear genes, 18 (37%) have 20 or morecorresponding ESTs and 17 (35%) have 40 or more ESTsafter normalization. We conclude that the abundance ofESTs after normalisation was sufficient in the majority ofcases. Even though this strategy requires re-arraying, thereis no bias due to insert length compared to normalizationby re-association [15] and therefore constitutes a usefulalternative.

EST assemblyWe analyzed these sequences with PHRAP [16] to buildcontigs out of the overlapping and redundant sequences(Table 1). A total of 31,767 sequences were assembledinto 8,756 contigs. These were further grouped by virtueof clone links into 6,547 unique groups (scaffolds). Tak-ing into account the 2,982 singletons issued from 2,304clones, a total of 9,693 transcripts sequences were identi-fied. We compared our results to the global clustering ofall X. tropicalis ESTs (including ours) by the UniGene pipe-line and the DFCI Gene Index. In UniGene, our set of ESTsbelong to 7,778 groups made of between 1 and 220clones. Similarly, The DFCI Xenopus tropicalis Gene Indexclustered these ESTs in 9,350 TCs and 1,160 singletons.

The majority of clusters (66%) contained three or lessESTs. Only 11 contigs were composed of more than 100sequences (See Additional file 2) and the largest contigcontained 159 sequences. Most of the corresponding geneproducts (23/50) are ribosomal proteins, the other beingproteins involved in basic cellular processes (tubulin,elongation factor 1 alpha). Two noteworthy exceptionsare myelin basic protein (contig8746) and metal-lothionein (contig8708), for which transcripts are foundalmost exclusively in the nervous sytem.

The sequence redundancy (number of ESTs/cluster) ofxthr and xtbs libraries was compared to other X. tropicaliscDNA libraries represented in dbEST (See Additional file3). A statistically significant difference at the 1% level ofsignificance indicates that the complexity is higher foradult-type cDNA libraries, whether or not a size fraction-ation was performed. Amongst cDNA libraries preparedfrom embryonic or larval stages of development, the com-plexity of the xtbs library ranks first, while the complexityof the xthr library is close to the mean value.

Sequences were assembled into contigs of up to 3 kb insize (hsp90 transcript, Contig 8575, See Additional file 4),but the mean contig length of 745 bp indicated that mostof them cover only parts of the cognate transcriptsequence.

To assess the fraction of clones likely to be full-length, weestimated the number of sequences in our dataset that

Table 1: Xenopus tropicalis EST project statistics

xthr xtbs xthr and xtbs

Number of sequences reads obtained 30272 27602 57874Number of clone sequences obtained 16548 14901 31449

Number of valid sequences 26440 22345 48785Number of clones with valid sequences 15540 12266 27806Number of clones with 5' and 3' EST 9354 11486 20840Number of clones with 5' EST only 4831 0 4831Number of clones with 3' EST only 1355 780 2135Average trimmed EST length 522 546 534

Number of contigs 4327 4002 8756Number of contigs groups 497 289 842Number of contigs grouped 1210 649 2209Number of unique contigs 3117 3353 6547Nulmber of clones in contigs 9616 9268 15642Number of singletons 7958 4262 17018Number of clones in singletons 5924 2998 12164Number of putative transcripts 9538 6640 19553Max. assembled sequence length 3028 3144 3028Average assembled sequence length 732 782 745Max. assembled sequence size 147 144 159Average assembled sequence size 6 5 5Number of contigs containing

1 509 97 11522 1779 2155 37303 452 362 9054–5 587 545 11596–10 505 476 95211–20 283 207 47921–30 91 90 17431–50 80 41 11850–100 31 26 76>100 10 3 11

Page 26: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 4 of 13(page number not for citation purposes)

extends over the 5' or 3' end of complete cDNA sequences(Figure 2; 1,945 entries from the X. tropicalis Xenopus GeneCollection [8], Xt-XGC and 2,963 entries from the SangerInstitute [17]). Using conservative criteria, at least one5'EST was found to provide additional 5' upstream or 3'downstream sequence for 854 complete cDNAs (17.4% ofthe set). Using the same criteria but only on contigsequences, further sequence information was obtained on355 complete cDNAs (7.2% of the set). Of these full-length cDNAs, 82 are completely matched by 122 contigs,and the latter are all longer. These results provide an indi-cation of the added-value of this sequence resource in theframework of the delineation of gene structure, especiallywith respect to the determination of the transcription ini-tiation site.

Another way to assess the fraction of clones likely to befull-length has been described by Gilchrist et al. [17].Using this method on all X. tropicalis cDNA sequences

(version Xt6 [18]) xthr and xtbs libraries we found to con-tain respectively 42% and 37% of full-length clones (MJGilchrist, personal communication). The mean fraction offull-length clones across all libraries is 18%. Therefore, weconclude that our normalization procedure did notimpair the proportion of full-length clones compared tonon-normalized libraries.

Sequence annotationsIn order to further analyse our dataset, we compared ourcontigs to ENSEMBL predicted transcripts. Altogether,4,437 contigs (52%) and 1,423 singletons (48%)matched 4,083 transcripts from 3,703 ENSEMBL pre-dicted genes (15%). The extent of the underclustering ofour ESTs was estimated from these numbers and used tocalculate that our whole EST set represents about 6,000genes. We conclude that our cDNA sequence collectionsignificantly improved annotation of the X. tropicalisgenome sequence. Similarly, we compared our dataset to

Assessment of normalization effectivenessFigure 1Assessment of normalization effectiveness. Histogram showing the percentage of sequence matching each oligonucle-otide used in the procedure of normalization by hybridization. Bars represent percentages of positive clones calculated before (grey bars) and after normalization (black bars). Data before normalization were obtained after partial sequencing of 1,989 cDNAs from xthr and 1,694 cDNAs from xtbs. Note the relatively high abundance of cell-type specific transcript such as gamma crystallin (crystallinG1) or neurogranin (underlined on the figure).

1.55%0.92%0.87%

Page 27: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 5 of 13(page number not for citation purposes)

2,402 X. tropicalis RefSeq mRNA sequences. We found that2,230 contigs (26%) and 484 (16%) singletons matched1,342 RefSeq entries (56%). These figures suggest that fur-ther extensive sequencing of putative full-length cDNAclones from our collection will be of great use in order tocover the entire Xenopus gene set.

We next estimated the proportion of our cDNA sequencesrepresenting mRNA molecules produced by a splicingevent and hence most likely to correspond to physiologi-cal products. We used "exonerate" to compute alignmentsbetween cDNA and genomic sequences. We retained onlythe alignments satisfying the thresholds of 95% identityand 90% coverage. Evidence of splicing was found for5,025 contigs (65%) out of 7,718 contigs aligned to thegenome. From the 2,693 contigs left, only 274 are signifi-cantly similar to a protein sequence and it is likely that theothers represent 3' untranslated regions, often encoded bya single exon in vertebrates.

Next, two complimentary methods were used to find evi-dence for alternative splicing. Using genomic sequences(see Methods) we predicted 111 cases of alternative splic-ing, including conserved ones such as Clathrin light chain

(contig7735 and contig8467, See Additional file 5). Usingalignments on protein and genomic sequences we found58 cases (such as elrD represented by contig7817, SeeAdditional file 6).

Our set of transcript sequences was annotated using simi-larity searches in nucleotidic and protein databases andmotif searches (Table 2). Of the 8,756 contigs, 62% havemore than 70% nucleotidic similarity to previouslydescribed X. laevis regular entries, and may be consideredas "known" Xenopus genes. Of these sequences, 4,426 hadsignificant similarity to 2,803 protein sequences in Swiss-Prot database, and 5,506 to 3,571 cluster of the Uniref90database. We identified 212 sequences corresponding tothe Xenopus orthologs of human disease genes (See Addi-tional file 7). Further molecular studies on these genes inXenopus will be useful for understanding the physiopa-thology of these diseases.

Putative coding regions were identified using frame-search, and corresponding protein sequences were anno-tated using InterProScan, allowing for an automatic GeneOntology Annotation (Table 3).

Several known genes specifically expressed in the eye wereidentified, including different crystallins (beta, gammaand mu isoforms), vsx1 (visual system homeobox 1), pax6(paired-box protein 6), rdgb (retinal degeneration Bhomolog), rgr (RPE-retinal G protein-coupled receptor).Well-characterised central nervous system specific geneswere identified as well, notably elrC, mbp (myelin basicprotein), plp (myelin proteolipid protein 1). The corre-sponding cDNAs will provide useful differentiation mark-ers for X. tropicalis.

A significant number of the contigs (37%) had no signifi-cant similarities to previously described genes, and mayrepresent transcribed pseudogenes, non-coding RNAsequences and undescribed genes. Indeed, comparing oursequences to non-coding RNA sequences (microRNAfrom RFAM, or ncRNA from the H-INV datasets) we found2 microRNA precursors (contig7127 and 7850 encodingmir-9-1 and mir-124a respectively) as well as E3(Contig2965) and 5SN4 (Contig5668) snoRNAs.Contig7127 (452 nt) is derived from the assembly of 6ESTs derived from 3 distinct cDNA clones of the xtbslibrary. The alignment of contig7127 sequence on X. trop-icalis genome sequence reveals 100% identity and indi-cates that one splicing event is required. Thus, contig7127represents a bona fide neural transcript of the mir-9-1gene. Contig7850 (800 nt) is derived from the assemblyof 10 ESTs derived from 6 cDNA clones (one from xtbsand 5 from xthr libraries). Four of these clones are identi-cal and characterized by a 409 bp cDNA, while two arelonger and have their 3' ends ESTs as singletons but map-ping to the same scaffold region.

Added-value of xthr and xtbs 5'ESTsFigure 2Added-value of xthr and xtbs 5'ESTs. 5' cDNA sequences were compared to 4,908 complete X. tropicalis cDNA sequences from XGC and Sanger Institute. When an EST matched unambiguously (>95% id over more than 50 nt on the same orientation) one of these cDNAs, the position of its first residue (X axis) was plotted as a function of the cDNA size (Y axis). Each dot represents the result of an alignment. A position of 0 on the x axis indicates identical 5' ends between the EST and cDNA. Negative values indicate that the EST extends further 5', positive values superior to the cDNA length indicate that the EST extends further 3', and positive values inferior to the cDNA size indicate the 5' EST position relative to the cDNA.

Page 28: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 6 of 13(page number not for citation purposes)

Expression profilesOther collections of X. tropicalis ESTs are ongoing[8,17,19] using a variety of cDNA libraries made fromadult tissues or embryos at different stages of develop-ment. Hence, we undertook an in silico analysis of geneexpression profiles estimated from EST counts [20].

In a first analysis, we searched transcripts identified byESTs derived predominantly from our cDNA libraries. Weidentified 99 and 238 cDNAs found prominently in theheads and retinas of tailbuds or brain and spinal cord oftadpole, respectively (See Additional file 8 and Additionalfile 9) and 25 clones found predominantly in both struc-tures. These clones are likely to represent genes differen-tially expressed in the retina or the central or peripheralnervous system during metamorphosis. The study of thesegenes in Xenopus could well improve our knowledge onCNS development and function in vertebrates.

MetamorphosisIn a second analysis, we explored the metamorphosistranscriptomes using expression profiles derived from ESTcounts. It is known that amphibian metamorphosisbrings about unique regulations triggered by thyroid hor-mones during late vertebrate development, but relativelyfew genes are characterised as playing regulatory roles inthis process [21]. We extracted 4,187 UniGene clusterscontaining at least one EST from the xtbs cDNA library.Similarly, we fetched data from 592 UniGene clusters con-taining at least one and up to 132 ESTs from anotherlibrary derived from a metamorphic stage of develop-ment. Combining both sets gives 4,779 UniGene clusters(13% of all clusters). To generate a useful expressionmatrix an initial filtration step was performed wherebyclusters composed of less than 10 ESTs were removedleading to a set of 3,422 UniGene clusters. We used the GTtest [22] to rank profiles in three categories: strong (64clusters with GT > 0.66), medium (803 clusters with 0.33< GT < 0.66) or weak (2555 cluster with GT < 0.33) differ-ential expression. Because UniGene is prone to overclus-tering we focused our analysis on the 64 clusterscorresponding to genes with a strong differential expres-

sion and analysed their expression profiles using hierar-chical clustering (Figure 3). Twelve characteristicexpression profiles are observed, corresponding to peaksof expression that are tissue (brain, intestine, kidney,heart, lung, skeletal muscle, skin) or stage-specific (egg,tailbud, tadpole, metamorphosis). The correspondinggenes are potential differentiation markers that can beuseful in developmental studies and can easily be checkedby in situ hybridization on embryos. Only one transcripttagged by ESTs derived solely from a metamorphic stagewas identified. This transcript codes for preprocaeruleintype-4; it is characterized by 40 ESTs derived from 24cDNA clones issued from 6 libraries made from stage 62and 64 tadpoles, i.e. representing late metamorphosisstages. Caerulein is a peptide found predominantly inskin secretions. It belongs to the gastrin/cholecystokininfamily of neuropeptide, and may play a role as an antimi-crobial molecule [23]. This finding is discussed later.

Since there are currently ten times more ESTs in cDNAlibraries derived from metamorphic stages of develop-ment in X. laevis than in X. tropicalis, we did a similar sur-vey of the expression profiles of transcripts in X. laevis.

We extracted 6,297 UniGene clusters (24% of all clusters)containing at least one and up to 710 ESTs from at leastone cDNA library prepared from metamorphic tadpoles.This corresponds to 24,262 ESTs made from four cDNAlibraries: limb, tail, intestine and tadpole (NF stage 62).The level of expression of each transcript was estimated bycounting ESTs providing a corresponding UniGene clus-ter. The 26 clusters containing the highest number ofESTs, and hence corresponding to the most highlyexpressed genes during metamorphosis, are listed in table4. We expected to find either ubiquitously or differentiallyexpressed categories among these highly abundant tran-scripts at metamorphosis stages. Indeed, 16 of these 25UniGene clusters are found as characterized by a restrictedexpression in the tail, limb, or heart (table 4). Interest-ingly, it is known that 11 are expressed in the muscle cellsthat compose most of the tail and limb. One remarkablecase is a gene coding for a protein involved in freeze-toler-

Table 2: Results of sequence comparisons

BLASTX BLASTN

SwissProt Uniref 90 X. laevis reg. entries

X. laevis UniGene

X. tropicalis UniGene

X. tropicalis genome X. tropicalis cDNA

ND 2034All hits 4426 51% 5506 63% 5447 62% 7175 82% 7865 90% 8703 99% 3877 44% All hits>= 90% 1594 18% 2753 31% 744 8% 864 10% 7199 82% 8371 96% 3660 42% >95%70 – 90% 1705 19% 1917 22% 2963 34% 3843 44% 454 5% 302 3% 217 2% 90 – 95%< 70% 1127 13% 836 10% 1740 20% 2468 28% 212 2% 30 0% 0 0% < 90%No similarity 4330 49% 3250 37% 3309 38% 1581 18% 891 10% 53 1% 4879 56% No similarity

Page 29: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 7 of 13(page number not for citation purposes)

ance found predominantly in metamorphic limbs, butexpressed in a variety of other tissues both embryonic(starting at gastrula stage) and adult (nearly all adult tis-sues sampled with the exception of ovary, testis and lung).This can be an artefact due to the handling of tissues at thetime of RNA extraction. Alternatively, this may reflect theinduction by stress-related hormones (glucocorticoids)during metamorphosis.

We then carried out an in silico reconstruction of the tran-scriptional profile of X. laevis metamorphic genes usingthe IDEG suite of statistical tests. We removed clusterscomposed of less than 10 ESTs, leading to a set of 3,599UniGene clusters. The GT test ranked profiles in three cat-

egories: strong (167 clusters with GT > 0.66) medium(1,300 clusters with 0.33 < GT < 0.66) or weak (2,132cluster with GT < 0.33) differential expression.

From the 167 clusters corresponding to genes with astrong differential expression, only 30 are composed of atleast 2 ESTs derived solely from metamorphic or adult tis-sues, four of which bear no similarity to known proteins.We analysed the expression profiles of these metamorphicgenes using unsupervised hierarchical clustering. Theresulting clusters could be interpreted along the predomi-nant expression domains (See Additional file 10). Threeclusters (metamorphic, tadpole and limb) correspond tolarval stages of development that are made of eight, threeand four genes, respectively (Fig. 4). These genes arepromising candidates, potentially playing important rolesduring this late developmental event. Below, we describebriefly what is known about each of these genes.

A larval beta chain of globin is among the metamorphiccluster together with an alpha chain, an indication of therelevance of our analysis. The comparison of Xl.56714EST sequences with known proteins shows that theyresemble cell surface receptors of the SLAM (SignallingLymphocytic Activation Molecule) family. The SLAMreceptors regulate immune cell activation. Indeed, it isknown that immune system remodelling is a major eventof metamorphosis [24]. The gene corresponding toXl.56714 is expressed in metamorphic tadpoles (includ-ing tail and intestine) as well as in the adult kidney. How-ever we could not detect significant similarities to anyknown gene sequences or proteins. Alpha-1 antichymot-rypsin (a plasma protease inhibitor) is highly expressedduring metamorphosis and found in adult liver. This cor-relates with the associated stress condition that occursduring tadpole transformations. The gene encodingalpha-2-HS-glycoprotein (also named fetuin) steadilyincreases in expression from tailbud stage up to metamor-phosis. This gene product is secreted in plasma and playsa physiological role during mammalian fetal develop-ment, especially in mineralization and growth. A knownXenopus gene encoding a small peptide named PYLa isfound exclusively in a cDNA library prepared from stage62 tadpoles. As for the preprocaerulein transcripts in X.tropicalis, the PYLa transcripts are abundant in metamor-phic stages, with ESTs found in limbs and whole tadpoles.Both caerulein and PYLa peptides may be secreted fromskin glands and exert antimicrobial activities. This findingcorroborates a previous report on caerulein expression[25]. Remarkably, skin glands are known to express acocktail of signalling peptides, including neuropeptidessuch as xenopsin, thyrotropin-releasing hormone andPGLa. Whether these peptides play specific roles in thecontext of metamorphosis is unknown. The clusterXl.24674 corresponds to a gene resembling uromodulin.

Table 3: GO Molecular function classification.

Gene ontology term N

All molecular function terms 1775->antioxidant activity 14->binding 623-->calcium ion binding 89-->carbohydrate binding 7-->lipid binding 10-->nucleic acid binding 350--->DNA binding 135---->chromatin binding 0---->transcription factor activity 31--->RNA binding 40--->translation factor activity, nucleic acid binding 58-->nucleotide binding 51-->oxygen binding 0-->protein binding 52--->cytoskeletal protein binding 25---->actin binding 23-->receptor binding 28->catalytic activity 455-->hydrolase activity 110--->nuclease activity 9--->peptidase activity 66--->phosphoprotein phosphatase activity 6-->kinase activity 56--->protein kinase activity 33-->transferase activity 112->chaperone regulator activity 0->enzyme regulator activity 37->motor activity 15->nutrient reservoir activity 0->signal transducer activity 48-->receptor activity 13-->receptor binding 28->structural molecule activity 398->transcription regulator activity 40->translation regulator activity 58->transporter activity 120-->electron transporter activity 39-->ion channel activity 11-->neurotransmitter transporter activity 3->triplet codon-amino acid adaptor activity 0

Page 30: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 8 of 13(page number not for citation purposes)

Digital expression profiles of X. tropicalis transcripts differentially expressed at metamorphosisFigure 3Digital expression profiles of X. tropicalis transcripts differentially expressed at metamorphosis. Each line gives the expression profile of a given transcript represented by a UniGene cluster. The expression is deduced from counting the occurence of ESTs derived from a given cDNA library. The level of expression is colour coded in blue shades, dark blue means evidence for high levels of transcripts and white means no evidence for expression. On the left, clusters of expression profiles are delineated by a vertical bar labelled with the associated characteristic domain of expression. On the right, the cluster name and its annotation (i.e. the corresponding gene product description as deduced from sequence similarity analysis) are given. Each column corresponds to a category of tissue or stage of development: 8 adult tissues and 6 stages of development. Note that a given category may correspond to several cDNA libraries. Here, only clusters for which evidence of differential expres-sion were used to build the matrix of expression. This matrix was analysed by hierarchical clustering on the expression profile dimension using CLUSTER 3.0 as described in the methods section.

adi

pose

tiss

ue

bra

in d

iges

tive

gen

itour

inar

y

hea

d h

eart

lung

ske

leta

l mus

cle

ski

n e

gg g

astr

ula

neu

rula

tailb

ud ta

dpol

e m

etam

orph

osis

Str.20603 keratin 5 Str.40562 keratin 14 Str.49404 keratin 5 Str.41834 keratin 14 Str.51481 transcribed locus Str.41889 26 serine protease Str.41838 transcribed locus Str.51637 acyl-CoA dehydrogenase Str.29235 G protein pathway suppressor 2 Str.4962 slc7a7 Solute carrier family 7 Str.8295 elastase I Str.5512 ctrb2: chymotrypsinogen B2 Str.26623 lactotransferrin Str.26669 arginosuccinate synthetase (ass) Str.38826 apolipoprotein A-I precursor Str.7018 frzb: Frizzled-related protein (frzb) Str.8426 myl3: Myosin, light polypeptide 3 Str.24081 transcribed locus Str.2078 NADH dehydrogenase Str.49042 cytochrome-c oxidase Str.21355 progonadoliberin-2 precursor Str.22092 transcribed locus Str.49431 preprocaerulein type-4 Str.5804 kinesin light chain Str.17081 myosin heavy chain, fast skeletal muscle, embryo Str.8341 collagen, type V, alpha 3 Str.21728 transcribed locus Str.5815 DNA-binding protein A; Y box protein 3 Str.35909 myosin, heavy polypeptide 2, fast muscle Str.37351 ribosomal protein L3-like Str.27777 mogat2: Monoacylglycerol O-acyltransferase 2 Str.3439 eukaryotic translation initiation factor 2-alpha Str.21815 erythrocyte membrane protein band 4.9 (dematin) Str.28019 discs, large homolog 4 Str.14347 RAB40B, member RAS oncogene family Str.21084 C14orf37 Str.21253 transcribed locus Str.52283 Rho-related GTP-binding protein RhoI Str.21114 neurensin 1 Str.21299 phytanoyl-CoA 2-hydroxylase interacting protein Str.20386 myelin basic protein Str.14909 ADP-ribosylation factor 3 Str.28128 microtubule-associated protein 2 Str.21155 neuronal pentraxin II Str.21068 transcribed locus Str.21392 transmembrane protein 59 Str.28868 centaurin gamma 1 Str.21176 neurexin 1, isoform alpha Str.28022 fibrinogen C domain containing 1 Str.20925 protein tyrosine phosphatase, receptor type, N Str.21977 kinesin-like protein KIF1A Str.15361 arrestin beta 1, isoform B Str.38310 latrophilin 1 Str.20880 contactin associated protein 1 Str.21943 transcribed locus Str.21088 EPH receptor A8 Str.21930 transcribed locus Str.8446 EF-hand calcium binding protein 1 Str.38334 lysosomal associated protein transmembrane 4 beta Str.49508 myelin proteolipid protein Str.26607 C2orf21 Str.21007 glutamate receptor, ionotrophic, AMPA 4 Str.38154 calcium channel, voltage-dependent, alpha 2/delta 1 Str.5352 prickle-like 2

skin

kidney

intestine

heart

tailbud

tadpolemetamorph

muscle

lung

brain

Page 31: Exploitation de données de séquences et de puces à ADN

BMC

Gen

omic

s 20

07, 8

:118

http

://w

ww

.bio

med

cent

ral.c

om/1

471-

2164

/8/1

18

Page

9 o

f 13

(pag

e nu

mbe

r not

for c

itatio

n pu

rpos

es)

Table 4: Highly expressed metamorphic transcripts.

UniGene ID Met ESTs Total ESTs cDNA source Note Restricted expression Description

Xl.24656_a 710 1405 b;h;l;li;ot;sk;t;wb R ubiquitous tail actin alpha 1 skeletal muscle

Xl.1115__b 248 653 et;h;he;l;ot;sk;t;wb R tail, limb tail actin alpha 1 skeletal muscle

Xl.17432 560 4290 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;sk;t;te;th;wb U ubiquitous - eukaryotic translation factor 1 alpha, somatic form

Xl.5860 397 1008 b;he;l;ot;s;sk;t;te;wb U ubiquitous - creatine kinase, muscle

Xl.11405 280 751 b;et;l;t;wb R tail, limb, tailbud tail Calcium ATPase at 60A, cardiac muscle

Xl.24815 238 510 b;l;t;th;wb R limb, tadpole limb myosin, heavy polypeptide 13, skeletal muscle

Xl.47042 227 468 b;et;ey;fb;he;k;l;li;s;sk;t;th;wb R limb metamorphosis collagen, type I, alpha 1

Xl.25492 217 833 l;ot;sk;wb R limb, tadpole limb metamorphosis oncomodulin

Xl.7551 211 1333 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;sk;t;te;th;wb U ubiquitous - eukaryotic translation elongation factor 2

Xl.28832 205 474 et;h;h;l;t;wb R tail, limb tail metamorphosis actin alpha skeletal muscle

Xl.4138__a 198 1376 b;et;ey;fb;h;k;l;li;lu;ot;ov;s;sk;t;te;th;wb U ubiquitous - actin cytoplasmic 1

Xl.29221_b 98 644 b;et;ey;fb;he;k;l;li;lu;ot;ov;s;sk;t;te;th;wb U thymus - actin cytoplasmic 1

Xl.5146 168 360 b;et;ey;fb;he;k;l;li;s;t;th;wb R limb limb metamorphosis freeze tolerance-associated protein FR47

Xl.8842 164 1033 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;sk;t;te;th;wb N limb, thymus, spleen_PHA - ribosomal protein L3

Xl.1032 158 243 b;l;ot;sk;t;wb R tadpole limb metamorphosis fast skeletal troponin C beta

Xl.1055 151 288 b;et;h;l;ot;sk;t;th;wb R limb, tailbud, tadpole limb metamorphosis myosin light chain 1, fast skeletal muscle isoform

Xl.24699 134 426 b;ey;he;l;lu;t;th;wb R limb, heart heart metamorphosis actin alpha cardiac

Xl.7875 120 975 b;et;ey;fb;h;he;k;l;lu;ot;ov;s;sk;t;te;th;wb D ubiquitous - solute carrier family 25 (mitochondrial carrier; adenine nucleotide translocator), member 5

Xl.1728 120 884 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;sk;t;te;th;wb N limb, tadpole, olfactory epith.

- acidic ribosomal protein P0

Xl.8672 116 756 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;sk;t;te;th;wb N limb - ribosomal protein L4

Xl.49126 115 272 b;ey;fb;l;lu;ot;s;sk;t;te;wb D ?ubiquitous limb metamorphosis procollagen, type I, alpha 2

Xl.995 113 440 b;et;ey;fb;h;he;k;l;lu;ot;ov;s;sk;t;te;th;wb N limb - glyceraldehyde-3-phosphate dehydrogenase

Xl.3463 109 554 b;et;ey;fb;h;he;k;l;li;lu;ot;ov;s;sk;t;te;th;wb R limb metamorphosis guanine nucleotide binding protein, beta 2, related sequence 1

Xl.118 102 274 b;et;fb;h;k;l;lu;ot;s;sk;t;wb R limb metamorphosis tropomyosin 1 alpha chain

Xl.1464 100 537 fb;l;li;ot;s;sk;t;wb R tail, tadpole tail myosin, heavy polypeptide 4, skeletal muscle

Xl.395 94 479 li;lu;th;wb R liver liver serum albumin 74 kDa

Xl.938 94 221 b;ey;fb;l;ot;sk;t;te;wb R limb limb metamorphosis tropomyosin beta chain, skeletal muscle

Xl.23399 94 156 b;l;ot;sk;t;wb D ?ubiquitous tail metamorphosis aspartyl beta-hydroxylase

b: brain;et: embryonic tissue;ey: eye;fb: fat body;h: head;he: heart;k: kidney;l: limb;li: liver;lu: lung;ot: other;ov: ovary;s: spleen;sk:skin; t: tail;te: testis;th: thymus;wb: whole body. Allogenes are written in italics. U: ubiquitous; D: no differential; N: no restricted expression; R: restricted expression.

Page 32: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 10 of 13(page number not for citation purposes)

Corresponding transcripts are found in metamorphicintestine and whole tadpole. In mammals, uromodulin isexcreted in urine and plays a role in the cellular defenseresponse. A gene encoding a stomatin homolog is highlyexpressed in intestine during metamorphosis. Stomatin isa membrane protein regulating cation exchange andcytoskeletal attachment. Among the genes represented inthe metamorphic limb cluster are a keratin and two clus-ters (Xl.57017 and Xl.57064) annotated as lacking signif-icant similarities to known proteins. In the tadpolecluster, a carboxyl ester lipase is found expressed in tad-pole and in metamorphic intestine. Mucin 2 is anothergut protein highly expressed in tadpole, as well as carbox-ypeptidase.

Taken together, these expression profiles, based on ESTcounts, reveals certain genes that are up-regulated duringmetamorphosis, possible targets of the thyroid hormonessignalling pathway.

PolymorphismsThe cDNA sequences we produced are derived from theAdiopodoume strain of X. tropicalis, originating from theIvory Coast. The genomic and most other cDNA sequenc-ing efforts are made on the N strain from Nigeria or a dis-tinct IC strain from Ivory Coast. We therefore looked forpolymorphisms that could be used in genetic mappingexperiments or to discriminate with mutations obtainedfrom ENU mutagenesis. We identified 8 SNPs derived

from mitochondrial genes, three of which (snp1A, snp2A,snp6G) are specific to the Adiopodoume strain (See Addi-tional file 11). The presence of shared alleles for 5 SNPsindicates the close relationship with the N strain asalready reported by Evans et al. 2004. We searched fornovel polymorphism markers made of di, tri, tetra andpentanucleotide sequence repeats present in our EST col-lection. We found from two to ten alleles in 225 markersderived from 212 contigs/ESTs clusters. A subset of 107markers are potential highly informative since two ormore alleles are observed at high frequencies (See Addi-tional file 12). The dinucleotide repeat AT and TA are themost common, accounting for 137 markers. These intra-genic markers should be useful once placed on a geneticlinkage map.

This dataset will provide an invaluable tool for exon defi-nition when the X. tropicalis genome sequence is finallydetermined. The results presented here are availablethrough a database on our web site [26]. Users can carryout BLAST and other searches based on GO classification,InterproScan results, and expression information. ThecDNA sequences have been deposited with Genbank/EMBL/DDBJ (accession numbers CN072222 –CN121006) and clones are available upon request.

ConclusionLarge-scale cDNA sequencing has provided invaluableresources to decipher vertebrate genome structure and

Digital expression profiles of X. laevis transcripts differentially expressed at metamorphosisFigure 4Digital expression profiles of X. laevis transcripts differentially expressed at metamorphosis. Using the same rep-resentation as in Fig. 3 three clusters are depicted that are associated with differential expression at metamorphosis (top clus-ter), tadpole stage (middle) and in the forming limb (down).

an

imal

cap

b

rain

e

ye

fat

bo

dy

gen

ito

uri

nar

y h

ead

h

eart

li

mb

li

ver

lun

g

lym

ph

ore

ticu

lar

ski

n

fem

ale

ger

m li

ne

bla

stu

la

gas

tru

la

gas

tru

la/n

euru

la c

usp

n

euru

la

tai

lbu

d

tad

po

le

met

amo

rph

osi

s

Xl.56714 transcribed locus cDNA clone IMAGE:6871962 Xl.23966 Alpha-1-antichymotrypsin precursor Xl.24530 Xl Larval beta II globin Xl.1127 Xl Hemoglobin alpha chain Xl.24560 alpha-2-HS-glycoprotein Xl.144 Xl PYLa precursor Xl.24674 uromodulin Xl.26509 Xl stomatin

Xl.51631 Keratin, type II cytoskeletal 1 Xl.57017 transcribed locus cDNA clone XL081o03 Xl.57064 transcribed locus cDNA clone IMAGE:8636 Xl.34936 Xl preprocaerulein type I

Xl.24591 Xl carboxyl ester lipase Xl.24594 mucin 2 precursor, intestinal Xl.26001 carboxypeptidase B

metamorphosis

tadpole

limb

Page 33: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 11 of 13(page number not for citation purposes)

function. Recent studies on cDNA sequencing with deepcoverage provide fundamental knowledge on the com-plexity of transcriptomes in mammals [27]. Here, we pro-vide information on the transcript sequence andexpression of an estimated 6,000 genes in X. tropicalis. Aweb resource [26] is available with associated annota-tions. The genetic resources stemming from the cDNAsequencing project described here can be used in diverseresearch projects, including vertebrate comparativegenomics, studies on evolution and development, cellbiology and developmental genetics. More specifically,retinogenesis and remodelling of the central nervoussytem during metamorphosis will benefit from this cDNAresource.

We are currently undertaking full cDNA insert sequencingfor a set of non-redundant clones, as well as characterizingtheir expression using a whole-mount in situ hybridiza-tion screen [28,29]. The genomic resources developed tostudy X. tropicalis biology are crucial to explore amphibianphysiology and genetics, this model system providingexcellent characteristics for addressing key questionsrelated to anuran metamorphosis and its associated regu-latory processes.

MethodsEmbryo and tissue dissectionEmbryos of Xenopus tropicalis Adiopodoumé strain wereobtained from parents issued of the Geneva collection[30].

From each of the two libraries made, 58,368 clones werepicked, arrayed in microtiter plates and gridded on high-density nylon filters. A sample of 1,989 cDNA clones fromthe xthr library and 1,694 clones from xtbs were partiallysequenced to obtain 4,120 ESTs. This step provided aquality assessment of the two libraries, showing theabsence of clones of bacterial origin and few ribosomal(0.45% in xthr, 0% in xtbs) and mitochondrial contami-nants (1.2 and 3.6% in xthr and xtbs, respectively). Thisprocedure provided information on overrepresentedclones, which were then removed before further sequenc-ing of up to 30,000 clones. These ESTs were grouped into1,985 clusters.

Normalization of cDNA librariesTwo pools of 25 and 28 oligonucleotides probes of 35 ntin length were labelled using Terminal Transferase (NewEngland Biolabs) and P33-dATP (Amersham). Labelledoligonucleotides were hybridised on two high-density fil-ters representing the xthr and xtbs libraries as in [31]. Pos-itive clones were identified using X-Digitize software onimages acquired using a phosphorimager.

High-throughput sequencing, assemblingThe reactions were performed with a Big-Dye terminatorcycle sequencing kit and analyzed by ABI-3700 and ABI-3730. Sequences were base-called using PHRED, thentrimmed using LUCY and custom perl scripts. Sequencesless than 100 bp were discarded, as well as those identifiedto be derived from ribosomal and mitochondrial RNAs.PHRAP was used to assemble the sequences taking intoaccount quality scores, further clustering was obtained byscaffolding using mate-pairs informations. We retainedscaffolds only if two clone links were available (exceptingsingletons) and if the orientation of the reads was consist-ent.

AnnotationRepetitive sequences were masked using CENSOR. Con-tigs and singletons were used as queries in BLASTX andBLASTN searches of Swissprot, Uniprot, Unigene (rel 70)and Xenopus tropicalis JGI assembly v4.1 databases onINFOBIOGEN server. The february 2005 release ofENSEMBL was used. Framefinder was used to identifycoding sequences, and protein domains were searchedusing INTERPROSCAN. The results of sequence assembly,scaffolding and annotation were loaded on a custom-made mySQL database. Web interface was developedusing PHP scripts.

Assessment of the fraction of clones likely to be full-length5'ESTs or contigs were compared to full-length cDNAsusing BLASTN. Alignments longer than 50 nt and exhibit-ing more than 95% identities were selected. Overlapbetween query and subject sequences was scored onlywhen the alignment encompassed up to the last nucle-otide at the 5' or 3' end of the sequences.

Alternative splicingAlignments between cDNA and genomic sequences(masked) were computed using exonerate. Evidence foralternative splicing was found when two contigs werealigned to the same genomic region with at least 95%mean overlap but with a different number of exons.

Alignments between cDNA and UNIREF proteinsequences were computed using BLASTX. Alignmentscharacterized by a gap (at least 10 aa) introduced in thecontig sequence were retrieved. Alternative splicing wasconsidered present if the contig sequence including thegap could be aligned to the genomic sequence.

Expression profilesEST counts were downloaded from UniGene (release 70for X. laevis and 32 for X. tropicalis). Relevant profiles wereextracted using custom perl scripts. GT test was run usingthe IDEG6 software [22]. Single hierarchical clusteringwas performed using Cluster 3.0 software [32]. We used

Page 34: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 12 of 13(page number not for citation purposes)

absolute correlation similarity metrics followed by com-plete clustering on mean centered gene expression pro-files. Results were visualised using TreeView.

PolymorphismsMitochondrial SNPs were identified on a collection ofmitochondrial ESTs downloaded from dbEST, JGI [33]and from our own set. These ESTs were assembled usingCAP3 and analyzed using visualization software [34].Microsatellites were identified using custom perl scripts.

Authors' contributionsRT, LC and MP carried out laboratory and data analysis.AF wrote and ran the EST processing pipeline, includingEST assembly and annotation, and is responsible for theweb-available database. G.G., J.W. and P.W. managed andconducted the sequencing experiments. BD, MW and AMparticipated in the coordination of the study. NP partici-pated in the conception and design of the study, carriedout laboratory and data analysis and drafted the manu-script.

Additional material

Additional File 1Oligonucleotides used for normalization. The table provided lists the oligonucleotide identifier, corresponding gene, number of corresponding ESTs, oligonucleotide sequence and Tm.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S1.xls]

Additional File 2Top 50 of contigs according to the number of constituent ESTs. The table provided lists the Contig identifier, the number of constituent sequence reads, the contig length and the description of the best Swissprot hit identified by blastx.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S2.xls]

Additional File 3Analysis of X. tropicalis cDNA libraries complexity. The data provided represent the description and complexity of X. tropicalis cDNA libraries sampled by more than 20,000 ESTs.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S3.xls]

Additional File 4Top 50 of contigs according to their size. The table provided lists the Contig identifier, the contig length and the description of the best Swiss-prot hit identified by blastx.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S4.xls]

Additional File 5Alternative splicing detected by a different number of exons in contigs aligned in the same genomic region. The table provided lists the Contig identifier, the number of constituent exons, the description of the best pro-tein hit identified by blastx. Each line describes one alternative splicing case.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S5.xls]

Additional File 6Alternative splicing detected by a gap in the alignment against UniRef100. The table provided lists the Contig identifier, the identifier of the protein evidencing an alternative splicing event, and the identifier of the protein showing the highest similarity by blastx. Each line describes one alternative splicing case.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S6.xls]

Additional File 7Xenopus tropicalis genes related to human disease genes. The table provided lists the human disease trait, corresponding human gene name, chromosomal location, OMIM identifier and the corresponding Xenopus tropicalis Contig identifier.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S7.xls]

Additional File 8cDNA clones found specifically in library xthr. The table provided lists the Contig identifier and the description of the best protein hit identified by blastx.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S8.xls]

Additional File 9cDNA clones found specifically in library xtbs. The table provided lists the Contig identifier and the description of the best protein hit identified by blastx.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S9.xls]

Additional File 10Digital expression profiles of X. laevis transcripts. Using the same for-malism as in Fig. 3, all X. laevis Unigene clusters associated with differ-ential expression at metamorphosis are depicted.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S10.eps]

Additional File 11Mitochondrial SNPs. The table lists the occurence of given alleles of mito-chondrial SNPs in the adiopodoume and N strains, in association with the corresponding mitochondrial gene.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S11.xls]

Page 35: Exploitation de données de séquences et de puces à ADN

BMC Genomics 2007, 8:118 http://www.biomedcentral.com/1471-2164/8/118

Page 13 of 13(page number not for citation purposes)

AcknowledgementsWe thank L. Du Pasquier for the gift of X. tropicalis animals and his contin-uous support. This research was funded by grants from l'Association pour la Recherche contre le Cancer, le Centre National de la Recherche Scien-tifique, le Ministère de l'Education, de la Recherche (French Xenopus Stock Center) et de la Technologie and the University of Paris Sud.

References1. Amaya E, Offield MF, Grainger RM: Frog genetics: Xenopus trop-

icalis jumps into the future. Trends Genet 1998, 14(7):253-255.2. Bisbee CA, Baker MA, Wilson AC, Haji-Azimi I, Fischberg M: Albu-

min phylogeny for clawed frogs (Xenopus). Science 1977,195(4280):785-787.

3. Evans BJ, Kelley DB, Tinsley RC, Melnick DJ, Cannatella DC: A mito-chondrial DNA phylogeny of African clawed frogs: phyloge-ography and implications for polyploid evolution. MolPhylogenet Evol 2004, 33(1):197-213.

4. Evans BJ, Kelley DB, Melnick DJ, Cannatella DC: Evolution of RAG-1 in polyploid clawed frogs. Mol Biol Evol 2005, 22(5):1193-1207.

5. Khokha MK, Chung C, Bustamante EL, Gaw LW, Trott KA, J. Y, LimN, Lin JC, Taverner N, Amaya E, Papalopulu N, Smith JC, Zorn AM,Harland RM, Grammer TC: Techniques and probes for the studyof Xenopus tropicalis development. Dev Dyn 2002,225:499-510.

6. Richardson P, Chapman J: The Xenopus tropicalis genomeproject. Current Genomics 2003, 4:645-652.

7. Graf JD, Kobel HR: Genetics of Xenopus laevis. Methods Cell Biol1991, 36:19-34.

8. Morin RD, Chang E, Petrescu A, Liao N, Griffith M, Chow W, Kirk-patrick R, Butterfield YS, Young AC, Stott J, Barber S, Babakaiff R,Dickson MC, Matsuo C, Wong D, Yang GS, Smailus DE, WetherbyKD, Kwong PN, Grimwood J, Brinkley CP 3rd, Brown-John M, Red-dix-Dugue ND, Mayo M, Schmutz J, Beland J, Park M, Gibson S, OlsonT, Bouffard GG, Tsai M, Featherstone R, Chand S, Siddiqui AS, JangW, Lee E, Klein SL, Blakesley RW, Zeeberg BR, Narasimhan S, Wein-stein JN, Pennacchio CP, Myers RM, Green ED, Wagner L, GerhardDS, Marra MA, Jones SJ, Holt RA: Sequencing and analysis of10,967 full-length cDNA clones from Xenopus laevis andXenopus tropicalis reveals post-tetraploidization transcrip-tome remodeling. Genome Res 2006, 16(6):796-803.

9. Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, LeeNH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al.: Initialassessment of human gene diversity and expression patternsbased upon 83 million nucleotides of cDNA sequence. Nature1995, 377(6547 Suppl):3-174.

10. Houlgatte R, Mariage-Samson R, Duprat S, Tessier A, Bentolila S,Lamy B, Auffray C: The Genexpress Index: a resource for genediscovery and the genic map of the human genome. GenomeRes 1995, 5(3):272-304.

11. Marra M, Hillier L, Kucaba T, Allen M, Barstead R, Beck C, Blistain A,Bonaldo M, Bowers Y, Bowles L, Cardenas M, Chamberlain A, Chap-pell J, Clifton S, Favello A, Geisel S, Gibbons M, Harvey N, Hill F, Jack-son Y, Kohn S, Lennon G, Mardis E, Martin J, Mila L, McCann R,Morales R, Pape D, Person B, Prange C, Ritter E, Soares M, Schurk R,

Shin T, Steptoe M, Swaller T, Theising B, Underwood K, Wylie T,Yount T, Wilson R, Waterston R: An encyclopedia of mousegenes. Nat Genet 1999, 21(2):191-194.

12. Wei C, Brent MR: Using ESTs to improve the accuracy of denovo gene prediction. BMC Bioinformatics 2006, 7:327.

13. Okubo K, Hori N, Matoba R, Niiyama T, Fukushima A, Kojima Y, Mat-subara K: Large scale cDNA sequencing for analysis of quanti-tative and qualitative aspects of gene expression. Nat Genet1992, 2(3):173-179.

14. Gomez SM, Eiglmeier K, Segurens B, Dehoux P, Couloux A, ScarpelliC, Wincker P, Weissenbach J, Brey PT, Roth CW: Pilot Anophelesgambiae full-length cDNA study: sequencing and initial char-acterization of 35,575 clones. Genome Biol 2005, 6(4):R39.

15. Bonaldo MF, Lennon G, Soares MB: Normalization and subtrac-tion: two approaches to facilitate gene discovery. Genome Res1996, 6(9):791-806.

16. Ewing B, Green P: Analysis of expressed sequence tags indi-cates 35,000 human genes. Nat Genet 2000, 25(2):232-234.

17. Gilchrist MJ, Zorn AM, Voigt J, Smith JC, Papalopulu N, Amaya E:Defining a large set of full-length clones from a Xenopustropicalis EST project. Dev Biol 2004, 271(2):498-516.

18. Wellcome X. tropicalis Full-Length Database [http://informatics.gurdon.cam.ac.uk/online/xt-fl-db.html]

19. Klein SL, Strausberg RL, Wagner L, Pontius J, Clifton SW, RichardsonP: Genetic and genomic tools for Xenopus research: TheNIH Xenopus initiative. Dev Dyn 2002, 225(4):384-391.

20. Ewing RM, Ben Kahla A, Poirot O, Lopez F, Audic S, Claverie JM:Large-scale statistical analyses of rice ESTs reveal correlatedpatterns of gene expression. Genome Res 1999, 9(10):950-959.

21. Tata JR: Amphibian metamorphosis as a model for the devel-opmental actions of thyroid hormone. Mol Cell Endocrinol 2006,246(1-2):10-20.

22. Romualdi C, Bortoluzzi S, Danieli GA: Detecting differentiallyexpressed genes in multiple tag sampling experiments: com-parative evaluation of statistical tests. Hum Mol Genet 2001,10(19):2133-2141.

23. Gibson BW, Poulter L, Williams DH, Maggio JE: Novel peptidefragments originating from PGLa and the caerulein andxenopsin precursors from Xenopus laevis. J Biol Chem 1986,261(12):5341-5349.

24. Izutsu Y, Tochinai S, Maeno M, Iwabuchi K, Onoe K: Larval antigenmolecules recognized by adult immune cells of inbred Xeno-pus laevis: partial characterization and implication in meta-morphosis. Dev Growth Differ 2002, 44(6):477-488.

25. Seki T, Kikuyama S, Yanaihara N: Development of Xenopus laevisskin glands producing 5-hydroxytryptamine and caerulein.Cell Tissue Res 1989, 258(3):483-489.

26. Xtscope [http://indigene.ibaic.u-psud.fr/EST]27. Maeda N, Kasukawa T, Oyama R, Gough J, Frith M, Engstrom PG,

Lenhard B, Aturaliya RN, Batalov S, Beisel KW, Bult CJ, Fletcher CF,Forrest AR, Furuno M, Hill D, Itoh M, Kanamori-Katayama M,Katayama S, Katoh M, Kawashima T, Quackenbush J, Ravasi T, RingBZ, Shibata K, Sugiura K, Takenaka Y, Teasdale RD, Wells CA, ZhuY, Kai C, Kawai J, Hume DA, Carninci P, Hayashizaki Y: Transcriptannotation in FANTOM3: mouse gene catalog based onphysical cDNAs. PLoS Genet 2006, 2(4):e62.

28. Pollet N, Muncke N, Verbeek B, Li Y, Fenger U, Delius H, Niehrs C:An atlas of differential gene expression during early Xenopusembryogenesis. Mech Dev 2005, 122(3):365-439.

29. Pollet N, Schmidt HA, Gawantka V, Vingron M, Niehrs C: Axeldb: aXenopus laevis database focusing on gene expression. NucleicAcids Res 2000, 28(1):139-140.

30. Rungger D: Xenopus helveticus, an endangered species? Int JDev Biol 2002, 46(1):49-63.

31. Bulle F, Chiannilkulchai N, Pawlak A, Weissenbach J, Gyapay G, Guel-laen G: Identification and chromosomal localization of humangenes containing CAG/CTG repeats expressed in testis andbrain. Genome Res 1997, 7(7):705-715.

32. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysisand display of genome-wide expression patterns. Proc NatlAcad Sci U S A 1998, 95(25):14863-14868.

33. JGI X. tropicalis v4.1 home [http://genome.jgi-psf.org/Xentr4/Xentr4.download.html]

34. SNP/INDEL Discovery Pipeline based on CAP3 assembly[http://cgpdb.ucdavis.edu/SNP_Discovery/]

Additional File 12Highly informative intragenic microsatellite markers. The table lists allelic data for a set of intragenic microsatellite markers, including the Contig ID, corresponding UniGene cluster ID, number of alleles, type of microsatellite. In bold case are figured contig/ESTs/UG clusters for which at least two alleles have a frequency higher than the mean (calculated as the total number of ESTs divided by the number of alleles) or higher than 33%. Alleles number and frequency are shown in bold if the frequency is higher than the mean or higher than 33%. A * is indicative of more than one repeat polymorphism observed for that cluster.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-118-S12.xls]

Page 36: Exploitation de données de séquences et de puces à ADN

2.2 XTScope: Xenopus tropicalis EST, a web resource for the nervous system

In recent years Xenopus tropicalis has emerged as a good candidate for vertebrate comparative genomics, due to a simpler diploid genome and a shorter generation time compared to Xenopus laevis, the classic model for developmental biology. To explore amphibian genome characteristics, the genome of X. tropicalis is currently being sequenced and more than 1 million ESTs coming from several tissues and developmental stages are available at NCBI. However, up until now, the nervous system has not been completely covered by these EST libraries. The previous section presents the EST project carried out to study the nervous system of X. tropicalis from a biological point of view. In developping this, the bioinformatic tasks are very important to derive useful biological information from the original raw data (chromatograms). As described previously, ESTs were assembled to create a gene index, then the assembled cDNA sequences were annotated to identify the corresponding genes. In addition, this information needs to be available for the scientific community. Therefore, the collected information has been stored in a public database accessible through a web application: XTScope. XTScope is a web resource that provides information on expressed sequence data from the diverse tissues studied by this project. This section describes the database content and data production, i.e. the information stored in XTScope and how this information was obtained from the ESTs. It follows with the implementation and architecture of XTScope. Finally the web interface is described to show the utility of this tool. XTScope is freely accessible at http://indigene.ibaic.u-psud.fr/EST.

Database content and data production XTScope contains the information of: (i) 11,738 tentative consensus (TC) sequences assembled from 48,785 ESTs; (ii) clusters (scaffolds) of TCs based on clone links; (iii) alignments against the genome sequence; (iv) the functional annotations based on BLAST similarity searches; (v) predictions for the open reading frame; (vi) protein domain predictions and the Gene Ontology (GO) annotations. This information was entirely derived from the EST raw data that were collected from 2 libraries covering two developmental stages: embryogenesis and metamorphosis. The first library was derived from dissected retinas and heads of young tadpoles, while the second library was made from central nervous systems (brains and spinal cords) of metamorphosing tadpoles. Several bioinformatic steps are needed to obtain useful information from ESTs (see chapter 1). The process can be divided in three steps: (1) EST preprocessing to extract the high quality regions from raw data and remove the nucleotides that are not related to the mRNA sequence of interest, such as vector, mitochondrial and polyA sequences, (2) EST assembly to reconstruct the original transcript sequences, a step which produces tentative consensus (TC) sequences, (3) and the TC annotation to identify the corresponding genes and to extract the maximum information.

36

Page 37: Exploitation de données de séquences et de puces à ADN

I implemented a specific workflow in order to extract the EST sequence from chromatograms and to subsequently reconstruct the corresponding mRNA sequences, including the sequence annotation step (see Figure 1). EST preprocessing The sequencing step generated 57,874 reads (chromatograms) from the two libraries. In order to determine the sequence from chromatograms, Phred (Ewing et al., 1998a; Ewing et al., 1998b) was used for base-calling (see chapter 1). The “called” sequences are subject to errors and they contain both, high and bad quality regions. Since Phred only removes bad quality regions from the ends and bad regions may still remain in the middle of sequences, LUCY (Chou and Holmes, 2001) was used to identify the largest high quality regions. In addition, LUCY was used to identify and remove the vector sequence. PolyA tails were removed by a custom perl script, and sequences less than 100 bp were discarded, as well as those identified to be derived from ribosomal and mitochondrial RNAs. This step resulted in 48,785 high-quality sequences that were used in the assembly step. EST assembly ESTs represent a tiny part of a given transcript, therefore in an assembly step an attempt is made to reconstruct the original transcript sequence (see chapter 1). High-quality EST sequences from both libraries were assembled together to build tentative consensus sequences using the PHRAP assembler (http://www.phrap.org). This step has generated 8,756 contigs and 2,982 singletons. However, in theory, a contig should represent an individual gene, but reads from the same clone may fall into different TCs during the assembly step. In order to correct for this problem and to group together the sequences coming from the same gene, the TCs were further grouped by virtue of clone links into 6,547 unique groups (scaffolds). These clusters were obtained by scaffolding using mate-pairs informations. We retained scaffolds only if two clone links were available (excluding singletons) and if the orientation of the reads was consistent. Inconsistent links are also kept in the database, and they are displayed in the web site as “gap problem”. TC annotation Functional annotations were performed on both contigs and singletons to identify the corresponding genes and determine their putative function. The functional annotation is based on similarities (E-value ≤ 0.001) with known proteins using BLAST (Altschul et al., 1990; Altschul et al., 1997) searches versus the Swiss-Prot, Trembl (Boeckmann et al., 2003), Uniref90, Uniref100 (Wu et al., 2006) databases respectively. Previous to any search, contigs and singletons were masked for repetitive sequences using CENSOR (Kohany et al., 2006). Similarity with known proteins also indicates the putative coding regions (ORF) of TCs, but the exact region cannot be determined by this method. We used Framefinder (Slater, 2000) for this task, a tool based on hexamer frequency statistics. Predicted coding regions were further annotated to identify protein domains, which is useful for contigs having low or no similarities with known proteins. We used InterPro (Mulder et al., 2005) with the tool InterProScan (Quevillon et al., 2005), since it incorporates the major protein signature databases into a single resource, such as PROSITE (Hulo et al., 2006), which uses regular expressions and profiles, ProDom (Bru et al., 2005), which uses automatic sequence clustering, and Pfam (Finn et al., 2006), which use hidden Markov models (HMMs). InterPro

37

Page 38: Exploitation de données de séquences et de puces à ADN

also associates protein domains with Gene Ontology (GO) categories. Programs like Framefinder are useful to predict coding regions when no known homologues are available, but they need a training set, which implies a limitation for organism lacking annotated sequence data. More recently we used TargetIdentifier (Min et al., 2005) to predict coding regions. This algorithm is based on BLASTX alignments with TrEMBL. Masked TCs (contigs and singletons) were aligned against the genomic sequences of Xenopus tropicalis (JGI assembly v4.1 database) to reconstruct the respective gene model. We used an adapted tool for this purpose, EXONERATE (Slater and Birney, 2005), the program used by Ensembl. The BLASTN alignment against the genomic sequence is also available in XTScope for an informative purpose.

Figure 1: EST preprocessing and annotation workflow. Putative cases of alternative splicing, obtained by two different methodologies, are stored in XTScope. First, to detect alternative forms of the same gene expressed in the nervous system, we used the alignments between TCs and genomic sequences (masked). Evidence for alternative splicing was found when two TCs were aligned to the same genomic region with at least 95 % mean overlap but with a different number of exons or splicing sites. Secondly, we used an additional approach to detect alternative forms of the same gene, that may be expressed in other tissues. Since no large set of protein sequences is available for X. tropicalis, we used the Uniref database. We computed alignments between cDNA and Uniref protein sequences using BLASTX. Alignments characterized by a gap (at least 10 aa) introduced in the contig sequence were retrieved. Alternative splicing was considered present if the contig sequence including the gap could be aligned to the genomic sequence. ESTs derived predominantly from our cDNA libraries are available in XTScope (for more details see section 2.1). These clones are likely to represent genes differentially expressed in the retina or the central or peripheral nervous system during metamorphosis. These computations were performed by biologists and added to the XTScope database.

38

Page 39: Exploitation de données de séquences et de puces à ADN

Implementation and Architecture The XTScope architecture that I have developed consists of a relational database and a web interface to be query by external users. The database was implemented in MySQL and it contains the information generated by the workflow previously described. The web interface was created using HTML and PHP scripts which dynamically execute MySQL queries. It operates under an Apache web server on a Linux system. The web interface also provides links to external databases used to perform the TC annotation (Figure 2).

Figure 2: XTScope architecture. The XTScope database contains the information about high quality ESTs, the tentative consensus sequences and their annotation. This information is accessible through the XTScope website, which also provides links to external databases. An entity relationship data model was designed to store the information generated by the three steps of the EST preprocessing and annotation workflow (Figure 3). In-house scripts were used to parse and load the information in the database. Adapted Perl scripts were developped for the output format of the PHRAP assembler, InterproScan, framefinder, Exonerate (genomic alignment) and BLAST reports. BioPerl was mainly used to parse BLAST output files. The current implementation allows to easily include new BLAST results. The alternative splicing cases were detected using either the data contained in XTScope database and BLAST results. The TC annotation step generates a large amount of data that needs to be quickly accessed from PHP scripts. To gain performance in MySQL queries, I used a unique ID generating mechanism to identify elements in tables and establish the relationship between tables. Moreover, additional indices were added to tables, in particular for those related to BLAST results. The web-application is organized in several modules as described in the next section. Each module contains HTML forms to query the database and/or predefined HTML queries to retrieve the information. The website navigation was implemented in a separated PHP library. This way, the implementation of new modules to XTScope is facilitated. The main HTML report contains the detailed information about TCs. This report can be accessed through the search module using an HTML form or by the URL indicating the TC id, allowing the establishment of direct links from external websites to the XTScope contig report.

39

Page 40: Exploitation de données de séquences et de puces à ADN

Protein domain identification

Scaffold

BLAST results Genomic alignment

Figure 3: The entity-relationship data model for the database of XTScope. Only the attributes used as identifiers are displayed (primary and foreign keys). Tables are grouped by annotation procedure. The blue box contains the tables with the protein domain identification and GO terms. The black box contains tables related to scaffold construction. The red box groups the tables with BLAST results, and the green box groups the tables with the genomic alignment results. We stored locally only the external information (protein identifiers, protein domain descriptions, etc) that was used for the sequence annotation. The accession numbers of best similarity hits with UniRef, Swiss-Prot and TrEMBL databases are stored in the XTScope database and they are used to build cross-references to these external databases. The Ensembl website is linked through the genomic sequences and TCs alignments.

Web interface The XTScope web application supports data retrieval through a search engine and provide useful information about the tentative consensus sequences. To design the web interface, we have identified several ways to query the database that can be interesting for users. First, a complete report for each TC, containing detailed information about the assembly and the annotation is necessary. Users must be able to find TCs searching the annotations using different criterias. Statistics about the assembly, transcripts specific to our libraries and cases of alternative splicing must be easy to retrieve. Following these criterias, we divided the application in four modules: (i) a ‘Search’ engine and the informative modules (ii) ‘Statistic’, (iii) ‘Expression’ and (iv) ‘Splicing’. Subsequently, as an addition to the implementation that I developed, a new module was incorporated in the web site. This will be described in the ‘Extension’ section. Several options are provided by the ‘Search’ module. A basic search option uses the TC identifier, a number that ranges from 1 to the number of contigs and singletons. For biologists working with physical clones, the clone identifier and plate number are criteria also available at the website. For general users, gene annotations may be more interesting: searching for TCs

40

Page 41: Exploitation de données de séquences et de puces à ADN

corresponding to a given protein, TCs containing a specific protein domain or TCs associated to a given GO functional category are possible. Searching by the TC identifier generates a ‘Contig report page’ that provides detailed information about the assembly and annotations related to a TC sequence. This report displays the nucleotide sequence, the ORF predictions and the scaffold group based on clone link information. The ESTs forming the consensus sequence and the gene model based in genomic alignments are displayed in a descriptive and graphic form (see Figure 4). The annotations based on sequence similarities are presented as summary tables, and each protein identifier has a link to the corresponding external record (Swiss-Prot, Uniprot or TRembl).

Figure 4: Part of the Contig Report page for a given contig. The contig sequences are shown on top. The Open Reading Frame predictions and protein domain predictions are displayed as lines with a size proportional to the contig length, under the relative position within the contig sequence. The position of the assembled ESTs are shown as grey or green lines, and more information is contained as a table. The ESTs are displayed as green lines when the 5’ and the 3’ EST from the same clone fall in the same contig.

41

Page 42: Exploitation de données de séquences et de puces à ADN

The clone name or the plate name allows to retrieve the TCs where the corresponding ESTs were assembled. The GO classification for molecular functions can be browsed to retrieve the TCs associated to GO codes. Finally, XTScope allows more complex queries to retrieve the TCs containing a specific annotation. Since protein descriptions are non formatted text, this web application allows full-text searches. In this way one or more words, exact phrases, etc can be used to search the protein description. Full-text searches can be performed on the protein description of the BLAST results, or the domain description of the predicted Interpro domains (Figure 5). All the search functions and informative pages of XTScope display the TC identifier with a link to the corresponding ‘Contig report page’. The ‘Statistic’ module of XTScope provides basic information about the assembly. The list of the longest contig sequences, as well as the list of the largest contigs is displayed. The ‘Expression’ module provides information relative to differential expression in the nervous system. This module contains the list of ESTs predominantly found in our cDNA libraries. The complete list shows the ESTs and their associated contigs, while the summary version displays only the contigs affected. The ‘Splicing’ module allows to retrieve the predicted cases of alternative gene forms. Three lists are available: predictions based on different number of exons, different splice sites or evidence of gaps in the alignment against known proteins.

Figure 5: Full-text searches in XTScope. The result page of full-text searches generates a report with the TC/Contig identifier, the protein description containing the query words and the ‘Hit rank’ , i.e. if the protein description matching the query words correspond to the best similarity hit, the second or another rank position. the protein description. The TC or Contig identifier provides a link to the respective ‘Contig report page’.

42

Page 43: Exploitation de données de séquences et de puces à ADN

Extensions The XTScope database has already been extended to include more annotations for the tentative consensus TCs. Contigs and singletons were compared to full-length clone sequences from Xenopus tropicalis (Gilchrist et al., 2004), which are named ‘Gurdon’s contigs’. These full-length sequences were aligned to the genome sequence, and they can also be retrieved using the module “Gurdon contigs on Genome”. Moreover, new annotations are available based on similarities versus Unigene sequences for Xenopus laevis and Xenopus tropicalis (Unigene at the NCBI).

Conclusion The XTScope database has been developed to provide information about the transcriptome of the nervous system of Xenopus tropicalis. The collection described here contains 11,738 tentative consensus sequences assembled from 48,785 high-quality EST sequences. TCs with their respective annotation are available through our web-application. The database system provides a tool which will enable the Xenopus community to take advantage of the specific information collected from the nervous system during embryogenesis and metamorphosis. XTScope can easily incorporate new information related to Xenopus research due to the modular design of both the database and the web application. This flexibility is demonstrated by the extensions already incorporated in XTScope, as described in the previous section.

43

Page 44: Exploitation de données de séquences et de puces à ADN

44

Page 45: Exploitation de données de séquences et de puces à ADN

Chapter 3: Microarrays to study the transcriptome Dans ce chapitre, des puces à ADN sont utilisées pour étudier le transcriptome de Xenopus tropicalis pendant la métamorphose. Plus précisement, nous nous intéressons aux gènes regulés pendant la métamorphose et comment les récepteurs des hormones thyroïdiennes agissent au niveau du contrôle de la transcription. Afin de comprendre mieux les programmes d’expression des gènes particuliers à certains tissus, six états de développement ont été utilisés pour produir des séries temporelles sur trois organes différents. Le chapitre décrit d’abord l’analyse bio-informatique effectuée sur les données de puces. Une série d'étapes doivent s'exécuter avant de pouvoir analyser des données de puces. Les étapes effectuées dans cette thèse correspondent a : (1) la contribution au choix de la stratégie expérimentale, c'est-à-dire, quels échantillons sont hybridés dans la même lame (array), afin d’utiliser de manière efficace les ressources disponibles. (2) Une fois que les lames ont été hybridées et scannées, les données doivent être analysés afin d’évaluer leur qualité, et elles doivent être normalisées pour diminuer les biais existantes. (3) Finalement, les profils d’expression sont obtenus en utilisant les outils statistiques appropriés. Après ces étapes, les analyses menées par les biologistes on permis d’identifier 802 gènes différentiellement exprimés dans le système nerveux central, la queue et le foie. L’analyse biologique des profils d’expression est décrite pour chaque organe étudié. Ainsi, la structure et la dynamique des réseaux transcriptionnels de la métamorphose amphibie sont illustrées à une échelle génomique.

45

Page 46: Exploitation de données de séquences et de puces à ADN

The power of microarrays to analyse thousands of genes in parallel increased the speed of experimental progress significantly. Microarrays are used in all fields of biology, for plants, animals and humans for a variety of biological questions. Expression profiles can be obtained from diverse developmental stages, under different environmental stress conditions, or in different disease states. The general goal of all these experiments is to find the function, the regulation of the genes and their interaction with other genes. Assessing the function of genes is mainly obtained by making the assumption that genes that share approximately the same expression patterns, are likely to have a similar biological function. Therefore, the classical output of microarray experiments consists of a number of clusters, showing genes with a similar behaviour under different conditions. This study involves the monitoring of gene expression through several developmental stages during metamorphosis of Xenopus tropicalis. Three tissues were studied independently generating three time series that cover six developmental stage, one for each tissue. The first section of this chapter describes the bioinformatic aspects that I have developed for this thesis: contribute to the experimental design, the quality assesment of microarray data, the choice of the normalization method and the profile reconstruction. The second section presents the biological application (article in preparation). 3.1 Bioinformatic issues

Experimental design The first challenge in this study was the experimental design for the hybridizations. The available resources were mainly constrained by the number of arrays to perform the experiments. Although replication is crucial for statistical analysis and reliable results, this implies to spend more resources for each experiment. A number of 3 replicates is usually used in microarrays, however, a ‘common reference’ design requires 18 arrays or slides for time-series experiments with 6 conditions, as in our case. Because the experiment consist of 3 time series, then 54 slides have to be used, in addition to the slides used for test and control. Recent articles about experimental design point to alternative designs as more efficient in term of resources (number of slides and RNA samples) and they can answer the biological question with similar statistical performance (minimizing the observed variance) (Yang and Speed, 2002; Kerr and Churchill, 2001). One of the main critics to the reference design is that a half of the resources are used to measure a condition that, in most of the cases, has no biological interest. An interwoven design uses a half of resources to generate gene profiles, requering 9 slides instead of 18 slides to measure 6 time points using 3 technical replicates for each time point. For this reason, we have chosen an interwoven design for the three time series under study (Figure 1). Although the used design slighly differs from the optimal design described by Kerr and Churchill (2001), it performed well in our biological application.

46

Page 47: Exploitation de données de séquences et de puces à ADN

T3

T2

T4

T5

T6

T1 Figure 1: Interwoven desing used in our study. Circles represent samples or time points, and arrows represent a direct hybridization between two samples. The arrows point from the time point labelled with Cy3 to the time point labelled with Cy5.

Analysis steps for the microarray experiment The standard analysis of microarray experiment starts from the data, as they come out of the scanner. These ‘raw’ data are tab-delimited files, that contain the signal intensities and a number of other characteristics of the spots, as spots size, a quality flag for each spot, etc. Before this data can be actually analysed, some quality assessment and normalization steps are requiered (see chapter 1). Quality assessment The quality assessment can help to discover serious quality problems, or even mistakes that ocurred during the labeling and hybridization steps. If the quality assessment does not point any serious irregularities, then the analysis can continue and the data can be normalized. In this steps we evaluated the background signal, the intensities relative to the backgroud signal and the intensities itselfs to detect bias related to dye effects. First, we controled the background signal distribution on the arrays to detect regions where spots may have unreliable high signals. Plots of the background distribution showed that for some arrays these regions are very small (Figure2 top) but for some other the noise is higher (Figure2 bottom).

Figure 2: Backgroud signal distribution for two arrays. Left images correspond to Cy5 channel (red) and right images correspond to Cy3 channel (green).

47

Page 48: Exploitation de données de séquences et de puces à ADN

The assessment of the intensity relative to the background signal indicates how well the intensities differs from the backgound, and how much the array and the experimental procedure serve to measure gene expression. The mean and median of the intensities and background were computed for the Cy3 and Cy5 channels, as well as the histograms of the signal intensities with background substraction (Figure 3). The number of spots with intensity values above 2 times the standard deviation of the background indicates how many spots are giving a clear signal of expression (Table 1).

Figure 3: Histogram of background corrected intensities for Cy5 (red) and Cy3 (green) channel. The overall mean and median of corrected intensities is indicated.

Array Cy5 Red Cy3 Green 66CMEvs66CME 5629 5914 56CMEvs57CME 5285 4993 55CMEvs57CME 4926 4101 57CMEvs64CME 2939 5205 62CMEvs55CME 5034 5117 66CMEvs55CME 5717 6064 64CMEvs56CME 5356 5566 64CMEvs66CME 5205 5841 62CMEvs56CME 4663 5518 66CMEvs62CME 5154 5980

Table1: Number of spots with intensities above 2 X standard deviation of background. The values correspond to the brain (CME) time-series form a total of 6272 spots, including control spots. To assess the precence of dye effects, we used the plot of the ratios versus intensities (MA-plot). In our dataset, bias due to the dye depend on the arrays. For most of them a normalization step is strongly recommended (see figure 4).

Figure 4: MA-plot for the array hybridizing stages 64 and 56. Control spots are labeled with colors.

48

Page 49: Exploitation de données de séquences et de puces à ADN

Normalization step There are many sources of systematic variation in microarray experiments which affect the measured gene expression levels. Normalization serves to remove bias in the data that is of a non-biological nature. Two-color cDNA microarray experiments are comparative in nature, therefore, commonly used normalization methods such as loess, focus on adjusting the value of log-intensity ratios between the red and the green channels. Loess (Yang et al., 2002) is based on the hypothesis that most of the genes are not differentially expressed. In our experiments, 3000 genes of Xenopus tropicalis were selected to be printed on the slides (dedicated array). Since the selected genes are a small portion of the total number of genes of X. tropicalis and we do not know how many of these will be expressed in the tested conditions, we cannot assume that most of them will not be differencially expressed and therefore, we do not expect that the hypothesis for normalization apply in our data. For this reason, several normalization methods were tested to select the most apropriate for our experiment. We tested three method provided by the Limma package (Smyth, 2004): loess for global normalization, print-tip loess that correct each print-tip group separately, and robust-spline which is compromise between print-tip and global loess normalization, with 5-parameter regression splines used in place of the loess curves. First we assessed how well these methods corrected dye and spatial bias. MA-plot were used to check the dye bias corrections. Although they give different results, all corrected the dye bias (Figure 5). Evaluating the ratio distribution per print tip group showed the advantage of local (print-tip and robust splines) versus global normalization methods (loess), since the bias due to the print tip group was better treated (Figure 6).

Figure 5: Comparison of not-normalized, print-tip loess, robust spline and loess using the MA-plot. Each time series include a self-self array, i.e. the same sample/condition is labelled with Cy3 and Cy5 dye and co-hybridized in the same array. For this data, there is no differential expression and consequently, all ratios are expected to be zero (all ratios must be around zero in the MA-plot). Again, local normalization methods performed better than global normalization (data not shown).

49

Page 50: Exploitation de données de séquences et de puces à ADN

Figure 6: Boxplot of ratios per print tip group. Finally, normalization methods were also compared based on duplicated spots. Since each oligo was spotted in duplicate in two separated grids, correlation between duplicated spots should be high if the normalization step corrected bias. The correlation and the standard deviation between duplicated spots confirm the previous results, that local normalization methods are best suited for this dataset. In addition to the normalization methods implemented in the Limma package, other methods exists. A method specialy designed for dedicated arrays (Wang et al., 2005) was tested in a preliminary comparison, but this showed a similar result with loess since are both global normalization methods (data not shown). Profile reconstruction The interwoven design used for the presented study, requires complex analysis procedures to reconstruct the factor of interest (e.g. a gene profile across a time series) from the normalized data. Some packages are available in R which allows to estimate the time profiles and assess differentially expressed genes. Limma package implements a gene-specific linear model to estimate the time profiles and an empirical Bayes method to determine differential gene expression. On the other hand, Maanova implements a two-stage (fixed or mixed) model (http://www.jax.org/staff/churchill/labsite/software/anova) and uses the F tests to estimate differential expression. We have chose Maanova to carry out our data analysis, that also includes functionalities to perform clustering analysis (for a complete comparison see chapter 4). Data analysis The biological application, including the analysis, the interpretation of the clusters and the biological mechanism behind these expression profiles is presented in the following section.

50

Page 51: Exploitation de données de séquences et de puces à ADN

3.2 Xenopus tropicalis metamorphosis transcriptomes analysis using microarrays

51

Page 52: Exploitation de données de séquences et de puces à ADN

Xenopus tropicalis metamorphosis transcriptomes

analysis using microarrays

Raphaël Thuret1,2*; Ana Carolina Fierro1,2* and

Nicolas Pollet1,2#.

1 CNRS UMR 8080, F-91405 Orsay, France.

2 Univ Paris Sud, F-91405 Orsay, France

* First co-authors

# Corresponding author

Fax: +33 169154949

Phone: +33 169157273

e-mail: [email protected]

Adress: Laboratoire Développement et Evolution, CNRS

UMR 8080, Bat 445, Université Paris-Sud, 91405 Orsay

cedex, France.

1

Page 53: Exploitation de données de séquences et de puces à ADN

Abstract Background: Amphibian metamorphosis is a developmental process that played a critical role during vertebrate evolution. Frog metamorphosis is triggered by thyroid hormones which act on the cells via their receptors that are ligand-binding transcription factors. Here we asked what genes are regulated during metamorphosis and how the thyroid hormone receptors mediates the specific modulation of transcription required to change cell fate during metamorphosis. We used Xenopus tropicalis as model because it benefits from a recent genome project and enables genetic manipulations. Results: To gain insights about the tissue-specific gene expression programs, profiles of gene expression during metamorphosis were obtained in three different organs taken at six time points. A total of 802 genes differentially expressed in the central nervous system, the tail or the liver was identified. Analysis of the expression profiles provides evidences for up and down regulations that are either constant or transient. We show that the repertoire of differential expression of each organ is mainly non-overlapping with the other one and that this is due to the activity of a specific set of transcription regulators for each organ. We identify as well putative Thyroid hormone Responsive Elements by in silico promoter studies. Conclusions: This report illustrates both the structure and the dynamics of amphibian metamorphosis transcriptional networks on a genomic scale. Moreover this study provides a foundation for functional genomics of amphibian metamorphosis using Xenopus tropicalis as a model to study thyroid hormone roles during vertebrate development.

2

Page 54: Exploitation de données de séquences et de puces à ADN

Background Amphibian metamorphosis is a unique and complex developmental process that brings about dramatic biochemical and morphological changes. Each tadpole organ is modified during this transition from a larval to an adult organism. For example, cells from the tail are fated to die by apoptosis, while those from the limb buds are proliferating and specific cell types of the nervous system, the intestine or the skin enter a differentiation process. Thyroid hormones (TH) are necessary and sufficient to trigger metamorphosis. This is why metamorphosis is a unique model to study the genomic effects of TH. These hormones act on the cells via their receptors, the Thyroid Hormone Receptors that are ligand-binding transcription factors of the nuclear receptors superfamily (THR) [1, 2]. THR act as heterodimers with Retinoic X Receptors (RXR) and bind their respective hormone responsive elements (Thyroid hormone Responsve Element for THR) present in the regulatory regions of transcription units [3]. Before metamorphosis, in the absence of hormone, THR act as transcriptional repressors on their target genes, recruiting co-repressor complexes. When TH is produced, it binds to THR and thus activates the transcription of target genes by replacing co-repressors by co-activators complexes. Therefore these receptors modify cell fates by a transcriptional reprogramming operating by chromatin modifications. The nature of this reprogramming depends on positional cues and cell type. Gene regulation by TH in the context of metamorphosis has been studied using differential display screens on cultured cells, tail, brain and intestine [4-8]. More recently, microarray studies have been reported [9-11]. Overall, two waves of transcriptional activation have been identified. The first wave corresponds to the regulation of direct TH-response genes. The second wave corresponds to the regulation of late TH-response genes that trigger most of the morphological changes occurring during metamorphosis. Direct TH-response genes encode proteases and proteins involved in various metabolic processes and transcription factors including TRβ itself, THbZIP, BTEB and Fra2 (for review see [12]). These direct TH-response factors would regulate the transcription of the late-response genes. Most studies on metamorphosis purpose the amphibian anuran model Xenopus laevis. However, X. laevis genome is pseudotetraploid and it is common to find one gene represented by two paralogous loci (“allogenes”) in X. laevis. Furthermore, nucleotide sequence similarity between “allogenes” is variable and there is no genomic sequence to help decipher a global view. This feature of X. laevis genome implies that when highly similar “allogene” expression differs in time or space, nucleic acid hybridization experiments results are impaired. On a genomic scale, such a technical limitation has important consequences. X. tropicalis is a better model for such transcriptomic studies [13, 14]. With more than a million of EST (Expressed Sequence Tag) and the genome sequence availability, large-scale surveys of gene expression are easier to tackle and interpret in this diploid anuran. We took advantage of the X. tropicalis model to answer the following questions: what genes are regulated during metamorphosis and how the TH/THR mediates the specific modulation of transcription required to accomplish the various cell fate changes during metamorphosis. To gain insights about the tissue-specific gene expression programs, profiles of gene expression during metamorphosis were studied in different tissues with microarrays. The transcription factors expressed in different organs were searched to identify direct TH response genes by in silico promoter studies.

3

Page 55: Exploitation de données de séquences et de puces à ADN

Repositories of X. tropicalis nucleotide sequences were mined to build a long oligonucleotides microarray representing 2902 different genes. Then this microarray was used to study physiological metamorphosis in tail, central nervous system and liver of X. tropicalis. After analysis, sets of differentially expressed genes were compared to identify similarities and differences in transcriptional programs of the three organs. Availability of genomic sequences allowed us to search for TRE in promoters of regulated genes and to potentially identify TH regulated genes. Results Experimental design for a time-course study on gene expression during metamorphosis Gene expression modifications were monitored in three dissected organs during metamorphosis to study common and specific transcriptional regulations. Our study is focused on the repertoire of transcription factors that can be regulated in each organ. Time-course experiments were performed on organs accomplishing different metamorphic fates: central nervous system (brain and spinal chord, CNS), tail and liver. These organs were selected because they are respectively partially remodeled, totally resorbed or enable a metabolic shift (from ammonotelism to ureotelism) during metamorphosis. The choice of metamorphic stages to sample was driven by the results gathered in the litterature. We selected the metamorphic period where gene expression changes were the most important for each organ. In the tail, major gene expression changes are observed between NF stages 59 and 64 (for Nieuwkoop and Faber Stage) [5]. Accordingly, we chose NF stage 58, 61, 62, 63, 64 and 65 to study gene expression in the tail during metamorphosis. In the central nervous system, innervations changes tend to occur all over metamorphosis stages. We chose to cover the period with NF stages 55, 56, 57, 62, 64 and 66 [7]. In the liver, samples covering all the metamorphosis process (NF55, NF58, NF60, NF63, NF 65 and NF 66) were selected since few data on gene expression are available [15]. Comparison of the 6 different samples was made using an interwoven loop design [16, 17]. This experimental design offers a good compromise between the number of arrays to use (6 different samples are compared in 9 hybridization experiments) and the statistical representation of each signal (variance minimization). Each sample is subjected to three direct and two indirect comparisons. These indirect comparisons are made using two independent paths (see fig 1.B design for example: NF 55 and NF 58 are not directly compared but we can estimate their expression level differences by using independently NF 60 and NF 63 comparisons). Analysis of gene expression profiles in the tail Previous studies of tail metamorphosis [5, 6, 11, 18] showed that different biological processes (such as matrix-metalloproteinase expression, maintenance of tissue contraction, apoptosis regulation via mitochondrial pathway) are engaged to resorb tail tissues in response to thyroid hormone. Our microarray study of tail resorption during metamorphosis identified a set of 381 genes differentially expressed between NF stages 58 and 65. We performed an analysis of the corresponding gene expression profiles using K-means clustering in ten different groups. The results are presented on figure 1. Four broad different expression patterns can be observed. Transient up-regulation is characteristic of Clusters 1 and 2. Constantly up-regulated genes between NF stage 58

4

Page 56: Exploitation de données de séquences et de puces à ADN

and 65 are found in clusters 3,4 and 5. Cluster 6 is the only group of transiently down-regulated genes. Finally, genes constantly repressed starting from NF stage 58 are gathered in clusters 7, 8, 9 and 10. We analyzed the molecular functions and biological processes associated with each cluster using Gene Ontology terms enrichment (Table 1). Transitory up-regulated profiles: Transitory up-regulation is associated with the Wnt signaling pathway, regulation of transcription, cellular metabolism and cell cycle (figure 1, clusters 1 and 2). Metabolism is more associated with cluster 2 and transcriptional regulation with cluster 2. A complex modulation of the Wnt pathway is observed; both positive (axn, porca, wnt10, frz1 and tcf1c) and negative (wif1) regulators are found in clusters 1 and 2. Coactivators such as hmgn3, hacs and psmc5 account for a global stimulation of transcription. Homeobox genes (tlx1, hoxb7, lhx1, six2) and zinc finger proteins (zn184, zg7, zo71) are recruited for gene-specific transcriptional regulation. Cell cycle checkpoint proteins mcm6, rfa1 and bub3 are found in cluster 2. Modifications of oxidative enzymes of the mitochondria (ldha, succa, gapdh, isocitdhmp) are most likely associated with the activation of apoptosis. Constantly up-regulated profiles: Different genes previously reported as being transcriptionally modulated during metamorphosis showed up as constantly up-regulated (Fig.1, clusters 3, 4 and 5): deiod2 (deiodinase 2, an enzyme catalyzing the conversion of T4 to T3; cluster 5), gata1a (a transcription factor regulating the switch from the larval to the adult hemoglobin; cluster 5), intb1 (integrin β1, a receptor of extra-cellular matrix proteins; cluster 4), nr2b2 (RXRβ an obligatory partner of thyroid hormone receptors; cluster 5) and prlr (Prolactine receptor, playing a role in antagonizing TH actions; cluster 4). These results indicate the efficiency of our experiments in capturing physiological transcriptional regulations. The biological processes found most significantly enriched among these constantly up-regulated genes are response to stress, regulation of transcription, and protein catabolism (table 1). The extra-cellular matrix is remodeled in the tail during metamorphosis [5, 6]. Indeed, we observed up-regulated genes playing a role in proteolysis: proteases mostly found in cluster 1(mmp2, ctsb, cstd, ctsl, elas3bp) and ubiquitin conjugation (polyubiquitin). The same genes and others (nfat, lhx2 or bmp2r) are involved in the response to stress. This response can be related to the known links between thyroid and stress axes [19] and by the roles played by the immune system during tail resorption [20, 21]. Among the transcriptional regulators, we found both global and specific modifiers. In the first category, co-activators or chromatin modifiers were found such as tcp4, hmgb1, pc2, p66, runx2, and notably nr2b2. In the second category, we found several known or orphan zinc finger proteins (zo72, xfin, zo6, zn343, zg5, zn207) as well as transcription factors whose role during tail resorption is unexpected (tbx6, anf1, fd4pr and twn). Transitory down-regulated profiles: Regulation of transcription and macromolecule biosynthesis are associated with transitory down-regulation (figure 1, cluster 6). Both global (anm1, hdac1, cbx6, chd1l, hmg14, hmgn1, z297b) and specific transcriptional regulators (foxd3, vax1, tbx6l, znf84) share this expression profile. We can observe gene products playing a role in distinct growth factors signalling pathways such as frz8 (frizzled 8, receptor of WNT) and fgfr2 (an FGF receptor), as well as hdgf (Hepatoma-Derived Growth Factor). Constantly down-regulated profiles: Among the constantly down-regulated clusters (fig 1, clusters 7, 8 9 and 10), regulation of both transcription and progression through cell cycle as well as transport are the prominent processes found regulated (table 1).

5

Page 57: Exploitation de données de séquences et de puces à ADN

Cluster 8 deals mainly with transcriptional regulation. Some known or probable co-activators as ada3, myst1, hmgb1rs15, usf1 smarcc1, edf1 or pc1 are shut down and therefore alter transcription on a global level. Specific transcriptions factors are also found down regulated such as homeobox genes (optx2, msx2, gsc, mxr, shox2, dlx1), known or orphan zinc protein finger (zbt17, zbt34, zg8) or others (junb, esr6e; olig2, olig3). In cluster 7, the regulation deals mainly with actin cytoskeleton function especially muscular actin network (act3, mle1, mlrs, tnnc2, tnnt3, tpm1). This is concordant with the disappearance of tail muscle occurring essentially between NF stage 57 and 63 [22]. The cell cycle is regulated by turning down genes involved in the initiation of mitosis (dpk, gppsup2 and cdcp6) or in inhibition of cell cycle arrest (cmyc2, ppase2cd). Other genes modulating cell proliferation are also down regulated (bhh or 143ga). Cluster 9 represents a global down-regulation of cellular physiological processes (macromolecule biosynthesis and metabolism). The observation of housekeeping genes being turned down could be interpreted either as a direct repression or as a consequence of tissue resorption. Cluster 10 is associated with transport and regulation of cellular metabolism. Analysis of gene expression profiles in the central nervous system Morphological changes in the nervous system were previously described during metamorphosis [23-27]. However, only few studies on the gene expression program of the CNS have been done [7, 28]. One of the major difficulties is the identification of specific transcriptional program of each cell type composing the central nervous system. A set of 502 genes was identified as being differentially expressed in the CNS between NF stage 55 and 66. The gene expression profiles were clustered in 8 groups by K-means (fig 2). Three classes of genes profiles can be seen in the clustering results. Transient up-regulation is observed for genes in clusters 1 and 2. Constantly up-regulated genes are in clusters 3 and 4, while constantly down-regulated genes belong to clusters 5, 6 and 7. The remaining cluster (Cluster H) is composed of differentially expressed genes exhibiting a divergent profile. Results of GO (Gene Ontology) term enrichment analysis for the three global expression patterns observed are presented on Table 2. Transitory up-regulated profiles: Transiently up-regulated genes (fig2, cluster 1 and 2) are implicated in the regulation of cell cycle, noticeably in cell cycle arrest (skb1, tsc2) or proliferation inhibition (eststferf). These processes are in agreement with precursor cells engaging in differentiation. Cluster 1 contains a particular enrichment for transcription factors, in particular thbzip, a transcription factor regulated by TH during metamorphosis. nkx2.1, usf1 and sox1 were found similarly expressed together with unknown zinc finger proteins (zn561, zn300). Constantly up-regulated profiles: Analysis of the up-regulated clusters 3 and 4 revealed the expression of genes already identified as direct targets of thyroid hormone: nr1a2b1 (TRβ, thyroid hormone receptor up-regulated during metamorphosis, cluster 3) and bzip (fra2, direct TH-response transcription factor, cluster 4). The glucocorticoid receptor (gr, cluster 4) is another gene up-regulated during metamorphosis in the brain and identified in this category [19]. Transcription factor activity, Wnt signaling pathway and glucose metabolism are the most significant GO terms enriched in this group. Wnt signaling is essentially found in cluster 4 together with transcription factors. Cation transporter activity together with neurophysiological processes represent significantly enriched terms in clusters 3 and 4. Genes encoding neurotransmitters or

6

Page 58: Exploitation de données de séquences et de puces à ADN

neurotransmitter receptors are found up-regulated: oxt, chrnb2, chrnb4 in cluster 3 or chrm2, act1, adra2, gabt4, gabarapl1, grm7, grin1 in cluster 4. This is concordant with a new wave of neurogenesis identified by the expression of different transcription factors involved in this process (xotx2, zic1, ngn1 and co-regulators pcafa and pcafb) as well as molecules implicated in axon growth or guidance (epha2, elfa1, ephrinb3, ntrn1). Global regulators of transcription (co-factors nspc1, bcap37, smarcb1), homeobox proteins (lbhx1, cad2, mxr, hoxb7, gsc) or uncharacterized zinc finger proteins (zn343, zf260) were found up-regulated. All these transcriptional regulators were not implicated in metamorphosis so far. Constantly down-regulated profiles: Chromatin silencing, regulation of translation, transcription factor activity and regulation of cell cycle are GO terms found significantly overrepresented in constantly down regulated clusters 5, 6 and 7 (Fig2, table 2). While we found structural components of ribosome as being represented mostly in clusters 5 and 6, chromatin silencing is more associated with cluster 6. Enhancement of gene expression in the CNS by inactivation of co-repressor activity is evidenced by the down-regulation of several chromatin modifiers such as the CBX genes (cbx1, cbx3 and cbx5), hdac1 or ctcf. We observed the inhibition of HLH binding proteins id3 and id4. Moreover, many transcription factors are also down regulated such as homeobox transcription factors of the HoxC family (hoxc4, hoxc6 and hoxc9), zinc finger proteins (zf161, zn161, zo6 and zo71) or other factors (sox2, zic3). Causes of the global down-regulation of 11 ribosomal proteins are poorly understood. This is correlated with the down-regulation of several translational factors such as if5, if36 if2b or sui1. However, this could be interpreted as the consequence of specific cell types losses. Cluster H: Among the differentially expressed genes exhibiting divergent profiles, we observed key regulators such as jagged1 (jdd11, a NOTCH ligand), smoothened (smo, a hedgehog receptor), igfr1 (insulin-like growth factor 1 receptor), derriere (a TGFß family member) and frz2 (secreted frizzled 2, a secreted WNT inhibitor). Transcriptional factors such as p3f2b (pou3), cdx1, mad4 were found similarly expressed but were not reported as playing a role during metamorphosis. Analysis of gene expression profiles in the liver The gene expression cascade occurring in the liver during amphibian metamorphosis is poorly described. The morphological change of liver cells has been widely documented in the past [23, 25, 29] and modifications of metabolism, especially the transition from ammono- to ureotelism has been reported in amphibians [29-32]. One study reported the identification of 20 genes as being differentially expressed in the liver [15]. Indeed, few genes have been implicated in the transcriptional regulation of the metabolic shift occurring in liver during metamorphosis. We report a set of 133 genes identified as being differentially expressed in the liver between NF stages 55 and 66. The corresponding expression profiles were clustered by K-means in 6 different clusters (figure 3) that can be further grouped in three categories. Four of these clusters are composed of genes showing a transient (cluster 1and cluster 2) or constant (cluster 3) up-regulation during the period studied. Cluster 4 gathers genes with a transitory down-regulated profile. Cluster 5 and 6 are made of genes showing a constant down regulation from NF stage 63 to 66 and NF stage 58 to 66 respectively. Results of GO term enrichment analysis for the three global expression patterns observed are presented on Table 3.

7

Page 59: Exploitation de données de séquences et de puces à ADN

Transitory up-regulated profiles: GO terms significantly enriched in these clusters deals essentially with transcription regulation (table 3, figure 3 clusters 1 and 2). The global co-regulators smarcc1, nap1 and piaspg are up-regulated. smarcc1 may act either as a co-activator or a co-repressor. nap1 regulate cell proliferation and numerous signaling pathways recruit piaspg for transcriptional read-out. We also identify smad10, a regulator of the TGFβ signaling, and junb a transcriptional regulator expressed in response to growth factors as well as homeobox proteins (hoxc8, pprx2 and six3). Two zinc finger proteins, potentially regulating transcription are also found in these clusters (zn206 and zo8). Response to stress is also identified as a significant process in transitory up-regulated clusters (table 3). The genes involved work at the level of protein degradation, cytoskeleton stability, protein metabolism and transcription. The matrix metalloprotease mmp21 and prspsn are proteins involved in cellular catabolism. Eplin and stmn2 are involved in the regulation of cytoskeleton stability grp58, ec5218 act on protein metabolism and junb, stat3 and six3 are transcriptional regulators associated with the stress response. Constantly up-regulated profiles: Several genes encoding regulatory functions showed up in cluster 2(figure 3). frz1 is a receptor with Wnt ligand activity. xr11 is an anti-apoptotic molecule of the Bcl superfamily. ngn1 is a transcriptional regulator of the bHLH superfamilly playing a role in the specification of sensory neurons. Its function in the liver has never been reported. Transitory down-regulated cluster: Transcription factor activity is the prominent term found enriched in this cluster (figure 3, cluster 4; table 3). The gene nfib1 is found in this cluster. It encodes a transcriptional activator involved in proliferation and identified as up-regulated by TH in the intestine during metamorphosis. Nuclear receptors (nr2f2 and nr6a1), homeobox protein hoxb7 and forkhead box protein (foxf1a, fd4) are similarly expressed in this cluster together with tcf1c (transcriptional modulator of the Wnt signaling pathway), pparb (responsible for β oxydation of fatty acids), tcea1 and tcea2 (transcription elongation factors) and zinc finger proteins (znf85, zo6, zn11b). Constantly down-regulated profiles: GO analysis of constantly down-regulated clusters underscore components of ribosome, chromatin modification and nucleic acid binding as significantly enriched GO terms (figure 3, clusters 5 and 6). Numerous histones are identified as down-regulated (h1h1t, h2h4, h1h2ac, h1h3c, h4h4) suggesting a modification of the histone isoforms. As we already observed in the CNS, several ribosomal proteins are down-regulated in the liver during metamorphosis (rpl8, l5a, rl15, rs23, rs11, rpl21, sui1). Quantitative PCR validation QRT-PCR experiments were performed to independently assess the results of our microarray hybridizations. Since ef1α is not identified as differentially expressed in our experiments and previously reported for TH receptors quantification by RT-PCR [33], it appears as a good reference for ∆∆Ct experiments. Two sets of genes were tested by QRT-PCR. The first is composed of up or down regulated genes identified in the three organs (aldo2, frz1, retdh1, stmn3 and wif1, figure 4A). The second is made of genes regulated during metamorphosis (nr1a1a, nr1a2b1 and thbzip, figure 4B) and only partially identified by our microarray studies. Expression levels were tested for two metamorphic stages. For 11 out of 15 gene-organ combinations, we observed a similar expression between microarray and QRT-PCR (Figure 4A). In four cases the results are discordant (stmn3 in tail, aldo2, stmn3, wif1 in liver). It should be remarked that in six cases the ratios observed were less than 1 (i.e.

8

Page 60: Exploitation de données de séquences et de puces à ADN

less than a two-fold change) and that we selected a subset of genes for which the differential was not strongest but observed in the three organs. We conclude that our microarray data are in majority confirmed by QRT-PCR experiments, especially for the tail and CNS experiments. Our liver and tail microarray data do not provide evidences for significant changes in the expression of TRβ (nr1a2b1) and thbzip. However, QRT-PCR results revealed their regulation during the period studied in these two tissues (Figure 4B). We confirm the regulation of nr1a2b1 and thbzip observed in the CNS between NF stage 55 and 66 (Figure 4 B). Similarly, nr1a1 (TRα) was found differentially expressed in our tail microarray data but not in liver and CNS. QRT-PCR experiments showed that this transcript level is increasing in the liver (2,4 fold-change). As referred in [7, 10, 34, 35], Ribosomal Protein L8 is often chosen as a reference for RT-PCR or Northern blot experiments in metamorphosis studies. We observed a down-regulated profile in our experiment on CNS and liver (respectively cluster 6 in figure 2 and cluster 6 in figure 3). To further our observations, we quantified the abundance of this transcript in the CNS at four different metamorphic stages. A slight down-regulation of this transcript was detected by QRT-PCR (sup figure 1) corresponding to a quantity of mRNA varying from 1,32 to 2,97 fold and thus confirming our microarray results. We conclude that rpl8 transcript level is effectively decreasing in the CNS during metamorphosis. This finding is in agreement with the observed down-regulation of ribosomal proteins in liver and CNS samples. We strongly suggest not using rpl8 as a reference when conducting expressions surveys during amphibian metamorphosis. Comparisons between the tail, CNS and liver gene expression profiles To gain insights on the different fates elicited by TH during metamorphosis, we compared the repertoires of differential gene expression between the three organs sampled (Fig. 5). Globally, 797 genes are found differentially expressed in at least one of the three organs (figure 5A). The proportions of genes found differentially expressed exclusively in one organ are 41% for the liver, 72% for the CNS and 61% for the tail (Figure 5A). Similar proportions are found when only transcription factors are taken in account (liver: 47% ; CNS: 73% ; tail 61%: Fig. 5D). This indicates that the repertoire of differential expression of each organ is mainly non-overlapping with the other one, and due to the activity of a specific set of transcription regulators for each tissue. Nearly two-thirds of the genes that are found differentially expressed are up-regulated (513 genes, Fig. 5B). Only 2% of these 513 genes are common to the three organs. This set composed of 10 genes commonly up regulated by thyroid hormones could represent a part of the genes enhancing the beginning of metamorphosis in all organ. Tail and CNS express a common set of 44 genes whereas CNS and liver express only 5 genes in common. This underscores the similarity of the tail and CNS transcriptional programmes, and the difference with the liver. Looking at down regulated genes (figure 5C) shows that no genes are shared in this category between the three tissues. Common sets of genes shared by two organs are small. However, 12 genes are co-regulated between CNS and liver and theses transcripts are essentially ribosomal proteins. We compared the common sets of transcriptions regulators expressed in these three tissues (figure 5D). Each organ expresses relatively specific sets of transcription factors (17 TFs in liver, 67 in CNS and 48 in tail). Only 9 transcription factors are expressed in tail, CNS and liver. Only 2 common specific genes are found between CNS and liver,

9

Page 61: Exploitation de données de séquences et de puces à ADN

14 are common to CNS and tail. Our interpretation is that transcriptional control of metamorphosis is very different in each organ submitted to thyroid hormone influence. This raises once again the question of how a single molecule, acting on gene expression can confer such different gene expression program. Some genes were found as up-regulated in some organs and down regulated in the others. The result of comparing these gene expression profiles is shown on table 5. For example, Zo6, a potential transcriptional regulator, is expressed in the three organs. This gene is strongly up regulated in the tail but strongly repressed in CNS. The same case is observable for Tcp4 (transcriptional co-activator) or Hsp70 (chaperone). Such results show how metamorphic programs can be different between organs. The fact that one gene is up regulated in one tissue but down regulated in others, especially for transcriptional regulators, is an observation of how thyroid hormone can govern different genes regulations in different organ. Such genes, showing different expression programs in different tissues, could have enhancing activities on some biological processes, thus facilitating accomplishment of the metamorphic program in one organ. Promoter Studies We were next interested to identify potential binding sites of thyroid hormone on promoter of differentially expressed genes. Thyroid hormone regulates transcription via thyroid hormone receptors. These receptors hetero-dimerise with RXR receptors and recognize Thyroid hormone Responsive Elements on DNA. TREs are sequences composed of two direct hexamer repeats (preferentially AGGTCT) spaced by 4 nucleotides (Direct Repeat 4, DR4 motif). TREs have been identified in various positions around start site of genes directly regulated by thyroid hormones, for example upstream THbZIP coding sequence [36] or in the first intron of Stromelysin 3 gene [37]. We took advantage of the genomic sequence of X.tropicalis to study the promoter regions of genes identified as differentially expressed during metamorphosis. Despite the fact that promoters’ characterization remains still difficult in X.tropicalis due to the lack of proper genome annotation, upstream regions of predicted transcripts are retrievable using tools like ENSEMBL Biomart [38] or TOUCAN [39]. We extracted 10kb upstream and 5kb downtsream start codon for each Xenopuce transcript showing a significant alignment on X.tropicalis predicted transcripts. These 15 kb where masked for repeated sequences with Censor4.1 and then analyzed with NHRscan, a program that allows to identify biding sites of nuclear hormone receptors using Hidden Markov Model [40]. By comparing DR4 identification in promoters regions of differentially expressed genes in each organ and a random set of 200 sequences, no difference in distribution was observed (figure 6 A). This suggests that others factors could be required for proper regulation of thyroid response genes. As shown for X. laevis TRβ and THbZIP, several potential TRE sites are identified upstream their cognate coding sequences but only one or two are known to be functional [36, 41]. We analyzed distribution of number of DR4 among promoters sequences. DR4 distribution by clusters is different from the one observed globally. Some clusters have a larger proportion (more than 15%) of sequences with 3 or more sites (clusters 1, 4, 6, 7 and H in CNS data, clusters 2 and 3 in liver data, and cluster 1, 2, 5, 7 and 8 in tail data, figure 6 B). Distribution of motifs was statistically assessed by comparing DR4 distributions between clusters showing more than 3 DR4 and others. No difference is seen in repartition of sequences showing less than 3 DR4. Nonetheless, statistical difference is robust between clusters showing more than 15% of sequences with 3 or more DR4 and others. Interestingly, these clusters contain identified direct

10

Page 62: Exploitation de données de séquences et de puces à ADN

response genes of thyroid hormone (NR1A2B1 in CNS cluster 1, MMP2 in Tail cluster 1). These clusters could contain direct target genes of thyroid hormone receptors. Discussion 1. General features of the expression data Here we have characterized the temporal expression profiles of 802 transcripts found significantly regulated in the central nervous system, tail or liver during metamorphosis in X. tropicalis. We evidenced that most transcriptional regulations during metamorphosis are due to a combination of ubiquitous and organ specific factors. The global extent of transcriptional changes triggered by thyroid hormone during metamorphosis was evidenced using the biological processes defined in the Gene Ontology thesaurus. Indeed the functional annotation of the genes members of a same cluster enabled us to found statistically significant overrepresentation of processes such as cell cycle, apoptosis, neurogenesis and metabolism. Similarly, the TH hormone regulation of specific signalling pathways (wnt, notch) was observed. A recurrent thema is the modulation of transcription, either an activation or an inhibition, using either coregulators or specific factors. Transcriptional regulation of hox genes was evidenced and might play an important role in the respecification of cell identity during metamorphosis. We conclude that metamorphosis can be conceived essentially in terms of regulation as a process of transcriptional reprogramming. From our in silico promoter characterization, we observed that one-third of the genes identified are presumed to lack canonical DR4 elements in their regulatory regions and might be totally controlled by other factors than TH receptors. We can not exclude that thyroid hormone response elements (TRE) exist in the promoters of these genes but either they are located further away from the transcriptional unit or they are non-canonical DR4. Of the remaining two thirds of the genes, at least one DR4 was evidenced in silico. These transcription units are putative direct targets of THR and further experiments are required to validate these predictions. 2. TF composing the metamorphic pathway Validation of the Shi’s Hypothesis An important question dealing with metamorphosis is how thyroid hormone drives different cellular processes such as apoptosis, proliferation and differentiation. YB Shi proposed three different models about the transcriptional control of metamorphosis (figure 7, [12]). The first one is that a common set of transcription factors is required to achieve metamorphosis by the regulation of different target genes in the different tissues. The second model purposes that tissue-specific sets of transcription factors allow the achievement of metamorphosis. The third model is a combination of the two previous models. Our data are suggestive of a composite model (i.e. ubiquitous factors and specific factors). Indeed, we found two populations of transcription factors being differently expressed during metamorphosis. First a minimal set which is common to the three organs and could represent a part of a shared transcriptional program. This observation is coherent with what has been already observed. Only four transcription regulators at least common to three organs (brain, tail and intestine) have already been identified during amphibian metamorphosis. These are TRβ, bZIP, BTEB and THbZIP [5, 7, 8]. These genes are direct TH-response genes and then form the core of the shared transcriptional regulation program during metamorphosis. Our data implicate new common transcription factors but functional studies proving a direct T3 regulation for these.

11

Page 63: Exploitation de données de séquences et de puces à ADN

Another noticeable observation is that the number of shared transcription factors is more important between tail and central nervous system than between liver and other tissues. This underlines the fact that transformations occurring in liver are very different of those occurring in tail or central nervous system and hence that genes expression regulation are totally different between organs accomplishing totally different metamorphic transformation. 3. Homeobox genes expression and thyroid hormone signaling Numerous Hox and homeobox genes have been identified as differentially expressed during metamorphosis in liver (HoxB7, HoxC8, Six3 and Prrx2), tail (Anf1, Twn, Dlx1, Shox2, Mxr, Optx2, Msx2, Gsc, Lhx2, Tlx1, HoxB7, Lhx1, Six2, Brn3A, Vax1) and CNS (Cad2, Mxr, Otx5B, Nkx2.1, HoxB7, Gsc, Brn3A, Hhex1, HoxC9, HoxC4, Pbx2, HoxC6, Meis1, Twn and Cad1). It is now well established that hox genes keep an activity during adult life in cell populations [42, 43]. HoxA 9, 10 and 11 have been implicated in these processes, giving position identities to the cells where these genes are expressed. Regulation of hox genes during this period is mainly due to endocrine system. Retinoic acid, vitamin D, estrogen and progesterone have been shown to regulate directly hox gene transcription [43]. Moreover, response elements to these hormones have been identified upstream various Hox transcription units [44-47]. As in adult tissues showing cyclic plasticity during life, hox genes could play a predominant role during metamorphosis by redefining cell position in tissues subjected to changes, especially in CNS. Until now, the regulation of hox genes during metamorphosis is an unexplored thema. But our data together with other lines of evidence suggest a role played by hox genes during metamorphosis. First, nuclear receptors and especially retinoic acid receptors are well known to regulate hox genes transcription. DNA binding elements of these receptors are structurally similar i.e. direct repeats separated by several nucleotides. DR4 elements are characterized as binding sites of TR/RXR hetero-dimers. Such proteic complexes could recognize other binding sites of the same familysuch as DR0 or DR1 [48]. Therefore genes known to be regulated by retinoic acid or others nuclear receptors, such as the hox genes, could also be regulated by thyroid hormone. Moreover, the overexpression of thyroid hormone receptor α1 disrupts the retinoic acid signaling necessary for proper Hox gene expression and a direct competition between TR/RXR and RAR/RXR on genomic DNA has been shown to explain this observation [49]. Altogether, the hypothesis of the direct regulation of hox genes by thyroid hormone during metamorphosis is likely and remains to be experimentally challenged. 4. Comparison with other microarrays data on Xenopus metamorphosis A recent study [11] reported gene expression profiling during induced metamorphosis in X.laevis. In this study, pre-metamorphic tadpoles were treated by thyroid hormone and gene expression in the brain, tail and hind limbs was monitored using microarrays. Gene expression profiles characterized on both our and Das et al. studies were characterized. Orthologous gene characterization between X.laevis and X.tropicalis was only possible for 60 to 70% of our genes identified as differentially expressed depending on the experiment (CNS, liver or tail, see table 5). Our results were then separated in up or down regulated categories and compared to each organ results of the Das et al study; finally resulting in nine different comparisons. A small proportion of genes are common to both experiments. For example, only 52 % of the genes identified as differentially expressed in our study on CNS are found in Das brain datas. Of these, 37% share a similar pattern of expression. Similarly, of the 35 % of genes identified as differentially

12

Page 64: Exploitation de données de séquences et de puces à ADN

expressed in the tail in our study and found in Das tail data, 69% show a similar pattern between the two datasets. The remaining genes show an opposite regulation behavior. Different reasons can explain such differences. First, studies were not conducted on exactly the same organs ( i.e. brain and complete CNS). Second, even if TRβ and THbzip have been shown to be regulated similarly in physiological metamorphosis and in T3 treatment there is no evidence that T3 treatment could mimic physiologic TH roles during metamorphosis. Moreover, premetamorphic tadpoles treated by T3 do not undertake a complete metamorphosis. Thus, the differences observed between Das and our datasets are logical consequences of the sum of experimental set-up and technical differences. However, the complimentarity of these experiments is interesting to determine which genes are most likely under the influence of thyroid hormone during Xenopus metamorphosis. Promoter Characterization Promoter analysis remains still difficult despite huge efforts of the community to annotate this kind of sequence. A lot of tools are now available, using different methods to characterize TF binding site or even extract regulation model from transcripts upstream sequences [50]. We took advantage of X.tropicalis genome to try to identify any regulators of gene expression during metamorphosis. Here, we focused on TREs by analyzing DR4 distribution in upstream regions of genes identified as differentially expressed. Results showed that DR4 don’t seem to be particularly over-represented in our sequences subset compared to a random set of sequences extracted from X. tropicalis ENSEMBL. Looking more closely to results make appear that DR4 distribution is not homogenous among clusters, with more sequences with at least 3 DR4 motifs in some of them, clusters that could correspond to early response genes. Known direct genes regulated by thyroid hormone signaling are elements of these clusters. But experimental characterization of these sequences needs to be conducted to validate DR4 functionality. TFBS distribution in promoter sequences should also be realized. This task promise to be tricky since little is know about TFBS in Xenopus genus and phylogenetic footprinting is not easily applicable to this since no genome of close species are sequenced. An interesting approach could be to analyse ECR (Extremely Conserved Regions) but how reliable to transcription regulation are these datas is not yet determined. Another interesting tool that is still in development is a X. tropicalis genomic array. For the moment, only 3000 genes are planned to be represented on this array but it could be a extremely useful tool for TFBS studies especially during metamophosis where a lot of work have already been done on chromatin immunoprecipitation. Materials and methods 1. Microarray design and genes annotation Xenopuce is a project gathering efforts of four laboratories interested in different aspects of the Xenopus development. The aim of this project was to build a long oligonucleotide microarray representing 3000 genes interesting for each consortium member. We initially drew a list of biological processes characteristic of early development and metamorphosis corresponding to research topics of each research team. Gene products playing a role in these processes were compiled using different methods from manual listing to global selection of genes belonging to a given Gene

13

Page 65: Exploitation de données de séquences et de puces à ADN

Ontology term. A vertebrate accession number is associated to each item allowing X.tropicalis orthologous identification. Reversal best BLAST hit method was used to identify orthologous genes using the available X.tropicalis sequences data especially ESTs sequences produced from head and retinas of young tadpoles and nervous system of metamorphic tadpoles (xtscope project, Thuret et al 2007). Orthologous transcript identification was possible for 1639 genes. A specific search for genes encoding DNA binding domain was made in ESTs databases. 563 potential transcriptional regulators were identified by this way. Finally, we included potential full-length cDNA sequences encoding genes of unselected biochemical function or biological processes. These cDNA were issued either from the XGC project (302 sequences) or our own cDNA sequencing efforts (404 sequences). This set of 2908 sequences was transmitted to MWG Biotech for oligonucleotides design and spotting on glass slides. One or two 50 nt long oligonucleotide were selected to represent each cDNA sequence. Annotation of the microarray transcripts is three dimensioned. First, meta-category and category was associated to each gene while setting the list. This annotation is biologically orientated since each list designer chose genes identified as playing a role in his research topic. Second, a BLASTP comparison of each Xenopuce sequence was made using SWISSPROT/UNIPROT database. Each transcript is then associated to a SWISSPROT/UNIPROT entry including the protein definition and function. Finally, systematic GO annotation of Xenopuce transcripts was realized by using Gotcha [51]. This software assigns a GO annotation to a nucleotidic or proteic sequence by BLAST homology. GO annotation accuracy depends on the similarity threshold accepted. In our case, this thresold was set to 40%. Informations on the composition and annotation of Xenopuce microarray are available on our website [52]. 2. Animal staging and RNA preparation for microarray experiments Xenopus tropicalis tadpoles were raised to metamorphosis and then staged according to the Niewkoop and Faber developmental table [53]. 10 tadpole of each NF stage were dissected and isolated organs were then kept in RNAlater at –20°C until RNA preparation. RNAs were then extracted using Trizol™ (Invitrogen™) coupled with Phase Lock Gel (Eppendorf™). After DNAse digestion (Ambion Turbo DNase™), RNA were purified with Megaclear™ kit (Ambion) and their quality was assessed using Agilent 2100 Bioanalyzer with ARN 6000 Nano lab chip kit. 3. Probes preparation and microarray hybridization 10 mg of total RNA were used to prepare probes with Invitrogen SuperScript™ Indirect cDNA labeling system (using polyA and random hexamers primers) according to the manufacturer’s protocol. These probes where then coupled to Amersham Cy3 or Cy5 monofunctionnal reactive dye following Invitrogen protocol. Probes quality was then assessed on 1% agarose minigel on glass slide scanned in Genetac LS IV scanner and quantified with Nanodrop ND-1000 spectrophotometer. Dyes quantities were then equilibrated for hybridization by quantity of fluorescence per ng of cDNA. Probes were then dried by speedvac, re-suspended in 35 µL of MWG™ hybridization buffer and placed between slide and coverslip on pre-saturated arrays according to the manufacturer’s protocol (QMT ref). Hybridization were conducted in hybridization chambers (ref) at 45 °C for 20h. Slides were then washed once in 2X SSC 0,1% SDS for

14

Page 66: Exploitation de données de séquences et de puces à ADN

5’ and twice in 1X SSC and 0,5X SSC for 5’. Hybridized arrays were then dried by centrifugation at 500 rpm in slides baskets and kept in dark until scanning. 4. Data retrieval, normalization and analysis Microarrays were scanned using an Axon scanner. Gpr files were created with Genepix 3.0 and data normalization was conducted under R/Bioconductor [54, 55] using LIMMA [56]. Flagged spots were excluded before Printiploess normalization applied on raw signals. Log of absolute intensities were extracted to identify differentially expressed genes with MAANOVA package [57], using array replicates as separated datasets. Mixed model analysis was conducted taking in account following parameters: Dye, Array and Sample. Transcripts showing a differential expression for the three statistical tests were then retained and log2 of absolute data expression for each sample were retrieved. Expression profiles were finally reconstructed by subtracting initial stage data to others stages data (NF stage 55 for central nervous system and liver, NF stage 58 for tail). Hierarchical and K-means clustering were realized using Cluster [58], [59]for Mac OSX adaptation) and visualized with TreeView [60]. GO terms enrichment evaluation was performed with STEM [61]. 5. Quantitative RT-PCR Primers picking was performed with Primer Express. PCR fragments were chosen around introns to avoid genomic DNA amplification. Table xx lists primer sequences of genes tested by QRT-PCR and their location on X.tropicalis genome version 4.1. Independent RNAs isolations were prepared using the same protocol than for microarrays. Five tadpoles of each tested stage (NF 55, 57, 62 and 64) were dissected. RNA quality was assessed on Agilent 2100 Bioanalyzer with ARN 6000 Nano lab chip kit. Retrotranscription was conducted with Invitrogen Superscript III kit following manufacturer’s protocol. Quantitative RT-PCR were then performed on ABI prism 7900 HT using SYBR green PCR Master Mix according to manufacturer’s protocol. Primers couples efficiency was assessed by absolute quantification experiments on 5 orders of dilution (from 100 ng to 0,01 ng). Relative quantification experiments using ∆∆Ct method were then performed using EF1α as the reference gene. Standard PCR conditions (40 cycles, 15’’ at 94°C, 1’ at 60°C, primers) were used. Amplification of each gene in each sample was realized in triplicate. 6. Promoter studies Microarray oligonucleotides sequences were blasted on X.tropicalis ENSEMBL predicted transcripts to identify translation initiation sites. ENSEMBL transcripts IDs were then use in TOUCAN [39] to retrieve 10kb upstream and 5kb downstream of the start codon. Genomic sequences were masked for repeated sequences with Censor 4.1 and then analyzed with NHRscan [40] to identify potential TRE sites (DR4). These sites were then validated with fuzznuc since NHRscan eliminates stretch of Ns. After verification, DR4 where counted and position kept for further PCR primers design. List of abbreviations CNS: Central Nervous System ; DR4: Direct Repeat 4 ; EST : Expressed Sequence Tag ; GO: Gene Ontology ; NF: Nieuwkoop and Faber ; RXR: Retinoid-X-Receptor ; TF: Transcription Factor ; TH: Thyroid Hormone ; THR: Thyroid Hormone Receptor ; TRE: Thyroid hormone Responsive Element.

15

Page 67: Exploitation de données de séquences et de puces à ADN

Acknowledgements We thank L. Du Pasquier for the gift of X.tropicalis animals and his continuous support. This research was funded by grants from le Centre National de la Recherche Scientifique, le Ministère de l’Education, de la Recherche et de la Technologie (French Xenopus Stock Center), the University of Paris Sud and the European Community FP6 (X-omics coordinated action No. 512065). We thank the Department of Energy’s Joint Genome Institute for the availability and the use of X. tropicalis genomic sequences. We thank Christophe de Medeiros for taking care of the animals. We acknowledge the technical support from the Plateforme d’Instrumentation et de Compétences en Transcriptomique of INRA Jouy-en-Josas and especially Sophie Pollet and Emmanuelle Zalachas for their assistance as well as the GODMAP microarray facility of CNRS Gif-sur-Yvette. We acknowledge the assistance of David Du Pasquier, Laurent Coen, Catherine Jessus, Yann Audic, Muriel Perron, De-Li Shi for their contribution in the gene selection process for the Xénopuces project. Thanks to André Mazabraud and Maurice Wegnez for general support. Bibliography

1. Sap J, Munoz A, Damm K, Goldberg Y, Ghysdael J, Leutz A, Beug H, Vennstrom B: The c-erb-A protein is a high-affinity receptor for thyroid hormone. Nature 1986, 324(6098):635-640.

2. Tsai MJ, O'Malley BW: Molecular mechanisms of action of steroid/thyroid receptor superfamily members. Annu Rev Biochem 1994, 63:451-486.

3. Umesono K, Evans RM: Determinants of target gene specificity for steroid/thyroid hormone receptors. Cell 1989, 57(7):1139-1146.

4. Wang Z, Brown DD: A gene expression screen. Proc Natl Acad Sci U S A 1991, 88(24):11505-11509.

5. Wang Z, Brown DD: Thyroid hormone-induced gene expression program for amphibian tail resorption. J Biol Chem 1993, 268(22):16270-16278.

6. Brown DD, Wang Z, Furlow JD, Kanamori A, Schwartzman RA, Remo BF, Pinder A: The thyroid hormone-induced tail resorption program during Xenopus laevis metamorphosis. Proc Natl Acad Sci U S A 1996, 93(5):1924-1929.

7. Denver RJ, Pavgi S, Shi YB: Thyroid hormone-dependent gene expression program for Xenopus neural development. J Biol Chem 1997, 272(13):8179-8188.

8. Shi YB, Brown DD: The earliest changes in gene expression in tadpole intestine induced by thyroid hormone. J Biol Chem 1993, 268(27):20312-20317.

9. Helbing CC, Werry K, Crump D, Domanski D, Veldhoen N, Bailey CM: Expression profiles of novel thyroid hormone-responsive genes and proteins in the tail of Xenopus laevis tadpoles undergoing precocious metamorphosis. Mol Endocrinol 2003, 17(7):1395-1409.

10. Veldhoen N, Crump D, Werry K, Helbing CC: Distinctive gene profiles occur at key points during natural metamorphosis in the Xenopus laevis tadpole tail. Dev Dyn 2002, 225(4):457-468.

11. Das B, Cai L, Carter MG, Piao YL, Sharov AA, Ko MS, Brown DD: Gene expression changes at metamorphosis induced by thyroid hormone in Xenopus laevis tadpoles. Dev Biol 2006, 291(2):342-355.

12. Shi YB: Amphibian Metamophosis. New York: Wiley-Liss; 2000. 13. Beck CW, Slack JM: An amphibian with ambition: a new role for Xenopus in the

21st century. Genome Biol 2001, 2(10):REVIEWS1029.

16

Page 68: Exploitation de données de séquences et de puces à ADN

14. Carruthers S, Stemple DL: Genetic and genomic prospects for Xenopus tropicalis research. Semin Cell Dev Biol 2006, 17(1):146-153.

15. Lyman DF, White BA: Molecular cloning of hepatic mRNAs in Rana catesbeiana responsive to thyroid hormone during induced and spontaneous metamorphosis. J Biol Chem 1987, 262(11):5233-5237.

16. Wit E, McClure J: Statistical adjustment of signal censoring in gene expression experiments. Bioinformatics 2003, 19(9):1055-1060.

17. Kerr MK, Churchill GA: Experimental design for gene expression microarrays. Biostatistics 2001, 2(2):183-201.

18. Berry DL, Schwartzman RA, Brown DD: The expression pattern of thyroid hormone response genes in the tadpole tail identifies multiple resorption programs. Dev Biol 1998, 203(1):12-23.

19. Krain LP, Denver RJ: Developmental expression and hormonal regulation of glucocorticoid and thyroid hormone receptors during metamorphosis in Xenopus laevis. J Endocrinol 2004, 181(1):91-104.

20. Izutsu Y, Tochinai S, Maeno M, Iwabuchi K, Onoe K: Larval antigen molecules recognized by adult immune cells of inbred Xenopus laevis: partial characterization and implication in metamorphosis. Dev Growth Differ 2002, 44(6):477-488.

21. Watanabe M, Ohshima M, Morohashi M, Maeno M, Izutsu Y: Ontogenic emergence and localization of larval skin antigen molecule recognized by adult T cells of Xenopus laevis: Regulation by thyroid hormone during metamorphosis. Dev Growth Differ 2003, 45(1):77-84.

22. Nakajima K, Yaoita Y: Dual mechanisms governing muscle cell death in tadpole tail during amphibian metamorphosis. Dev Dyn 2003, 227(2):246-255.

23. Dodd MHI, Dodd JM: The Biology of Metamorphosis. In: In Physiology of the Amphibia. Edited by Lofts B. New York: Academic Press; 1976: 467-599

. 24. Kollros JJ: Transitions in the nervous system during amphibian metamorphosis. In:

In Metamorphosis: a problem of developmental biology, 2nd edition. Edited by Gilbert LI, Frieden E. New York: Plenum Press; 1981: 445-459.

25. Fox H: Amphibian Morphogenesis. Clifton, N.J.: Humana Press; 1983. 26. Gona AG, Hauser KF, Uray NJ: Ultrastructural studies on Purkinje cell maturation

in the cerebellum of the frog tadpole during spontaneous and thyroxine-induced metamorphosis. Brain Behav Evol 1982, 20(3-4):156-171.

27. Tata JR: Gene expression during metamorphosis: an ideal model for post-embryonic development. Bioessays 1993, 15(4):239-248.

28. Denver RJ: The molecular basis of thyroid hormone-dependent central nervous system remodeling during amphibian metamorphosis. Comp Biochem Physiol C Pharmacol Toxicol Endocrinol 1998, 119(3):219-228.

29. Atkinson BG: Metamorphosis: Model systems for studying gene expression in postembryonic development. Dev Genet 1994, 15:313-319.

30. Chen Y, Hu H, Atkinson BG: Characterization and expression of C/EPB-like genes in the liver of Rana catesbeiana tadpoles during spontaneous and thyroid hormone-induced metamorphosis. Dev Genet 1994, 15(4):366-377.

31. Weber R: Biochemistry of amphibian metamorphosis. In: In The biochemistry of animal development. Edited by Weber R, vol. 2. New York: Academic Press; 1967: 227-301.

17

Page 69: Exploitation de données de séquences et de puces à ADN

32. Underhay EE, Baldwin W: Nitrogen excretion in tadpoles of Xenopus laevis daudin. Biochem 1955, 61:544-547.

33. Opitz R, Lutz I, Nguyen NH, Scanlan TS, Kloas W: Analysis of thyroid hormone receptor betaA mRNA expression in Xenopus laevis tadpoles as a means to detect agonism and antagonism of thyroid hormone action. Toxicol Appl Pharmacol 2006, 212(1):1-13.

34. Rowe I, Coen L, Le Blay K, Le Mevel S, Demeneix BA: Autonomous regulation of muscle fibre fate during metamorphosis in Xenopus tropicalis. Dev Dyn 2002, 224(4):381-390.

35. Kuiper GG, Klootwijk W, Morvan Dubois G, Destree O, Darras VM, Van der Geyten S, Demeneix B, Visser TJ: Characterization of recombinant Xenopus laevis type I iodothyronine deiodinase: substitution of a proline residue in the catalytic center by serine (Pro132Ser) restores sensitivity to 6-propyl-2-thiouracil. Endocrinology 2006, 147(7):3519-3529.

36. Furlow JD, Brown DD: In vitro and in vivo analysis of the regulation of a transcription factor gene by thyroid hormone during Xenopus laevis metamorphosis. Mol Endocrinol 1999, 13(12):2076-2089.

37. Fu L, Tomita A, Wang H, Buchholz DR, Shi YB: Transcriptional regulation of the Xenopus laevis Stromelysin-3 gene by thyroid hormone is mediated by a DNA element in the first intron. J Biol Chem 2006, 281(25):16870-16878.

38. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14(1):160-169.

39. Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, De Moor B: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res 2003, 31(6):1753-1764.

40. Sandelin A, Wasserman WW: Prediction of nuclear hormone receptor response elements. Mol Endocrinol 2005, 19(3):595-606.

41. Machuca I, Esslemont G, Fairclough L, Tata JR: Analysis of structure and expression of the Xenopus thyroid hormone receptor-beta gene to explain its autoinduction. Mol Endocrinol 1995, 9(1):96-107.

42. Magli MC, Largman C, Lawrence HJ: Effects of HOX homeobox genes in blood cell differentiation. J Cell Physiol 1997, 173(2):168-177.

43. Taylor HS, Igarashi P, Olive DL, Arici A: Sex steroids mediate HOXA11 expression in the human peri-implantation endometrium. J Clin Endocrinol Metab 1999, 84(3):1129-1135.

44. Akbas GE, Song J, Taylor HS: A HOXA10 estrogen response element (ERE) is differentially regulated by 17 beta-estradiol and diethylstilbestrol (DES). J Mol Biol 2004, 340(5):1013-1023.

45. Marshall GM, Cheung B, Stacey KP, Norris MD, Haber M: Regulation of retinoic acid receptor alpha expression in human neuroblastoma cell lines and tumor tissue. Anticancer Res 1994, 14(2A):437-441.

46. Popperl H, Featherstone MS: Identification of a retinoic acid response element upstream of the murine Hox-4.2 gene. Mol Cell Biol 1993, 13(1):257-265.

47. Ogura T, Evans RM: Evidence for two distinct retinoic acid response pathways for HOXB1 gene regulation. Proc Natl Acad Sci U S A 1995, 92(2):392-396.

48. Shin DJ, Plateroti M, Samarut J, Osborne TF: Two uniquely arranged thyroid hormone response elements in the far upstream 5' flanking region confer direct

18

Page 70: Exploitation de données de séquences et de puces à ADN

thyroid hormone regulation to the murine cholesterol 7alpha hydroxylase gene. Nucleic Acids Res 2006, 34(14):3853-3861.

49. Essner JJ, Johnson RG, Hackett PB, Jr.: Overexpression of thyroid hormone receptor alpha 1 during zebrafish embryogenesis disrupts hindbrain patterning and implicates retinoic acid receptors in the control of hox gene expression. Differentiation 1999, 65(1):1-11.

50. GuhaThakurta D: Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res 2006, 34(12):3585-3598.

51. Martin DM, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 2004, 5:178.

52. Xenopuce: http://indigene.ibaic.u-psud.fr/xenopuce ; 2005. 53. Nieuwkoop PD, Faber F: Normal Table of xenopus laevis (Daudin). A systematical

and chronological survey of the development from fertilized egg till the end of metamophosis. New York: Garland; 1994.

54. R package: statistics for microarray analysis: http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html.

55. Bioconductor: http://www.bioconductor.org. 56. Smyth GK, Speed T: Normalization of cDNA microarray data. Methods 2003,

31(4):265-273. 57. Wu H, Kerr M, Cui X, Churchill G: MAANOVA: a software package for the analysis

of spotted cDNA microarray experiments. In. 58. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of

genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95(25):14863-14868.

59. de Hoon MJ, Imoto S, Nolan J, Miyano S: Open source clustering software. Bioinformatics 2004, 20(9):1453-1454.

60. Saldanha AJ: Java Treeview--extensible visualization of microarray data. Bioinformatics 2004, 20(17):3246-3248.

61. Ernst J, Bar-Joseph Z: STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics 2006, 7:191. Figures legends Figure 1: Clustergrams of tail genes expression profiles during metamorphosis. A.: Unsupervised K-means clustering results: K-means clustering was done with 10 classes. Clusters are sorted according to their expression profiles: constantly up-regulated (clusters 1, 2 and 3), transiently up-regulated (clusters 4 and 5), constantly down-regulated (clusters 7, 8, 9 and 10) and transiently down-regulated (cluster 6). First time point was used as the reference. Histograms on the bottom of each cluster represent the mean of log2 expression values of genes belonging to the cluster at each stage studied. Genes symbols are annotated on the right of each cluster and color-coded as follows: purple: transcription factor involved in proliferation; grey: transcription factor involved in differentiation; turquoise: transcription factor involved in apoptosis; dark blue: genes involved in proliferation/differentiation; brown: gene involved in proliferation/apoptosis; red: transcription factor; green: genes involved in proliferation; light blue: genes involved in differentiation; ochre: genes involved in apoptosis; black: remaining genes. Genes descriptions are given in supplementary table 1. B. Correspondance between intensities (expressed as log2 ratio) and color, metamorphic stages studied according to NF, scheme of the experimental design are represented.

19

Page 71: Exploitation de données de séquences et de puces à ADN

Figure 2: Clustergram of central nervous system (CNS) genes expression profiles during metamorphosis. A.: Unsupervised K-means clustering results: K-means clustering was done with 8 classes. Clusters are sorted according to their expression profiles: up regulated, transitory up regulated, down regulated and transitory down regulated. Last cluster, showing no particular profile, was clustered using hierarchical method. First time point was used as the reference. Histograms on the bottom of each cluster represent the mean of log2 expression values of genes belonging to the cluster at each stage studied. Genes symbols are annotated on the right of each cluster and color-coded as follows: orange: genes involved in proliferation/differentiation/apoptosis; pink: transcription factor involved in proliferation/apoptosis; purple: transcription factor involved in proliferation; grey: transcription factor involved in differentiation; turquoise: transcription factor involved in apoptosis; brown: gene involved in proliferation/apoptosis; blue green: genes involved in differentiation/apoptosis; red: transcription factor; green: genes involved in proliferation; ochre: genes involved in apoptosis; black: remaining genes. Genes descriptions are given in supplementary table 1. B. Correspondance between intensities (expressed as log2 ratio) and color, metamorphic stages studied according to NF and scheme of the experimental design are represented. Figure 3 Clustergrams of liver genes expression profiles during metamorphosis. A.: Unsupervised K-means clustering results: K-means clustering was done with 6 classes. Clusters are sorted according to their expression profiles: up regulated, transitory up regulated, down regulated and transitory down regulated. First time point was used as the reference. Histograms on the bottom of each cluster represent the mean of log2 expression values of genes belonging to the cluster at each stage studied. Genes symbols are annotated on the right of each cluster and color-coded as follows: pink: transcription factor involved in proliferation/apoptosis; purple: transcription factor involved in proliferation; grey: transcription factor involved in differentiation; turquoise: transcription factor involved in apoptosis; brown: gene involved in proliferation/apoptosis; red: transcription factor; green: genes involved in proliferation; ochre: genes involved in apoptosis; black: remaining genes. Genes descriptions are given in supplementary table 1. B. Correspondance between intensities (expressed as log2 ratio) and color, metamorphic stages studied according to NF, scheme of the experimental design are represented. Figure 4 : Validation of Microrarray Results by Quantitative RT-PCR. For each organ, the log2 of MA expression values (curves) and the QRT-PCR log2 of RQ (histograms) were plotted. NF stages studied are indicated at the bottom of each graph. The * mark stages that were tested by QRT-PCR. Since stage 57 and stage 62 were not analyzed on microarray respectively in tail and liver, MA values for these time points were extrapolated as the mean of the neighboring values (dashed lines). A., genes identified regulated in every organ. B., genes known to be regulated by thyroid hormone but only identified as regulated in one organ. C.: Absolute quantification of RPL8 gene in CNS. Mean values of the difference of Ct of NF stage 55 and others are plotted for three different quantities of cDNA. Figure 5: Venn diagrams representations of the overlap between sets of differential expressed genes in liver, tail and central nervous system during metamorphosis. In A., all genes were considered, in B., only up regulated genes, in C., only down regulated genes and in D., only transcription factors. Figure 6: Analysis of DR4 distribution in 15kb of sequences around start codon of differentially expressed genes. A. The global distribution of DR4 motifs in genes is represented the. Xenopuce random set is a set of 50 sequences corresponding to genes not identified as differentially expressed in the

20

Page 72: Exploitation de données de séquences et de puces à ADN

experiments realized. Random set of ENSEMBL transcripts are genomic region of a set of 200 ENSEMBL transcripts. B. Representation of the proportion of sequences containing 1, 2, 3 or more DR4 motifs per clusters. Figure 7: Transcriptional control of tissue-specific metamorphic programs hypothesis (adapted from Shi 2000). In each case, TR/RXR heterodimers repress TH-controlled gene expression until TH is available in the cell. Once complexed with TR/RXR, TH enhances expression of ubiquitous (A) or tissue-specific (B) transcriptional factors allowing expression of late tissue-specific response genes. A composite model of A and B is presented in C. In this third model, early response ubiquitous and tissue-specific transcription factors control expression of late response genes. Table 1: Tail-related GO terms identified with a maximum p-value of 0,05 for each class, sorted by biological process (P), molecular function (F) and cellular component (C). Table 2: CNS-related GO terms identified with a maximum p-value of 0,05 for each class, sorted by biological process (P), molecular function (F) and cellular component (C). Table 3: Liver-related GO terms identified with a maximum p-value of 0,05 for each class, sorted by biological process (P), molecular function (F) and cellular component (C). Table4: Expression profiles of genes differently regulated in two or more organs during metamorphosis. For each gene is depicted the organ specific expression profile Table 5: Comparison of microarray studies on natural and induced metamorphosis. Table 6: Primers sequences for genes tested by QRT-PCR. Location on X. tropicalis genome and primers sequences are given for each gene used in QRT-PCR experiments.

21

Page 73: Exploitation de données de séquences et de puces à ADN

22

Page 74: Exploitation de données de séquences et de puces à ADN

23

Page 75: Exploitation de données de séquences et de puces à ADN

24

Page 76: Exploitation de données de séquences et de puces à ADN

25

Page 77: Exploitation de données de séquences et de puces à ADN

26

Page 78: Exploitation de données de séquences et de puces à ADN

27

Page 79: Exploitation de données de séquences et de puces à ADN

28

Page 80: Exploitation de données de séquences et de puces à ADN

29

Page 81: Exploitation de données de séquences et de puces à ADN

30

Page 82: Exploitation de données de séquences et de puces à ADN

31

Page 83: Exploitation de données de séquences et de puces à ADN

Chapter 4: Evaluation of time profile reconstruction from complex two-color microarray designs Les contrantes au niveau du matériel pour mener l’étude du transcriptome avec des puces à ADN ont conduit au choix d’une stratégie expérimentale du type « interwoven design ». Au vu des limitations de temps, de matériaux et d’argent, cette option est incontournable, mais il faut une méthode d’analyse afin de reconstruire l’information d’intérêt (profils d’expression). Les avantages potentiels d’utiliser ces stratégies alternatives de puces dépendent largement du succès de la reconstruction des profils. Il est donc nécessaire d’évaluer les méthodes d’analyse des puces à ADN, afin de déterminer laquelle des approches est la meilleure. Pour cette raison, on a comparé jusqu’à quel point des différents modèles linéaires sont capables de reconstruire des profils d’expression semblables.

83

Page 84: Exploitation de données de séquences et de puces à ADN

It has previously been shown that complex two-color microarray designs (e.g., loop design, interwoven design) offers advantages over the commonly used reference design: at the same cost, more balanced measurements in the number of replicates per condition can be obtained. More and more laboratories tend to use these complex designs. What is often ignored, however, is that such complex designs require more complex analysis procedures to reconstruct the factor of interest (e.g. a gene profile across a time series) from the data. Reconstruction of the gene profile is a critical step in the analysis as the practical usefulness of complex designs will depend on how well analysis methods are able to retrieve this factor of interest from the data. In this study we performed an exhaustive comparison between different profile reconstruction methods (article submited to BMC Bioinformatic).

84

Page 85: Exploitation de données de séquences et de puces à ADN

- 1 -

Evaluation of time profile reconstruction from complex two-color microarray designs

Ana Fierro1,2,3

, Raphael Thuret1,2

, K. Engelen4, Gilles Bernot

3, Kathleen Marchal

4§,

Nicolas Pollet1,2,3

1CNRS UMR 8080, Laboratoire Développement et Evolution, Bat 445, F-91405

Orsay, France. 2Univ Paris Sud, F-91405 Orsay, France

3Programme d'Epigenomique – Genopole, Univ Evry, Tour Evry-2, Place des

terrasses, 91000 Evry, France 4Dep Microbial and Molecular Sciences, K.U.Leuven, Kasteelpark Arenberg 20, 3000

Leuven, Belgium

§Corresponding authors

Email addresses:

AF: [email protected]

RT: [email protected]

KE: [email protected]

GB: [email protected]

KM: [email protected]

NP: [email protected]

Page 86: Exploitation de données de séquences et de puces à ADN

- 2 -

Abstract

Background

As an alternative to the frequently used “reference design” for two-channel

microarrays, other designs have been proposed. These designs have been shown to be

more profitable from a theoretical point of view (more replicates of the conditions of

interest for the same number of arrays). However, the interpretation of the

measurements is less straightforward and a reconstruction method is needed to

convert the observed ratios into the genuine profile of interest (e.g. a time profile).

The potential advantages of using these alternative designs thus largely depend on the

success of the profile reconstruction. Therefore, we compared to what extent different

linear models agree with each other in reconstructing expression ratios and

corresponding time profiles from a complex design.

Results

On average the correlation between the estimated ratios was high, and all methods

agreed with each other in predicting the same profile, especially for genes of which

the expression profile showed a large variance across the different time points.

Assessing the similarity in profile shape, it appears that, the more similar the

underlying principles of the methods (model and input data), the more similar their

results. Methods with a dye effect seemed more robust against array failure. The

influence of a different normalization was not drastic and independent of the method

used. The accuracy of the different methods in estimating the true profiles was

assessed using a spike in experiment: all of the tested methods reconstructed very

similar profiles. Only when ratios were to be estimated from low intensity signals

(corresponding to low spike in concentration), they failed to approximate the true

expression ratios.

Conclusions

Including a dye effect such as in the methods lmbr_dye, anovaFix and anovaMix

compensates for residual dye related inconsistencies in the data and renders the results

more robust against array failure. Including random effects only makes sense if a

design is used with a sufficient number of replicates, otherwise it deteriorates the

results. Because of this, we believe lmbr_dye, anovaFix and anovaMix are most

appropriate for practical use.

Background Microarray experiments have become an important tool for biological studies,

allowing the quantification of thousands of mRNA levels simultaneously. They are

being customarily applied in current molecular biology practice.

In contrast to the Affymetrix based technology, for the two-channel microarray

technology assays, mRNA extracted from two conditions is hybridised simultaneously

on a given microarray. Which conditions to pair on the same array is a non trivial

issue and relates to the choice of the “microarray design”. The most intuitively

interpretable and frequently used design is the “reference design” in which a single,

fixed reference condition is chosen against which all conditions are compared.

Page 87: Exploitation de données de séquences et de puces à ADN

- 3 -

Alternatively, other designs have been proposed (e.g. a loop design). From a

theoretical point of view, these alternative designs usually offer, at the same cost,

more balanced measurements in the number of replicates per condition than a

common reference design. They are thus, based on theoretical issues, potentially more

profitable [1, 2]. For instance, a loop design would outperform the common reference

design when searching for differentially expressed genes [3]. However, the drawback

of such alternative design is that the interpretation of the measurements becomes less

straightforward. More complex analysis procedures are needed to reconstruct the

factor of interest (genes being differentially expressed between two particular

conditions, a time profile, etc.), so that the practical usefulness of a design depends

mainly on how well analysis methods are able to retrieve this factor of interest from

the data. Such analysis would require removing systematic biases from the raw data

by the appropriate normalization steps and combining replicate values to reconstruct

the factor of interest.

When focusing on profiling the changes in gene expression over time, the factor of

interest is the time profile [1, 2]. For such time series experiment, the “reference

design”, where, for instance, time point zero is chosen as the common reference has a

straightforward interpretation: for each array, the genes’ mean ratio between

replicates readily represents the changes in expression of that gene relative to the first

time point. However, when using an alternative design, such as an interwoven design,

mean ratios represent the mutual comparison between distinct (sometimes

consecutive) time points. A reconstruction procedure is needed to obtain the time

profile from the observed ratios [3-5].

Several profile reconstruction methods are available for complex designs. They all

rely on linear models and for the purpose of this study, we subdivided them in “gene

specific” and “two-stage” methods. Gene specific profile reconstruction methods

apply a linear model, on each gene separately. The underlying linear model is usually

only designed for reconstructing a specific gene profile from a complex design, but

not for normalizing the data. As a result, normalized log-ratios are used as input to

these methods (see ’Methods’). Examples of these methods are described by

Vinciotti, et al. (2005) [3] and Smyth, et al. (2004) (Limma) [4]. Two stage profile

reconstruction methods on the other hand, first apply a single linear model on all data

simultaneously, i.e. the model is fitted on the dataset as a whole. These models use the

separate log-intensity values for each channel, as spot effects are explicitly

incorporated. They return normalized absolute expression levels for each channel

separately, which can then be used to reconstruct the required time profile by a second

stage gene specific model. An example of such two-stage method is implemented in

the Maanova package [6].

So far, comparative studies focused on the ability of different methods to reconstruct

“genes being differentially expressed” from different two-color array based designs

[7-9] or the ratio estimation between two particular conditions [5]. In this study, we

aimed at performing a comparative study focusing on the time profile as the factor of

interest to be reconstructed from the data.

We compared to what extent five existing profile reconstruction methods (lmbr,

lmbr_dye, limmaQual, anovaFix, and anovaMix; see ‘Methods’ for details) were able

to reconstruct similar profiles from data obtained by two channel microarrays using

Page 88: Exploitation de données de séquences et de puces à ADN

- 4 -

either a loop design or an interwoven design. We assessed similarities between the

methods, their sensitivity towards using alternative normalizations and their

robustness against array failure. Using a spike-in experiment we were able to assess

the accuracy of the time profiles estimated by each of the methods.

Results

Assessing the influence of the used methodology on the profile reconstruction

We compared to what extent the different methods agreed with each other in 1)

estimating the changes in gene expression relative to the first time point (i.e. the log-

ratios of each single time point and the first time point) and 2) in estimating the

overall gene specific profile shapes. Results were evaluated using two test sets, each

of which represents a different complex design.

The first dataset was a time series experiment consisting of 6 time points measured on

9 arrays using an interwoven design (Figure 1a). This design resulted in three

replicate measurements for each time point, with alternating dyes. As a second test, a

smaller loop design was derived from the previous dataset by picking the combination

of five arrays that connect five time points in a single loop (Figure 1b). A balanced

loop is obtained with two replicates per condition, for which each condition is labeled

once with the red and once with the green dye (see ‘Methods’)

The balance with respect to the dyes (present in the loop design) ensures that the

effect of interest is not confounded with other sources of variation. In this study, the

effect of interest corresponds to the time profile. The replication (as present in the

interwoven design) improves the precision of the estimates and provides the essential

degrees of freedom for error estimation [2]. Moreover, the interwoven design not only

has more replicates, but also increases the possible paths to join any two conditions in

the design. As they have different characteristics, using both datasets allows us to

assess the reconstruction process under two different settings, while the RNA

preparations for both designs are the same.

Effect of profile reconstruction methods on the ratio estimates

We first assessed to what extent the different methods agreed with each other in

estimating similar log-ratios for each single gene at each single time point. To this

end, we calculated the overall correlation per time point between the gene expression

ratios estimated by each pair of two different methods. Table 1 gives the results for all

mutual comparisons between the methods tested for the loop design. Irrespective of

which two methods were compared, the correlation between the estimated ratios was

high on average, ranging from 0.94 to 0.98 (Table 1, mean column). Moreover, this

high average correlation is due to a high correlation of all individual ratios throughout

the complete ratio range (see supplementary Figure S1), with only a few outliers

(genes for which a rather different ratio estimate was obtained, depending on the

method used). Note that for the loop design, there was no difference between the

results of lmbr and lmbr_dye due to the balanced nature of this design (see ‘Methods’

section).

For this loop design the ratio estimates T3/T1 or T4/T1 obtained by each of the

different methods are on overall more correlated than estimates of respectively T5/T1

Page 89: Exploitation de données de séquences et de puces à ADN

- 5 -

and T6/T1. As can be expected, direct estimates, i.e. estimates of a ratio for which the

measurements were assessed on the same array (see Figure 1b: ratios T3/T1 and

T4/T1) are more consistent than indirect estimates, i.e. the measurements used to

obtain the estimates were assessed on different arrays (see Figure 1b: ratios T5/T1 and

T6/T1). A similar observation was already made by Kerr and Churchill (2001), and

Yang and Speed (2002). For a loop design, both the ANOVA (two-stage) [2] and the

gene-specific methods [10], have trouble estimating ratios between conditions not

measured on the same array (indirect estimates). The larger the loops (the longer the

paths) between indirectly measured pairs of conditions, the less precise estimates will

be.

For the interwoven design, the correlation between ratio estimates, obtained by any

pair of two different methods was even higher, with values ranging from 0.95 to 0.99

(see supplementary Table S1). For this unbalanced design, the ratio estimates for the

lmbr_dye and the lmbr methods were no longer exactly the same. The difference in

consistency between direct and indirect ratio estimates was not obviously visible for

this design.

Effect of profile reconstruction methods on the profile shape

A high average correlation between the ratio estimates obtained by the different

methods at each single time point is a first valuable assessment. However, it is

biologically more important that gene specific profiles reconstructed by the different

methods exhibit the same tendency over time. Therefore, we also compared to what

extent profile shapes estimated by each of the methods differed from each other. This

was done by computing the mean similarity between profile estimates obtained by any

combination of two methods (Table 2a).

Figure 2 shows a few illustrative examples of profiles estimated by the different

methods. For the ribosomal gene “L22” (Figure 2a), irrespective of the method,

highly similar profiles were obtained. However, for the MGC85244 gene (Figure 2c),

the observed degree of similarity between profiles derived by each of the different

methods is much lower, especially for the last two time points.

Table 2a summarizes the results of the profile comparison expressed as average

profile similarities across all genes. The similarity was computed with the cosine

similarity measure after mean centering the profiles (see ‘Methods’). It ranges from -1

(anti-correlation) to 1 (perfect correlation), 0 being no correlation. Also here, the

overall correlation between different methods was not drastically different. From this

table, it appears that the more similar the underlying principles of the used methods

(both the model and the input data) are, the more correlated their results. Indeed,

correlations between profiles estimated by either limmaQual and lmbr (both gene

specific models without dye effect), or anovaMix and anovaFix (both two stage

models) are high. The most divergent correlations are observed when comparing a

gene-specific method (more specifically lmbr, or limmaQual) with a two-stage

method (anovaFix or anovaMix). When using lmbr_dye on the interwoven design, it

behaves somewhere in between: although it is a single gene model, it includes a dye

effect just like the two stage models. This does not apply for the loop design due to

its dye-balance (lmbr and lmbr_dye give the same results for balanced designs; see

‘Methods’).

Page 90: Exploitation de données de séquences et de puces à ADN

- 6 -

Differences in the input data (log ratio versus log expression values) and alterations in

the underlying model (including a dye or random effect) are confounded in affecting

the final result. Therefore, in order to assess into more detail the specific effect of

including either a dye or a random effect in the model, we compared results between

methods that share the same input data.

To assess the influence of including a dye effect on profile estimation, we compared

the results of the gene-specific methods (see Table 2a, the first two rows). Including a

dye effect (present in lmbr_dye but not in limmaQual and lmbr) has a strong effect

under the unbalanced interwoven design (seen as decrease in correlation between

lmbr_dye and the other single gene methods). For the loop design this effect is non-

existent because of the loop design’s balance with respect to the dyes (see ‘Methods’).

The mere impact of including a random effect in the model can be assessed by

comparing results of anovaFix and anovaMix. Indeed, they both contain the same

input data, the same normalization procedure, and the same model except for the

random effect. Seemingly, inclusion of the random effect has a higher influence on

the loop design than on the interwoven design.

Usually in a microarray experiment, an important proportion of the genes does not

change its expression significantly under the conditions tested (global normalization

assumption), exhibiting a “flat” profile. We wondered whether removing such flat

genes, with a noisy profile would affect the similarity in profile estimation between

the different methods. Indeed, because the cosine similarity with centering only

measures the similarity in profile shape, regardless of its absolute expression level, the

higher level of similarity we observe between the methods might be due to a high

level of random correlation between the “flat” profiles. Therefore, we applied a

filtering procedure by removing those genes for which the profile variance over the

different time points was lower than a certain threshold (a range of threshold values

going from 0.2-0.4 was tested. The similarity was assessed for any pair of cognate

profile estimates if at least one of the two profiles passed the filter threshold (Table 2b

for the variance threshold of 0.4, results for the other thresholds can be found in the

supplementary information, see Table S2).

Overall, the results obtained with each of the different variance thresholds confirmed

the observations of Table 2a: 1) the more similar the models and input data, the more

similar the methods behaved (two-stage methods differed most from limmaQual

followed by lmbr in estimating the gene profiles, 2) including a dye effect has a

pronounced effect in an interwoven design (in a loop design there is no distinction due

to the balance with respect to the dyes; see ‘Methods’), 3) including a random effect

has most influence on the loop design. In addition, it seems that, the more flat profiles

are filtered from the dataset, the more similar the results obtained by each of the

different methods become.

The effect of array failure on the profile reconstruction

In practice, when performing a microarray experiment some arrays might fail with

their measurements falling below standard quality. When these bad measurements are

removed from the analysis, the complete design and the results inferred from it will be

affected. Here we evaluated this issue experimentally by simulating array defects. In a

first experiment, the interwoven design (dataset 1) was considered as the original

Page 91: Exploitation de données de séquences et de puces à ADN

- 7 -

design without failure. We tested 9 different, possible situations of failure, by each

time removing a single array from the design, resulting in 9 reduced datasets. The

same test was performed with the loop design (dataset 2).

We compared for each of the different profile reconstruction methods, the mean

similarity between the ratios obtained either with the full dataset or with each of the

reduced datasets (9 comparisons). Table 3 summarizes the results for the interwoven

design, and Table 4 for the loop design.

For the interwoven design (Table 3), it appears that in general removing one array

from the original design did not really affect the ratio reconstruction. For all methods,

ratio estimates tend to be more affected when an array measuring the reference time

point was removed (T1) (Table 3). Overall the two-stage methods, and in particular

anovaMix, seemed most robust against array failure, while limmaQual was most

sensitive (Table 3). Methods including a dye effect were more robust against array

failure. Similar results were obtained when the effect of array failure was assessed on

the similarity in profiles (see supplementary Table S3).

For the loop design, the situation was quite different (Table 4). Note that here, the

lmbr_dye and limmaQual methods were not used for profile reconstruction as the

reduced datasets did not contain sufficient information for estimating all the model

parameters. For both lmbr and limmaQual, the linear models lose their main differing

characteristics compared to lmbr (see ‘Methods’ section). For all remaining methods

removing one array from the design affected the results considerably more than was

the case for the interwoven design. Two-stage methods were the most robust, but in

this design anovaMix performs slightly worse than anovaFix. The lmbr method turned

out to be very sensitive to array failure, giving a mean similarity around 0.2,

indicating no correlation between profiles estimated with and without array failure

(see supplementary Table S4).

Note that overall, all methods seem to be more robust to array failure under the

interwoven design than under the loop design. This is to be expected as the latter

design contains more replicates.

Consistency of the methods under different normalization procedures

In the previous section we compared profiles and ratio estimates obtained by the

different methods after applying default normalization steps. However, other

normalization strategies are possible, and could potentially affect the outcome. To

assess the influence of using alternative normalization procedures, we compared

profiles reconstructed from data normalized with 1) print tip Loess without additional

normalization step (the default setting for anovaMix and anovaFix as used throughout

this paper), 2) print tip Loess with a scale-based normalization between arrays [13],

and 3) print tip Loess with a quantile-based between array normalization (the default

normalization for lmbr, lmbr_dye, and limmaQual as used throughout this paper) [12,

14].

Table 5 shows, for each of the different methods, the mean similarity between

reconstructed profiles derived from differently normalized datasets. Overall, the

influence of the normalization was not drastic. More importantly, the influence of the

additional nomalization steps seemed independent of the method used (similar

Page 92: Exploitation de données de séquences et de puces à ADN

- 8 -

influences were observed for all methods). When assessing the similarity in ratio

estimates instead of profile estimates, similar results were obtained (data not shown).

Accuracy of estimation

So far we only assessed to what extent changes in the used methodologies or

normalization steps affected the inferred profiles. This, however, does not give any

information on the accuracy of the methods, i.e., which of these methods is able to

best approximate the true time profiles. Assessing the accuracy is almost impossible

as usually the true underlying time profile is not known. However, datasets that

contain external controls (spikes) could prove useful in this regard. Spikes are added

to the hybridisation solution in known quantities, so that we have a clear view of their

actual profile. In the following analysis, we used such a spike-in experiment to test the

accuracy of each of the profile reconstruction methods [15]. For the technical details

of this dataset we refer to ‘Methods’ and Table 6.

As lmbr and lmbr_dye and limmaQual gave exactly the same results using this

balanced design, we further assessed to what extent lmbr, anovaFix and anovaMix

agreed with each other in estimating similar profiles. The profile was reconstructed

using one specific spike concentration as a reference point. Figure 3 shows the results

for two representative spikes using the 10 cpc (copies per cell) measurement as a

reference. Similar results were obtained for the other spikes (data not shown). The

three compared methods reconstruct very similar profiles for both representative

spikes. This is consistent with our previous observations where these three methods

were very consistent with each other (Table 1 and 2 in the balanced loop design).

Judging to which extent these methods were able to approximate the true underlying

profile (i.e. approximating the true ratios and not only the profile behaviour), it

appeared that the tested linear methods started failing when ratios were to be

estimated from low intensity signals (corresponding to low spike in concentration). As

shown in Figure 3, differences between low concentrations can no longer be

accurately detected by the linear methods, giving estimation around zero instead of

approximating the true range of ratio concentrations (which should be between –

log2(0.001) and –log2(1)). Clearly, ratios were overestimated relative to their true

values. In contrast, at the high concentration range, ratios were consistently

underestimated for all three methods (Figure 3).

Figure 4 shows spike-in profiles reconstructed by using a more extreme concentration

as reference for the reconstructed ratio profile. In the high concentration range (from

10 to 10,000 cpc) the shape of the estimated profile is highly similar to the expected

shape (Figure 4a, dotted line), but the estimated values depend on the concentration

used as reference. When the ratios were estimated using the maximum concentration

of 10,000 cpc as reference, we observed a high correlation between the estimated

ratios and the true profile (Figure 4a). However, when the lowest concentration of 0

cpc was used as reference point, reconstructed ratio profiles were highly

underestimated (Figure 4b). These results illustrate how, when estimating ratio

profiles, the observed intensity of the reference point does not drastically disturb the

profile shape, but can largely bias the accuracy of the estimated ratio.

Discussion In this study, we evaluated the performance of five methods based on linear models in

estimating gene expression ratios and reconstructing time profiles from complex

Page 93: Exploitation de données de séquences et de puces à ADN

- 9 -

microarray experiments. From a theoretical viewpoint, two major differences can be

distinguished between the methods selected for this study: 1) differences related to

alterations in the input data: the selected two-stage methods make use of the log-

intensity values while the gene-specific methods use log-ratios, 2) differences related

to the model characteristics: some of the models include an explicit dye effect

(lmbr_dye, anovaFix and anovaMix) or an explicit random effect (anovaMix).

Although Kerr [5] assumed that observed differences in estimates obtained by

different models are due to the differences in model characteristics, rather than to the

input data, we cannot clearly make this distinction. Indeed, the way the error-term is

modelled influences the statistical inference and hence the use of log-intensities or

log-ratios does cause a difference between models [5]. However, when focusing on

results obtained between methods with similar input data, we can assess, to some

extent, the effect of different model specificities. In the following sections, some of

these effects are discussed more in detail.

The inclusion of the dye effect

In general we observed that, gene specific methods without dye effects, and two-stage

models with dye effect behaved more similar with each other than when they were

compared among each other. Lmbr_dye (a gene specific model with dye effect) is

situated somewhere in between when the design is unbalanced with respect to the

dyes. Indeed, the gene specific models lmbr and limmaQual contain a combination of

log-ratios plus an error term. However, when adding a dye effect to these models as is

the case of lmbr_dye, the formulations and estimations converge with those of the

two-stage ANOVA models for unbalanced designs.

Originally, Vinciotti, et al. (2005) [3] and Wit, et al. (2005) [16] added the dye effect

for purposes of data normalization when one is working with non-normalized data.

From our results, we also noted a practical advantage of including a dye effect even

with normalized data. The fact that adding a dye effect showed pronounced

differences for a dye-unbalanced design indicates that, despite the data being

normalized, there are still dye-related inconsistencies in the data that might –partially-

be compensated for by including a dye effect. Moreover, models with dye effects

seemed more robust in estimating log-ratios from a design disturbed by array failure.

Therefore, when working with unbalanced designs, it is advisable to include a dye

effect, not only for the two-stage ANOVA models, as was also suggested by

Wolfinger (2001) [17], Kerr (2003) [5], and Kerr and Churchill (2001) [2], but also

for gene specific models based on log-ratios.

Mixed models versus Fixed models

Several studies advise the users to model the spot-gene or array-gene effects as

random variables [9, 17]. We observed that under the loop design (with 5 arrays),

profiles estimated by anovaMix and anovaFix diverged. We also noticed that, for the

loop design anovaMix had a lower capacity than anovaFix to handle array failures.

For the interwoven design with 9 arrays these effects were less pronounced. Probably,

the loop design used in our study does not contain a sufficient number of arrays to

allow for the estimation of the spot-gene effect when using a mixed anova model. As

a result, ratios and time profiles estimated by anovaMix are less reliable for an

experiment with few arrays than when using the anovaFix model in similar

conditions.

Page 94: Exploitation de données de séquences et de puces à ADN

- 10 -

The effect of using alternative normalization steps on the methods’ performance

We tested the influence of using additional normalization steps. Differently

normalized data give different results, but the effects were not dramatic. Moreover,

they had the same influence on all methods, indicating that all methods were equally

sensitive to changes in the normalization.

Accuracy of estimated ratios

Based on spike-in experiments for two-channel microarrays, we could also assess to

what extent the estimated ratios approximated the true ratios (i.e., the accuracy of the

estimated ratios). We observed that all five tested linear methods generated biased

estimations, consistently overestimating changes in expression relative to a reference

with low mRNA-concentration. These results showed to be independent of the

method used (gene specific or two-stage) or of the number of effects included the

model.

Conclusions On average the correlation between the estimated ratios was high, and all methods

more or less agreed with each other in predicting the same profile. The similarity in

profile estimation between the different methods improved with an increasing

variance of the expression profiles.

We observed that when dealing with unbalanced designs, including a dye effect, such

as in the methods lmbr_dye, anovaFix and anovaMix, seems to compensate for

residual dye related inconsistencies in the data (despite an earlier normalization step).

Adding a dye effect also renders the results more robust against array failure.

Including random effects only makes sense if a design is used with a sufficient

number of replicates, otherwise it deteriorates the results.

The accuracy of the different methods in estimating the true profiles was assessed

using a spike in experiment: all of the tested methods reconstructed very similar

profiles. Only when ratios were to be estimated from low intensity signals

(corresponding to low spike in concentration), they failed to approximate the true

expression ratios.

Conclusively, because of their robustness against imbalances in the design and array

failure, we believe lmbr_dye, anovaFix and anovaMix are most appropriate for

practical use (given a sufficient number of replicates in case of the latter).

Methods

Microarray data

The first dataset used in this study was a temporal Xenopus tropicalis expression

profiling experiment. The array used consisted of 3000 oligos of 50mers,

corresponding to 2898 unique X. tropicalis gene sequences and negative control spots

(Arabidopsis thaliana probes, blanks and empty buffer controls). Each oligo was

spotted in duplicate on each array in two separated grids. On each grid,

oligonucleotides were spotted in 16 blocks of 14 x 14 spots. Pairs of duplicated

Page 95: Exploitation de données de séquences et de puces à ADN

- 11 -

oligo’s on the two grids of the same gene sequence were treated as replicates during

analysis, corresponding to a total of 2999 different duplicated measurements (a few

oligos were spotted multiple times on the arrays). MWG Biotech performed

oligonucleotide design, synthesis and spotting. X. tropicalis gene sequences were

derived from the assembly of public and in-house expressed sequence tags. The

temporal expression of X. tropicalis during metamorphosis was profiled at 6 time

points, using an experimental design consisting of 9 arrays. Each time point was

measured three times, with alternating dyes as shown in Figure 1a. This interwoven

design was used as a first test set.

From this original design a second test set containing a smaller loop design was

derived by picking the combinations of five arrays that connect five time points in a

single loop (Figure 1b) and with the first time point as a reference. This results in a

balanced loop design

A publicly available spike-in experiment [18] was used as a third test set. This dataset

contains 13 spikes-in, or control clones spiked with known concentrations. The

control clones were spiked at different concentrations for each of the 7 conditions,

where each spike describes a specific temporal profile (Table 9).

The microarray design used for the spike-in experiment was a common reference

design, with dye swap for each condition, and the concentrations of spikes ranges

from 0 to 10,000 copies per cellular equivalent (cpc), assuming that the total RNA

contained 1% poly(A) mRNA and that a cell contained on average 300,000

transcripts. This concentration range covered all biologically relevant transcript

levels.

Probes preparation and microarray hybridization

10 µg of total RNA were used to prepare probes. Labeling was performed with the

Invitrogen SuperScript™ Indirect cDNA labeling system (using polyA and random

hexamers primers) using the Amersham Cy3 or Cy5 monofunctional reactive dyes.

Probe quality was assessed on an agarose minigel and quantified with a Nanodrop

ND-1000 spectrophotometer. Dye quantities were equilibrated for hybridization by

the amount of fluorescence per ng of cDNA. The arrays were hybridized for 20 h at

45 °C according to the manufacturers protocol (QMT ref). Washing was performed in

2X SSC 0.1% SDS at 42°C for 5’ and then twice at room temperature in 1X SSC,

0.5X SSC each time for 5’. Arrays were scanned using a GenePix Axon scanner.

Microarray normalization

The raw intensity data were used for further normalization. No background

subtraction was performed. Data were log-transformed and the intensity dependent

dye or condition effects were removed by using a local linear fit loess on these log-

transformed data (Printtiploess command with default settings as implemented in the

limma BioConductor package [13]). As this loess fit not only normalizes the data but

also linearizes them, applying it before profile reconstruction is a prerequisite as all

linear models used for profile reconstruction assume non linearities to be absent from

the data.

For the gene specific methods (lmbr, lmbr_dye and limmaQual), Loess corrected log-

ratios (per print tip) were subjected to an additional quantile normalization step [4, 12]

Page 96: Exploitation de données de séquences et de puces à ADN

- 12 -

as suggested by Vinciotti et al. (2005) [3] in order to improve the intercomparability

between arrays. It equalizes the distribution of probe intensities for each array in a set

of arrays. For the two-stage profile reconstruction methods (anovaFix and anovaMix),

corrected log-intensities for the red (RCORR) and green (GCORR) channels were

calculated from the Loess corrected log-ratios (MCORR; no additional quantile

normalization was done for the two-stage methods) and mean absolute intensities (A)

as follows: ( ) 2CORR CORR

R A M /= + , and ( ) 2CORR CORR

G A M /= − .

Used profile reconstruction methods

Available R implementations (BioConductor [19]) of the presented methods were

used to perform the analyses.

Gene specific methods based on log-ratios Gene specific profile reconstruction methods apply a linear model on each gene

separately. The goal is to estimate the true expression differences between the mRNA

of interest and the reference mRNA, from the observed log-ratios. The presented

models assume that the expression values have been appropriately pre-processed and

normalized [3, 20]. The three selected gene-specific models for this study are:

1) lmbr, the linear model described by Vinciotti et al. (2005) [3]:

An observation yjk is the log-ratio of condition j and condition k. For each gene a

vector of n observations ),...,( 1 nyyy = can be represented as

εµ += Xy

where X is the design matrix defining the relationship between the values

observed in the experiment and a set of independent parameters

),...,,( 11312 Tµµµµ = representing true expression differences, and ε is a vector of

errors. The parameters in µ are arbitrarily chosen this way for estimation

purposes; all expression differences between other conditions i and k can be

calculated from these parameters as 1 1ik k i

µ µ µ= − . The goal is to obtain estimates

of the true expression differences µ̂ separately for each gene. Given the

assumptions behind the linear model, the least squares estimator for µ is [3]

yXXX tt 1)(ˆ −=µ

2) lmbr_dye, an extension of lmbr including a general dye effect:

The previous model can be extended to include a gene-specific dye effect [3]

εµ ++= DXy

where D is a vector of n times a constant value representing the gene-specific dye

effect δ. Alternatively, one could write D Dy X µ ε= + where XD is the design

matrix X with an extra column of ones, and 12 13 1T( , ,..., , )µ µ µ µ δ= . Note that

in the case of dye-balanced designs, the addition of a dye effect will not yield any

different estimators for the contrasts of interest. In a balanced design, each column

of X will have an equal amount of 1’s and -1’s. I.e. the ith column of X,

corresponding to the true expression difference 1iµ , reflects how condition i was

measured an equal number of times with both dyes. As such, the positive and

negative influences of the dye effect will cancel each other out in the estimation of

Page 97: Exploitation de données de séquences et de puces à ADN

- 13 -

true expression differences. The use of lmbr_dye will thus only render different

results compared to lmbr when using it to analyze unbalanced experiments.

In order to estimate all parameters, the matrix XD must be of full rank. If the

column representing the dye effect is not linearly independent, the matrix is rank

deficient. This situation occurs for example when an array is removed from the

loop design used in this paper. In this case, there are an infinite number of

possible least squares parameter estimates. Since we expect a single set of

parameters, a constraint must be applied (this is done on the dye effect) in which

case the true expression estimates are the same as for lmbr.

Lmbr and lmbr_dye were implemented in the R language using the function ‘lm’

for linear least squares regression.

3) limmaQual, the limma model [4, 20, 21] including an array quality adjustment:

The quality adjustment assigns a low weight to poor quality arrays, which can be

included in the inference. The approach is based on the empirical reproducibility

of the gene expression measures from replicated arrays, and it can be applied to

any microarray experiment. The linear model is similar to the model describes by

Vinciotti et al., (2005) but the variance of the observations y includes the weight

term. In this case, the weighted least squares estimator of µ̂ is [20]:

yXXX tt 111 )(ˆ −−− ∑∑=µ

where ∑ is the diagonal matrix of weights.

The weights in the LimmaQual model are the inverse of estimated array variances,

down weighting the observations of lower quality arrays in order to decrease the

power to detect differential expression. The method is restricted for use on data

from experiments that include at least two degrees of freedom. When testing the

array failure in case of the loop design, there is no array level replication for two

of the conditions so the array quality weights can not be estimated: the limma

function returns a perfect quality for all the arrays (in this case ∑ is a diagonal

matrix of 1’s).

The fit is by generalized least squares allowing for correlation between duplicate

spots or related arrays, implemented in an internal function of the limma package.

Two-stage methods based on the log-intensity values The selected methods correspond to ANOVA (Analysis of variance) models. They

can normalize microarray data and provide estimates of gene expression levels that

are corrected for potential confounding effects.

Since the global methods are computationally time-consuming, we selected two-stage

methods that apply a first stage on all data simultaneously and a second stage on a

gene by gene level. These models use partially normalized data as input (i.e., the

separate log-intensity values for each channel), as spot effects are explicitly

incorporated. They return normalized absolute expression levels for each channel

separately (i.e. no ratios), which can then be used to reconstruct the required time

profile.

Page 98: Exploitation de données de séquences et de puces à ADN

- 14 -

4) anovaFix, two-stage ANOVA with fixed effects [6]:

We denote the loess-normalized log-intensity data by ijkgry that represents the

measurement observed in the array i, labeled with the dye j, representing the time

point k, from gene g and spot r. The first stage is the normalization model:

ijkgrijjiijkgr rADDAy ++++= µ

where the term µ captures the overall mean. The other terms capture the overall

effects due to arrays (A), dyes (D) and labelling reactions (AD). This step is called

“normalization step” and it accounts for experiment systematic effects that could

bias inferences made on the data from the individual genes. The residual of the

first stage is the input for the second stage, which models the gene-specific

effects:

ijkgrkjrijkgr VGDGSGGr ε++++=

Here G captures the average effect of the gene. The SG effect captures the spot-

gene variation and we used it instead of the more global AG array-gene effect.

The use of this effect obviates the need for intensity ratios. DG captures specific

dye-gene variation and VG (variety-gene) is the effect of interest, the effects due

to the time point measured. The Maanova fixed model computes least squares

estimators for the different effects.

5) anovaMix, two-stage ANOVA with mixed effects [6, 17]:

The model applied is exactly the same as anovaFix, but in this case the SG effect

was treated as a random variable, meaning that if the experiment were to be

repeated, the random spot effects would not be exactly reproduced, but they would

be drawn from a hypothetical population. A mixed model, where some variables

are treated as random, allows for including multiple sources of variation.

We used the default method to solve the mixed model equation, the REML

(restricted maximum likelihood) method. Duplicated spots were treated as

independent measurements of the same gene. For Maanova and limma packages

the option to do so is available, for lmbr and lmbr_dye duplicated spots were

taken into account by the design matrix.

Profile reconstruction

Applying the gene-specific methods mentioned above results in estimated differences

in log-expression between a test and a reference condition or in log-ratios. To

reconstruct from the different designs a time profile, the first time point was chosen as

the reference. A gene specific reconstructed profile thus consists of a vector which

contains as entries ratios of the measured expression level of that gene at each time

point except the first, relative to its expression value at the first time point. For

instance, for the loop design shown in Table 1 the profile contains 4 ratios.

In contrast to the gene-specific methods, two-stage methods estimate the absolute

gene expression level for each time point rather than log-ratios. In this case, for the

loop design shown in Table 1, the profile contains 5 gene expression levels.

Page 99: Exploitation de données de séquences et de puces à ADN

- 15 -

Comparison of profile reconstruction

To assess the influence of using different methodologies on the profile reconstruction,

the following similarity measures were used to compare the consistency in

reconstructing profiles for the same gene between the compared methods:

1. Overall similarity in the estimated ratios: we assessed the similarity between the

estimations of each single ratio of the time profile generated by two methods

using the Pearson correlation. Since two-stage methods estimate gene expression

levels (variety-gene effect in the model) instead of log-ratios, we converted these

absolute values into log-ratios by subtracting from the absolute expression levels

estimated for each of the conditions the estimated level of the first time point (the

reference).

2. Profile shape similarity: the profile shape reflects the expression behaviour of a

gene over time. For each single gene, we computed the mutual similarity between

profile estimates obtained by any combination of two methods. To make profiles

consisting of log-ratios obtained by the gene-specific methods comparable with

the profiles estimated by the two-stage methods, we extended the log-ratios profile

by adding as a first time point a zero. This represents the contrast between the

expression value of the first time point against itself in log-scale (Figure 2).

Profile Similarity

The mutual similarity was computed as the cosine similarity, which corresponds to

the angle between two vectors representing genes i and j with profiles Pi and Pj.

PjPi

PPPP

ji

ji

⋅=),cos(

All profiles were mean centred, i.e. data have been shifted by the mean of the profile

ratios to have an average of zero for each gene, prior to computing the cosine

similarity. With centred data, the cosine similarity can also be viewed as the

correlation coefficient, and it ranges between -1 (opposite shape) and 1 (similar

shape), 0 being no correlation. The cosine similarity only considers the angle between

the vectors focusing on the shape of the profile. As a result, it ignores the magnitude

of ratios of the profiles, resulting in relatively high similarities for false positives (i.e.

“flat profiles”, genes that do not change their expression profile over time, but for

which the noise profile corresponds by chance to other gene profiles).

No variance normalization was performed on the profiles to preserve their shape.

Instead of normalizing by the variance, the profiles were filtered using the standard

deviation.

Filtering flat profiles

Constitutively expressed genes or genes for which the expression did not significantly

change over the conditions were filtered by removing genes of which the variance in

expression over different conditions was lower than a fixed threshold (the following

range of thresholds was tested: 0.1, 0.2, 0.4). A pairwise similarity comparison was

made for all cognate profile estimates that were above the filtering threshold in at

least one of the two methods compared. Similar results were obtained when applying

Page 100: Exploitation de données de séquences et de puces à ADN

- 16 -

as a filter that all cognate profile estimates had to be above the filtering threshold in

both methods compared (data not shown).

Authors' contributions

AF performed the analysis and wrote the manuscript. RT and NP were responsible for

the microarray hybridization. NP and GB critically read the draft. KE contributed to

the analysis. KM coordinated the work and revised the manuscript. All authors read

and approved the final manuscript.

Page 101: Exploitation de données de séquences et de puces à ADN

- 17 -

References

1. Glonek GF, Solomon PJ: Factorial and time course designs for cDNA

microarray experiments. Biostatistics 2004, 5(1):89-111.

2. Kerr MK, Churchill GA: Experimental design for gene expression

microarrays. Biostatistics 2001, 2(2):183-201.

3. Vinciotti V, Khanin R, D'Alimonte D, Liu X, Cattini N, Hotchkiss G,

Bucca G, de Jesus O, Rasaiyaah J, Smith CP et al: An experimental

evaluation of a loop versus a reference design for two-channel

microarrays. Bioinformatics 2005, 21(4):492-501.

4. Smyth GK: Linear models and empirical bayes methods for assessing

differential expression in microarray experiments. Stat Appl Genet Mol

Biol 2004, 3.

5. Kerr MK: Linear models for microarray data analysis: hidden

similarities and differences. J Comput Biol 2003, 10(6):891-901.

6. MAANOVA software

[http://www.jax.org/staff/churchill/labsite/software/anova]

7. Tsai CA, Chen YJ, Chen JJ: Testing for differentially expressed genes

with microarray data. Nucleic Acids Res 2003, 31(9):e52.

8. Marchal K, Engelen K, De Brabanter J, Aert S, De Moor B, Ayoubi T,

Van Hummelen P: Comparison of different methodologies to identify

differentially expressed genes in two-sample cDNA microarrays. J

Biological Systems 2002, 10(4):409-430.

9. Cui X, Churchill GA: Statistical tests for differential expression in cDNA

microarray experiments. Genome Biol 2003, 4(4):210.

10. Yang YH, Speed T: Design issues for cDNA microarray experiments. Nat

Rev Genet 2002, 3(8):579-588.

11. Smyth GK, Speed T: Normalization of cDNA microarray data. Methods

2003, 31(4):265-273.

12. Yang YH, Thorne NP: Normalization for two-color cDNA microarray

data. In: Science and Statistics: A Festschrift for Terry Speed, IMS Lecture

Notes-Monograph Series Edited by DR G, vol. 40; 2003: 403-418.

13. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP:

Normalization for cDNA microarray data: a robust composite method

addressing single and multiple slide systematic variation. Nucleic Acids

Res 2002, 30(4):e15.

Page 102: Exploitation de données de séquences et de puces à ADN

- 18 -

14. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of

normalization methods for high density oligonucleotide array data based

on variance and bias. Bioinformatics 2003, 19(2):185-193.

15. Hilson P, Allemeersch J, Altmann T, Aubourg S, Avon A, Beynon J,

Bhalerao RP, Bitton F, Caboche M, Cannoot B et al: Versatile gene-

specific sequence tags for Arabidopsis functional genomics: transcript

profiling and reverse genetics applications. Genome Res 2004,

14(10B):2176-2189.

16. Wit E, Nobile A, Khanin R: Near-optimal designs for dual-channel

microarray studies. Applied Statistics 2005, 54(5):817-830.

17. Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel

P, Afshari C, Paules RS: Assessing gene significance from cDNA

microarray expression data via mixed models. J Comput Biol 2001,

8(6):625-637.

18. Allemeersch J, Durinck S, Vanderhaeghen R, Alard P, Maes R, Seeuws

K, Bogaert T, Coddens K, Deschouwer K, Van Hummelen P et al:

Benchmarking the CATMA microarray. A novel tool for Arabidopsis

transcriptome analysis. Plant Physiol 2005, 137(2):588-601.

19. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S,

Ellis B, Gautier L, Ge Y, Gentry J et al: Bioconductor: open software

development for computational biology and bioinformatics. Genome Biol

2004, 5(10):R80.

20. Ritchie ME, Diyagama D, Neilson J, van Laar R, Dobrovic A, Holloway

A, Smyth GK: Empirical array quality weights in the analysis of

microarray data. BMC Bioinformatics 2006, 7:261.

21. Smyth GK, Michaud J, Scott HS: Use of within-array replicate spots for

assessing differential expression in microarray experiments.

Bioinformatics 2005, 21(9):2067-2075.

Acknowledgements We thank Dr. Filip Vandenbussche for his comments on earlier versions of this

manuscript. AF was partially supported by a grant French embassy - CONICYT Chili.

This work was also partially supported by 1) IWT: SBO-BioFrame, 2) Research

Council KULeuven: EF/05/007 SymBioSys, IAP BioMagnet,

Figures

Figure 1 - Experimental microarray designs used in this study

Circles represent samples or time points, and arrows represent a direct hybridization

between two samples. The arrows point from the time point labeled with Cy3 to the

Page 103: Exploitation de données de séquences et de puces à ADN

- 19 -

time point labeled with Cy5. (a) Interwoven design (first dataset). Grey arrows were

removed to generate a single loop design (see (b)). (b) Loop design (second dataset).

Figure 2 - Examples of reconstructed profiles for two representative genes from the interwoven design

For the gene specific methods (based on log-ratios estimate) the ratios are expressed

relative to T1 and the ratio T1/T1 is set to zero. For the two-stage (ANOVA) methods,

the estimated VG effect (gene-variety) is plotted. (a) Estimated profiles for the

Ribosomal like gene under the interwoven design. (b) Reconstructed time profile for

the gene MGC85244 under the interwoven design.

Figure 3 - Spike-in profiles. Spikes were spotted at different concentrations on the arrays, resulting in a spike profile (which can be treated as a mimic of a time profile)

Reconstructed ratio profiles of spikes 7 and 8 are plotted. The concentration of 10 cpc

(copies per cell) was used as the reference point. Estimated log-ratios were sorted

from low to high concentrations. The dotted line corresponds to the expected log-ratio

for the known concentration.

Figure 4 - Spike-in profiles using an extreme concentration as baseline for profile reconstruction

Estimated log-ratios were sorted from low to high concentrations. The dotted line

corresponds to the log-ratios expected based on the known concentration. (a) For the

representative spikes (7 and 8) the baseline (reference) used to calculate log-ratios

was the concentration of 10.000 cpc. (b) similar as (a) but log-ratios were calculated

with the signals observed at concentration of 0 cpc as baseline.

Tables

Table 1 - Pairwise correlation between ratios estimated by each pair of applied methods (column 1 and column 2) for a loop design

Each ratio corresponds to a time dependent change in expression as compared to the

first time point (T1). The last column corresponds to the mean correlation of the 4

estimates. Since the loop design is balanced with respect to the dyes, the results for

lmbr and lmbr_dye were the same (see ‘Methods’ section), which is why they are not

treated differently.

Method1 Method2 T3/T1 T4/T1 T5/T1 T6/T1 mean

lmbr/lmbr_dye limmaQual 0.9966 0.9998 0.9648 0.9909 0.9880

lmbr/lmbr_dye anovaFix 0.9899 0.9913 0.9721 0.9856 0.9848

lmbr/lmbr_dye anovaMix 0.9829 0.9751 0.9420 0.9549 0.9637

limmaQual anovaFix 0.9889 0.9918 0.9298 0.9775 0.9720

limmaQual anovaMix 0.9810 0.9758 0.8950 0.9467 0.9496

anovaFix anovaMix 0.9936 0.9847 0.9726 0.9694 0.9801

Page 104: Exploitation de données de séquences et de puces à ADN

- 20 -

Table 2 - Mean similarity between profiles for both the interwoven and the loop design

Values in the table correspond to the similarity between any two methods, expressed

as the mean profile similarity of the genes. Since the loop design is balanced with

respect to the dyes, the results for lmbr and lmbr_dye were the same (see ‘Methods’

section), which is why they are not treated differently.

A) No filtering applied, similarity is assessed for all 2999 profile estimates, B) a

filtering threshold of 0.4 is used on all profiles estimated by each of the methods, a

pairwise similarity comparison is made for all cognate profile pairs estimated by each

of the two methods compared, for which at least one profile is above the filtering

threshold.

A Interwoven design Loop design

lmbr lmbr_dye limmaQual anovaFix lmbr/lmbr_dye limmaQual anovaFix

lmbr_dye 0.9477 1.0000

limmaQual 0.9940 0.9359 0.9844

anovaFix 0.9321 0.9572 0.9157 0.9514 0.9252

anovaMix 0.9138 0.9373 0.8989 0.9767 0.9186 0.8934 0.9611

B Interwoven design Loop design

lmbr lmbr_dye limmaQual anovaFix lmbr/lmbr_dye limmaQual anovaFix

lmbr_dye 0.9834 1.0000

limmaQual 0.9980 0.9799 0.9920

anovaFix 0.9800 0.9948 0.9755 0.9961 0.9851

anovaMix 0.9748 0.9905 0.9701 0.9957 0.9866 0.9728 0.9905

Table 3 - Assessing the effect of array failure on estimated ratios for the interwoven design

The different methods for which the influence of the failure was assessed are

represented in the columns. Each row shows the mean correlation between the

corresponding estimated ratios from the complete design and those obtained from a

defect design (where one array was removed compared to the complete design).

Mean: shows the overall mean correlation for a given method.

Array removed Lmbr lmbr_dye limmaQual AnovaFix anovaMix Conditions affected

1 0.9615 0.9812 0.9623 0.9824 0.9784 T2/T3

2 0.8905 0.9277 0.8768 0.9317 0.9431 T1/T3

3 0.9606 0.8790 0.9521 0.8951 0.8943 T3/T5

4 0.9080 0.9353 0.8821 0.9478 0.9539 T4/T1

5 0.9601 0.9632 0.9505 0.9569 0.9592 T6/T1

6 0.9773 0.9863 0.9742 0.9852 0.9847 T5/T2

7 0.9238 0.9585 0.9317 0.9549 0.9662 T5/T6

8 0.9615 0.9599 0.9615 0.9617 0.9690 T2/T4

9 0.9816 0.9859 0.9613 0.9836 0.9845 T4/T6

Mean 0.9472 0.9530 0.9392 0.9555 0.9592

Page 105: Exploitation de données de séquences et de puces à ADN

- 21 -

Table 4 - Assessing the effect of array failure on estimated ratios for a loop design

The different methods for which the influence of the failure was assessed are

represented in the columns. Each row shows the mean correlation between the

corresponding estimated ratios from the complete design and those obtained from a

defect design (where one array was removed compared to the complete design).

Mean: shows the overall mean correlation for a given method. Lmbr_dye and

limmaQual were not evaluated as in this particular case they lose their main differing

characteristic compared to lmbr.

Array removed lmbr anovaFix anovaMix Conditions affected

1 0.5964 0.6461 0.5312 T1/T3

2 0.7161 0.7545 0.6559 T3/T5

3 0.6713 0.8883 0.7606 T5/T6

4 0.5697 0.7883 0.6637 T6/T4

5 0.4359 0.6042 0.5534 T4/T1

Mean 0.5979 0.7363 0.6330

Table 5 - Effect of additional normalization procedures on estimating gene profiles from respectively an interwoven design and a loop design

Similarities between profiles were assessed using the mean cosine similarity measure.

Rows indicate the different normalization procedures. Columns indicate the different

models used to reconstruct the profiles.

Interwoven design

Normalization methods lmbr lmbr_dye limmaQual anovaFix anovaMix

none/quantile 0.9602 0.9576 0.9599 0.9605 0.9584

none/scale 0.9900 0.9881 0.9914 0.9867 0.9848

scale/quantile 0.9540 0.9531 0.9557 0.9547 0.9566

Loop design

lmbr lmbr_dye limmaQual anovaFix anovaMix

none/quantile 0.9520 0.9520 0.9534 0.9601 0.9481

none/scale 0.9822 0.9822 0.9793 0.9835 0.9778

scale/quantile 0.9355 0.9355 0.9361 0.9433 0.9370

Table 6 - Concentration (copies per cell) of the control clones spiked

From the total of 14 arrays, 7 were hybridized with the respective spike mixes labeled

in Cy5 against the reference mix labeled in Cy3. The remaining 7 arrays were

hybridized with the spike mixes labeled in Cy3 against the reference mix labeled in

Cy5. Spike 11a was removed from analysis due to quality issues (Allemeersch et al.,

2005 [18]).

Spike No. Spike Mix 1 Spike Mix 2 Spike Mix 3 Spike Mix 4 Spike Mix 5 Spike Mix 6 Spike Mix 7 Reference Mix

1, 2 10,000 0 0.1 1 10 100 1,000 100

3, 4 1,000 10,000 0 0.1 1 10 100 100

5, 6 100 1,000 10,000 0 0.1 1 10 100

7, 8 10 100 1,000 10,000 0 0.1 1 100

9, 10 1 10 100 1,000 10,000 0 0.1 100

11, 11a 0.1 1 10 100 1,000 10,000 0 100

12, 13 0 0.1 1 10 100 1,000 10,000 100

Page 106: Exploitation de données de séquences et de puces à ADN

- 22 -

Additional files Additional file 1 – Figure S1

Comparison of corresponding ratios estimated by two linear methods (lmbr and

anovaMix) using the loop design. The line indicates the identity between both

methods and most of the points are situated near this identity line.

Additional file 2 – Table S1

Pairwise correlation between ratios estimated by each pair of methods (columns 1 and

2) for the interwoven design. The ratios correspond to the change in expression

compared to the first time point. The last column corresponds to the mean correlation

of the 5 estimations.

Additional file 3 – Table S2

Mean similarity between profiles for both the interwoven and loop design using

different filtering thresholds. Values in the table correspond to the similarity between

any two methods, expressed as the mean profile similarity of the genes. Since the loop

design is balanced with respect to the dyes, the results for lmbr and lmbr_dye were the

same (see ‘Methods’ section), which is why they are not treated differently. A) No

filtering applied, similarity is assessed for all 2999 profile estimates, B) a filtering

threshold (SD) is used on all profiles estimated by each of the methods, a pairwise

similarity comparison is made for all cognate profile pairs estimated by each of the

two methods compared, for which at least one profile is above the filtering threshold

(SD >0.1, 0.2, 0.4 respectively).

Additional file 4 – Table S3

Assessing the effect of array failure in reconstructing profiles from an interwoven

design. Profile similarities were assessed using the cosine similarity. The different

methods for which the influence of the failure was assessed are represented in the

columns. Each row shows the mean cosine similarity between the corresponding

profiles estimated from the complete design and those obtained from a defect design

(where one array was removed compared to the complete design). Mean: shows the

overall mean similarity for a given method.

Additional file 5 – Table S4

Assessing the effect of array failure in reconstructing profiles from a loop design.

Profile similarities were assessed using the cosine similarity. The different methods

for which the influence of the failure was assessed are represented in the columns.

Each row shows the mean cosine similarity between the corresponding profiles

estimated from the complete design and those obtained from a defect design (where

one array was removed compared to the complete design). Mean: shows the overall

mean similarity for a given method.

Page 107: Exploitation de données de séquences et de puces à ADN

T1

T2

T3

T4

T6

T5

a) b)

T1

T3

T5T6

T4

Figure 1

Page 108: Exploitation de données de séquences et de puces à ADN

a) b)

Fig

ure

2

Page 109: Exploitation de données de séquences et de puces à ADN

Figure 3

Page 110: Exploitation de données de séquences et de puces à ADN

Figure 4

Page 111: Exploitation de données de séquences et de puces à ADN

Additional files provided with this submission:

Additional file 1: figure s1_rev.pdf, 46Khttp://www.biomedcentral.com/imedia/7346383481527797/supp1.pdfAdditional file 2: table s1_rev.pdf, 104Khttp://www.biomedcentral.com/imedia/1590033140152779/supp2.pdfAdditional file 3: table s2_rev.pdf, 32Khttp://www.biomedcentral.com/imedia/9445176201527790/supp3.pdfAdditional file 4: table s3_rev.pdf, 105Khttp://www.biomedcentral.com/imedia/1392800909152779/supp4.pdfAdditional file 5: table s4_rev.pdf, 88Khttp://www.biomedcentral.com/imedia/1151875245152779/supp5.pdf

Page 112: Exploitation de données de séquences et de puces à ADN

112

Page 113: Exploitation de données de séquences et de puces à ADN

Chapter 5 : From gene expression profiles towards data integration L’expression de gènes n’est plus une question de gène par gène, mais une interprétation de profils générés par des techniques à grande échelle. Ce chapitre présente la liaison entre l’acquisition des profils d’expression, a partir soit d’ ESTs soit de puces à ADN, et l’intégration des données. L’utilisation des ESTs a permis l’integration de données produites par plusieurs laboratoires/groupes de recherche pour comparer différents tissus ou conditions d’un organisme. Bien que l’intégration de données de puces à ADN ne soit pas possible d’une manière très directe, des études recentes ont été menées dans cette direction. En plus, l’intégration des données entre multiples organismes a été effectuée avec les deux types de méthodologies (puces ADN et ESTs). Mais l’analyse de l’expression n’est pas limitée à ces sources de données, et peut être combinée avec une multiplicité d’autres sources, comme des données ChIP-on-chip, des études d’intéractions de protéines, des recherches de motifs, etc. pour un but encore plus ambitieux : l’inférence des réseaux régulatoires des gènes.

113

Page 114: Exploitation de données de séquences et de puces à ADN

This chapter describes two methodologies for gene expression profiling and the state-of-art of data integration. Microarrays and expressed sequence tags (ESTs) enable large-scale studies, each with their own advantages and limitations. The EST-based approach has allowed the integration of data produced by diverse laboratories/research groups to compare different tissues or conditions of the same organism. Although in microarray data this integration is not straightforward, recent studies have been oriented in this direction. In addition, data integration between related organisms have been carried out in both methodologies. The analysis of gene expression is not limited to these data sources, and it can be combined with multiple sources as ChIP-on-chip data, protein interactions, motifs, etc. for a more ambitious aim: the inference of gene regulatory networks. 5.1 Background The advent of whole-genome sequencing and high-throughput experimental technologies transformed biological research. Gene expression studies have changed from a gene-by-gene basis to large-scale experiments, where thousands of genes are monitored at the same time. A huge challenge lies in interpreting these large-scale data sets and thereby deriving fundamental and applied biological information about whole systems. Gene expression at genome-wide scale has mainly been studied based on the transcriptome. Transcriptomic data provides information about both the presence and the relative abundance of RNA transcripts, thereby indicating the genes that are being actively transcribed within the cell. Although the eventual gene function depends on the protein translated from the mRNA, the direct measurement of protein level remains non-trivial at a large-scale due to detection limits and a more tedious analysis, while several methods to estimate mRNA expression levels have been developed. Although various levels of post-transcriptional control are not captured by these analyses and might rival its importance, these methods provide crucial information regarding the expression state, or primary genomics readout of the cell. Since the second half of the 1990s, a large number of genome-wide studies have been described that examine the dynamics of gene expression in many model systems and environments. Two major strategies, a first one relying on hybridization techniques (for instance, microarrays), and a second one based on sequencing, using techniques such as cDNA-AFLP (Bachem et al., 1996), serial analysis of gene expression (SAGE)(Velculescu et al., 1995) and expressed sequence tags (ESTs) (Adams et al., 1991) have been used. Both sequence based and hybridization based techniques have been applied to many model systems, as well as in applications such as the study of genes predominantly expressed in certain tissues or conditions, the classification of the molecular subtypes, and the monitoring of the transcriptional response to pathogens. Although they are very useful, applying these techniques is costly and laborious. As a result, the number of conditions, time points or tissues under study varies, depending on the available resources of the research unit ranging from a small number to hundreds of samples. Standard data analysis focuses on the experiments performed in each particular lab, and validations are covered using low throughput gene-by-gene techniques, such as RT-PCR or northern blot, where only a few number of genes can be confirmed. However, when all these relative small-scale experiments are taken together a much bigger compendium of data can be obtained, covering a wide range of experimental conditions. The more the publication policies require the obligate deposit of high throughput experiments in public databases upon publication, the richer the public resources become. For molecular biologists, having access to

114

Page 115: Exploitation de données de séquences et de puces à ADN

this wealth of data and viewing their own small-scale analysis in the light of what is already available can be very informative. However, much of the public information remains unexploited as it is far from clear how these datasets can be integrated with the small-scale analysis from a particular laboratory. Integration of gene expression data can cover different axes: data integration across platforms, across species and across data sources. Datasets produced by other researchers on the same organism, even on the same pathway can be very useful but might have been generated using different platforms (Affymetrix, cDNA microarrays). Data generated for other related species, in particular for related well-annotated model organisms, cannot only help with annotating own data but also allows performing more complex comparative and evolutionary analyses. Finally, other large-scale data sources are available, such as whole-genome sequence data and protein-protein interaction data. These diverse data sources can be combined to provide a better understanding of biological processes. Here an overview is presented for two representative techniques, microarrays and ESTs, to generate gene expression data. In addition, the potential of how these data can be integrated or combined with other data sources is discussed. 5.2 Gene expression profiles from transcriptomic experiments

Gene expression profiles from microarrays Microarrays are the main technology for large-scale transcriptional gene expression profiling. In this technique, labelled cDNA targets representing the mRNA population of interest are hybridized with a large number of probes that have been immobilized on a substrate (DeRisi et al., 1996; Shalon et al., 1996). Several microarray platforms exist which can be classified as one or two-channel arrays, depending on the methodology followed during the hybridization reaction: in one channel array one color label is used and only one sample can be measured at the time, while for the two channel techniques differential labelling is used allowing two samples to be hybridized at the time. Each platform uses different optimised sample preparation, labelling, hybridization and scanning protocol. Also the downstream data analysis, including image quantification, normalization and determining statistically and biologically significant differential gene expression can vary from analysis to analysis. In theory, the signal intensity detected for a probe is proportional to the mRNA abundance of the gene. The accuracy of microarrays can be assessed using spike-in experiments, where the true concentration of the spike-in probes is known in advance (van Bakel and Holstege, 2004). For instance, in Affymetrix data, the agreement between observed and actual fold changes is significant when the probe sets in the lowest quartile of signal intensity are filtered out. However, measurements are less reliable for genes expressed at low levels (Draghici et al., 2006). Although the wide use of microarrays for gene expression profiling, this technique suffers of some limitations. The probes in a microarray are designed to represent unique gene sequences, but cross-hybridization between related gene sequences may occur (e.g. between genes belonging to a gene family or genes with common functional domains). This limitation is minimized to a degree dependent upon the completeness of available sequence information and using specific oligonucleotide probes instead of cDNA, which should be carefully designed to avoid cross-hybridization. Additionally, all array platforms require significant

115

Page 116: Exploitation de données de séquences et de puces à ADN

amounts of mRNA for the preparation of fluorescently labeled targets that are not always available.

Gene expression profiles from ESTs Expressed sequence tags (ESTs) are short segments (about 200-900 nucleotides) obtained by sequencing the 5’ and/or 3’ ends of randomly isolated gene transcripts that have been converted into cDNA (Adams et al., 1991). These cDNA clones are obtained from a mRNA library, usually representing transcripts of a specific tissue or the whole organism. The information extracted from ESTs has been used primarily for genome annotation, allowing the discovery of new genes, alternative spliced forms of the same gene, splice sites, polymorphisms, etc. ESTs are useful especially in organisms for which the complete genome sequence is not available. As an example, there are 8 species with more than 1 million of sequenced tags available in dbEST (release February 2007). This amount of ESTs available can also be used to quantify gene expression, since ESTs from a specific library represent a random sample of mRNA. Consequently, ESTs can be used for the detection of differentially expressed genes, and for gene expression profiling. Usage of ESTs for quantitative expression profiling assumes that the cDNA clone frequency is, in principle, proportional to the expression level of its corresponding transcript in the sampled tissue or that the number of ESTs from a gene represents the mRNA abundance in the tissue. As a result, normalized libraries, where frequencies of clones representing abundant and rare transcripts are normalized with respect to one another; and libraries where ESTs have been selected before submission, are not suitable for quantitative gene profiling (Ewing et al.,1999). Also, EST sequences usually do not cover the complete gene but correspond to a fraction of a gene: the same gene might thus be represented by many different EST sequences. Therefore quantitative EST profiling requires that genes are correctly associated with all their representative ESTs. This is achieved using existing gene indices such as UniGene at NCBI or the Gene Indices from TIGR, or assembling ESTs to cluster together the fragments corresponding to the same gene. Once quantitative EST data are available, one can search for differentially expressed genes using the appropriate statistical tests (Fisher’s test (Aouacheria et al., 2006), the A-C (Audic and Claverie, 1997) and the Stekel test (Stekel et al., 2000) are customarily used). Gene expression profiles, ie., changes of a gene over different conditions, developmental stages and tissues on the other hand, can be obtained by simple EST counting in the selected libraries (Ewing et al.,1999). Genes are subsequently grouped according to their expression pattern using existing clustering techniques (Ewing et al.,1999). Reliability of the estimated quantitative information depends on the amount of sequenced ESTs. When the number of ESTs is high, there is more evidence about the expressed genes (Ewing et al.,1999). To detect differentially expressed genes, statistical tests need a large sample size to reach a statistical significance for small variations (Audic and Claverie, 1997). In the case of quantitative expression profiling two ways have been used to guaranty good quality profiles, choosing a minimum number of ETSs per gene (for example 5 or 6 ESTs) (Ronning et al., 2003; Ewing et al.,1999), or choosing libraries with a minimum sequencing depth (e.g. more than 10.000 ESTs per library) (Kawaura et al., 2005). Furthermore, the absolute EST counts can vary widely between libraries due to the sampling size of these. A

116

Page 117: Exploitation de données de séquences et de puces à ADN

normalization step may be necessary to convert raw EST counts to values relative to the library size in order to compensate this difference (Kawaura et al., 2005). Expression profiling based on EST counts has contributed largely to the analysis of new sequenced libraries. For instance, a significant number of ESTs from diverse developmental stages of potato (Ronning et al., 2003) or tissues found in the wheat life cycle (Ogihara et al., 2003), have been analysed to identify broad patterns of gene expression and to group genes of similar behavior, which may be useful to estimate the function of anonymous genes. Public EST databases are also a valid data source for quantitative expression profiling. Ewing et al. (1999) presented one of the first attempts to combine public ESTs data originating from diverse studies. In order to identify co-expressed genes in rice, they compiled a compendium covering 10 libraries corresponding to different tissues, each with about thousand ESTs. Despite the small number of sequences per library, by clustering genes with a similar profile over the different libraries they could group together genes with similar functions. More recent studies, such as the one of Kawaura et al. (2005) who focused their study on two gene families of wheat, have covered larger libraries, usually containing more than 10.000 ESTs. Several tools exist for EST-based profiling, although their number is still low compared to the tools available for microarray-based profiling. For example, DigiNorthern (Wang and Liang, 2003), GBA server (Wu et al., 2005) and TissueInfo (Skrabanek and Campagne, 2001) allow the determination of gene expression profiles for human and mouse, both organisms with more than 4 million of EST sequences in dbEST (release March 2007). DigiNorthern and TissueInfo display the expression profile for a user given query gene based on EST counts, while the GBA server can find genes that have similar expression patterns, based on the Fisher’s exact test, and it can automatically link gene annotation from multiple databases. GO-Diff is the first software integrating EST profiles with GO terms (Chen et al., 2006) to infer functional differences between two EST inferred transcriptomes. The overall effort and cost of EST sequencing clearly precludes quantitative expression profiling as a general tool for expression analysis; nevertheless, as more EST collections are created, this approach becomes a valuable means of extracting expression information from available EST collections. (Fei et al., 2004)

ESTs versus Microarrays The nature of gene expression data that can be acquired from microarrays and ESTs differs. Microarray experiments are repeated to obtain replicated values of the same measured sample, while the number of ESTs per gene in a selected condition is a single value. For this reason, statistical tests to detect differentially expressed genes for both data types differ. Irrespective of the technique used, both EST based and microarray based profiling will most likely fail detecting weakly expressed genes. For EST-based profiling, the sensitivity of the expression analysis depends on the sequencing depth of the libraries. Expression profiles can only be derived if sufficient numbers of ESTs are generated (Ewing et al., 1999), since the probability of capturing and sequencing lowly expressed genes increases with the sequencing depth of the library. For the microarray technology, genes expressed at low levels produce less reliable measurements. The sensitivity threshold of microarray measurements defines the concentration range in which accurate measurements can be made. Microarrays fail to produce meaningful measurements below that threshold. The detection limit of current

117

Page 118: Exploitation de données de séquences et de puces à ADN

microarray technology seems to be between one and ten copies of mRNA per cell (Allemeersch et al., 2005). Although this sensitivity is impressive, it might still be insufficient to detect relevant changes in low abundance genes, such as transcription factors. (Draghici et al., 2006) Another problem common to the EST- and microarray based profiling techniques is the specificity. Microarray probes can be too short or ill-designed and will therefore lead to cross-hybridization. This common fact, where different cDNAs will hybridize to the same probe giving rise to “gene specificity” problems, is most pronounced for orthologs and paralogs belonging to the same protein family. Also the EST-gene association might be fuzzy sometimes. ESTs when derived from the 5’ and 3’ extreme ends of a long transcript will constitute discrete contigs if they do not overlap, and as a result will erroneously be assigned to different genes. In theory, EST sampling should be more precise in pinpointing both qualitative and quantitative differences in the expression of individual loci within a gene family compared with standard RNA hybridization methods when each tissue (library) is sampled by yielding a high number of ESTs (Fernandes et al., 2002). Accurate EST sequencing theoretically allows distinguishing the expression of closely related genes based on limited nucleotide polymorphisms. However, very much depends on which gene index is used to associate ESTs with their corresponding genes. For instance, UniGene tends to over-cluster and may put in the same cluster ESTs from more than one gene. In general, whereas expression data measured with the microarray method arise from a single large experiment, the ESTs used in expression profiling arise from a public database, where ESTs are generated by different laboratories. This is one of the advantages of EST based profiling and many studies of this type already exist. In contrast to the difficulty to compare microarray data across array platforms, unbiased EST libraries can be easily combined and compared. Although in microarray data this integration is not straightforward, recent studies have been oriented in this direction. 5.3 Data Integration

Data integration within the same organism Both profiling techniques, those based on ESTs and those based on hybridization, are nowadays commonly used by many research groups. Viewing their own data in the light of other publicly available data on the same organism not only helps validating the obtained results, as usually the number of replicates per experiment is still limited, but can also aid in increasing the scope of the own study to conditions that could not be covered by the budget available. Integrating as many data obtained in as many conditions as possible will eventually lead to the understanding of an organism at a global level. For EST based profiling, integration of data is pretty straightforward when sticking to the same organism. In fact, the previously described tools for ESTs (GBA server, DigiNorthern, TissueInfo) allows combination of data from different libraries, tissues, developmental conditions or disease stages, which originated from diverse studies and laboratories and which can easily be obtained from public repositories, such as for instance dbEST at NCBI.

118

Page 119: Exploitation de données de séquences et de puces à ADN

In order to effectively link EST data from diverse studies and laboratories, two informations are needed. Firstly, the tissue or condition sampled by the library to which an EST belongs must be clearly specified. This requires manual curation of library labels (Pao et al., 2006) and elimination of the normalized libraries (see before). Secondly, each EST should be associated to its corresponding gene. Existing databases such as UniGene (NCBI) and Gene Index (TIGR now DFCI) are often used to obtain the link between ESTs and genes (Pao et al., 2006), but many research groups also pay the effort to make an in-house assembly of the selected ESTs (Kawaura et al., 2005). EST profiling or quantifying expression across the diverse conditions is then based on EST counting (Ewing et al., 1999; Fei et al., 2004; Kawaura et al., 2005). A different situation arises for microarray based profiling. The existence of several technologies to measure gene expression makes cross-technology comparison a more complex issue. Different platforms require different sample preparation protocols and different data normalization algorithms, making it difficult to compare data sets. To evaluate the impact and the feasibility of cross-platforms integration, the reproducibility within and across platforms needs to be assessed. Some recent studies have compared different microarray platforms such as Affymetrix, cDNA and oligo arrays (Bammler et al., 2005; Irizarry et al., 2005; Larkin et al., 2005; Tan et al., 2006). Within platforms, the reproducibility is high when data come from the same lab (Bammler et al., 2005; Tan et al., 2006). Good and bad performance has been observed using the same platform (two-channel oligo arrays), showing the importance of considering lab-to-lab variability (Irizarry et al., 2005). Reproducibility across laboratories was generally poor but this increased markedly when standardized protocols are implemented for RNA labeling, hybridization, microarray processing, data acquisition and data normalization (Bammler et al., 2005). On the other hand, the reproducibility across platforms has been reported as poor by some studies (Bammler et al., 2005; Tan et al., 2006), while other reported consistent results between Affymetrix and cDNA arrays (Larkin et al., 2005). Different results between platforms may be affected by low expressed genes for which the low signal is less reliable (Draghici et al., 2006). Other possibility is that genes exhibing differences across platforms may represent splice variants (Larkin et al., 2005). Similar to the EST profiling, combining microarray data originating from different laboratories and platforms, requires establishing a unique link between each gene and the probes on the different arrays, i.e. for each gene corresponding probes should be present on the different platforms, being it cDNA probes, oligos, Affy probes. Because no unique gene nomenclature or sequence identifier is used for all platforms, a database linking the different identifiers belonging to each gene is required (Svensson et al., 2003). The association probe-gene depends on the current knowledge of genes: a probe gene association can thus change over time. For this reason it is more useful to dynamically update a gene-probe or gene-EST assignment. For instance, CleanEx (Praz et al., 2004) is an integrative tool based on “expression targets”. Targets consists of a sequence tag, a clone name or a set of oligonucleotide sequences from an Affymetrix probe set. These targets are associated with their corresponding genes using sequence matching algorithms and other frequently updated information resources, such as UniGene. Despite such tools, the probe- or the EST-gene association is not always successful. Indeed, as long as the reference gene catalogue of a model organism is incomplete, some targets will not match any gene, while other targets may hit multiple genes. To overcome this limitation, Cleanex adds a quality-control flag to indicate that the mapping is ambiguous ((Praz et al., 2004)).

119

Page 120: Exploitation de données de séquences et de puces à ADN

Despite the mentioned limitations for cross-platform integration, several tools for this type of data integration exist (Paananen et al., 2006; Praz et al., 2004; Tsai et al., 2001).

Data integration across species Data integration across platforms is not limited to the analysis of a single organism. It also offers the potential to integrate data across-species. Based on ESTs or hybridizations, the rapid accumulation of large-scale transcriptional data from multiple species provides a great opportunity to study biological systems. Cross-species analysis allows studying conserved gene functions and regulations (Pao et al., 2006) and facilitates functional comparison of gene family members, thus assessing functionally conserved homologs across species (Fei et al., 2004). One important application is the study of the human development and diseases through the analysis of gene expression in model organisms, particularly mouse and rat (Grigoryev et al., 2004; Tsai et al., 2001). Finally, cross-species comparisons open the possibility to study the evolution of biological systems (Rifkin et al., 2003), to draw evolutionary inferences concerning specific biological processes and to study the global properties of expression networks (Bergmann et al., 2004). In the previous section we described how the EST-gene and probe-gene association is needed to integrate data from diverse studies and platforms. For cross-species data integration, the association between genes puts an additional challenge. Studies based on the EST approach have used well-known gene indices, such as Unigene and HomoloGene to associate genes between species (Pao et al., 2006). For microarray based profiling, several tools are available to cover this association and to facilitate data integration in both, cross-platforms and cross-species studies, such as CROPPER (Paananen et al., 2006) and RESOURCERER (Tsai et al., 2001). These tools are also based on well-known gene indices that serve to identify the genes associated to probes or ESTs and their corresponding orthologs in multiple species. These gene indices include the Ensembl database (CROPPER) and the TIGR Gene Indices with EGO (TIGR Eukaryotic Gene Orthologs) (RESOURCERER). Both tools not only differ in the gene index used, but also in the linked information. RESOURCERER works with a restricted number of preprocessed gene expression data, while CROPPER allows the users to process their own databases. The advantage of a restricted dataset is the manual and curated data preprocessing that can be applied, but it also limits the scope of this resource. Ortholog association across species is not a simple task. In moderately related species, many orthologs are no longer in a simple one to one relation, and when alternative splicing and EST assembly errors are taken into account, a common unique-transcript set between two species becomes very difficult to establish. Another option is to use a higher structure like Gene Ontology instead of the direct gene-EST association. GO-Diff (Chen et al., 2006) is a tool for EST-based profiling that uses the GO structure to organize transcripts into functional groups and perform meaningful comparisons. However, GO-Diff lacks the potential of a gene-by gene based association to study specific gene families or the evolution of biological processes. Profile comparisons across-species have been carried out using both ESTs and microarray data. An example based on EST data is presented by Pao et al. (2006), where profiles from human and mouse orthologs genes have been compared. Their comparison suggests that tissue-specific orthologs tend to have more similar expression patterns than those lacking significant tissue-specificity. Additionally, they observed orthologs with significant disparity. Many factors, such as heterogeneity of the tissue samples used to construct EST libraries and

120

Page 121: Exploitation de données de séquences et de puces à ADN

insufficient ESTs for theses genes could contribute to these significant disparities (Pao et al., 2006). Inaccurate ortholog pairing is also a potential source of error. However, the gene expression profiles can help to find functionally conserved homologs beyond sequence similarity. Instead of using a direct ortholog assignation, Fei et al.(2004) compared Tomato gene profiles with profiles for their corresponding top six gene hits in Arabidopsis, based on sequence similarity. They observed some cases where the sequence similarity alone cannot identify the potential functional homolog, while the gene expression profile pointed to better candidates. Thus, combined sequence and digital expression can assist in identification of functionally conserved homologs across species. Data integration using microarray datasets have been used in some other applications. Studies of specific biological mechanisms were performed using closely related species, as well as evolutionary more distant ones. Six members of the Drosophila subgroup were used by Rifkin et al.(2003) to study the evolution of gene expression during the start of metamorphosis. On the other hand, four distant species (rat, mouse, dog and human) were used by Grigoryev et al.(2004) to study ventilator-associated lung injury. They selected orthologous genes, which changes in a statistically significant way during the studied process and exhibited similar patterns of expression across all species. The linkage of these orthologous genes increased the statistical power of the gene-expression analysis and allowed to identify candidate genes that would otherwise remain unnoticed. Beyond the study of specific biological mechanisms, Bergmann et al.(2004) performed a genome-wide comparative study of six evolutionary distant organisms: S. cerevisiae, C. elegans, E. coli, A. thaliana, D. melanogaster, and H. sapiens. All these model organisms are well annotated and large datasets of expression profiles are available. They showed that functionally related sets of genes are frequently coexpressed in multiple organisms. Based on these results they compared the global topological properties of the transcription networks derived from the expression data, using a graph theoretical approach. This analysis reveals that despite the differences in the regulation of individual gene groups, the expression data of all organisms share the same large-scale properties. But some limitations to cross-species comparisons still exist. Current microarray data repositories are unbalanced in terms of the species represented. For example, data from human and yeast are much more abundant than those from fly and E. coli, and experimental conditions for each species also vary. Some drawbacks such as incompleteness of gene representation on different array platforms and tissue-specific gene expression can be overcome by careful selection of array platforms and experimental models, respectively, as well as further improvements or refinements in the platform itself (Grigoryev et al., 2004).

Integration across different data sources Gene expression data sets are the most commonly available functional genomic data, but in the recent years, other types of high-throughput functional genomic data have become available. These additional genome-scale, or ‘omics’, data sources include genomics (DNA and protein sequence), proteomics (celular levels of proteins), metabolomics (set of metabolites of the cell), localizomics (subcellular location of proteins), protein-DNA interactomics (interactions between transcription factors and their target promoter), protein-protein interactome, fluxomics (flux of metabolites through enzymatic reactions) and

121

Page 122: Exploitation de données de séquences et de puces à ADN

phenomics (determination of cellular fitness or viability in response to genetic and/or environmental perturbations.). While gene expression data are an excellent tool, this data alone often lack the degree of specificity needed to make accurate biological conclusions. Data integration has showed to increase the accuracy of gene function prediction compared to a single high-throughput method (Troyanskaya, 2005). This section describes first the integration between transcriptomic data, then it continues with genomic and transcriptomic datasets, to finally overview the integration of the other omic data with gene expression profiling. Correlation between ESTs and microarray data Although both are used to measure gene expression and theoretically describe the same process of transcriptional regulation, the use of microarrays and ESTs for expression profiling has mainly been developed independently from each other. A question remains as to what extent both techniques agree with each other in describing similar transcriptional processes. Comparative studies focusing on the detection of differentially expressed genes among tissues, showed clear differences between results obtained by EST versus microarray profiling (Fernandes et al., 2002; Pao et al., 2006). However, when increasing the stringency level of the p-value threshold to detect differentially expressed genes for ESTs, the correlation between EST and microarray based profiling results became more evident, although the degree to which this occurred varied with tissue type. In particular for human, the correlation between microarray and EST based profiling results for some tissues like the brain was much higher than that for other tissues, like pancreas or ovary (Huminiecki et al., 2003; Pao et al., 2006). Similar low agreement between experimental techniques describing similar cellular phenomena were observed when comparing protein profiling techniques. Inherent properties of the different techniques, such as incomplete EST sampling of low-abundance transcripts and cross-hybridizations in microarrays may cause this difference. Huminiecki observed that EST and microarray based profiles correlate relatively well when large numbers of EST tags are available or when tissue cellular composition is simple (Huminiecki et al., 2003). Sequence and expression data / Genomic and Transcriptomic In cross-platform and cross-species studies the integration between sequence and transcriptional data is used implicitly. Indeed the association between genes is constructed based on sequence similarity to associate genes across platforms and putative orthologs genes across species (COPPER and RESOURCERER). Beyond the association of genes across species, combining sequence and expression data have showed to improve functional gene annotation (Bergmann et al., 2004; Fei et al., 2004; Zhou and Gibson, 2004). Functional annotation based on sequence similarity only is limited because an ORF can have several close homologues, some of which may have evolved different functions. On the other hand, the sequence of an ORF may have diverged beyond recognition although the gene maintained its function. Moreover, owing to biological interference and the noise in the expression data, the inferred coexpression could be accidental and may not necessarily reflect similar function. Combining expression and sequence data may help to overcome the above-mentioned limitations (Bergmann et al., 2004). Additionally to functional annotation, the combined use of sequence and gene expression data have been applied in the identification of regulatory elements. Identification of a promoter and the regulatory elements that it contains is one of the major challenges in bioinformatics.

122

Page 123: Exploitation de données de séquences et de puces à ADN

Because co-expressed genes tend to behave similarly, they are expected to be co-regulated. Under the simplifying assumption that this co-regulation occurs at the transcriptional level, co-expressed genes should contain similar cis-regulatory elements in their promoter regions. Several tools exist to search de novo promoters in the upstream region of co-expressed genes (usually determined by expression data) such as MEME (Bailey and Elkan, 1995), Motif Sampler (Thijs et al., 2001) and AlignACE (Roth et al., 1998). Combining diverse omic data sources Beyond the identification of promoter regions, many studies are oriented to identify gene regulatory networks. An advantage of integrating different data sources is that firstly it allows overcoming the limitations of each single data source, such as the variability, or ‘noise’, of microarray gene-expression data and the general difficulty in identifying the often degenerate regulatory-motif signatures in genomic sequences. Secondly, integrating more data sources allows gaining a more global view on the organism. For example, sequence based motif discovery methods are not able to identify which transcription factor binds the motifs. By integrating sequence based motif and expression or ChIP-Chip data identifying complere regulatory modules becomes possible (the set of coexpressed genes, the regulator responsible for the overexpression and the sequence tag (motif) responsible for the binding of the regulator. Protein-DNA interaction data, obtained using chromatin immunoprecipitation experiments (ChIP-chip), can connect specific transcription factors to a large number of binding sites. The integration between sequence and transcriptional data have been used by Wang et al. (2005) to find regulatory motifs and subsequently allowing for the inclusion of either ChIP–chip or transcription-factor-perturbation experiments to delineate the regulatory network. Bar-Joseph et al.(2003) integrated protein–DNA-interaction data with transcriptomics data. ReMoDiscovery (Lemmens et al., 2006) is a tool that also integrates motif data in addition to protein–DNA-interaction and transcriptomics data. They described an algorithm for discovering gene modules, where a gene module is defined as a set of coexpressed genes to which the same set of transcription factors binds. Functional information from expression data combined with DNA binding data allows to explicitly link genes to the factors that regulate them. Other interactomics, like protein-protein interaction data, have been used by several groups in addition to gene expression data (Tanay et al., 2004). Localisation data have been used by Hartemink et al. (2002) in order to reduce noise in regulatory network models. More detailed revisions on methods for omic data integration and its applications can be found in Troyanskaya 2005; Joyce and Palsson (2006) and De Keersmaecker et al. (2006).

123

Page 124: Exploitation de données de séquences et de puces à ADN

124

Page 125: Exploitation de données de séquences et de puces à ADN

Conclusion et perspectives

125

Page 126: Exploitation de données de séquences et de puces à ADN

126

Page 127: Exploitation de données de séquences et de puces à ADN

Plusieurs techniques de mesure existent pour caractériser le transcriptome. Dans cette étude nous avons choisi deux techniques utilisées fréquemment dans les études transcriptomiques: les ESTs comme technique représentative des approches basées sur le séquençage, et les puces à ADN à deux couleurs comme technique basée sur des hybridations, le tout dans le contexte du transcriptome de Xenopus tropicalis pendant sa métamorphose. Ces deux approches, les ESTs et les puces à ADN, nous offrent des informations complémentaires. D’un côté, les ESTs ont permis une étude plutôt exploratoire du transcriptome et de la structure des gènes. Nous avons exploité le potentiel offert par les ESTs, qui est parfois oublié malgré la quantité des études menées actuellement. Par exemple, la disponibilité des ESTs des diverses banques ont permis d’observer l’expression des transcrits. En plus, l’identification des ARN non-codants fait aussi partie de l’intérêt scientifique. Les puces à ADN permettent de générer rapidement des données sur plusieurs conditions, mais puisqu’elles sont coûteuses, nous avons optimisé la stratégie expérimentale et adopté la méthode d’analyse qui va avec, afin de réduire le nombre d’expériences sans perdre de résultats. Cette thèse est une démarche complète qui illustre l’analyse à partir de données brutes. Les étapes d’analyse incluent évaluer la qualité de données, détecter et redresser les biais afin d’obtenir les signaux d’intérêt biologique. Chaque technique a besoin d’un traitement différent puisque les données brutes de ces deux techniques diffèrent: des chromatogrammes pour les ESTs et des images pour les puces à ADN. Pour chaque technique et chaque étape d'analyse il y a une série d'outils disponibles. Seulement le choix d'un outil par chaque étape demande un effort significatif. Aborder l’analyse d’une façon pluridisciplinaire permet de satisfaire aux besoins des biologistes et de sélectionner les outils appropriés. Xenopus tropicalis a permis de prouver l’intérêt et la faisabilité de la démarche. Cet organisme, modèle pour le développement de vertébrés, inclut toutes les difficultés de travailler avec un organisme eucaryote complexe. Du côté biologique, l'information obtenue dans cette thèse est un apport à la connaissance de cet organisme dans une étape aussi importante que la métamorphose. Et du côté méthodologique, cette thèse est représentative d’une démarche générale et peut servir de guide pour d’autres travaux qui veulent étudier le transcriptome. Au-delà des analyses présentées, l’étude du transcriptome est un domaine très fructueux vers la compréhension des systèmes biologiques, comme la régulation de la transcription et l’interaction entre macromolécules biologiques. Les études transcriptomiques ne sont pas limitées aux jeux de données produits par les groupes de recherche de manière individuelle. L'intégration de données implique de savoir tirer profit des données existantes, qui peuvent être produites par d'autres laboratoires, plateformes, sur divers organismes ou en utilisant des autres techniques à grande échelle. Sans compter le potentiel qu’offre la combinaison de données de différentes sources, les études entre organismes peuvent tirer profit des organismes modèles, pour lesquelles la connaissance est plus complète et précise. D'ailleurs, ils ouvrent une voie pour des études évolutives solides autant que pour l’étude des propriétés globales des réseaux de régulation géniques. Vu le nombre croissant de projets de séquençage et de puces à ADN et l’importance de l'intégration de données, l'analyse du transcriptome a besoin d’efforts conséquents d'ingéniérie. Des projets à grande échelle, comme la création des outils qui permettent l'accès

127

Page 128: Exploitation de données de séquences et de puces à ADN

facile aux données pour que tout le monde puisse profiter des résultats, demandent une vision pluridisciplinaire qui permette l'exploitation de données et à la fois de répondre aux questions biologiques. L'expérience acquise durant cette thèse et exposée dans ce document sera utile pour des analyses postérieures, comme l'inférence de réseaux de régulation. Connaître l'origine, les potentiels et les limitations des données peuvent aider à améliorer les outils d'analyse qui existent aujourd'hui.

128

Page 129: Exploitation de données de séquences et de puces à ADN

References Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 1991, 252(5013):1651-6. Allemeersch J, Durinck S, Vanderhaeghen R, Alard P, Maes R, Seeuws K, Bogaert T, Coddens K, Deschouwer K, Van Hummelen P et al: Benchmarking the CATMA microarray. A novel tool for Arabidopsis transcriptome analysis. Plant Physiol 2005, 137(2):588-601. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990, 215:403-410. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402. Aouacheria A, Navratil V, Barthelaix A, Mouchiroud D, Gautier C. Bioinformatic screening of human ESTs for differentially expressed genes in normal and tumor tissues. BMC Genomics 2006, 26;7:94. Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Res 1997, 7(10):986-95. Bachem CW, van der Hoeven RS, de Bruijn SM, Vreugdenhil D, Zabeau M, Visser RG. Visualization of differential gene expression using a novel method of RNA fingerprinting based on AFLP: analysis of gene expression during potato tuber development. Plant J 1996, 9(5):745-53. Bailey TL, Elkan C. The value of prior knowledge in discovering motifs with MEME. Proc. third intl. conf. intell. sys. mol. biol 1995. 21-29. Bammler T, Beyer RP, Bhattacharya S, Boorman GA, Boyles A, Bradford BU, Bumgarner RE, Bushel PR, Chaturvedi K, Choi D, Cunningham ML, Deng S, Dressman HK, Fannin RD, et al. Standardizing global gene expression analysis between laboratories and across platforms. Nature Methods 2005, 2(5):351-6. Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yoo JY, Robert F, Gordon DB, Fraenkel E, Jaakkola TS, Young RA, Gifford DK. Computational discovery of gene modules and regulatory networks. Nature Biotechnol 2003, 21(11):1337-42. Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES. ARACHNE: a whole-genome shotgun assembler. Genome Res 2002, 12(1):177-89. Bergmann S, Ihmels J, Barkai N. Similarities and differences in genome-wide expression data of six organisms. PLoS Biol 2004, 2(1):E9.

129

Page 130: Exploitation de données de séquences et de puces à ADN

Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31(1):365-70. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185-193. Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, Roth R, George D, Eletr S, Albrecht G, Vermaas E, Williams SR et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnol 2000, 18(6):630-4. Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 2005, 33(Database issue):D212-5. Chen Z, Wang W, Ling XB, Liu JJ, Chen L. GO-Diff: mining functional differentiation between EST-based transcriptomes. BMC Bioinformatics 2006, 16;7:72. Chou HH, Holmes MH. DNA sequence quality trimming and vector removal. Bioinformatics 2001, 17(12):1093-104. Cui X, Churchill GA: Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003, 4(4):210. De Keersmaecker SC, Thijs IM, Vanderleyden J, Marchal K. Integration of omics data: how well does it work for bacteria? Mol Microbiol 2006, 62(5):1239-50. DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet 1996, 14(4):457-60. Draghici S, Khatri P, Eklund AC, Szallasi Z. Reliability and reproducibility issues in DNA microarray measurements.Trends Genet 2006, 22(2):101-9. Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998, 8(3):186-94. Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 1998, 8(3):175-85. Ewing RM, Ben Kahla A, Poirot O, Lopez F, Audic S, Claverie JM. Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res 1999, 9(10):950-9. Fei Z, Tang X, Alba RM, White JA, Ronning CM, Martin GB, Tanksley SD, Giovannoni JJ. Comprehensive EST analysis of tomato and comparative genomics of fruit ripening. Plant J 2004, 40(1):47-59.

130

Page 131: Exploitation de données de séquences et de puces à ADN

Fernandes J, Brendel V, Gai X, Lal S, Chandler VL, Elumalai RP, Galbraith DW, Pierson EA, Walbot V. Comparison of RNA expression profiles based on maize expressed sequence tag frequency analysis and micro-array hybridization. Plant Physiol 2002, 128(3):896-910. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A. Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247-51. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. Gilchrist MJ, Zorn AM, Voigt J, Smith JC, Papalopulu N, Amaya E. Defining a large set of full-length clones from a Xenopus tropicalis EST project. Dev Biol 2004, 271(2):498-516. Glonek GF, Solomon PJ: Factorial and time course designs for cDNA microarray experiments. Biostatistics 2004, 5(1):89-111. Grigoryev DN, Ma SF, Irizarry RA, Ye SQ, Quackenbush J, Garcia JG. Orthologous gene-expression profiling in multi-species models: search for candidate genes. Genome Biol 2004, 5(5):R34. Hartemink AJ, Gifford DK, Jaakkola TS, Young RA. Combining location and expression data for principled discovery of genetic regulatory network models. Pac Symp Biocomput 2002, 437-49. Hilson P, Allemeersch J, Altmann T, Aubourg S, Avon A, Beynon J, Bhalerao RP, Bitton F, Caboche M, Cannoot B et al: Versatile gene-specific sequence tags for Arabidopsis functional genomics: transcript profiling and reverse genetics applications. Genome Res 2004, 14(10B):2176-2189. Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome Res 1999, 9(9):868-77. Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ. The PROSITE database. Nucleic Acids Res 2006, 34(Database issue):D227-30. Huminiecki L, Lloyd AT, Wolfe KH. Congruence of tissue expression profiles from Gene Expression Atlas, SAGEmap and TissueInfo databases. BMC Genomics 2003, 4(1):31. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, et al. Multiple-laboratory comparison of microarray platforms. Nature Methods 2005, 2(5):345-50. Joyce AR, Palsson BO. The model organism as a system: integrating 'omics' data sets. Nature Reviews Mol Cell Biol 2006, 7(3):198-210. Kawaura K, Mochida K, Ogihara Y. Expression profile of two storage-protein gene families in hexaploid wheat revealed by large-scale analysis of expressed sequence tags. Plant Physiol 2005, 139(4):1870-80.

131

Page 132: Exploitation de données de séquences et de puces à ADN

Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray data. J Comput Biol 2000, 7(6):819-37. Kerr MK, Churchill GA: Experimental design for gene expression microarrays. Biostatistics 2001, 2(2):183-201. Kerr MK: Linear models for microarray data analysis: hidden similarities and differences. J Comput Biol 2003, 10(6):891-901. Kohany O, Gentles AJ, Hankus L, Jurka J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics 2006, 7:474. Larkin JE, Frank BC, Gavras H, Sultana R, Quackenbush J. Independence and reproducibility across microarray platforms. Nature Methods 2005, 2(5):337-44. Lemmens K, Dhollander T, De Bie T, Monsieurs P, Engelen K, Smets B, Winderickx J, De Moor B, Marchal K. Inferring transcriptional modules from ChIP-chip, motif and microarray data. Genome Biol 2006, 7(5):R37. Marchal K, Engelen K, De Brabanter J, Aert S, De Moor B, Ayoubi T, Van Hummelen P: Comparison of different methodologies to identify differentially expressed genes in two-sample cDNA microarrays. J Biological Systems 2002, 10(4):409-430. Min XJ, Butler G, Storms R, Tsang A. TargetIdentifier: a webserver for identifying full-length cDNAs from EST sequences. Nucleic Acids Res 2005, 33(Web Server issue):W669-72. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, Copley R, et al. InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33(Database issue):D201-5. Ogihara Y, Mochida K, Nemoto Y, Murai K, Yamazaki Y, Shin-I T, Kohara Y. Correlated clustering and virtual display of gene expression patterns in the wheat life cycle by large-scale statistical analyses of expressed sequence tags. Plant J 2003, 33(6):1001-11. Okubo K, Matsubara K. Complementary DNA sequence (EST) collections and the expression information of the human genome. FEBS Lett 1997, 403(3):225-9. Paananen J, Storvik M, Wong G. CROPPER: a metagene creator resource for cross-platform and cross-species compendium studies. BMC Bioinformatics 2006, 7:418. Pao SY, Lin WL, Hwang MJ. In silico identification and comparative analysis of differentially expressed genes in human and mouse tissues. BMC Genomics 2006, 21;7:86. Praz V, Jagannathan V, Bucher P. CleanEx: a database of heterogeneous gene expression data based on a consistent gene nomenclature. Nucleic Acids Res 2004, 32(Database issue):D542-7.

132

Page 133: Exploitation de données de séquences et de puces à ADN

Quackenbush J, Liang F, Holt I, Pertea G, Upton J. The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res 2000, 28(1):141-5. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. InterProScan: protein domains identifier. Nucleic Acids Res 2005, 33(Web Server issue):W116-20. Rifkin SA, Kim J, White KP. Evolution of gene expression in the Drosophila melanogaster subgroup. Nature Genet 2003, 33(2):138-44. Ritchie ME, Diyagama D, Neilson J, van Laar R, Dobrovic A, Holloway A, Smyth GK: Empirical array quality weights in the analysis of microarray data. BMC Bioinformatics 2006, 7:261. Ronning CM, Stegalkina SS, Ascenzi RA, Bougri O, Hart AL, Utterbach TR, Vanaken SE, Riedmuller SB, White JA, Cho J, Pertea GM, Lee Y, Karamycheva S, Sultana R, et al. Comparative analyses of potato expressed sequence tag libraries. Plant Physiol 2003, 131(2):419-29. Roth, F.P., J.D. Hughes, P.W. Estep, and G.M. Church. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnol 1998, 16: 939-945. Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE. Using the transcriptome to annotate the genome. Nature Biotechnol 2002, 20(5):508-12. Shalon D, Smith SJ, Brown PO. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Res 1996, 6(7):639-45. Skrabanek L, Campagne F. TissueInfo: high-throughput identification of tissue expression profiles and specificity. Nucleic Acids Res 2001, 29(21):E102-2. Slater, G. "Algorithms for analysis of ESTs." Ph.D. thesis University of Cambridge, UK 2000. Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005, 6:31. Smyth GK, Speed T: Normalization of cDNA microarray data. Methods 2003, 31(4):265-273. Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3. Smyth GK, Michaud J, Scott HS: Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics 2005, 21(9):2067-2075. Stekel DJ, Git Y, Falciani F. The comparison of gene expression from multiple cDNA libraries. Genome Res 2000, 10(12):2055-61.

133

Page 134: Exploitation de données de séquences et de puces à ADN

Svensson BA, Kreeft AJ, van Ommen GJ, den Dunnen JT, Boer JM. GeneHopper: a web-based search engine to link gene-expression platforms through GenBank accession numbers. Genome Biol 2003, 4(5):R35. Tan PK, Downey TJ, Spitznagel EL Jr, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM, Cam MC. Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 2003, 31(19):5676-84. Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci U S A. 2004, 101(9):2981-6. Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 2001, 17(12):1113-22. Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F, Quackenbush J. RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biol 2001,2(11):SOFTWARE0002. Tsai CA, Chen YJ, Chen JJ: Testing for differentially expressed genes with microarray data. Nucleic Acids Res 2003, 31(9):e52. Troyanskaya OG. Putting microarrays in a context: integrated analysis of diverse biological data. Brief Bioinform 2005, 6(1):34-43. van Bakel H, Holstege FC. In control: systematic assessment of microarray performance. EMBO Rep 2004, 5(10):964-9. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science 1995, 270(5235):484-7. Vinciotti V, Khanin R, D'Alimonte D, Liu X, Cattini N, Hotchkiss G, Bucca G, de Jesus O, Rasaiyaah J, Smith CP et al: An experimental evaluation of a loop versus a reference design for two-channel microarrays. Bioinformatics 2005, 21(4):492-501. Vos P, Hogers R, Bleeker M, Reijans M, van de Lee T, Hornes M, Frijters A, Pot J, Peleman J, Kuiper M, et al. AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res 1995, 23(21):4407-14. Wang J, Liang P. DigiNorthern, digital expression analysis of query genes based on ESTs. Bioinformatics 2003, 22;19(5):653-4. Wang W, Cherry JM, Nochomovitz Y, Jolly E, Botstein D, Li H. Inference of combinatorial regulation in yeast transcriptional networks: a case study of sporulation. Proc Natl Acad Sci U S A. 2005, 102(6):1998-2003.. Wang D, Huang J, Xie H, Manzella L, Soares MB. A robust two-way semi-linear model for normalization of cDNA microarray data. BMC Bioinformatics 2005, 6:14.

134

Page 135: Exploitation de données de séquences et de puces à ADN

Wit E, Nobile A, Khanin R: Near-optimal designs for dual-channel microarray studies. Applied Statistics 2005, 54(5):817-830. Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules RS: Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 2001, 8(6):625-637. Wu X, Walker MG, Luo J, Wei L. GBA server: EST-based digital gene expression profiling. Nucleic Acids Res 2005, 33(Web Server issue):W673-6. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34(Database issue):D187-91. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 2002, 30(4):e15. Yang YH, Speed T: Design issues for cDNA microarray experiments. Nat Rev Genet 2002, 3(8):579-588. Yang YH, Thorne NP: Normalization for two-color cDNA microarray data. In: Science and Statistics: A Festschrift for Terry Speed, IMS Lecture Notes-Monograph Series Edited by DR G, vol. 40; 2003: 403-418. Zhou XJ, Gibson G. Cross-species comparison of genome-wide expression patterns. Genome Biol 2004, 5(7):232.

135