a random-periods model for expression of cell-cycle genes

Post on 02-Jan-2017

215 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A random-periods model for expressionof cell-cycle genesDelong Liu*, David M. Umbach*, Shyamal D. Peddada*, Leping Li*, Patrick W. Crockett†, and Clarice R. Weinberg*‡

*Biostatistics Branch, National Institute of Environmental Health Sciences, National Institutes of Health, P.O. Box 12233, Research Triangle Park, NC 27709;and †Constella Health Sciences, 2605 Meridian Parkway, Durham, NC 27713

Communicated by Calyampudi R. Rao, Pennsylvania State University, University Park, PA, March 31, 2004 (received for review October 2, 2003)

We propose a nonlinear regression model for quantitatively ana-lyzing periodic gene expression in studies of experimentally syn-chronized cells. Our model accounts for the observed attenuationin cycle amplitude by a simple and biologically plausible mecha-nism. We represent the expression level for each gene as anaverage across a large number of cells. For a given cell-cycle gene,we model its expression in each cell in the culture as following thesame sinusoidal function except that the period, which in anyindividual cell must be the same for all cell-cycle genes, variesrandomly across cells. We model these random periods by using alognormal distribution. The variability in period causes the mea-sured amplitude of the cyclic expression trajectory to attenuateover time as cells fall increasingly out of synchrony. Gene-specificparameters include initial amplitude and phase angle. Applying themodel to data from Whitfield et al. [Whitfield, M. L., Sherlock, G.,Saldanha, A. J., Murray, J, I., Ball, C. A., et al. (2002) Mol. Biol. Cell13, 1977–2000], we fit the trajectories of 18 well characterizedphase-marker genes and find that the fit does not suffer when acommon lognormal distribution is assumed for all 18 genes com-pared with a separate distribution for each. We then use the modelto identify 337 periodically expressed transcripts, including the 18phase-marker genes. The model permits estimation of and hypoth-esis testing about biologically meaningful parameters that char-acterize cycling genes.

bootstrap test � gene expression � microarray � nonlinear regression

Experimental protocols that arrest cells in vitro at a particularphase of the cell cycle and then release them in a synchro-

nized way allow detailed study of the cycling process. In con-junction with such experiments, cDNA microarray technologyallows investigators to assess the temporal expression patterns ofthousands of genes simultaneously. Gene-expression studies inyeast (1–3) and in cultured human cells (4, 5) have revealed thatexpression levels for cell-cycle genes vary periodically and haveamplitudes that attenuate through time (Fig. 1). The observedattenuation is generally attributed to cells in the cultures fallingincreasingly out of synchrony through time.

In cultures of homogeneous cells released from a block oncycling, asynchrony can arise from at least two mechanisms. Cellsthroughout the culture can differ slightly in the exact timing oftheir arrest or release, or cells can differ slightly in the durationof their individual cycles. The former mechanism leaves asyn-chrony constant through time, so only the latter mechanism,where asynchrony increases, produces the characteristic atten-uation. Such attenuation may hold biologic interest in itself. Therate of attenuation (i.e., the variation in the duration of the cellcycle across cells) may vary by strain of organism or by cell type(e.g., across tumors with differing metastatic potential). Someinfluences of cell characteristics on the scheduling of the cellcycle are already known. For instance, cell-size distribution inyeast mutants is related to time spent in the late G1 phase (6).

Few methods proposed for the analysis of cell-cycle expressiondata address attenuation explicitly. For example, several inves-tigators used a sinusoidal template to identify periodically ex-pressed genes but did not explicitly account for attenuation

(2, 5). The basic single-pulse model (SPM) proposed by Zhao etal. (7) addresses asynchrony by assuming that the cell-specifictimings with which individual cells in a culture reach a givenobservation point have a normal distribution whose mean is theobservation time itself and whose unknown variance differsacross observation times. The magnitudes of the variance pa-rameters measure asynchrony. Stochastic cell-to-cell variation,as mentioned earlier, does not necessarily produce decay inamplitude. Attenuation was built into the SPM separately by anassumed submodel where the log of the variance parameterincreases linearly in time; but that submodel’s parameters lackdirect biologic interpretation.

In this article, we propose an alternative regression model forstudying periodically expressed genes. Our model has biologicimport: it directly links attenuation in the amplitude of periodicgene expression to stochastic variation across cells in the dura-tion of the cell cycle, while permitting estimation of the phase ofthe cycle in which the gene is most transcribed. Parametersestimated under our model can be used not only to help identifyand characterize periodically expressed genes but also to clusterthe identified genes into subgroups based on the estimated phaseangle, amplitude, or drift. Our model can facilitate studies ofeffects of experimental conditions on the variance and medianduration of the cell cycle. We describe our model and illustrateits use with a publicly available data set (5). We use the observedexpression trajectories of 18 cell-cycle ‘‘phase-marker’’ genes totest a key assumption of our model. We then use estimates based

Abbreviations: RPM, random-periods model; SPM, single-pulse model.

‡To whom correspondence should be addressed at: Mail Drop A3–03, National Institute ofEnvironmental Health Sciences, P.O. Box 12233, Research Triangle Park, NC 27709. E-mail:weinberg@niehs.nih.gov.

© 2004 by The National Academy of Sciences of the USA

Fig. 1. Trajectory of log2-transformed expression ratio for a known cell-cyclegene, PCNA, from synchronized HeLa cultures (data from ref. 5) showingtypical attenuation in amplitude.

7240–7245 � PNAS � May 11, 2004 � vol. 101 � no. 19 www.pnas.org�cgi�doi�10.1073�pnas.0402285101

on those genes to help identify additional periodically ex-pressed genes among the remaining transcripts.

Random-Periods Model for Periodically Expressed GenesTo model the expression trajectories of cell-cycle-related genesthrough time based on cell cultures where synchrony is experi-mentally induced, we make the following assumptions: (i) mea-sured gene expression is, in effect, the average of mRNA levelsacross a large number of individual cells in the culture; (ii) in anyindividual cell, the expression levels (perhaps after logarithmictransformation) of all cell-cycle-related genes have temporalprofiles that are well approximated by sinusoidal waves; (iii) theduration of the cell cycle varies stochastically across cells in theculture, following a lognormal distribution with a characteristicmedian (T) and geometric standard deviation (�); and (iv) anynonstationary background expression levels are approximatelylinear in time with gene-specific intercepts and slopes (a and b,respectively). We model the sinusoid using a cosine function withdistinct amplitude (K) and phase angle (�) for each gene.

We model the observed expression of gene g at time t as thesum of an expected response and a random error, that is, Yg(t) �f(t, �g) � �g(t). Here, �g denotes a vector of parameters. We takethe �g(t) to have mean zero but make no additional assumptionsabout their distribution; in particular, we regard them as havingpossibly different variances across genes or through time and aspossibly having serial correlation. In view of the precedingassumptions, we propose the ‘‘random-periods’’ model (RPM)for characterizing the expected periodic expression of a cell-cyclegene:

f�t ,�g� � ag � bgt �Kg

�2����

��

cos� 2�tTexp��z�

� �g�exp��z2�2�dz, [1]

where �g is explicitly (Kg, T, �, �g, ag, bg). The integration in themodel computes the expected cosine across the lognormaldistribution of periods and thereby accounts for the aggregationof expression levels across a large number of cells. The sub-scripted parameters in Eq. 1 are gene-specific. The parameter �gcorresponds to the phase of the cell cycle where the gene has itspeak transcription with �g � 0 corresponding to the point whencells are first released to resume cycling. The parameter Kg is theinitial amplitude of the periodic expression pattern. The param-eters ag and bg account for any drift in a gene’s backgroundexpression level.

Under our assumptions, the parameters T and � are specificto the population of cells and the same for all genes, althoughthey can be estimated from data on a single gene or on a set ofgenes. The parameter � governs the rate of attenuation inamplitude. If � is zero, the duration of the cell cycle does not varyacross cells, cells remain synchronous through time, and theaggregate expression shows no attenuation in amplitude. If � islarge, cells fall rapidly out of synchrony, and amplitude decayssharply. Increasing � has two distinct effects on the shape of theaggregate expression trajectory (Fig. 2); the expression levelattenuates faster, and the times between successive crossings of0 increase over time more markedly. The latter feature is moreeasily seen in the last three cycles of the curve with the largest� and is less noticeable for curves whose � values are smaller(�0.075), in particular, within the first three cycles.

Statistical Inference: Estimation and TestingUsing numerical quadrature to approximate the needed integral,we estimate the unknown parameters in Eq. 1 by using nonlinear

least-squares regression, i.e., we minimize the sum of squaredresiduals. (For details about nonlinear regression, see refs. 8 and9.) To help ensure that we reach a global minimum, we repeatthe iterative fitting process from multiple distinct starting values(usually 50) and choose the fit with the minimum sum of squaredresiduals as the best. Our approach to fitting can be applied toone transcript at a time or to several simultaneously andestimates all parameters simultaneously. It ignores, however,possible variance heterogeneity and lack of independence amongthe �g(t). Calculations were performed with MATLAB software(MathWorks, Natick, MA).

To carry out inference on model parameters, we need toestimate the covariance matrix of their estimates. Suppose thatwe are fitting Eq. 1 simultaneously to G genes. Each gene g �{1, 2, 3, . . . , G} is observed at ng time points, so the total numberof observations is N � ¥g�1

G ng. In general, the overall parametervector � has p components indexed by subscript j. For example,P � 2 � 4G if T and � are estimated jointly for all G genes, butp � 6G if T and � are estimated separately for each gene. Avector version of the model uses the N observations stacked intoa column vector, Y, ordered by genes and by time points withingenes, and the corresponding stacked versions of the expectedresponse and the errors: Y � f(t, �) � �. Let V be the N � p matrixof partial derivatives of f(t, �). The (i, j)th element of V is �f(t,�)i/��j, and V is V evaluated at � (‘‘hats’’ denote estimatedvalues). Let Vg be the ng � p submatrix of V whose rowscorrespond to the ng time points for gene g. We can express ourestimator of ¥, the covariance matrix of �, as:

� �VV��1� �g�1

G�g�g

ng ggVgVg �VV��1 , [2]

where gg is the trace of the matrix [Vg(VV)�1Vg]. This variance

estimator ¥ has favorable properties in both heteroscedasticlinear models (10, 11) and nonlinear models (12, 13), and it isdesigned for �g(t) having different variances among genes (butnot across time points within genes).

To test hypotheses about the parameters, we use Wald teststatistics. Suppose we want to test the null hypothesis h(�) � 0,where h� is vector-valued with q components and is differentiable.The corresponding Wald statistic is W � h(�)(H¥H)�1h(�), whereH is the q � p Jacobian of h� evaluated at � and ¥ is from Eq. 2.Under technical regularity conditions and for large sample sizes, theWald statistic would have a �2 distribution with q degrees of

Fig. 2. Trajectories of the RPM for different values of �. For these curves, (K,T, �, �, a, b) � (1, 15, �, ��2, 0, 0) with � � {0, 0.05, 0.075, 0.13}. Larger valuesof � correspond to faster attenuation of peak amplitude.

Liu et al. PNAS � May 11, 2004 � vol. 101 � no. 19 � 7241

CELL

BIO

LOG

YST

ATI

STIC

S

freedom under the null hypothesis. Because the number of timepoints is not large, and because we expect temporal and gene-to-gene correlations in the �g(t), we opt to evaluate the null distributionof the Wald statistic with a moving-blocks bootstrap procedure (14).Resampling individual values destroys temporal correlation; toretain it, the moving blocks bootstrap resamples fixed-length blocksof consecutive values. Because all genes appear on a single chip ateach observation time, we sample observation times and carriedalong residuals for all G genes at those sampled times when ahypothesis simultaneously involved G genes.

The procedures for generating the bootstrap distribution were,in brief: (i) fit the null model to the original data and computethe residuals; (ii) draw a random sample with replacement fromall possible blocks of consecutive residuals with a given length(we sampled six blocks of length 9 and truncated to 47 residuals,the number of time points in the data); (iii) add these sampledresiduals to the curve fitted under the null model to obtain abootstrap data set; (iv) fit the alternative model to the bootstrapdata set and compute the Wald statistic; and (v) repeat steps iithrough iv a large number of times (our application used 2,000).The resulting collection of bootstrap Wald statistics is used toapproximate the null distribution of the test. The bootstrap Pvalue is the proportion of bootstrap Wald statistics that fall abovethe Wald statistic calculated from the original data.

Identifying Additional Cell-Cycle-Related TranscriptsOne approach to identifying additional cell-cycle-related tran-scripts would fit the RPM to each gene individually and use atesting strategy to see whether the cosine term in the model werenecessary for a good fit. Fitting Eq. 1 to the thousands oftranscripts typical of cell-cycle expression data would, however,be extremely impractical. Only a small portion of the availabletranscripts are likely involved in the cell-cycle process (2).Attempts to fit a model designed for cycling trajectories totrajectories without periodicities is time-consuming; conver-gence of the iterative estimation process is slow, and multiplelocal minima often present problems. A simple and practicalalternative is to adopt template-based correlation methods forselecting genes with periodic expression patterns (3, 5, 15). Wepropose to use estimates from fitting the RPM to known

cell-cycle genes to inform a correlation approach for selectingother cell-cycle-related genes.

We first created a set of model-based templates. A templateis a list of fitted values, one at each observation time, generatedfrom Eq. 1, using a prespecified parameter vector. The param-eter vectors for the set of templates are chosen so that the setspans trajectories typical of cell-cycle-related genes. We basedtemplates on data from phase-marker genes through parametersestimated by fitting the RPM. We set both a and b to zero, sincea has no effect on correlation and since the data for phase-marker genes indicated that b was near zero (Table 1). If a andb are both zero, then K has no effect on the correlation and canbe set to 1. Also, under the RPM, T and � should be constantfor all cycle-related genes, but � can vary from 0 to 2� dependingon the cell-cycle phase in which the transcript is expressed. In ourapplication, we chose a set of 24 vectors (K, T, �, �, a, b) � (1,T, �, �, 0, 0), where T and � were estimates based on 18phase-marker genes and � was one of 24 angles, equally spacedaround the circle and starting at 0. For each transcript, wecalculate its Pearson correlation with each template in the setand take the maximum correlation over those templates as itsscore. The higher this score, the more the pattern displayed bythe transcript resembles one of the templates. This scoring allowsan ordering of any number of transcripts by their similarity totypical cell-cycle genes. Clearly, some transcripts will falselyappear as cycling given that we are examining so many (�44,000in our application). Accordingly, we based our choice of cut-point for the ordered correlation scores on a permutation proce-dure (see supporting information, which is published on the PNASweb site) to restrict the expected number of false-positive tran-scripts, those mistakenly declared as cell-cycle related, to �1%.

Application of the RPMTo illustrate application of the RPM, we used an experimentwhere HeLa cells were arrested in S phase by using a double-thymidine block and subsequently released in synchrony (5).Gene expression was assessed with cDNA microarrays by usingRNA from asynchronously growing HeLa cells as the reference.We downloaded the ‘‘raw’’ data (nonnormalized mean intensityvalue on each channel for each spot) for the 46-h experiment

Table 1. Estimated parameters of the RPM for 18 well characterized phase-marker genes

Nominalphase* Gene symbol K T, hr � �, rad a b SSE†

G1�S CCNE1 0.67 15.1 0.054 0.56 0.46 0.002 1.705CDC6 0.69 14.7 0.056 5.96 0.46 0.000 1.774PCNA 0.62 15.1 0.074 5.87 0.54 0.012 1.529E2F1 0.46 14.3 0.055 5.83 0.42 0.005 1.346

S RFC4 0.36 14.3 0.058 5.47 0.38 0.007 1.224RRM2 0.69 15.3 0.075 5.36 0.76 �0.008 3.281

G2 CDC2 1.33 14.8 0.081 4.24 0.12 0.005 8.157TOP2A 0.81 14.6 0.080 3.74 0.14 0.008 3.345CCNA2 0.58 14.5 0.068 3.55 0.55 �0.003 2.785CCNF 1.00 13.9 0.083 3.25 0.44 0.000 2.946

G2�M STK15 1.23 14.2 0.076 3.06 0.32 0.004 3.257CCNB1 0.37 13.9 0.115 2.67 0.36 0.003 1.420PLK 1.16 14.0 0.070 2.61 0.43 0.005 1.741BUB1 0.69 13.8 0.073 2.51 0.56 �0.002 1.608

M�G1 VEGFC 0.49 14.4 0.068 2.66 0.67 0.003 1.781PTTG1 0.52 14.6 0.071 2.40 0.54 0.008 1.068CDKN3 0.51 14.0 0.096 2.25 0.30 0.007 1.842RAD21 0.36 13.2 0.084 1.81 0.29 0.009 1.745

Data were log2-transformed expression ratios from the third experiment in ref. 5. Parameters are defined in the text.*Nominal phases reported in ref. 5.†Denotes sum of squared residuals.

7242 � www.pnas.org�cgi�doi�10.1073�pnas.0402285101 Liu et al.

(experiment 3) from http://genome-www.stanford.edu/Human-CellCycle/HeLa/data.shtml. These data describe 44,158 tran-scripts at 47 hourly observation times (approximately three cellcycles). We analyzed base-2 logarithms of the nonnormalizedexpression ratios.

We illustrate the fit of the RPM using 18 of the phase-markergenes identified in ref. 5 (see supporting information and Table1). First, we fit Eq. 1 to each gene, estimating T and � separatelyfor each gene. Tg ranged from 13.13 to 15.24 h; �g ranged from0.054 to 0.115; and �g, phase angles estimated in radians, rangedaround the circle (Table 1). Reassuringly, the estimates �g

agreed well with the known phases for the 18 genes; the onlyexception was VEGFC, whose estimated phase of peak tran-scription came a little early compared with its expected relativeposition in the cell cycle (5). Our model fit the observedexpression trajectories of these genes reasonably well (Fig. 3);the oscillations attenuate with time, with some transcripts show-ing background drift either upward or downward. Our fitted

curves tended to lie below the observations at the first few timepoints.

We tested the null hypothesis H0:� � 0 to examine whetherexplicitly modeling attenuation improved fit. In general, it did;the bootstrap P value was �0.05 for 16 of the 18 genes (seesupporting information). We also fit the 18 phase-marker genesin a single model with common values for both T and �. Theestimates T and � for the 18 marker genes taken together were14.42 h and 0.073, respectively. The bootstrap P value forcomparing the model with the same T and � for all genes to onewith separate values for each gene was 0.12, suggesting that thesegenes share common values for T and for � as the biologyrequires (see supporting information).

From the 44,158 transcripts, we identified 337 with a corre-lation score of 0.6 or greater as cell-cycle-related (see supportinginformation). Of those 337, we estimated that two would be falsepositives. One could adjust the cut-point to identify more genesat the cost of more expected false positives. If we lowered the

Fig. 3. Plots of log2 expression ratio versus time (hr) for 18 well characterized phase-marker genes through about three cell cycles (data from ref. 5). Data ( )and fitted trajectory of the RPM (—). Genes in the first row are considered G1/S-phase genes; second row, S phase; third row, G2 phase; fourth row, G2/M phase;fifth row, M/G1 phase.

Liu et al. PNAS � May 11, 2004 � vol. 101 � no. 19 � 7243

CELL

BIO

LOG

YST

ATI

STIC

S

cut-point to 0.5, we identified 675 transcripts overall and ex-pected some 62 of them to be false positives. We considered thelatter false-positive rate to be unacceptably high. Comparing ourlist of 337 transcripts with the list of 1,134 in ref. 5, 219 transcriptswere on both lists and 118 were newly identified by our approach.

After selecting the 337 putatively cycling transcripts, we fit theRPM to each one. For most of these, Tg fell between 13 to 16 hand �g ranged from near 0 to �1.1. The scatter plot of �g versus�g revealed unusually large estimates of �g for some transcripts(see supporting information). The transcripts with �g � 0.2preferentially had phase angles that correspond to late G1 andearly S phases, near the point when the cells had been arrested.Further exploration revealed transcripts with aberrant profiles(Fig. 4): an extreme initial mRNA level, with values high enoughto distort the model’s fit, produced a high estimated � and acorrespondingly rapid decay in oscillation and effectively maskedthe evident cycling. When we removed the first two data pointsand repeated the fitting, the characteristic cycling pattern wasrevealed (Fig. 4).

Because the early time points may be subject to a recoveryphenomenon unrelated to steady-state cycling (5), we refit themodel for all 337 transcripts after omitting the first two timepoints to identify transcripts that might be sensitive to such earlytransient behavior. We regarded the original fit for a transcriptas suspect if the estimated value of the parameter vector changedby a sufficiently large amount. For each transcript, we deter-mined the Euclidean distance between the estimated parametervector based on the original data and the one based on thereduced data, and we calculated the median and interquartilerange of this sample of distances. We flagged as exhibitingsuspiciously large changes in estimates those transcripts whosedistance was more than three interquartile ranges from themedian distance. By this criterion, 11 of 337 transcripts werejudged subject to distortion. We reported estimates based on thereduced data for these 11 transcripts while retaining the originalestimates for the remaining transcripts (see supporting informa-tion). This strategy tamed the more extreme estimates of �g.

Of the 337 transcripts, five had � � 0.001, corresponding tointriguing patterns that exhibited little evident attenuation.When one remembers that we searched for periodicity in�44,000 transcripts, these estimated values were likely due to therandom variation that is inevitable with real data. Overall, exceptfor a few extreme values, the estimates Tg and �g from the set of

newly identified transcripts appeared compatible with thosefrom the 18 phase-marker genes.

DiscussionRegression modeling of gene-expression trajectories can be animportant alternative to clustering methods for analyzing theexpression patterns of cell-cycle-related genes. By providingestimates of transcript-specific parameters, a regression ap-proach reduces the raw data to a smaller number of biologicallyinterpretable summary parameters that can be used to describethe transcripts or to characterize the effects of experimentalinterventions.

Recently, several groups have proposed regression models forthe analysis of periodically expressed genes. An autoregressivemodel was able to provide an adequate description of trajectoriesby using relatively few parameters (16), but its parameters lacknatural biologic interpretations. The SPM (7) is based on thesimple notion that in a given cell each gene begins full expressionabruptly at some point in the cell cycle and is expressed at aconstant rate until it reaches the point where it abruptly stopsbeing expressed and the mRNA instantaneously disappears.Under the SPM, stochastic variation among cells in the timing ofactivation and deactivation smoothes abrupt expression changesand allows a somewhat flexible shape for the observed trajectoryof a large number of cells. The SPM accommodates attenuationbut without clear biologic mechanism. In addition, the SPMdescribes the portion of the cycle where the gene is transcribedby two parameters, the activation time and deactivation time,although many investigators use a single ‘‘phase angle’’ tomeasure the location of peak expression (e.g., refs. 2 and 5).Other methods for describing periodic expression trajectoriesinclude singular-value decomposition (17–19), B splines (20),and partial least squares (21). In general, these approaches donot provide the parsimony and biologically interpretable param-eters that regression models offer.

An important feature of the RPM is that attenuation arises asa natural consequence of variation in the duration of the cellcycle across cells. The model provides a single parameter, �, toassess this variation and, hence, to measure attenuation. It allowsestimation of a transcript’s phase angle, a useful parameter forelucidating its role in the cell cycle. On the other hand, all modelsinvolve simplifying assumptions. The cosine functional formprovides a rigidly defined shape and does not flexibly adapt toexpression trajectories that may vary widely in shape fromtranscript to transcript while maintaining common periodicity.Nevertheless, as seen in Fig. 3, the cosine curve adequatelyaccommodated differently shaped trajectories while keeping themodel parameters as few and as intuitive as possible.

Inference under regression models for cell-cycle expression,whether SPM (7) or RPM, is difficult, however, because thedistribution and the correlation structure of the error terms areunknown. In such circumstances, we prefer bootstrap methodsfor inference to methods that rely on asymptotic distributions,but bootstrapping can be computationally expensive. Betterstatistical techniques for inference in such models are needed.

One goal in current studies of cell-cycle gene expression is toidentify transcripts that are expressed periodically and entrainedwith the cell cycle. Our approach was to modify widely usedcorrelation-based methods for clustering genes (2, 5, 15). Usuallythese methods construct templates, the ideal trajectories thatrepresent typical cycling genes, by using averages of a fewobserved trajectories of known cycling genes with similar phaseangles. Our modification replaced the simple averages withpredicted trajectories based on the RPM, an approach thatsmoothes random fluctuations. We used parameter values de-rived from fitting known cell-cycle genes so that our templatesreflected data and incorporated attenuation (see supportinginformation). We used a range of values only for �, but larger

Fig. 4. Observed log2 expression ratios ( ) and fitted trajectories basedon the RPM with (– – –) and without (—) the first two data points for thetranscript with accession no. N95578 (identified as a clone ofDKFZp434D0818). The reduced data fit captured the cycling because theinfluence of extreme values in the first two time points was eliminated.

7244 � www.pnas.org�cgi�doi�10.1073�pnas.0402285101 Liu et al.

sets of templates including ranges of values for other parametersmight sometimes be warranted (see supporting information). Inaddition, templates can be formed by the model for possibletrajectories for which established cell cycle genes are eitherunknown or unavailable in the data, features that rule outaveraging. We must caution, however, that correlation-basedmethods can be misleading (22).

In conclusion, we have proposed the RPM for studyingperiodically expressed transcripts and have demonstrated its usewith published data from synchronized HeLa cell cultures.Attenuation in the expression level of cell-cycle genes over timeis well characterized by allowing variability across cells in theduration of the cell cycle. Such variability causes the cells to fallincreasingly out of synchrony over time, which, in turn, dampsthe periodic expression. The RPM is parsimonious and can beapplied to characterize aggregated levels arising from any studies

where initial synchrony among cycling units is experimentallyinduced. The RPM allows simultaneous estimation of biologi-cally relevant parameters and formal hypothesis testing. Genescan be clustered based on these biologically interpretable pa-rameters, and relationships among cycling genes may be re-vealed. Several additional applications are envisioned. Theapproach allows identification of transcripts of periodic expres-sion in the cell cycle. The estimated model parameters could alsoallow cell cultures, e.g., from normal and tumor tissues, to becharacterized and contrasted based on biologically interpretablefeatures of their growth regulation. In studies of genotoxicagents, effects of the agents on the shifts of phase angles ofspecific checkpoint genes may be studied.

We thank Barbara Wetmore, Fred Parham, and the two reviewers fortheir careful reading of and constructive comments on an earlier versionof this manuscript.

1. Cho, R. J., Campbell M. J., Winzeler, E. A., Steinmetz, L., Conway, A.,Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart,D. J., et al. (1998) Mol. Cell 2, 65–73.

2. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen,M. B., Brown, P. O., Bostein, D. & Futcher, B. (1998) Mol. Biol. Cell 9,3273–3297.

3. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O. &Herskowitz, I. (1998) Science 282, 699–705.

4. Cho, R. J., Huang, M., Campbell, M. J., Dong, H., Steinmetz, L., Sapinoso, L.,Hampton, G., Elledge, S. J., Davis, R. W. & Lockhart, D. J. (2001) Nat. Genet.27, 48–54.

5. Whitfield, M. L., Sherlock, G., Saldanha, A. J., Murray, J. I., Ball, C. A.,Alexander, K. E., Matese, J. C., Perou, C. M., Hurt, M. M., Brown, P. O., etal. (2002) Mol. Biol. Cell 13, 1977–2000.

6. Jorgensen, P., Nishikawa, J. L., Breitkreutz, B.-J. & Tyers, M. (2002) Science297, 395–400.

7. Zhao, L. P., Prentice, R. & Breeden, L. (2001) Proc. Natl. Acad. Sci. USA 98,5631–5636.

8. Gallant, A. R. (1987) Nonlinear Statistical Models (Wiley, New York).9. Seber, G. A. F. & Wild, C. J. (1989) Nonlinear Regression (Wiley, New York).

10. Peddada, S. D. & Patwardhan, G. (1992) Biometrika 79, 654–657.11. Peddada, S. D. (1993) in Handbook of Statistics, ed. Rao, C. R. (Elsevier

North-Holland, New York), Vol. 9, pp. 723–744.12. Shao, J. (1990) Stat. Probabil. Lett. 10, 77–85.13. Zhang, J., Peddada, S. D. & Rogol, A. (2000) Statistics for 21st Century, eds.

Rao, C. R. & Szekeley, G. (Dekker, New York), pp. 459–483.14. Kunsch, H. (1989) Ann. Stat. 17, 1217–1241.15. Heyer, L. J., Kruglyak, S. & Yooseph, S. (1999) Genome Res. 9, 1106–1115.16. Ramoni, M. F., Sebastiani, P. & Kohane, I. S. (2002) Proc. Natl. Acad. Sci. USA

99, 9121–9126.17. Alter, O., Brown, P. O. & Bostein, D. (2000) Proc. Natl. Acad. Sci. USA 97,

10101–10106.18. Holter, N. S., Mitra, M., Maritan A., Cieplak, M., Banavar, J. R. & Fedoroff,

N. V. (2000) Proc. Natl. Acad. Sci. USA 97, 8409–8414.19. Holter, N. S., Maritan, A., Cieplak, M., Fedoref, N. V. & Banavar, J. R. (2001)

Proc. Natl. Acad. Sci. USA 98, 1693–1698.20. Luan, Y. & Li, H. (2003) Bioinformatics 19, 474–482.21. Johansson, D., Lindgren, P. & Berglund, A. (2003) Bioinformatics 19, 467–473.22. Peddada, S. D., Lobenhofer, E. K., Li, L., Afshari, C. A., Weinberg, C. R. &

Umbach, D. M. (2003) Bioinformatics 19, 834–841.

Liu et al. PNAS � May 11, 2004 � vol. 101 � no. 19 � 7245

CELL

BIO

LOG

YST

ATI

STIC

S

top related