balli: bartlett-adjusted likelihood-based linear model approach...

14
METHODOLOGY ARTICLE Open Access BALLI: Bartlett-adjusted likelihood-based linear model approach for identifying differentially expressed genes with RNA-seq data Kyungtaek Park 1 , Jaehoon An 2 , Jungsoo Gim 3 , Minseok Seo 4,5 , Woojoo Lee 6 , Taesung Park 1,7 and Sungho Won 1,2,8* Abstract Background: Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many statistical methods have been proposed to identify differentially expressed genes (DEGs) under two or more conditions with RNA-seq data. However, statistical analyses with RNA-seq data are often limited by small sample sizes, and global variance estimates of RNA expression levels have been utilized as prior distributions for gene-specific variance estimates, making it difficult to generalize the methods to more complicated settings. We herein proposed a Bartlett-Adjusted Likelihood-based LInear mixed model approach (BALLI) to analyze more complicated RNA-seq data. The proposed method estimates the technical and biological variances with a linear mixed-effects model, with and without adjusting small sample bias using Bartlketts corrections. Results: We conducted extensive simulations to compare the performance of BALLI with those of existing approaches (edgeR, DESeq2, and voom). Results from the simulation studies showed that BALLI correctly controlled the type-1 error rates at various nominal significance levels and produced better statistical power and precision estimates than those of other competing methods in various scenarios. Furthermore, BALLI was robust to variation of library size. It was also successfully applied to Holstein milk yield data, illustrating its practical value. Conclusions;: BALLI is statistically more efficient and valid than existing methods, and we conclude that it is useful for identifying DEGs in RNA-seq analysis. Keywords: Differentially expressed genes, RNA sequencing, Linear mixed model, Bartletts correction Background Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many attempts have been made to identify differen- tially expressed genes (DEGs) by microarray analysis. However, microarray analysis often suffers from many systematic errors, such as hybridization and dye-based detection bias, hampering the detection of true DEGs [1, 2]. Recently, high-throughput sequencing technology has markedly improved. RNA sequencing (RNA-seq), also called whole-transcriptome shotgun sequencing, uses next-generation sequencing to quantify the abundance of transcripts with several desirable features, such as in- creased dynamic range and freedom from a priori chosen probes [3]. Furthermore, RNA-seq is robust against systematic errors and has therefore emerged as a successful alternative to microarray analysis [4]. RNA-seq quantifies the numbers of reads aligned to particular transcripts or genes, and various approaches have been proposed to manage RNA-seq data [5]. There are two different types of statistical methods: read- count-based approaches and transformation-based ap- proaches. Read-count-based approaches assume that © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. * Correspondence: [email protected] 1 Interdisciplinary Program of Bioinformatics, Seoul National University, Seoul 08826, South Korea 2 Department of Public Health Science, Seoul National University, Seoul 08826, South Korea Full list of author information is available at the end of the article Park et al. BMC Genomics (2019) 20:540 https://doi.org/10.1186/s12864-019-5851-6

Upload: others

Post on 31-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

METHODOLOGY ARTICLE Open Access

BALLI: Bartlett-adjusted likelihood-basedlinear model approach for identifyingdifferentially expressed genes with RNA-seqdataKyungtaek Park1, Jaehoon An2, Jungsoo Gim3, Minseok Seo4,5, Woojoo Lee6, Taesung Park1,7 andSungho Won1,2,8*

Abstract

Background: Transcriptomic profiles can improve our understanding of the phenotypic molecular basis ofbiological research, and many statistical methods have been proposed to identify differentially expressed genes(DEGs) under two or more conditions with RNA-seq data. However, statistical analyses with RNA-seq data are oftenlimited by small sample sizes, and global variance estimates of RNA expression levels have been utilized as priordistributions for gene-specific variance estimates, making it difficult to generalize the methods to more complicatedsettings. We herein proposed a Bartlett-Adjusted Likelihood-based LInear mixed model approach (BALLI) to analyzemore complicated RNA-seq data. The proposed method estimates the technical and biological variances with alinear mixed-effects model, with and without adjusting small sample bias using Bartlkett’s corrections.

Results: We conducted extensive simulations to compare the performance of BALLI with those of existingapproaches (edgeR, DESeq2, and voom). Results from the simulation studies showed that BALLI correctly controlledthe type-1 error rates at various nominal significance levels and produced better statistical power and precisionestimates than those of other competing methods in various scenarios. Furthermore, BALLI was robust to variationof library size. It was also successfully applied to Holstein milk yield data, illustrating its practical value.

Conclusions;: BALLI is statistically more efficient and valid than existing methods, and we conclude that it is usefulfor identifying DEGs in RNA-seq analysis.

Keywords: Differentially expressed genes, RNA sequencing, Linear mixed model, Bartlett’s correction

BackgroundTranscriptomic profiles can improve our understandingof the phenotypic molecular basis of biological research,and many attempts have been made to identify differen-tially expressed genes (DEGs) by microarray analysis.However, microarray analysis often suffers from manysystematic errors, such as hybridization and dye-baseddetection bias, hampering the detection of true DEGs [1,2]. Recently, high-throughput sequencing technology has

markedly improved. RNA sequencing (RNA-seq), alsocalled whole-transcriptome shotgun sequencing, usesnext-generation sequencing to quantify the abundanceof transcripts with several desirable features, such as in-creased dynamic range and freedom from a priorichosen probes [3]. Furthermore, RNA-seq is robustagainst systematic errors and has therefore emerged as asuccessful alternative to microarray analysis [4].RNA-seq quantifies the numbers of reads aligned to

particular transcripts or genes, and various approacheshave been proposed to manage RNA-seq data [5]. Thereare two different types of statistical methods: read-count-based approaches and transformation-based ap-proaches. Read-count-based approaches assume that

© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

* Correspondence: [email protected] Program of Bioinformatics, Seoul National University, Seoul08826, South Korea2Department of Public Health Science, Seoul National University, Seoul08826, South KoreaFull list of author information is available at the end of the article

Park et al. BMC Genomics (2019) 20:540 https://doi.org/10.1186/s12864-019-5851-6

Page 2: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

observed read counts follow negative binomial distribu-tion, and generalized linear regression with a logarithmas a link function is utilized. These approaches typicallyassume that variances include biological and technicalvariances; the latter indicates variance observed amongmeasurements of the same biological unit, and theformer indicates variance between different biologicalunits, such as different subjects or different tissues ofthe same subject. If technical replicates are analyzed, ob-served read counts from technical replicates have thesame means under the same conditions. Marioni et al.(2008) demonstrated that the data follow a Poisson dis-tribution, and variances in technical replicates are ex-pected to be the same as their means for each gene [6].However, if biological replicates are available, means andvariances of read counts differ among different biologicalunits. Bullard et al. (2010) carefully examined such vari-ability and concluded that the biological variances wereusually larger than technical variances, supporting thepresence of overdispersion [7]. Thus, negative binomialdistribution has often been utilized; edgeR [8] andDESeq2 [9] are such methods. Transformation-based ap-proaches assume that the transformed read counts foreach gene follow normal distribution. For example,voom calculated proportions of read counts for eachgene per subject, and the log-transformed values werethen assumed to follow normal distribution, assumingthat the relative proportion of technical variances be-comes smaller if the read count grows larger [10].Negative-binomial distributions for read counts and

normal distributions for log-transformation of counts permillion (CPM) successfully describe distributions of RNA-seq data. However, RNA-seq is relatively expensive com-pared with microarray, and thus, further adjustment hasbeen made to handle the problem of small sample size. Ifthe sample size is small, the estimated variance can havelarge standard errors, and thus, multiple methods that in-corporate prior knowledge have been proposed. For ex-ample, variances of read counts are assumed to bepositively related to their means, and their relationshipscan be estimated by comparing the means and variancesof read counts for all genes. This can often be utilized toshrink variance parameters [11, 12]. edgeR and DESeq2estimate the overall dispersion parameter for all genes andare then combined with gene-wise dispersion parametersfor each gene using empirical Bayesian rules. Voom as-sumes that the variances of log-transformed CPM (log-cpm) are functionally related to their means. Locallyweighted scatterplot smoothing (LOWESS) curves be-tween the mean and residual variances of genes are thenutilized to weight variance estimates of each gene.Existing methods shrink the variance estimate of each

gene toward global variance estimates or use varianceestimates based on the relationships between means and

variances. Such assumptions are often very useful if thesample sizes are small. However, there are multiple fac-tors that can distort these relationships, and if they areviolated, the performance of existing approaches can beaffected. For example, the quality of data is highlydependent on the preparation steps, and unexpectednoise, such as noise from different storage periods or se-quencing organization of samples, can occur duringpreparation steps. Moreover, read counts of cancer tis-sues are more heterogeneous than those of normal tis-sues, and biological variances can be affected by diseasestatus [13]. Thus, their effects can distort the relation-ship between technical and biological variances. Multiplestudies have shown that misspecified relationships canlead to substantial biases [14, 15]. For example, varianceestimators for random effects, which are assumed to fol-low normal distribution, can be seriously biased unlessthey are normally distributed [14]. General approachesthat are not sensitive to these problems are necessary.In this article, we present new methods for identifying

DEGs with RNA-seq data, namely, BALLI and LLI. Statis-tical analyses with log-transformed read counts are oftenmore powerful than other existing methods and are rela-tively insensitive to various errors [16, 17]. Thus, we con-sidered log-cpm as response variables and used linearmixed-effects models to estimate technical and biologicalvariance. Furthermore, Bartlett-adjusted likelihood ratiotests were applied to correct the small sample bias [18].By allowing model comparisons among different models,our models enabled robust analyses for various scenarios.For our simulation studies, artificial RNA-seq data weregenerated based on real data and negative binomial distri-butions. Our studies showed that the proposed methodperformed better than existing methods. The proposedmethods were applied to Holstein milk yield data at thefalse discovery rate (FDR)-adjusted 0.1 significance leveland uniquely produced significant results. The proposedmethods were implemented as an R package and are freelydownloadable at CRAN (http://cran.r-project.org) orhttp://healthstat.snu.ac.kr /software/balli/.

MethodsNotationsWe assumed that there were M different groups, andthe averages of the expressed read counts for each genewere compared among these groups. For case-controlstudies, M = 2. We assumed that there were nm subjectsin group m and denoted the total sample size by N; thus,we got N = n1 + … + nM. We defined dummy variablesfor subject i in group m by

xmi ¼ 1 if subject i is in group m0 o:w:

�;

Park et al. BMC Genomics (2019) 20:540 Page 2 of 14

Page 3: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

m ¼ 1;…;M−1; i ¼ 1; 2;…;N :

A design matrix for group variables is defined by

X¼x11 ⋯ x M−1ð Þ1⋮ ⋱ ⋮

x1N ⋯ x M−1ð ÞN

0@

1A:

We assumed that indexes of all subjects were sorted inascending order of groups. Thus, the first n1 subjectswere in group 1; the second n2 subjects were in group 2;and so on. The effect of continuous variables can also betested, and then some or all of the columns of the designmatrix, X, become their realization. We assumed thatexpressed read counts were observed for G genes andwere denoted by rgi for gene g of subject i (g = 1, …., G).Then, the library size for subject i, Ri, was equivalent to

Ri ¼Xg

rgi . We transformed count data into the log-

transformed read counts per million (log-cpm) using thecpm function in the edgeR R package. If we denoted thenormalized Ri with the trimmed mean of the M-value[19] by R�

i , the log-cpm of gene g for subject i were de-fined by:

ygi ¼ log2rgi þ di

R�i þ 2 � di

� 106� �

;where di

¼ R�i

1N

XN

i¼1R�i

� 0:25… ð1Þ

and their vector Yg was defined as

Yg¼ yg1; yg2;…; ygN� �t

:

Technical and biological variances of ygiWe assumed that rgi followed a negative binomial distri-bution, and its mean and variance were μgi and μgi þ μ2giϕg , respectively. If we let the mean and variance of log2-rgi be λgi and s2gi , respectively, then var(ygi) can be ob-

tained by the first order approximation as follows [10]:

var ygi� �

≈ var log2rgi� �

¼ s2gi ≈ var μgi þrgi−μgiμgi

!¼ 1

μgi2var rgi� �

¼ 1μgi

þ ϕg… ð2Þ

Here, μ−1gi and ϕg indicate the variances attributable to

the technical and biological replicates, respectively. Bysecond order approximation, the technical variance, 1/μgi, can be expressed in terms of λgi and s2gi as follows:

μgi ¼ E rgi� � ¼ E 2 log2rgi

� �≈ E 2λgi þ loge2 � 2λgi � log2rgi−λgi

� �þ 12

loge2� �2 � 2λgi � log2rgi−λgi

� �2

¼ 2λgi � 1þ loge2 � E log2rgi−λgi� �þ 1

2loge2

� �2 � E log2rgi−λgi� �2� �

¼ 2λgi � 1þ 12

loge2� �2

s2gi

� �:

λgi and s2gi are functionally related, and both were esti-

mated with the method used for voom-trend [10] asfollows:

i. For all genes, g = 1,… , G, fit linear regressions, ygi¼ αg þ xtiβg þ ϵgi, and calculate ygi. Residual

variances are used as s2g . If environmental effects

affect ygi, then they should be included ascovariates.

ii. Calculate λg ¼ yg þ log2~R− log2106; where yg is

the average of ygi; ~R is the geometric mean of ðR�i

þ1Þ; and g = 1,… , G.iii. For ðλg ; s2gÞ; g ¼ 1;…;G, obtained from (i) and (ii),

fit LOWESS curve s1=2g on λg .

iv. Calculate λgi ¼ log2rgi ¼ ygi þ log2ðR�i þ 1Þ− log2

106 and apply LOWESS curve from (iii) to obtain

s1=2gi .

v. Calculate μgi by incorporating λgi and sgi to thefollowing equation:

μgi ≈ 2λgi � 1þ 12

loge2� �2

s2gi

� �… ð3Þ

Linear mixed-effects modelWe denoted a design matrix for nuisance effects, includ-ing an intercept by Z. We let bg and eg be vectors forrandom effects and measurement errors, respectively.Denoting a w ×w dimensional identity matrix by Iw, weconsidered the following linear mixed-effects model:

Yg ¼ Zαg þ Xβg þ bgþ eg ;bg∼MVNð0;ψgΣg;bÞ; eg∼MVNð0;Σg;eÞ;

Σg;b ¼μ−1g1 0 ⋯ 0

0 μ−1g2 ⋯ 0⋱

0 0 ⋯ μ−1gN

0BB@

1CCA;Σg;e ¼

σ2g1In1 0 ⋯ 0

0 σ2g2In2 ⋯ 0

⋱0 0 ⋯ σ2

gMInM

0BB@

1CCA:

Here, ψgΣg, b and σ2g indicate technical and biological

variances, respectively, according to Eq. (2), and the ran-dom effect, bg, and measurement error, eg, were used tomodel the technical and biological variances, respect-ively. The proposed linear mixed effects model may beconceptually useful for understanding the variance struc-ture of the RNA-seq data. Notably, elements of Σg, b are

Park et al. BMC Genomics (2019) 20:540 Page 3 of 14

Page 4: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

obtained from Eq. (3), and ψg and σ2g ¼ ðσ2g1;…; σ2gM ) are

estimated. Equation (2) shows that ψg becomes 1, andwe assumed that σ2g1 ¼ … ¼ σ2gM . Then, our final model

becomes

Yg ¼ Zαg þ Xβg þ bgþ eg ;bg∼MVNð0;Σg;bÞ; eg∼MVNð0; σ2gIÞ;

Σg;b ¼μ−1g1 0 ⋯ 0

0 μ−1g2 ⋯ 0⋱

0 0 ⋯ μ−1gN

0BBB@

1CCCA;

which is equivalent to

Yg ¼ Zαg þ Xβg þ e′g ; e′g∼MVNð0;Σg;b

þ σ2g IÞ; σ2g ≥0:

Bartlett-adjusted profile likelihood ratio testsThe likelihood ratio test (LRT) is very flexible and canbe utilized for various purposes such as testing two ormore variables at once. However, statistical analyses withRNA-seq data often involve few samples, and the LRTstatistic has a bias with order Op(N

−1) to its null distri-bution. In such cases, an adjustment can be applied toreduce the bias, such as the Bartlett adjustment or Cox-Reid adjustment [18, 20]. As Zucker et al. (2000) showedthat the latter estimated p values more conservative thanthe former did in linear mixed models [21], we selectedthe Bartlett-adjusted LRT for identifying DEGs, whichreduces the order of bias to Op(N

−2) and controls type-1error rates well when the sample size is small [18]. If welet βg = (βg, 1, … , βg, M − 1)

t, Vg ¼ Σg;b þ σ2gI and θg ¼ ðαg ; σ2gÞ , the likelihood for the proposed linear mixed

model is

Lðθg ; βgÞ∝jVgj−12expð− 1

2ðYg−Zαg−XβgÞtVg

−1ðYg−Zαg−XβgÞÞ:

If we let θg0 be a maximum likelihood estimate (mle)under the parameter space for the null hypothesis H0:

βg = 0, and ðθg ; βgÞ be mles of (θg,βg) under the param-

eter space for null or alternative hypothesis, the LRT forthe null hypothesis H0: βg = 0 can be obtained by

LRg ¼ −2flogLðθg0; 0Þ−logLðθg ; βgÞg∼χ2ðd f¼ M−1ÞunderH0:

The Bartlett-adjusted LRT ( LR�g ) for gene g can be

expressed by

LR�g ¼

LRg

1þ Cg= M−1ð Þ � χ2 df ¼ M−1ð Þ under H0:

Cg can be obtained based on the results of Melo et al.(2009) [22], as follows:

Cg ¼ Dg−1 −

12Mg þ 1

4Pg−

12νgτg

� �:

Here, Dg, Mg, Pg, νg, and τg are scalars, and if we let

X′ = [I − Z(ZTVg−1Z)−1ZtVg

−1]X and _X0¼ZðZtVg

−1ZÞ−1ZtVg

−2X0, they are

Dg ¼ −12tr Vg

−2� �;

Mg ¼ 2tr X0tVg−1X0

� �−1X0tVg

−3X0− _X0tVg

−2X0� �� �

;

Pg ¼ tr X0tVg−2X0 X0tVg

−1X0� �−1� �2

!;

νg ¼ −tr ZtVg−1Z

� �−1ZtVg

−2Z� �

and

τg ¼ −tr X0tVg−1X0

� �−1X0tVg

−2X0� �

:

The forms of Dg, Mg, Pg, νg, and τg depend on thestructure of Vg, and counterparts to general Vg s areshown in Additional file 1.

Parameter estimationThe log-likelihood function for our final model is givenby

logLðθg ; βgÞ ¼ C−12logjVgj− 1

2ðYg−Zαg−XβgÞtVg

−1ðYg−Zαg−XβgÞ;C: someconstants:

θg and βg can be estimated by maximizing the log-

likelihood function. Then, if we let P = Vg−1 −Vg

−1(Z,X)((Z,X)tVg

−1(Z,X))(Z,X)tVg−1, the profile log-likelihood

of σ2g becomes

lPðσ2gÞ ¼ C−12logjVgj− 1

2Yg

tPYg :

Here, Vg is a function of σ2g , and σ2g can be obtained by

maximizing lPðσ2gÞ with Fisher’s scoring method. σ2ðmÞg at

the m step was updated by

σ2 mð Þg ¼ σ2 m−1ð Þ

g þ Io m−1ð Þ� −1

U σ2 m−1ð Þg

� �

Park et al. BMC Genomics (2019) 20:540 Page 4 of 14

Page 5: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

, where Ioðm−1Þ ¼ Eð− ∂2lP

∂σ2ðm−1Þg ∂σ 2ðm−1Þ

g

TÞ and Uðσ2ðm−1Þg Þ

¼ ∂lP∂σ2ðm−1Þ

g

. We found that Fisher’s scoring method was

sometimes unsuccessful for estimating σ2ðmÞg , and in such

cases, we used Brent’s derivative-free method with theoptimize function in R [23]. It always converged and suc-cessfully estimated parameters, at least in our simula-tion. We assumed that σ2g was non-negative.

Vg¼Σg;b þ σ2g I , and if σ2g is equal to zero, Vg becomes

Σg, b. Then, βg and αg can be obtained by

ðαg

βgÞ ¼ ððZ;XÞtVg

−1

Z;X

!ÞðZ;XÞtVg

−1Y:

The Bartlett-adjusted LRT requires θg0 , which maxi-mizes the likelihood under the null hypothesis. Underthe null hypothesis, if we let P0 = Vg

−1 −Vg−1Z(ZtVg

−1Z)ZtVg

−1, the profile log-likelihood of σ2g becomes

lP0ðσ2gÞ ¼ C−12logjVgj− 1

2Yg

tP0Yg :

σ2g0 was estimated with Fisher’s scoring method, and if

we incorporated σ2g0 to Vg0 and denoted it as Vg0 , αg

could be obtained by

αg ¼ ðZtVg0−1ZÞZtVg0

−1Y:

DatasetsWe considered two real datasets consisting of samples fromunrelated Nigerian people and Holstein cattle, respectively.Nigerian subjects participated in the International HapMapProject and comprised 29 male and 40 female participants[24]. The read counts were downloaded from the ReCountwebsite [25]. Holstein data were obtained to identify genesassociated with milk yield and consisted of high- and low-milk-yielding groups, with nine and 12 samples per group,respectively [26]. Furthermore, parity and lactation periodswere available and were considered as covariates. We ob-tained the read counts from Gene Expression Omnibus(GSE60575). Based on count data, we generated simulationdata; the steps are described in Additional file 2.

ResultsSimulation studies with RNA-seq data from NigerianindividualsWe applied the proposed linear mixed models to the simu-lated data based on Nigerian individuals’ RNA-seq data andcalculated empirical type-1 error rates and statistical powerswith these models. The data were then compared withDESeq2 (v1.20.0), edgeR (v3.22.3), and voom (v3.36.5).

Table 1, Additional file 3, and Fig. 1 show results fromsimulation studies based on Nigerian individuals’ RNA-seqdata. Nigerian individuals’ RNA-seq data consisted of 52,580 genes, and after filtering genes whose total read countsacross samples were smaller than one-tenth of the samplesize, each replicate had around 10,000–10,500 genes. Em-pirical type-1 error rates and powers were estimated with50 replicates. Table 1 and Additional file 3 assumed δ = 0,and thus, their estimates indicated the empirical type-1error rates. For the proposed methods, we assumed thatψg = 1 and σ2

g1 ¼ σ2g2 , and the proposed methods with and

without Bartlett’s corrections are denoted as BALLI andLLI, respectively, for the remainder of this article. Accord-ing to Table 1 and Additional file 3, BALLI and voom al-ways controlled the nominal type-1 error rates correctly ifN was greater than or equal to 12. LLI also successfullycontrolled the nominal type-1 error rates if N was greaterthan or equal to 20. However, if N was less than 20, p valuesby LLI were inflated. edgeR showed the least performance,and the estimated type-1 error rates were always inflated at0.05, 0.01, and 0.005 nominal significance levels. Interest-ingly, DESeq2 tended to be conservative at 0.1 and 0.05 butliberal at 0.01 and 0.005 nominal significance levels. Thus,we could conclude that the proposed linear mixed modelwith Bartlett’s correction reasonably controlled the type-1error, and Bartlett’s correction was required if the samplesize was less than 20.Figure 1 shows estimated powers and precisions at the

FDR-adjusted 0.1 significance level when δ = 0.5σ or 1σ,and N = 12, 16, 20, 24, 28, 40, 64, or 68. Figure 1a and cshow the statistical power estimates, and Fig. 1b and dshow the precision. Precision indicates the proportions ofDEGs among genes for which FDR-adjusted p values areless than 0.1. According to Fig. 1a and c, LLI outperformedother methods in terms of power (Fig. 1a and c). However,it should be noted that this method showed worse precisionthan BALLI and voom if N was less than 40 (Fig. 1b and d),suggesting that LLI had larger false-positive rates than thoseof BALLI and voom when N was less than 40. The preci-sion of LLI was increased if N was sufficiently large. Interms of both power and precision, the best performancewas always obtained by BALLI. For example, when N = 20and δ = σ, the estimated power of BALLI was 0.597,followed by voom (0.548) and DESeq2 (0.399). The esti-mated precision of BALLI was 0.940, and those of voomand DESeq2 were 0.930 and 0.907, respectively. If N = 40and δ = 0.5σ, the estimated power and precision of BALLIwere 0.342 and 0.943, which were higher than those ofDESeq2 (0.205, 0.897) and voom (0.298, 0.932).Our method was also applied to simulation data based

on RNA-seq data from Holstein cows. The results weresimilar to those of the simulation data based on Nigerianpeople’s data. Additional file 4 shows that BALLI and

Park et al. BMC Genomics (2019) 20:540 Page 5 of 14

Page 6: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

Table

1Estim

ated

type

-1errorrateswith

simulationdata

basedon

Nigerianpe

ople’sRN

A-seq

data.Estim

ated

type

-1errorratesby

BALLI,DESeq

2,ed

geR,LLI,andvoom

and

their95%

confiden

celevelswereestim

ated

forN=12

,16,20

and24

.aThetype

-1errorratesaremarkedby

bold

font

iftheir95%

confiden

celevelsinclud

eor

lower

than

the

nominalsign

ificant

levelα

BALLI

DESeq

2ed

geR

LLI

voom

BALLI

DESeq

2ed

geR

LLI

voom

αN=12

N=16

0.1

0.09

217a

(0.081

73,0.102

62)

0.08

221

(0.071

44,0.092

98)

0.10

536

(0.095

95,0.114

77)

0.12695

(0.11443,0.13947)

0.09

381

(0.083

06,0.104

56)

0.08

976

(0.080

33,0.099

19)

0.08

794

(0.078

27,0.097

61)

0.11070

(0.10223,0.11916)

0.11613

(0.10522,0.12704)

0.09

618

(0.086

83,0.105

52)

0.05

0.04

485

(0.038

14,0.051

55)

0.04

374

(0.036

33,0.051

15)

0.05

495

(0.048

77,0.061

14)

0.06636

(0.05767,0.07506)

0.04

473

(0.037

98,0.051

48)

0.04

329

(0.037

64,0.048

95)

0.04

672

(0.040

41,0.053

02)

0.05762

(0.05218,0.06305)

0.06072

(0.05355,0.06790)

0.04

645

(0.040

78,0.052

12)

0.01

0.00

899

(0.006

86,0.011

12)

0.01

208

(0.009

13,0.015

03)

0.01441

(0.01186,0.01697)

0.01662

(0.01328,0.01996)

0.00

860

(0.006

53,0.010

66)

0.00

796

(0.006

43,0.009

48)

0.01

202

(0.009

80,0.014

24)

0.01393

(0.01212,0.01575)

0.01349

(0.01118,0.01580)

0.00

850

(0.007

06,0.009

93)

0.005

0.00

456

(0.003

28,0.005

84)

0.00741

(0.00531,0.00952)

0.00891

(0.00709,0.01072)

0.00942

(0.00720,0.01163)

0.00

424

(0.003

01,0.005

48)

0.00

381

(0.003

05,0.004

58)

0.00695

(0.00555,0.00835)

0.00809

(0.00692,0.00925)

0.00714

(0.00575,0.00853)

0.00

397

(0.003

26,0.004

68)

αN=20

N=24

0.1

0.08

698

(0.076

89,0.097

07)

0.08

700

(0.076

53,0.097

47)

0.10962

(0.10028,0.11896)

0.10

564

(0.094

42,0.116

86)

0.09

289

(0.082

46,0.103

32)

0.08

480

(0.076

51,0.093

09)

0.08

744

(0.078

02,0.096

86)

0.10874

(0.10024,0.11724)

0.10

067

(0.091

33,0.110

00)

0.09

124

(0.081

96,0.100

51)

0.05

0.04

113

(0.034

55,0.047

71)

0.04

607

(0.038

70,0.053

44)

0.05774

(0.05147,0.06401)

0.05

408

(0.046

33,0.061

83)

0.04

517

(0.038

44,0.051

91)

0.03

919

(0.034

50,0.043

87)

0.04

518

(0.039

22,0.051

14)

0.05881

(0.05343,0.06419)

0.04

960

(0.044

02,0.055

19)

0.04

390

(0.038

49,0.049

31)

0.01

0.00

752

(0.005

48,0.009

55)

0.01

175

(0.008

90,0.014

60)

0.01381

(0.01152,0.01609)

0.01

157

(0.008

79,0.014

36)

0.00

848

(0.006

43,0.010

53)

0.00

647

(0.005

50,0.007

44)

0.01

050

(0.008

73,0.012

28)

0.01317

(0.01163,0.01472)

0.00

955

(0.008

17,0.010

93)

0.00

778

(0.006

48,0.009

07)

0.005

0.00

368

(0.002

48,0.004

89)

0.00

669

(0.004

75,0.008

63)

0.00789

(0.00643,0.00934)

0.00

605

(0.004

28,0.007

83)

0.00

420

(0.003

05,0.005

35)

0.00

293

(0.002

48,0.003

38)

0.00

575

(0.004

77,0.006

72)

0.00728

(0.00636,0.00819)

0.00

474

(0.004

00,0.005

47)

0.00

363

(0.002

95,0.004

31)

Park et al. BMC Genomics (2019) 20:540 Page 6 of 14

Page 7: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

LLI controlled the nominal type-1 error rates if N ≥ 8and if N ≥ 20, respectively. Power estimates were highestfor LLI, but it always had a smaller precision value thanthose of BALLI and voom (Additional file 5).

Simulation studies with randomly generated RNA-seqdataRNA-seq data are generally known to follow negativebinomial distribution, and we conducted simulationstudies with RNA-seq data generated from negative

binomial distributions. First, we assumed that librarysizes were the same among subjects. The overalltrend of the estimated type-1 error rate was similarto that of simulation studies based on real RNA-seqdata. Estimated type-1 error rates by voom andBALLI usually maintained the nominal significancelevels (Table 2 and Additional file 6). P values ob-tained from LLI and edgeR tended to be inflated.DESeq2 generally showed deflation of type-1 errorrates at a 0.1 nominal significance level. Second, we

Fig. 1 Estimated powers and precisions with simulation data based on Nigerian people’s RNA-seq data. Statistical powers of BALLI, DESeq2,edgeR, LLI, and voom were estimated at FDR-adjusted 0.1 significance level when δ = 0.5σ or 1σ and N = 12, 16, 20, 24, 28, 40, 64, and 68.a Estimated power when δ=0.5σ. b Estimated precision when δ=0.5σ. c Estimated power when δ=1σ. d Estimated precision when δ=1σ

Park et al. BMC Genomics (2019) 20:540 Page 7 of 14

Page 8: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

Table

2Estim

ated

type

-1errorrateswith

simulationdata

basedon

simulated

RNA-seq

data

from

negativebino

miald

istribution.Estim

ated

type

-1errorratesby

BALLI,DESeq

2,ed

geR,LLI,andvoom

andtheir95%

confiden

celevelswereestim

ated

forN=12

,16,20

,and

24.aThetype

-1errorratesaremarkedby

bold

font

iftheir95%

confiden

celevels

includ

eor

lower

than

theno

minalsign

ificant

levelα

BALLI

DESeq

2ed

geR

LLI

voom

BALLI

DESeq

2ed

geR

LLI

voom

αN=12

N=16

0.1

0.09

550a

(0.093

94,

0.09

706)

0.08

417(0.082

78,

0.08

556)

0.12172(0.11984,

0.12360)

0.13198(0.13030,

0.13367)

0.09

912(0.097

67,

0.10

057)

0.09

211(0.090

97,

0.09

325)

0.08

428(0.083

07,

0.08

549)

0.12028(0.11911,

0.12145)

0.11853(0.11729,

0.11977)

0.09

977(0.098

68,

0.10

086)

0.05

0.04

863(0.047

57,

0.04

969)

0.04

334(0.042

43,

0.04

425)

0.05604(0.05492,

0.05716)

0.07063(0.06926,

0.07200)

0.05

013(0.048

99,

0.05

126)

0.04

508(0.044

29,

0.04

587)

0.04

333(0.042

61,

0.04

404)

0.05616(0.05528,

0.05703)

0.06345(0.06239,

0.06452)

0.05

013(0.049

37,

0.05

089)

0.01

0.00

989(0.009

39,

0.01

039)

0.00

997(0.009

54,

0.01

040)

0.01267(0.01209,

0.01326)

0.01858(0.01798,

0.01919)

0.00

997(0.009

57,

0.01

037)

0.00

893(0.008

47,

0.00

939)

0.01

008(0.009

61,

0.01

056)

0.01276(0.01225,

0.01328)

0.01469(0.01419,

0.01519)

0.00

996(0.009

69,

0.01

024)

0.005

0.00

524(0.004

92,

0.00

557)

0.00549(0.00517,

0.00581)

0.00689(0.00650,

0.00729)

0.01043(0.00992,

0.01094)

0.00

488(0.004

56,

0.00

519)

0.00

445(0.004

14,

0.00

475)

0.00543(0.00515,

0.00572)

0.00682(0.00653,

0.00711)

0.00803(0.00759,

0.00846)

0.00

506(0.004

83,

0.00

528)

αN=20

N=24

0.1

0.09

198(0.090

51,

0.09

346)

0.08

419(0.082

82,

0.08

557)

0.11844(0.11718,

0.11971)

0.11073(0.10925,

0.11221)

0.10

023(0.099

08,

0.10

139)

0.09

276(0.091

56,

0.09

395)

0.08

521(0.083

86,

0.08

657)

0.11846(0.11686,

0.12005)

0.10867(0.10747,

0.10987)

0.09

963(0.098

33,

0.10

092)

0.05

0.04

463(0.043

76,

0.04

549)

0.04

296(0.042

05,

0.04

387)

0.05748(0.05625,

0.05871)

0.05780(0.05662,

0.05898)

0.05

048(0.049

36,

0.05

159)

0.04

468(0.043

94,

0.04

541)

0.04

268(0.041

88,

0.04

348)

0.06050(0.05953,

0.06147)

0.05518(0.05427,

0.05608)

0.04

907(0.048

15,

0.05

000)

0.01

0.00

861(0.008

12,

0.00

911)

0.00

986(0.009

36,

0.01

036)

0.01271(0.01204,

0.01338)

0.01328(0.01267,

0.01388)

0.01

016(0.009

54,

0.01

077)

0.00

844(0.008

10,

0.00

878)

0.00

940(0.008

99,

0.00

982)

0.01242(0.01203,

0.01281)

0.01212(0.01167,

0.01257)

0.00

957(0.009

07,

0.01

007)

0.005

0.00

457(0.004

25,

0.00

489)

0.00536(0.00501,

0.00572)

0.00688(0.00647,

0.00730)

0.00707(0.00661,

0.00753)

0.00

521(0.004

84,

0.00

559)

0.00

405(0.003

86,

0.00

425)

0.00

497(0.004

69,

0.00

525)

0.00662(0.00639,

0.00685)

0.00625(0.00595,

0.00655)

0.00

474(0.004

40,

0.00

507)

Park et al. BMC Genomics (2019) 20:540 Page 8 of 14

Page 9: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

considered the effects of library size variance on stat-istical analyses. Data with unequal library sizes amongsubjects were generated by negative binomial distribu-tion whose mean parameters (agi) were the product ofmean estimates, under the equal library size assump-tion, and random numbers from U(u, 2 − u), whereu = 0.2, 0.4, 0.6, 0.8, or 1, and dispersion parameters

were estimated from ð0:2þ a−1=2gi Þ2 � δg , where 40=δg� χ240 . If u became smaller, the library size had largervariances. Figure 2 shows the estimated type-1 errorrates at the 0.05 significance level according to differ-ent choices of u. Figure 2 shows that voom was sensi-tive to the amount of library size variation andbecame conservative in the context of large librarysize variation. Compared with voom, BALLI and LLIwere robust with regard to u. The estimated type-1error rates of LLI were affected by sample size. If Nwas larger than or equal to 40, LLI controlled thetype-1 error rates the most correctly and was not af-fected by the library size variation. BALLI was slightlyconservative, but the amount remained constant. Re-sults at the 0.005 significance level are provided inAdditional file 7, and the general pattern was similarto that in Fig. 2, except that DESeq2 was conservativeat a significance level of 0.05 and liberal at 0.005

(Additional file 7). Therefore, we could conclude thatthe performances of BALLI and LLI were robust, andwe recommend using BALLI if 10 ≤N ≤ 40, and LLI ifN > 40.Figure 3 and Additional file 8 show the estimated statis-

tical powers and precision according to different choices ofu. BALLI usually had the best estimated power and preci-sion, as was observed in simulation studies based on realRNA-seq data. For example, when u = 0.2, N = 28, and δ =0.5σ, the estimated power by BALLI was 0.438, whereasthose for DESeq2 and voom were 0.317 and 0.352, respect-ively (Fig. 3a). Results when u = 0.2 and δ = 1σ in Fig. 3c arevery similar as those for Fig. 3a. Figure 3b and d also showthat BALLI achieves the best estimated precisions. Similarpatterns were observed when u = 0.4, 0.6, 0.8, and 1 (Add-itional file 8). In summary, we can conclude that BALLIshows better performance than that of other methods.

DEGs of Holstein milk dataHolstein milk data of 21 Holstein cows were generatedto detect genes related to the daily productivity of milk.High and low milk yields were considered the primaryexposure variables, and parity and lactation period wereincluded as covariates, as in the original study [26]. Inthis study, 12 tentative DEGs were chosen and

Fig. 2 Effect of varying library sizes on the type-1 error rates. Type-1 error rates were estimated by BALLI, DESeq2, edgeR, LLI, and voom whenu = 0.2, 0.4, 0.6, 0.8, and 1 and sample size (N) is 12 (a), 16 (b), 20 (c), 24 (d), 28 (e), 40 (f), 64 (g), and 68 (h) at the 0.05 nominal significance level

Park et al. BMC Genomics (2019) 20:540 Page 9 of 14

Page 10: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

technically validated using quantitative real-time poly-merase chain reaction (qRT-PCR). qRT-PCR was con-ducted with QuantiTect SYBR Green RT-PCR MasterMix (Qiagen, Valencia, CA, USA), and a 7500 Fast Se-quence Detection System (Applied Biosystems, FosterCity, CA, USA) was used to confirm whether the 12 ten-tative genes were true DEGs. Among the 12 genes, nine(TOX4, HNRNPL, SPTSSB, NOS3, C25H16orf88,KALRN, SLC4A1, NLN, and PMCH) were significantlyvalidated. According to Seo et al. (2016), however, noDEGs, including the nine genes, were found at an FDR

0.1 significance level by DESeq2, voom, and theirmethods owing to the lack of statistical power [26]. Ourproposed methods and existing methods (DESeq2,edgeR, and voom) were applied to the data analysis. LLIonly detected significant differences for the TOX4 genebetween the high and low milk yield groups at the FDR-adjusted 0.1 significance level, but other methods didnot detect any significant genes. The FDR-adjusted pvalue of TOX4 by BALLI was 0.1267, which was muchsmaller than those of DESeq2, edgeR, and voom. Table 3shows p values for the nine genes, including TOX4. P

Fig. 3 Effect of varying library sizes on the statistical power and precision. Statistical powers and precisions for BALLI, DESeq2, edgeR, LLI, andvoom were empirically estimated at FDR-adjusted 0.1 significance level when u = 0.2, δ = 0.5σ or 1σ and sample size (N) is 12, 16, 20, 24, 28, 40,64, and 68. a Estimated power when u=0.2 and δ=0.5σ. b Estimated precision when u=0.2 and δ=0.5σ. c Estimated power when u=0.2 and δ=1σ.d Estimated precision when u=0.2 and δ=1σ

Park et al. BMC Genomics (2019) 20:540 Page 10 of 14

Page 11: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

values for most of the nine genes obtained by LLI andBALLI were smaller than those obtained by othermethods. The nine genes were not significantly affectedby parity or lactation period (Additional file 9). We alsoanalyzed all genes by the proposed methods; Fig. 4shows the number of genes that were significant at the0.001 nominal significance level. There were no DEGsthat were commonly significant only for all existingmethods (DESeq2, edgeR and voom). Three genes, in-cluding HNRNPL, were detected as DEGs by only BALLIand LLI (Fig. 4). Table 4 shows 12 genes that were com-monly significant by BALLI, DESeq2, edgeR, LLI, andvoom at the 0.005 nominal significance level. Of the 12

genes, all genes except C4BPA had the lowest p valuesin LLI, and three genes had lower p values in BALLIthan in DESeq2, edgeR, and voom. This analysis withBALLI took 153.35 s using an Intel Xeon E7–4820 2.00GHz processor, which was quite a bit longer than othermethods, voom (2.81 s), DESeq2 (27.21 s) and edgeR(33.25 s) (Additional file 10). However, BALLI can con-duct the multi-threaded analyses with a simple optionunlike other methods and analyses of whole genes canbe completed within a reasonable time. For example, theanalyses with 20 cores took only 11.99 s under the sameconditions (Additional file 10). Simulation studies re-vealed that LLI tended to be liberal, with the results

Table 3 True DEG analysis results of Holstein milk data. Holstein milk data was analyzed by BALLI, DESeq2, edgeR, LLI, and voomand their p values (FDRs) are provided

BALLI DESeq2 edgeR LLI voom

TOX4 1.058 × 10−5 (1.267 × 10− 1) 2.949 × 10−4 (6.298 × 10− 1) 2.797 × 10−2 (1) 7.476 × 10−7 (8.947 × 10− 3) 4.980 × 10− 3 (9.999 × 10− 1)

HNRNPL 3.913 × 10− 4 (9.222 × 10− 1) 1.551 × 10− 3 (9.997 × 10− 1) 1.684 × 10− 1 (1) 6.796 × 10− 5 (1.787 × 10− 1) 4.677 × 10− 2 (9.999 × 10− 1)

SPTSSB 4.457 × 10− 4 (9.222 × 10− 1) 2.660 × 10− 3 (9.997 × 10− 1) 2.160 × 10− 4 (1) 8.686 × 10− 5 (1.787 × 10− 1) 1.037 × 10− 3 (9.999 × 10− 1)

NOS3 4.676 × 10−4 (9.222 × 10− 1) 2.396 × 10− 4 (6.928 × 10− 1) 2.549 × 10− 4 (1) 8.957 × 10− 5 (1.787 × 10− 1) 8.693 × 10− 4 (9.999 × 10− 1)

SLC4A1 2.145 × 10−2 (9.999 × 10− 1) 8.753 × 10− 2 (9.997 × 10− 1) 1.025 × 10− 1 (1) 9.856 × 10− 3 (7.179 × 10− 1) 7.579 × 10− 2 (9.999 × 10− 1)

NLN 9.513 × 10− 2 (9.999 × 10− 1) 3.028 × 10− 1 (9.997 × 10− 1) 3.789 × 10− 1 (1) 6.084 × 10− 2 (9.774 × 10− 1) 1.214 × 10− 1 (9.999 × 10− 1)

KALRN 9.792 × 10− 2 (9.999 × 10− 1) 8.943 × 10− 2 (9.997 × 10− 1) 8.815 × 10− 2 (1) 6.307 × 10− 2 (9.790 × 10− 1) 1.054 × 10− 1 (9.999 × 10− 1)

PMCH 1.635 × 10− 1 (9.999 × 10− 1) 2.225 × 10− 1 (9.997 × 10− 1) 2.353 × 10− 1 (1) 1.176 × 10− 1 (9.999 × 10− 1) 1.758 × 10− 1 (9.999 × 10− 1)

C25H16orf88 1.765 × 10− 1 (9.999 × 10− 1) 1.494 × 10− 1 (9.997 × 10− 1) 2.627 × 10− 1 (1) 1.289 × 10− 1 (9.999 × 10− 1) 1.516 × 10− 1 (9.999 × 10− 1)

Fig. 4 Significant genes of Holstein milk data. Venn diagram was provided with significant genes at the 0.001 nominal significance level by BALLI,DESeq2, edgeR, LLI, and voom

Park et al. BMC Genomics (2019) 20:540 Page 11 of 14

Page 12: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

likely to be inflated. However, BALLI controlled thenominal significance level, and p values by BALLI wereexpected to be statistically valid. Therefore, we con-cluded that the proposed method, BALLI, worked wellfor real data analysis.

DiscussionIn this article, we suggested new methods, designatedBALLI and LLI, for identifying DEGs with RNA-seqdata. We assumed that log-cpm values of read countsasymptotically followed normal distributions, and thelinear mixed-effects model with Bartlett’s correction wasproposed. The proposed methods were compared withexisting methods, such as DESeq2, edgeR, and voom,with extensive simulation studies. According to our re-sults, negative-binomial-based approaches often failed topreserve the nominal type-1 error rates. For example, pvalues from edgeR were inflated. DESeq2 tended to beconservative and suffered from large false-negative rates.However, the proposed method with Bartlett’s correc-tion, BALLI, preserved the nominal type-1 error ratesand was the most powerful method other than LLI. Un-less sample sizes were small, LLI controlled the type-1error rates as well and was the most powerful method.Therefore, we recommend using LLI if the sample size issufficiently large (e.g., larger than 40); otherwise, it isbetter to use BALLI.Furthermore, we evaluated the effects of library size

variations on statistical analyses. We found that librarysize variance could affect the estimated type-1 errorrates, and the effect was the largest for voom. Librarysizes are affected by multiple factors, such as the amountof mRNA and the sequencing instrument, which cangenerate substantial variation among library sizes for

subjects. Our simulation studies showed that BALLI wasrobust with regard to library size variation in samples ofvarious sizes and was a reasonable choice if large librarysize variance was observed.The proposed methods assumed that log-cpm values

of read counts asymptotically followed a normal distri-bution and that their variances were approximatelyequal to 1/μ + ϕ with first order approximation. Inaddition, voom considered log-cpm value as a responseand assumed that they were normally distributed. How-ever, our simulation studies revealed the superiority ofthe proposed methods compared with voom, which wasfound to be attributable to their different variance struc-tures. For the proposed methods, 1/μ + ϕ was derivedfrom the first order approximation of the negative bino-mial distribution and thus may be a natural assumptionfor RNA-seq data. Furthermore, for 1/μ + ϕ, ϕ obviouslyindicates the overdispersion parameter, and biologicaland technical variances can be estimated with BALLI.However, voom assumes ϕ/μ, and the amount attribut-able to biological or technical variances cannot be clearlydefined.We also suggested the most flexible and general linear

mixed model for log-cpm. The proposed model assumedthat the variance of log-cpm was φ/μ + ϕ and had themost generalized variance parameter space. Incorpor-ation of φ = 1 yielded BALLI and LLI, and ϕ = 0 yieldedvoom. We found that BALLI was the most efficient inthe considered scenarios; however, in real data analyses,various factors affected variance structure. For example,subjects with different ethnicities can cause φ to belarger than 1, and thus, a better model may differ ac-cording to RNA-seq data. φ and ϕ can be estimatedwith the proposed linear mixed model by implement-ing only a simple modification, and thus, we can

Table 4 Significant genes in all methods of Holstein milk data. Gene lists of Holstein milk data siginificant in nominal 0.005significant level for all methods (BALLI, DESeq2, edgeR, LLI, and voom) and their p values (FDRs) are provided

BALLI DESeq2 edgeR LLI voom

SPTSSB 4.457 × 10−4 (9.222 × 10− 1) 2.660 × 10− 3 (9.997 × 10− 1) 2.160 × 10− 4 (1) 8.686 × 10− 5 (1.787 × 10− 1) 1.037 × 10− 3 (9.999 × 10− 1)

NOS3 4.676 × 10− 4 (9.222 × 10− 1) 2.396 × 10− 4 (6.298 × 10− 1) 2.549 × 10− 4 (1) 8.957 × 10− 5 (1.787 × 10− 1) 8.693 × 10− 4 (9.999 × 10− 1)

FXYD3 7.198 × 10−4 (9.507 × 10− 1) 2.813 × 10− 4 (6.298 × 10− 1) 5.182 × 10− 4 (1) 1.513 × 10− 4 (2.012 × 10− 1) 2.097 × 10− 3 (9.999 × 10− 1)

SPESP1 1.103 × 10− 3 (9.507 × 10− 1) 4.834 × 10− 3 (9.997 × 10− 1) 1.669 × 10− 3 (1) 2.579 × 10−4 (2.385 × 10− 1) 4.001 × 10− 3 (9.999 × 10− 1)

CHST1 1.339 × 10− 3 (9.507 × 10− 1) 8.624 × 10− 4 (9.383 × 10− 1) 1.872 × 10− 3 (1) 3.166 × 10− 4 (2.385 × 10− 1) 6.108 × 10− 4 (9.999 × 10− 1)

LEPREL1 1.387 × 10− 3 (9.507 × 10− 1) 2.554 × 10− 3 (9.997 × 10− 1) 3.297 × 10− 3 (1) 3.334 × 10− 4 (2.385 × 10− 1) 1.487 × 10− 3 (9.999 × 10− 1)

JUB 1.486 × 10− 3 (9.507 × 10− 1) 1.755 × 10− 3 (9.997 × 10− 1) 2.445 × 10− 3 (1) 3.674 × 10− 4 (2.385 × 10− 1) 2.804 × 10− 3 (9.999 × 10− 1)

MIA 1.509 × 10− 3 (9.507 × 10− 1) 4.210 × 10− 4 (6.298 × 10− 1) 2.247 × 10− 3 (1) 3.666 × 10− 4 (2.385 × 10− 1) 2.432 × 10− 3 (9.999 × 10− 1)

C4BPA 2.241 × 10− 3 (9.999 × 10− 1) 3.190 × 10− 4 (6.298 × 10− 1) 1.307 × 10− 3 (1) 6.000 × 10− 4 (2.951 × 10− 1) 2.866 × 10− 3 (9.999 × 10− 1)

CLDN6 2.653 × 10− 3 (9.999 × 10− 1) 1.282 × 10−3 (9.997 × 10− 1) 2.414 × 10− 3 (1) 7.377 × 10− 4 (3.270 × 10− 1) 1.542 × 10− 3 (9.999 × 10− 1)

PALMD 3.737 × 10− 3 (9.999 × 10− 1) 1.151 × 10− 3 (9.997 × 10− 1) 2.648 × 10−3 (1) 1.127 × 10− 3 (3.776 × 10− 1) 3.032 × 10− 3 (9.999 × 10− 1)

KLK12 3.840 × 10− 3 (9.999 × 10− 1) 2.766 × 10− 3 (9.997 × 10− 1) 9.826 × 10− 4 (1) 1.225 × 10− 3 (3.776 × 10− 1) 3.162 × 10− 3 (9.999 × 10− 1)

Park et al. BMC Genomics (2019) 20:540 Page 12 of 14

Page 13: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

choose the best model using AIC or LRTs. The se-lected models can then be utilized to identify DEGs.This model was implemented as an R package andcan be downloaded from CRAN (http://cran.r-project.org) or http://healthstat.snu.ac.kr/software/balli/. Fur-thermore, in contrast to methods based on a general-ized linear mixed model such as MACAU [27], theproposed methods can be easily extended to variousscenarios with a simple modification. For example, re-peatedly observed data or multivariate phenotypes canbe analyzed by adding some random effects. Maximiz-ing the likelihood for negative binomial or Poissondistributions with random effects is computationallyintensive, but the proposed methods can easily obtainvariance parameter estimates using existing R pack-ages, such as lme4 and nlme.With simulation studies for various scenarios, we

showed that the proposed methods were usually themost efficient. However, Bartlett’s correction seemsinappropriate when N ≤ 6 (Additional files 3, 4, and 6)and the corrected LR statistic was smaller thanexpected in some cases, especially in simulation stud-ies based on a negative binomial distribution (Table 2).Further studies are necessary to adjust this. Addition-ally, results from simulation studies obviouslydepended on various factors. Our results were ob-tained from simulation data based on RNA-seq datafrom Nigerian individuals and Holstein cows RNA-seqdata and random samples from negative binomial dis-tributions, but any systematic differences in RNA-seqdata could generate different results, depending on se-quencing errors or differences in preparation steps.Multiple studies have revealed some possible differ-ences in these relationships, and our conclusionsbased on simulation studies could be limited to theconsidered scenarios. However, despite such limita-tions, we believe that our results illustrate the prac-tical value of the proposed methods. Further studiesare needed to confirm our findings and expand on thework presented herein.

ConclusionsIn this article, we proposed likelihood-based linearmixed model approaches with and without Bartlett’s cor-rection to analyze more complicated RNA-seq data. Theproposed methods consider log-cpm values of genes asresponse variables, and technical and biological vari-ances are estimated with a linear mixed model. Accord-ing to our simulation studies and real data analysis, ourmethods are statistically more efficient than existingmethods and correctly control the type-1 error rates.We found that the statistical performance of ourmethod, BALLI and LLI, depends on the sample size; we

recommend using LLI if the sample size is larger than40 and otherwise using BALLI.

Additional files

Additional file 1: The forms of Cg's components depending on thestructure of Vg. (DOCX 17 kb)

Additional file 2: Steps of generating simulation data. (DOCX 18 kb)

Additional file 3: Estimated type-1 error rates with simulation data for N= 4, 6, 8, 28, 40, 64, and 68 based on Nigerian people’s data. (DOCX 21kb)

Additional file 4: Estimated type-1 error rates with simulation data for N= 4, 6, 8, 12, 16 and 20 based on Holstein cow’s data. (DOCX 20 kb)

Additional file 5: Estimated powers and precisions with simulationdata when δ = 0.5σ or 1σ and N = 12, 16 and 20 based on Holsteincow’s data. (DOCX 197 kb)

Additional file 6: Estimated type-1 error rates with simulation data for N= 4, 6, 8, 28, 40, 64, and 68 based on simulated data from negative bino-mial distribution. (DOCX 21 kb)

Additional file 7: Effect of varying library sizes on the type-1 error rateswhen u = 0.2, 0.4, 0.6, 0.8, and 1 and N = 12, 16, 20, 24, 28, 40, 64, or 68at the 0.005 nominal significance level. (DOCX 371 kb)

Additional file 8: Effect of varying library sizes on the statistical powerand precision when u = 1, 0.8, 0.6, or 0.4, δ = 1σ and N = 12, 16, 20, 24,28, 40, 64, or 68. (DOCX 593 kb)

Additional file 9: Analysis of covariables, parity and lactation period, fortrue nine DEGs of Holstein cow's data. (DOCX 18 kb)

Additional file 10: Computation time for each method when analyzingHolstein cow’s data. (DOCX 14 kb)

AbbreviationsBALLI: Bartlett-Adjusted Likelihood based LInear mixed model approach;CPM: Counts Per Million; DEGs: Differentially Expressed Genes; FDR: FalseDiscovery Rate; Log-CPM: Log-transformed CPM; LOWESS: Locally WEightedScatterplot Smoothing; LRT: Likelihood Ratio Test; MLE: Maximum LikelihoodEstimate; qRT-PCR: quantitative Real Time Polymerase Chain Reaction; RNA-seq: RNA sequencing

AcknowledgementsSungho Won and Taesung Park contributed equally to this work ascorresponding authors.

Authors’ contributionsKP, TP, and SW designed and led the study, and wrote the manuscript. KPdeveloped BALLI R package. JA, JG, MS, and WL advised on statisticalmethods and data analysis. All authors have read and approved the finalversion of the manuscript.

FundingThis research was supported by a grant of the Korea Health Technology R&DProject through the Korea Health Industry Development Institute (KHIDI),funded by the Ministry of Health & Welfare, Republic of Korea (HI16C2037)and the National Research Foundation of Korea (2017M3A9F3046543). Thefunding body was not involved in the study design, data collection, analysisand interpretation, and writing the manuscript.

Availability of data and materialsWe downloaded Nigerian samples from ReCount website (http://bowtie-bio.sourceforge.net/recount/countTables/montpick_count_table.txt) and Holsteinsamples from Gene Expression Omnibus (GSE60575).

Ethics approval and consent to participateData used in this article were publicly available.

Consent for publicationNot applicable.

Park et al. BMC Genomics (2019) 20:540 Page 13 of 14

Page 14: BALLI: Bartlett-adjusted likelihood-based linear model approach …s-space.snu.ac.kr/bitstream/10371/156815/1/12864_2019... · 2019-07-09 · METHODOLOGY ARTICLE Open Access BALLI:

Competing interestsThe authors declare that they have no competing interests.

Author details1Interdisciplinary Program of Bioinformatics, Seoul National University, Seoul08826, South Korea. 2Department of Public Health Science, Seoul NationalUniversity, Seoul 08826, South Korea. 3Department of Biomedical Science,Chosun University, Gwangju 61452, South Korea. 4Channing Division ofNetwork Medicine, Brigham and Women’s Hospital, Boston, MA, USA.5Department of Medicine, Harvard Medical School, Boston, MA, USA.6Department of Statistics, Inha University, Incheon 22212, South Korea.7Department of Statistics, Seoul National University, Seoul 08826, SouthKorea. 8Institute of Health and Environment, Seoul National University, Seoul08826, South Korea.

Received: 23 October 2018 Accepted: 28 May 2019

References1. Okoniewski MJ, Miller CJ. Hybridization interactions between probesets in

short oligo microarrays lead to spurious correlations. BMC Bioinf. 2006;7:276.2. Dobbin KK, Kawasaki ES, Petersen DW, Simon RM. Characterizing dye bias in

microarray experiments. Bioinformatics. 2005;21(10):2430–7.3. Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq

and microarray in transcriptome profiling of activated T cells. PLoS One.2014;9(1):e78644.

4. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping andquantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8.

5. Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N,Keime C, Marot G, Castel D, Estelle J, et al. A comprehensive evaluation ofnormalization methods for Illumina high-throughput RNA sequencing dataanalysis. Brief Bioinform. 2013;14(6):671–83.

6. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: anassessment of technical reproducibility and comparison with geneexpression arrays. Genome Res. 2008;18(9):1509–17.

7. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methodsfor normalization and differential expression in mRNA-Seq experiments.BMC Bioinf. 2010;11(1):94.

8. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package fordifferential expression analysis of digital gene expression data.Bioinformatics. 2010;26(1):139–40.

9. Love MI, Huber W, Anders S. Moderated estimation of fold change anddispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.

10. Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linearmodel analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29.

11. Robinson MD, Smyth GK. Moderated statistical tests for assessing differencesin tag abundance. Bioinformatics. 2007;23(21):2881–7.

12. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied tothe ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116–21.

13. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis ofmultifactor RNA-Seq experiments with respect to biological variation.Nucleic Acids Res. 2012;40(10):4288–97.

14. Litière S, Alonso A, Molenberghs G. The impact of a misspecified random-effects distribution on the estimation and the performance of inferentialprocedures in generalized linear mixed models. Stat Med. 2008;27(16):3125–44.

15. Chavance M, Escolano S. Misspecification of the covariance structure ingeneralized linear mixed models. Stat Methods Med Res. 2016;25(2):630–43.

16. Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages fordetecting differential expression in RNA-seq studies. Brief Bioinform. 2015;16(1):59–70.

17. Soneson C, Delorenzi M. A comparison of methods for differentialexpression analysis of RNA-seq data. BMC Bioinf. 2013;14:91.

18. Bartlett MS. Properties of sufficiency and statistical tests. Proc R Soc Lond AMath Phys Sci. 1937;160(901):268–82.

19. Robinson MD, Oshlack A. A scaling normalization method for differentialexpression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.

20. Cox DR, Reid N. Parameter orthogonality and approximate conditionalinference. J R Stat Soc Ser B Methodol. 1987;49(1):1–18.

21. Zucker DM, Lieberman O, Manor O. Improved small sample inference in themixed linear model: Bartlett correction and adjusted likelihood. J R Stat SocSeries B Stat Methodol. 2000;62(4):827–38.

22. Melo TF, Ferrari SL, Cribari-Neto F. Improved testing inference in mixedlinear models. Comput Stat Data Anal. 2009;53(7):2573–82.

23. Brent RP. Algorithms for minimization without derivatives; 1973.24. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras

JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanismsunderlying human gene expression variation with RNA sequencing. Nature.2010;464(7289):768–72.

25. Frazee AC, Langmead B, Leek JT. ReCount: a multi-experiment resource ofanalysis-ready RNA-seq gene count datasets. BMC Bioinf. 2011;12:449.

26. Seo M, Kim K, Yoon J, Jeong JY, Lee H-J, Cho S, Kim H. RNA-seq analysis fordetecting quantitative trait-associated genes. Sci Rep. 2016;6:24375.

27. Sun S, Hood M, Scott L, Peng Q, Mukherjee S, Tung J, Zhou X. Differentialexpression analysis for RNAseq using Poisson mixed models. Nucleic AcidsRes. 2017;45(11):e106.

Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.

Park et al. BMC Genomics (2019) 20:540 Page 14 of 14