breedr - an open statistical package to analyse genetic ... conference... ·...
TRANSCRIPT
breedRAn open statistical package to analysegenetic data (WP6)http://famuvie.github.io/breedR/
Facundo Muñoz4th–6th april 2016
Outline
1. Introduction2. Additive-genetic effects3. Environmental effects4. Competition effects5. Longitudinal data6. Genomic data7. Multi-environment trials8. Multi-trait models9. Simulation framework10. Remote computing
1 Introduction
What is breedR
R-package implementing statistical models specificallytailored to the analysis of forest genetic resourcesA inference tool for Linear Mixed Models, with facilities fortypical needsbreedR acts as an interface providing the means to:
1 Combine any number of prefabricated model componentsinto a larger model
2 Compute automatically incidence and covariance matricesfrom a few input parameters
3 Fit the model4 Plot data and results, and perform model diagnostics
Installation
2 alternatives:
Project web page http://famuvie.github.io/breedR/Set up this URL as a package repository in .Rprofile(detailed instructions on the web)install.packages('breedR')Not possible to use CRAN (or yes?) due to closed-sourceBLUPF90 programsStable version, with automatic updates
GitHub dev-site https://github.com/famuvie/breedRif( !require(devtools) )install.packages('devtools')devtools::install_github('famuvie/breedR')Development version, latest features, more inestable, manualupdates
Getting started
These slides show WHAT can be done with breedRFor HOW to perform these analyses, refer to the website:
http://famuvie.github.io/breedR/
Where to find help
Package’s help: help(package = breedR)Help pages ?remlf90Code demos demo(topic, package = 'breedR') (omittopic for a list)Vignettes vignette(package = 'breedR') (pkg and wiki)
Wiki pagesGuides, tutorials, FAQ
Mailing list http://groups.google.com/group/breedrQuestions and debates about usage and interface
Issues pageBug reportsFeature requests
License
breedR is FOSS. Licensed GPL-3RShowDoc('LICENSE', package = 'breedR')
You can use and distribute breedR for any purposeYou can modify it to suit your needs
we encourage to!please consider contributing your improvementsyou can distribute your modified version under the GPL
However, breedR makes (intensive) use of the BLUPF90 suiteof Fortran programs, which are for free but not free (rememberCRAN?)
Fitting Linear Mixed Models
y =Xβ + Zu+ ε
u ∼N (0,G)ε ∼N (0,R)
A quantitative variable y is modelled as a linear function offixed effects β and random effects u, with unaccountedresiduals εThe function remlf90() yields a REML fit of a model to adatasetAdditional functions (e.g. summary(), fixef(), ranef(),plot(), etc.) extract and present specific results
I/O example
ped <- globulus[,1:3]
res <- remlf90(fixed = phe_X ~ gg,genetic = list(
model = 'add_animal',pedig = ped,id = 'self'),
data = globulus)
summary(res)
## Linear Mixed Model with pedigree and spatial## effects fit by AI-REMLF90 ver. 1.122## Data: globulus## AIC BIC logLik## 5799 5809 -2898#### Variance components:## Estimated variances S.E.## genetic 3.397 1.595## Residual 14.453 1.529#### Estimate S.E.## Heritability 0.1887 0.08705#### Fixed effects:## value s.e.## gg.1 13.591 0.5014## gg.2 14.085 0.7984## ...
2 Additive-genetic effects
What is an additive genetic effect
Random effect at individual levelBased on a pedigree (determining the relationship matrix A)
Zu, u ∼ N (0, σ2aA)
BLUP of Breeding Values from own and relatives’ phenotypesRepresents the additive component of the genetic valueMore general:
family effect is a particular caseaccounts for more than one generationmixed relationships
More flexible: allows to select individuals within familiesMore accurate: direct inference over the additive-geneticvariance of the base population
Pedigrees
A 3-column data.frame ormatrix with the codes for eachindividual and its parentsA family effect is easily translatedinto a pedigree:
use the family code as theidentification of a fictitiousmotheruse 0 or NA as codes for theunknown fathers
self dad mum
69 0 6470 0 4171 0 5672 0 5573 0 2274 0 50
Predicted Breeding Values
0
25
50
75
−2 0 2PBV
Predicted Breeding Values by Family
−3
−2
−1
0
1
2
295544668750216119311214366318482453523351571045402722374703946265935423868176511416495449435620303262286058341561325166723Family
PB
V
3 Environmental effects
What is spatial autocorrelation
The residuals of any LMM must be noiseHowever, most times there are environmental factors thataffect the responseThis causes that observations that are close to each other tendto be more similar that observations that are far awayThis is called spatial autocorrelationIt may affect both the estimations and their accuracyThis is why experiments are randomized into spatial blocks
Diagnosing spatial autocorrelationresiduals spatial plot
You can plot() the spatialarrangement of variousmodel components(e.g. residuals)Look like independentgaussian observations(i.e. noise)?Do you see any signal in thebackground?
0
30
60
90
0 25 50 75x
y
−10
−5
0
5
z
Diagnosing spatial autocorrelationvariograms of residuals
Plot the variogram of residuals with variogram()
9
10
11
12
10 20 30 40distance
vario
gram
0
10
20
30
40
0 10 20 30 40x
y
9
10
11
12
13
row disp.
010
203040
col disp.
010
2030
400
5
10
010203040
−50 −25 0 25 50x
y10
12
14z
Interpreting the variogramsIsotropic variogram
γ(h) = 12V [Z(u)− Z(v)], dist(u,v) = h
The empirical isotropic variogram is built by aggregating all thepairs of points separated by h, no matter the direction.
9
10
11
12
10 20 30 40distance
vario
gram
Interpreting the variogramsRow/Column variogram
γ(x, y) = 12V [Z(u)− Z(v)], dist(u,v) = (x, y)
The empirical row/col variogram is built by aggregating all thepairs of points separated by exactly x rows and y columns.
0
10
20
30
40
0 10 20 30 40x
y
9
10
11
12
13
z
Interpreting the variogramsAnisotropic variogram
γ(x) = 12V [Z(u)− Z(v)], u = v± x
The empirical anisotropic variogram is built by aggregating all thepairs of points in the same direction separated by |x|.
010203040
−50 −25 0 25 50x
y
10
12
14z
Accounting for spatial autocorrelation
Include an explicit spatial effect in the modelI.e., a random effect with a specific covariance structure thatreflects the spatial relationship between individuals
Spatial modellingBlocks
Zu, u ∼ N (0, σ2sI)
u is the vector of random effects for the blocksZ is an indicator matrix such that Z[i, j] = 1 if the observationi belongs to block jσ2s is the spatial variance parameter
The block effect, is a very particular case of spatial effect:It is designed from the begining, possibly using prior knowledgeCan account for non-spatial effects (e.g. operator)Introduces independent effects between blocksMost neighbours are within the same block (i.e. share the sameeffect)
Spatial modellingBidimensional Splines
A cubic B-spline B(x):
Piecewise curve defined in the intervals determined by 5 knotsEach piece is a polynomial of 3rd degree
Spatial modellingBidimensional Splines
A cubic B-spline B(x) with regularly spaced knots:
The curve is constrained for C2 continuity at each knotOnly 1 degree of freedom controls the scale
Spatial modellingBidimensional Splines
A number of overlapping curves form a base of B-splines {Bj(x)}
Spatial modellingBidimensional Splines
Each, can be scaled using a coefficient {ujBj(x)}
Spatial modellingBidimensional Splines
And summed to a linear combination f(x) =∑j ujBj(x)
Spatial modellingBidimensional Splines in a Mixed Model
f(x) =∑j ujBj(x) provides a spline representation of a
wide family of curves, in terms of a vector of coefficients uFor any set of points x = {xi}, the vector of values f(xi) canbe written as a matrix operation f =
[Bj(xi)
]u
breedR extends this to two dimensions and defines a randomeffect
Bu, u ∼ N (0, σ2sRs)
u is the vector of spline effectsB is the matrix of spline bases evaluated at the observationsσ2
s is the spatial variance parameterRs imposes a fixed positive correlation between coefficients ofneighbouring spline bases
Spatial modellingNumber of knots of a splines model
The smoothness of the spatial surface can be controlledmodifying the number of base functionsThis is directly determined by the number of knots (nok) ineach dimensionIf not explicitly set, it is determined heuristically by breedR as afunction of the number of observations
4
6
8
10
0 250 500 750 1000x
nok
Spatial modellingBidimensional First-Order Autoregressive Process
An AR1(ρ) on the line is a collection of random variables {xi}where
xt = ρxt−1 + εt, εt ∼ N (0, 1), |ρ| < 1A few random simulations with ρ = 0.5:
Spatial modellingBidimensional First-Order Autoregressive Process
breedR extends this model to the plane using and defines acomponent
Zu, u ∼ N (0, σ2sRAR)
u is the vector of random effects for each individual locationon a regular gridZ is an indicator matrix such that Z[i, j] = 1 if theobservation i is at site jσ2s is the spatial variance parameterRAR defines a separable correlation structure based on thekronecker product of two AR1 processes
Spatial modellingAutoregressive parameters of a AR model
The smoothness of the AR effects can be controlled by theautoregressive parameters (ρx, ρy) in each dimensionThey can be given explicitlyOtherwise, breedR fits a model for each combination ofparameters in a default grid and returns the most likely
−1.0
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5 1.0rho_r
rho_
c
−2925
−2900
−2875
−2850
loglik
Predicted spatial effectsby spatial model
All capture a similar underlying environmental patternwith somewhat increasing ranges of variability
Model residualsby spatial model
The spatial variability is taken mostly from the model residualswhich increasingly look like pure noise
4 Competition effects
Competition model assumptions
Each individual have two(unknown) Breeding Values(BV):
direct BV affects itsown phenotype,competition BV affectsits neghbours’
The total effect of theneighbouring competitionBVs is given by theirdistance-weighted sum
Weighted neighbour competition effect
i kdk
Let ∂i be the set of neighbouringlocations of tree i, anduc = (uc,k)′ the vector ofcompetition BVs
ωi(α) =∑k∈∂i
zik(α)uc,k
where zik(α) ∝ 1/dαik, such that∑k∈∂i
zik(α)2 = 1.
This condition is variance-estabilizer ensuring ∀i:
Var(ωi) = Var(uc) = σ2c
The decay parameter
The decay parameter α controls the relative intensity ofcompetition of the neighbours
α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1α = 1
α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 2α = 20
1
2
3
4
0 1 2 3 4Relative distance
Rel
ativ
e w
eigh
t
The weights zik arescale-invariante.g. a tree twice as far isweighted 1/2α as muchhigher values of αconcentrate the weights onthe closest trees
Random-effect representation
Zdud + Zc(α)uc,(uduc
)∼ N
(0,Σa ⊗A
), Σa =
(σ2d σdc
σdc σ2c
)
Each set of BVs is modelled as a zero-mean random effectwith structure matrix given by the pedigree and independentvariances σ2
d and σ2c
Both random effects are modelled jointly with covariance σdcZd is an indicator matrix linking observations and individualsZc(α) weights the competition effect of the neighbours with(fixed) decay parameter α
Permanent Environmental Competition Effect
Zpup, Zp = Zc, u ∼ N (0, σ2pI)
Optional companion effect with environmental (rather thangenetic) basisModelled as an individual independent random effect thataffects neighbouring trees in the same (weighted) way
Diagnosis of competition
additive−genetic only competition
12000
13000
14000
10 20 30 40 50 10 20 30 40 50distance
vario
gram
Competition is often observable in the first lag of thevariogram of residuals
increased antagonism between neighbouring phenotypescorrelation between neighbours pulled towards neg. values
In addition, use model comparison criteria (AIC, BIC, loglik)
Breeding for growth and collaboration
−100
−50
0
50
100
−100 −50 0 50 100direct
com
petit
ion
direct and competitionBV are usually negativelycorrelatedselection based only ondirect genetic merit tends tofavour competitiveindividuals, hampering theglobal performancethe competition model allowsfor selection based on a jointassessment
5 Longitudinal data
What is longitudinal data
Measurements repeated in time or along some climatic orgeographical variable (e.g. temperature, precipitation,latitude, altitude, . . . )All model parameters (e.g. variances, random effects) can befunctions of the longitudinal variableIncreased complexity: from estimating numerical values(dimension 0) to estimating (infinte-dimensional) functions(with finite data)Strategies:
assume parametric shape (e.g. linear regression)nonparametric components (e.g. splines, Legendre polynomials)
Repeated measurements in timeMean response evolution by replicate - Larix dataset
4
6
8
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Year
Mea
n rin
g le
ngth
There is a strong temporal trend that needs to be accountedfor
Climatic gradientMean ring-length by Mean Annual Precipitation
34.72 37.24 38.64
39.15 40.91 41.19
43.93 44.66 44.92
20
30
20
30
20
30
20 30 40 20 30 40 20 30 40
3
6
9
12LAS
Legendre Polynomials
Family of orthogonal polynomials dense in L2Any regular curve can be approximated as much as needed bytaking a linear combination of polynomials up to a sufficientlyhigh order
−2
−1
0
1
2
−1.0 −0.5 0.0 0.5 1.0x
y
order
0
1
2
3
Random-Regression model
For each observation of an individual i at j
yij =Xiβ + Σordk=0aikLk(j) + εij
(a′0, . . . , a′ord)′ ∼N (0,Σ⊗A)ε ∼N (0, σ2
e)
The Breeding Value of an individual is a function of anenvironmental variableThis function is parameterised as a linear combination ofLegendre orthogonal polynomials of order up to a fixed ordEach individual is described by ord + 1 correlated coefficients
Example of results
Functional Breeding Values for each individual
6 Genomic data
Using genomic markers
breedR allows random effects with an arbitrary covariancestructure (generic)This can be used to leverage genomic information (GBLUP)
The Generic model
This additional component allows to introduce a random effect ψwith arbitrary incidence and covariance matrices Z and Σ:
Zψ, ψ ∼ N(0, σ2
ψΣψ
)Applications:
include additional not-predefined components e.g. Dominance,Hybrid populations, Genomic evaluation, etc.
GBLUP
Zu, u ∼ N(0, σ2
GG)
Use markers to compute a relationship matrix G forindividuals
Several methods availablee.g. VanRaden et al. 2009
G = XX ′/∑
2p(1− p)
Replace the additive-genetic model, which uses thepedigree-based relationship matrix A with a generic modelwith a genomic relationship matrix GZ is an indicator matrix linking observations with individualsPredicts genetic value of individuals, not markersImproved accuracy wrt pedigree-pased evaluation
Relationship matricespedigree-based vs. genomic
Note the increased level of detail in the relationship structuure
7 Multi-environment trials
Douglas dataset
s1 s2 s3
Environments may refer to sites, but also to years or climates
Particularities
Fixed effects
some can be transversal accross environmentse.g. origin, provenance
other take different values at each environmente.g. the intercept, replicatemodelled as a fixed interaction
Random effects
some can be transversal accross environmentse.g. main effect of the genotype
other can be environment-specifice.g. blocks, GxE, residualsmodelled as a group of (correlated or not) random effects
Statistical model
y =X0β0 + Z0u0 + X∑e
1{e}βe + Z∑e
1{e}ue + ε
u0 ∼N (0,G0)(u1, . . .)′ ∼N (0,ΣG ⊗G)
ε ∼N (0,DR ⊗ I)
Particular case of the general LMMFor each environment e, there is a group of random effectsue, each with covariance structure G, possibly cross-correlatedthrough ΣGIndependent site-specific residual variances
Practical case 1GxE study
C13 =orig + site + fam +3∑e=1
fe1e + ε
fam ∼N(0, σ2
f I)
(f1, f2, f3)′ ∼N(0,ΣG×E ⊗ I
)ε ∼N
(0, D3 ⊗ I
)One global family effect (fam)One group of three site-specific family effects (fi, i = 1, 2, 3)Jointly, they represent the G× E interaction with geneticcross-covariation ΣG×E
Predicted genetic effectsby family
Sum of main and interaction effectsNote:
different variances per sitehigh genetic correlationsome families are more interactive
Genetic correlations
EcovalenceMeasure of interactivity of families
For each family x
ϕ(x) ∝∑i∈x
∑e
f2e
Practical case 2Site-specific spatial effects
8 Multi-trait models
Multivariate Linear Mixed Models2-trait case
Y1 = Xβ1 + Zu1 + ε1
Y2 = Xβ2 + Zu2 + ε2,
(u1, u2)′ ∼ N(0,Σu ⊗G)(ε1, ε2)′ ∼ N(0,Σ⊗ In).
Σu and Σ either diagonal or fully-parameterized 2× 2matricesSome of the fixed or random effects can affect only a subsetof the traits
e.g. fixed effect of operator
Limitationof breedR’s implementation
All fixed and random effects are assumed to be trait-specifictransversal effects not directly supported (ultimately byPROGSF90)
Simpler covariance structures not supportede.g. independent effects with shared variance, exchangeablestructure
A workaround is to reshape the dataset to long-layout
Multi-trait with reshapingwide to long-layout
Reshaping operation:Stack traits into a single variable valueAdditional variable traitDuplicate individual information and other variables
Use single-trait models with MET syntaxtrait instead of site
This overcomes the limitations breedR’s multi-traitimplementation
more complex models like multi-trait and multisite becomecumbersome
9 Simulation framework
breedR’s simulation functionSimulating data under competition
Simulate datasets of any size, from any most supported modelsSee ?simulation for details on the syntax
Source: local data frame [500 x 14]
self sire dam beta x y spatial BV1(dbl) (dbl) (dbl) (dbl) (int) (int) (dbl) (dbl)
1 41 14 40 1 1 19 -1.0738194 -1.0236542 42 18 33 1 18 22 1.6654315 2.4774883 43 11 31 1 19 19 0.7047348 1.9613054 44 5 38 1 1 23 0.4698724 1.2071125 45 16 24 1 7 12 -0.7233131 -1.9197426 46 9 38 1 3 5 -0.1197804 1.174054.. ... ... ... ... ... ... ... ...Variables not shown: BV2 (dbl), wnc (dbl), pec (dbl), wnp (dbl), resid
(dbl), phenotype (dbl)
Applicationsa.k.a. what the heck I want a simulator for?
check models under ideal scenariosBootstrapping:
compute heritability (and it s.e.) for complex modelscompute more accurate s.e. for fixed and random effectsinference on arbitrary hypotheses (involving any combination ofmodel parameters)
Heritability of complex models
e.g. competition or splines models fitted by EM-REML (rather thanAI)
Thus, heritability not availableOther methods (e.g. Delta) not feasibleEven when available, is approximate (relies on asymptoticnormality of parameters)
Example
1 Fit the model to your data2 Write a function to simulate data from your fitted model
parameters3 Write a function to fit a simulated dataset and return
realised heritability
Repeated calls to thisfunction yields the samplingdistribution of heritabilityCompute SE and CI fromnumerical summaries
0
20
40
60
0.00 0.25 0.50 0.75 1.00h2
coun
t
More accurate Standard Errors
SE in breedR’s output are approximateRely on asymptotic normality (same as heritability)Same Bootstrapping procedure applies
bl fam Residual
0
50
100
150
0 5 10 15 0 5 10 15 0 5 10 15value
coun
t
10 Remote computing
Remote computation
If you have access to a Linux server through SSH, you can performcomputations remotely
Take advantage of more memory or faster processorsParallelize jobsFree local resources while fitting modelsSee ?remote for details
Configuring a server
1 Windows users: install cygwin with ssh beforehand(http://cygwin.org/)
2 configure the client and server machines so that passwordlessSSH authentication works
3 Set breedR options remote.host, remote.user,remote.port and remote.bin (see ?breedR.setOption)
Optionally, set these options permanently in $HOME/.breedRrc
writeLines(c("remote.host = '123.45.678.999'",
"remote.user = 'uname'","remote.bin = 'remote/path/to/breedR/bin/linux'"),
con = file.path(Sys.getenv('HOME'), '.breedRrc'))
Fitting models remotely
res <- remlf90(..., breedR.bin = "remote")
Fit model remotelyR-console stays in stand-by until job is finishedWhen job finishes (provided that connection keeps alive),results are automatically retrieved
Identical in use to local computing, but without theprocessor/memory burden
Submitting jobs
res <- remlf90(..., breedR.bin = "submit")
Fit model remotelyConnection is closed in the meanwhileR-console is activeTyping res queries the server for the job status(Running/Finished/Aborted)
Parallel computing
After you submit a job, you are free to submit more (speciallywith multiple-processor servers)Query the status of all jobs with breedR.qstat()Kill some job with breedR.qdel(res) or all jobs withbreedR.qnuke()
breedRhttp://famuvie.github.io/breedR/