assessing the predictive accuracy of various statistical methods …. gauffreteau... · 2014. 7....

30
Assessing the predictive accuracy of various statistical methods that use environmental covariates to model Genotype x Environment Interactions in multi-environment trial A. Gauffreteau, G. Grignon, P. Pachot, J. Lorgeou, F. Piraux, F. Maupas, H. Escriou, C. Pontet, F. Salvi

Upload: others

Post on 27-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Assessing the predictive accuracy of various statistical methods that use environmental covariates to model

    Genotype x Environment Interactions in multi-environment trial

    A. Gauffreteau, G. Grignon, P. Pachot, J. Lorgeou, F. Piraux, F. Maupas, H.

    Escriou, C. Pontet, F. Salvi

  • Why predicting GEI?

    2/30

  • Context

    • Choice of cultivars adapted to their cropping environments

    • Need for cultivars : – Diversified and adapted to a large range of environmental conditions

    – Described : average performance and response to the various environmental conditions they can meet in fields

    → Modeling the GEI in multi-environment trials (MET) carried out during the last steps of selection, the registration process and the post-registration cultivar assessment

    Climate change

    Input reduction

    Diversification of cropping

    environments

    biotic and abiotic stresses higher

    and more variable

    3/30

  • 2001

    2001 2002

    2001 2002

    2001 2002

    2001 2002

    2001 2002 71%

    12%

    13%

    4%

    Env

    Var

    Var x Env

    ε

    GEI in MET

    Yijk = μ + Gi + Ej + GxEij + εijk

    9 genotypes 2 years 3 cropping

    practices

    4/30

  • Importance of GEI in MET

    Species Environment Genotype GEI Biblio/Study

    Soft wheat 50 to 80 % 10% 25% Lecomte, 2005

    80 % 7% 13 % Arvalis (in average)

    Sunflower 55 % 22 % 12 % Casadebaig, 2008

    70% 3% 9 % Cetiom (in average)

    → GEI effect = from 0.5 to 2 times as big as genotype effect

    5/30

  • Modeling GEI to

    • Assess the ability of cultivars to maintain their performance in various environments (more or less stressful) in comparison to other cultivars (dynamic stability)

    → Minimizing GEI (Wricke’s Ecovalence…) • Define the trials where each variety performs the best → Analysing the GEI matrix (AMMI…) • Characterize the cultivars (resistance to environmental stress and ability to

    promote environmental resources) → Explaining GEI by using environmental covariates in models (Factorial

    regression…) • Recommend cultivars for unexperimented conditions • Assess the stability (dynamic or static) of one cultivar on unexperimented

    conditions • Fulfill missing data in cultivar dataset: identify groups of cultivars with

    complementary GEI that could be recommended together for a better dynamic stability

    → Predicting GEI with models

    6/30

  • The methods implemented to model GEI

    7/30

  • Modeling GEI using environmental covariates

    Yijk = μ + Gi + Ej + GxEij + εijk

    α1.X1j+…+ αn.Xnj E’j β1i.X1j βni.Xnj GxE’ij

    Y

    X

    Gen A

    Gen B

    Mean Gen

    +βA - βB

    G

    8/30

  • Environmental covariates (soft winter wheat case)

    (Meynard and Sebillotte, 1980)

    1 covariate = 1 environnemental stress x 1 development period

    sum of radiations

    T° max > 25°C

    Notations, models

    notations

    Maturity Environmental covariates

    sum of P-ETP>0

    sum of ETR-ETM

  • • Development stages depending on cultivar earliness

    Environmental covariates

    X

    time G1

    G2 G3

    Xij : covariates are calculated by environment j and cultivar i

    βi : intrinsic cultivar tolerance

    Xj : covariates are calculated by environment j

    βi : intrinsic cultivar tolerance + escape

    Yijk = μ + Gi +

    α1.X1j+…+ αn.Xnj E’j+

    β1i.X1j βni.Xnj GxE’ij +

    εijk

    10/30

  • Statistical methods

    Yijk = μ + Gi + Σn(αn.Xnj) + E’j + Σn(βni.Xnj) + GxE’ij + εijk

    • One step factorial regression with forward selection of environmental covariates according to AIC : – X strongly correlated

    → Problem for variable selection (unstable)

    → Problem for parameter estimate

    → Low predictive value

    • Two step procedure: – Mixed model: Yijk = μ + Gi + Ej + GxEij + εijk

    – Analysis of GxEij by

    • Centered and scale PLS regression

    • Random Forest

    11/30

  • • The principle: – Look for orthogonal components th=X.ah maximizing cov(X.ah,Y):

    explaining X and correlated to Y

    – Regress Y on components th

    – Express the regression according to initial covariates X

    – Number of axis chosen by cross-validation

    • The advantages – Deals with correlated X

    – Deals with large number of X (can be larger than the number of individuals)

    Statistical methods: PLS regression

    12/30

  • • Structure of data

    VAR ENV VARxENV X1xVAR1 X1xVAR2 X2xVAR1 X2xVAR2

    var1 env1 int11 X11 0 X21 0

    var1 env2 int12 X12 0 X22 0

    var1 env3 int13 X13 0 X23 0

    var2 env1 int21 0 X11 0 X21

    var2 env2 int22 0 X12 0 X22

    var2 env3 int23 0 X13 0 X23

    Statistical methods: PLS regression

    • Variable selection: – From the model with all the variables

    – Successive removal of variables according to their absolute coefficient (from the smallest to the largest)

    – At each step calculation of RMSEP of the model by cross validation

    – Choice of the model (and the variables) that minimizes RMSEP

    13/30

  • • The principle: – Building independent decision trees : set of dichotomies (nodes) of one

    group splitted into 2

    – Random sample with replacement of n data : training dataset

    – For each node of the tree: a subset of variables are randomly sampled and tested: the variable which minimize the intra-group inertia is selected

    – Prediction of each out-of-bag data

    – Prediction of random forest = mean of predictions of individual trees

    → MSEP calculated on the random forest predictions

    • Variable importance: – Random permutation of a variable

    – Calculation of the increase of MSEP due to the permutation (IncMSEP)

    – The bigger the increase of MSEP the more important the variable

    Statistical methods: Random Forest

    14/30

  • • The advantages: – Deals with interactions between covariates

    – Deals with non linear relationships

    – Deals with large number of X (can be larger than the number of data)

    • Variable selection: – Forward :

    • From a model with only cultivar as variable

    • Successive introduction of environmental covariates with maximum IncMSEP

    • Stop when the introduction of a new variable doesn’t decrease the MSEP of the model

    – Backward • From the model with all the variables

    • Successive removal of environmental covariates with minimum IncMSEP

    • Stop when the removal of a new variable doesn’t decrease the MSEP of the model

    Statistical methods: Random Forest

    15/30

  • Cross-validation

    • Cross validation over the trial

    • Cross validation over the year

    • 1/3 – 2/3 cross validation

    Test dataset Calibration dataset

    1 trial N - 1 trial

    1 year N - 1 year

    1/3 2/3

    16/30

  • Objective : Assess the ability of GxE models to order cultivars and compare their performances in a given environment For each environment :

    • Spearman correlation between predicted and observed yield (Spear)

    • For all the couple of cultivars : – RMSEP of yield differences between cultivars (RMSEP)

    – Type III Error : Percentage of cases where the model doesn’t order correctly cultivars (ETIII-0)

    – Type III Error with a threshold of N t/ha : Percentage of cases where the model doesn’t order correctly cultivars showing a yield difference over N t/ha (ETIII-N)

    • Reference: additive model

    Cross-validation

    17/30

  • Results

    18/30

  • Wheat dataset

    • 8 years (2003-2010)

    • Trials treated against diseases, insects and weeds

    • 46 agro-climatic covariates calculated by environment and cultivar

    • 30 cultivars not experimented in all the trials

    → 50% of missing data

    19/30

  • Results according to cross-validation

    Method RMSEP Spear ETIII-0 ETIII-1 ETIII-3

    RF (Forward) 7.05 0.51 30.5% 24.9% 16.6%

    PLS 6.71 0.58 27.7% 22.4% 14.3%

    Additive model

    7.34 0.42 33.5% 28.2% 19.4%

    Trial

    CV

    Method RMSEP Spear ETIII-0 ETIII-1 ETIII-3

    RF (Forward) 7.20 0.52 30.4% 25.0% 16.6%

    PLS 7.63 0.45 33.5% 27.9% 18.5%

    Additive model

    7.40 0.43 33.4% 28.1% 19.3%

    1/3 – 2/3

    CV

    20/30

  • Cross validation over the Years

    CV 2010 RMSEP Spear ETIII-0 ETIII-1 ETIII-3

    RF (Forward) 6.7 0.41 33.9% 29.4% 20.2% PLS 6.79 0.38 35.6% 30.5% 21.0%

    Additive model

    6.69 0.42 33.7% 28.9% 19.8%

    CV 2009 RMSEP Spear ETIII-0 ETIII-1 ETIII-3

    RF (Forward) 7.14 0.29 40.0% 34.0% 23.4%

    PLS 6.95 0.38 37.0% 31.3% 21.1%

    Additive model

    6.96 0.35 37.8% 31.9% 21.7%

    CV 2008 RMSEP Spear ETIII-0 ETIII-1 ETIII-3 RF (Forward) 9.2 0.33 37.2% 33.3% 25.3%

    PLS 9.13 0.29 39.7% 35.6% 27.5%

    Additive model

    9 0.33 37.6% 33.9% 26.1%

    21/30

  • Conclusion

    • Same results observed on – Sunflower

    – Beet

    – Potatoe

    • Variability in the results according to the situation considered

    • GxE models are not significantly more predictive than additive models

    • Same results observed by using “sparse PLS” and “PLS spline”

    • GxE models under study using environmental covariates should not be used to recommend cultivars

    Spearman correlation

    -0.2 0.0 0.2 0.4 0.6 0.8

    0.0

    0.2

    0.4

    0.6

    0.8

    Coefficient de Spearman

    Spearman du modèle additif

    Spearm

    an d

    u m

    odèle

    inte

    rarf

    Efficience du modèle : 0.1

    GE

    I m

    odel (R

    andom

    Fore

    st)

    Additive model 22/30

  • Trying to explain the low predictive accuracy of GEI models

    Cause Hypothesis Method Result

    Covariates

    Presence of diseases in treated trials not taken into account in

    models

    Introducing cultivar tolerance to disease as covariates in the

    models V

    Imprecise covariates selected Check correlation between

    predictive value and distance to meteorological station

    X

    Cultivars

    Better predictive values for Cultivars assessed in more

    experiments and under a larger range of environmental conditions

    Check correlation between predictive value of cultivars and their number of occurrence in MET or D-optimality criterium

    X

    MET Lower predictive value when the

    calibration dataset is very different from the test one

    Check distance in terms of environmental conditions

    between calibration and test dataset (Mahalanobis distance)

    X

    Statistic model

    Statistical model are not able to simulate complex and dynamic

    mechanisms

    Simulate data with crop model integrating a cultivar component … 23/30

  • Dataset and modelling plan

    Cultivar

    (32)

    Cropping

    practices

    (1)

    Soil

    (30 sites)

    Environmental

    covariates

    (Xnj)

    Climate

    (5 years :

    2008-2012)

    Yield

    (Yij)

    Development

    stages

    SUNFLO

    Biomass

    N content

    Water status

    Known

    without

    error Balanced

    dataset

    No weed, disease

    and insect

    24/30

  • Cultivars under study

    • Virtual cultivars defined as the combinations of 5 phenotypic inputs of the SUNFLO model.

    • Two levels possible for each phenotypic variable corresponding to minimum and maximum values defined by experts

    • 25 = 32 cultivars contrasted for their tolerance to water stress

    Area of biggest leaf at

    flowering

    Sum of T° from

    emerging to maturity

    Threshold over which

    foliar growth is impacted

    by WS

    Threshold over which

    stomate functioning is impacted by

    WS

    Rank of the widest leaf at

    flowering

    min a l e c h

    max A L E C H

    25/30

  • Environmental covariates

    • 42 covariates measuring for each environment – Water stress

    – Nitrogen stress

    – Radiation offer

    – Temperature offer

    • during – Vegetative period

    – Flowering period

    – Grain filling period

    26/30

  • Simulated dataset ale

    chale

    cHale

    Ch

    ale

    CH

    alE

    chalE

    cHalE

    Ch

    alE

    CH

    aLech

    aLecH

    aLeC

    haLeC

    HaLE

    chaLE

    cHaLE

    Ch

    aLE

    CH

    Ale

    chA

    lecH

    Ale

    Ch

    Ale

    CH

    AlE

    chA

    lEcH

    AlE

    Ch

    AlE

    CH

    ALech

    ALecH

    ALeC

    hA

    LeC

    HA

    LE

    chA

    LE

    cHA

    LE

    Ch

    ALE

    CH

    20

    30

    40

    50

    boxplot des rendement simulés

    par variété

    Rdt

    sim

    ulé

    en q

    uin

    taux/

    ha

    moyenneMean

    Yields simulated by cultivar

    Yie

    ld (

    q/h

    a)

    88%

    6% 6%

    E

    G

    GxE

    • A slight over-estimation of yields in comparison with yields observed in real MET but a range of yields in accordance with expertise

    • GEI smaller than those observed in real MET 27/30

  • Results for a CV over the years

    Spear

    [part of GEI predicted]

    RMSEP ETIII-1 (%) ETIII-3 (%)

    Additive model

    0.73 [0%]

    1.87 16.60 4.81

    RF (Backward)

    0.80 [26%]

    1.52 12.22 2.72

    Factorial regression

    0.82 [33%]

    1.48 11.42 2.48

    PLS 0.83

    [37%] 1.41 10.43 2.27

    • All the GEI models have better predictive values than additive model

    • The part of GEI predicted by statistical models remains limited

    28/30

  • Spearman correlation

    Additive model

    GE

    I m

    odel

    RF (Backward) PLS

    • Compared to PLS, Random forest shows less situations where the predictive quality is below the one of additive model

    Results for a CV over the years

    29/30

  • Conclusion

    • The statistical GEI models under study using environmental covariates are not significantly better than additive ones

    → They should not be used to recommend cultivars or fulfill missing data in cultivar dataset

    • How improving GEI models?

    ‒ Better characterization of individual trials

    ‒ Test of various covariates and development period to quantify the environmental stresses and resources over the cropping period

    ‒ Test of interaction between covariates in the case of PLS regression

    ‒ Increase the number of trials and their variability in terms of environmental conditions

    ‒ Model Gene x Environment interactions (Genes assessed in more environments than cultivars)

    ‒ …

    • To be continued

    ‒ European Project in preparation: ANOVA (A NOvel approach for Variety Assessment (in Europe)

    ‒ … 30/30