studying the presence of genetically modified variants in organic oilseed rape by using relational...
TRANSCRIPT
Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using
Relational Data Mining
Aneta Ivanovska1, Celine Vens2, Sašo Džeroski1, Nathalie Colbach3
1 Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia
Emails: [email protected], [email protected]: +386 1 477 3144 (Aneta Ivanovska)
2 Department of Computer Science, K.U. Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium
Email: [email protected] UMR1210, Biologie et Gestion des Adventices, INRA, 21000 Dijon, France
Email : [email protected]
13/09/2007 EnviroInfo 2007 - Warsaw 2
The GM problem
Genetically Modified (GM) crops First introduced for commercial
production in 1996 Herbicide tolerant Pest-resistant
Concern: GM crops mixing with conventional or organic crops of the same species
13/09/2007 EnviroInfo 2007 - Warsaw 3
The GM problem (2)
Computer simulation model GENESYS Estimates the rate of adventitious
presence of GM varieties in non-GM crops
Ranks the cropping systems according to their probability of gene flow between GM and non-GM oilseed rape (OSR)
13/09/2007 EnviroInfo 2007 - Warsaw 4
Motivation
Predict the contamination of a field with GM material
The dataset produced by GENESYS was previously analyzed using propositional data mining techniques (Ivanovska et al., 2006)
Assumption: contamination of a field with GM seeds mostly depends on the cropping techniques and crops grown in the surrounding fields Exploit neighborhood relations in the predictive
model Create a relational representation of the problem
In this study: investigate the use of relational data mining to analyze the dataset produced by GENESYS and use the relational data mining system TILDE
5
Field plan
Crop succession
Crop management
For each field and year:
PlantsSeedbankSeeds
produced
•Number per m²•Genotypic proportions
Rape varieties
(Colbach et al., 2001)
13/09/2007 EnviroInfo 2007 - Warsaw 6
Materials and methods: the dataset
Output from GENESYS
Large-risk field plan maximizes the pollen and seed input into the central field
Focus of the analysis: predict the rate of adventitious presence in the central field of a large-risk field plan
13/09/2007 EnviroInfo 2007 - Warsaw 7
Materials and methods: the dataset (2)
100 000 simulations for 25 years Attributes:
Geometry of the region (field-plan) Genetic variables For each field and year: crops and
management techniques Full details kept only for the last 4
years
13/09/2007 EnviroInfo 2007 - Warsaw 8
Materials and methods: relational data mining, relational classification trees
Propositional data mining techniques Single table Popular DM techniques: classification and
regression decision trees
Relational data mining techniques Multiple tables Relations between them Relational classification or regression trees
13/09/2007 EnviroInfo 2007 - Warsaw 9
Materials and methods: relational data mining, relational classification trees (2)
Data scattered over multiple relations (or tables): can be analyzed by conventional data mining
techniques, by transforming it into a propositional table (attribute-value representation) – propositionalization
multi-relational approach takes into account the structure of the original data
Data represented in terms of relations: target(Field1, contaminated)
Background knowledge is also given
13/09/2007 EnviroInfo 2007 - Warsaw 10
Materials and methods: relational data mining, relational classification trees (3)
Relational vs. propositional classification trees - similarities:
predict the value of a dependent variable (class) from the values of a set of independent variables (attributes)
test in each inner node that tests the value of a certain attribute and compares it with a constant
leaf nodes give a classification that applies to all instances that reach the leaf
Relational vs. propositional classification trees - differences:
Prop. trees: tests in the inner nodes compare the value of a variable (property of the object) to a value
Rel. trees: tests can also refer to background knowledge relations or tables
13/09/2007 EnviroInfo 2007 - Warsaw 11
An example of a relational classification tree
targetField(FieldA) and fieldDataYear(FieldA,0,Crop,SowingDate),
SowingDate<252
fieldDataYear(FieldA,0,Crop,SowingDate), SowingDate<233
NEG
neighbour(FieldA,FieldB,noborder) andfieldDataYear(FieldB,1,gm-OSR,SowingDate)
NEGPOS
POS
yes no
yes
yes no
no
13/09/2007 EnviroInfo 2007 - Warsaw 12
Experiments and results
Representation of the data: Target relation (data label):
rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres)
Background relations: fieldDataYear(SimulationID, FieldID, Year,
CultivationTechniques) lastOSR(SimulationID, FieldID, LastGM,
LastNonGM) neighbour(Field1ID, Field2ID, NeighType)
13/09/2007 EnviroInfo 2007 - Warsaw 13
Experiments and results (2)
Discretized target attribute – 0.9% Experimental settings:
Propositional: besides the target relation rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres), only (propositional) data for the target field is included (not using any relations among the fields), i.e., the following predicates are used:
fieldDataYear(FieldID,Year,Crop,SowingDate), for the target field
lastOSR(FieldID,LastGM,LastNonGM), for the target field
Neighbor:the same relations were used as in the Propositional setting, but now other fields are introduced via the neighbour relation, starting at the target field:
neighbour(Field1ID, Field2ID, NeighType)
13/09/2007 EnviroInfo 2007 - Warsaw 14
Experiments and results (3)
TILDE’s experimental results – 3-fold cross-validation
Example of a rule from the Propositional experiments: contamination([neg]):-targetfield(T), fieldDataYear(T, 25, Crop,
SowingDate), SowingDate<252, lastOSR(T, Gm, NonGm), Gm<20
Example of a rule from the Neighbor experiments: contamination([pos]):-targetField(T), fieldDataYear(T, 25, Crop,
SowingDate), SowingDate<252, neighbour(T, FieldA, noborder), fieldDataYear(FieldA, 24, gm-OSR, SowingDate)
PROPOSITIONAL NEIGHBOR
TREE SIZE 15 13
ACCURACY 78.35% 79.66%
13/09/2007 EnviroInfo 2007 - Warsaw 15
Conclusions
Use of relational data mining for analyzing an output of the complex simulation model GENESYS
Predict the contamination of the central field of a large-risk field plan
Built relational classification trees – first-order decision tree learner TILDE
13/09/2007 EnviroInfo 2007 - Warsaw 16
Conclusions (2)
Propositional and relational trees
Relational experiments – slightly better Due to using a fixed field plan and a fixed
target field
Further work: Performing more experiments with GENESYS Different field plans Different target fields Analyze the results of other simulation models
13/09/2007 EnviroInfo 2007 - Warsaw 17
Acknowledgement
SIGMEA (Sustainable Introduction of Genetically Modified organisms into European Agriculture)