studying the presence of genetically modified variants in organic oilseed rape by using relational...
Post on 30-Dec-2015
221 Views
Preview:
TRANSCRIPT
Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using
Relational Data Mining
Aneta Ivanovska1, Celine Vens2, Sašo Džeroski1, Nathalie Colbach3
1 Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia
Emails: aneta.ivanovska@ijs.si, saso.dzeroski@ijs.siTel: +386 1 477 3144 (Aneta Ivanovska)
2 Department of Computer Science, K.U. Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium
Email: celine.vens@cs.kuleuven.be3 UMR1210, Biologie et Gestion des Adventices, INRA, 21000 Dijon, France
Email : colbach@dijon.inra.fr
13/09/2007 EnviroInfo 2007 - Warsaw 2
The GM problem
Genetically Modified (GM) crops First introduced for commercial
production in 1996 Herbicide tolerant Pest-resistant
Concern: GM crops mixing with conventional or organic crops of the same species
13/09/2007 EnviroInfo 2007 - Warsaw 3
The GM problem (2)
Computer simulation model GENESYS Estimates the rate of adventitious
presence of GM varieties in non-GM crops
Ranks the cropping systems according to their probability of gene flow between GM and non-GM oilseed rape (OSR)
13/09/2007 EnviroInfo 2007 - Warsaw 4
Motivation
Predict the contamination of a field with GM material
The dataset produced by GENESYS was previously analyzed using propositional data mining techniques (Ivanovska et al., 2006)
Assumption: contamination of a field with GM seeds mostly depends on the cropping techniques and crops grown in the surrounding fields Exploit neighborhood relations in the predictive
model Create a relational representation of the problem
In this study: investigate the use of relational data mining to analyze the dataset produced by GENESYS and use the relational data mining system TILDE
5
Field plan
Crop succession
Crop management
For each field and year:
PlantsSeedbankSeeds
produced
•Number per m²•Genotypic proportions
Rape varieties
(Colbach et al., 2001)
13/09/2007 EnviroInfo 2007 - Warsaw 6
Materials and methods: the dataset
Output from GENESYS
Large-risk field plan maximizes the pollen and seed input into the central field
Focus of the analysis: predict the rate of adventitious presence in the central field of a large-risk field plan
13/09/2007 EnviroInfo 2007 - Warsaw 7
Materials and methods: the dataset (2)
100 000 simulations for 25 years Attributes:
Geometry of the region (field-plan) Genetic variables For each field and year: crops and
management techniques Full details kept only for the last 4
years
13/09/2007 EnviroInfo 2007 - Warsaw 8
Materials and methods: relational data mining, relational classification trees
Propositional data mining techniques Single table Popular DM techniques: classification and
regression decision trees
Relational data mining techniques Multiple tables Relations between them Relational classification or regression trees
13/09/2007 EnviroInfo 2007 - Warsaw 9
Materials and methods: relational data mining, relational classification trees (2)
Data scattered over multiple relations (or tables): can be analyzed by conventional data mining
techniques, by transforming it into a propositional table (attribute-value representation) – propositionalization
multi-relational approach takes into account the structure of the original data
Data represented in terms of relations: target(Field1, contaminated)
Background knowledge is also given
13/09/2007 EnviroInfo 2007 - Warsaw 10
Materials and methods: relational data mining, relational classification trees (3)
Relational vs. propositional classification trees - similarities:
predict the value of a dependent variable (class) from the values of a set of independent variables (attributes)
test in each inner node that tests the value of a certain attribute and compares it with a constant
leaf nodes give a classification that applies to all instances that reach the leaf
Relational vs. propositional classification trees - differences:
Prop. trees: tests in the inner nodes compare the value of a variable (property of the object) to a value
Rel. trees: tests can also refer to background knowledge relations or tables
13/09/2007 EnviroInfo 2007 - Warsaw 11
An example of a relational classification tree
targetField(FieldA) and fieldDataYear(FieldA,0,Crop,SowingDate),
SowingDate<252
fieldDataYear(FieldA,0,Crop,SowingDate), SowingDate<233
NEG
neighbour(FieldA,FieldB,noborder) andfieldDataYear(FieldB,1,gm-OSR,SowingDate)
NEGPOS
POS
yes no
yes
yes no
no
13/09/2007 EnviroInfo 2007 - Warsaw 12
Experiments and results
Representation of the data: Target relation (data label):
rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres)
Background relations: fieldDataYear(SimulationID, FieldID, Year,
CultivationTechniques) lastOSR(SimulationID, FieldID, LastGM,
LastNonGM) neighbour(Field1ID, Field2ID, NeighType)
13/09/2007 EnviroInfo 2007 - Warsaw 13
Experiments and results (2)
Discretized target attribute – 0.9% Experimental settings:
Propositional: besides the target relation rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres), only (propositional) data for the target field is included (not using any relations among the fields), i.e., the following predicates are used:
fieldDataYear(FieldID,Year,Crop,SowingDate), for the target field
lastOSR(FieldID,LastGM,LastNonGM), for the target field
Neighbor:the same relations were used as in the Propositional setting, but now other fields are introduced via the neighbour relation, starting at the target field:
neighbour(Field1ID, Field2ID, NeighType)
13/09/2007 EnviroInfo 2007 - Warsaw 14
Experiments and results (3)
TILDE’s experimental results – 3-fold cross-validation
Example of a rule from the Propositional experiments: contamination([neg]):-targetfield(T), fieldDataYear(T, 25, Crop,
SowingDate), SowingDate<252, lastOSR(T, Gm, NonGm), Gm<20
Example of a rule from the Neighbor experiments: contamination([pos]):-targetField(T), fieldDataYear(T, 25, Crop,
SowingDate), SowingDate<252, neighbour(T, FieldA, noborder), fieldDataYear(FieldA, 24, gm-OSR, SowingDate)
PROPOSITIONAL NEIGHBOR
TREE SIZE 15 13
ACCURACY 78.35% 79.66%
13/09/2007 EnviroInfo 2007 - Warsaw 15
Conclusions
Use of relational data mining for analyzing an output of the complex simulation model GENESYS
Predict the contamination of the central field of a large-risk field plan
Built relational classification trees – first-order decision tree learner TILDE
13/09/2007 EnviroInfo 2007 - Warsaw 16
Conclusions (2)
Propositional and relational trees
Relational experiments – slightly better Due to using a fixed field plan and a fixed
target field
Further work: Performing more experiments with GENESYS Different field plans Different target fields Analyze the results of other simulation models
13/09/2007 EnviroInfo 2007 - Warsaw 17
Acknowledgement
SIGMEA (Sustainable Introduction of Genetically Modified organisms into European Agriculture)
top related