studying the presence of genetically modified variants in organic oilseed rape by using relational...

Post on 30-Dec-2015

221 Views

Category:

Documents

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using

Relational Data Mining

Aneta Ivanovska1, Celine Vens2, Sašo Džeroski1, Nathalie Colbach3

1 Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia

Emails: aneta.ivanovska@ijs.si, saso.dzeroski@ijs.siTel: +386 1 477 3144 (Aneta Ivanovska)

2 Department of Computer Science, K.U. Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium

Email: celine.vens@cs.kuleuven.be3 UMR1210, Biologie et Gestion des Adventices, INRA, 21000 Dijon, France

Email : colbach@dijon.inra.fr

13/09/2007 EnviroInfo 2007 - Warsaw 2

The GM problem

Genetically Modified (GM) crops First introduced for commercial

production in 1996 Herbicide tolerant Pest-resistant

Concern: GM crops mixing with conventional or organic crops of the same species

13/09/2007 EnviroInfo 2007 - Warsaw 3

The GM problem (2)

Computer simulation model GENESYS Estimates the rate of adventitious

presence of GM varieties in non-GM crops

Ranks the cropping systems according to their probability of gene flow between GM and non-GM oilseed rape (OSR)

13/09/2007 EnviroInfo 2007 - Warsaw 4

Motivation

Predict the contamination of a field with GM material

The dataset produced by GENESYS was previously analyzed using propositional data mining techniques (Ivanovska et al., 2006)

Assumption: contamination of a field with GM seeds mostly depends on the cropping techniques and crops grown in the surrounding fields Exploit neighborhood relations in the predictive

model Create a relational representation of the problem

In this study: investigate the use of relational data mining to analyze the dataset produced by GENESYS and use the relational data mining system TILDE

5

Field plan

Crop succession

Crop management

For each field and year:

PlantsSeedbankSeeds

produced

•Number per m²•Genotypic proportions

Rape varieties

(Colbach et al., 2001)

13/09/2007 EnviroInfo 2007 - Warsaw 6

Materials and methods: the dataset

Output from GENESYS

Large-risk field plan maximizes the pollen and seed input into the central field

Focus of the analysis: predict the rate of adventitious presence in the central field of a large-risk field plan

13/09/2007 EnviroInfo 2007 - Warsaw 7

Materials and methods: the dataset (2)

100 000 simulations for 25 years Attributes:

Geometry of the region (field-plan) Genetic variables For each field and year: crops and

management techniques Full details kept only for the last 4

years

13/09/2007 EnviroInfo 2007 - Warsaw 8

Materials and methods: relational data mining, relational classification trees

Propositional data mining techniques Single table Popular DM techniques: classification and

regression decision trees

Relational data mining techniques Multiple tables Relations between them Relational classification or regression trees

13/09/2007 EnviroInfo 2007 - Warsaw 9

Materials and methods: relational data mining, relational classification trees (2)

Data scattered over multiple relations (or tables): can be analyzed by conventional data mining

techniques, by transforming it into a propositional table (attribute-value representation) – propositionalization

multi-relational approach takes into account the structure of the original data

Data represented in terms of relations: target(Field1, contaminated)

Background knowledge is also given

13/09/2007 EnviroInfo 2007 - Warsaw 10

Materials and methods: relational data mining, relational classification trees (3)

Relational vs. propositional classification trees - similarities:

predict the value of a dependent variable (class) from the values of a set of independent variables (attributes)

test in each inner node that tests the value of a certain attribute and compares it with a constant

leaf nodes give a classification that applies to all instances that reach the leaf

Relational vs. propositional classification trees - differences:

Prop. trees: tests in the inner nodes compare the value of a variable (property of the object) to a value

Rel. trees: tests can also refer to background knowledge relations or tables

13/09/2007 EnviroInfo 2007 - Warsaw 11

An example of a relational classification tree

targetField(FieldA) and fieldDataYear(FieldA,0,Crop,SowingDate),

SowingDate<252

fieldDataYear(FieldA,0,Crop,SowingDate), SowingDate<233

NEG

neighbour(FieldA,FieldB,noborder) andfieldDataYear(FieldB,1,gm-OSR,SowingDate)

NEGPOS

POS

yes no

yes

yes no

no

13/09/2007 EnviroInfo 2007 - Warsaw 12

Experiments and results

Representation of the data: Target relation (data label):

rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres)

Background relations: fieldDataYear(SimulationID, FieldID, Year,

CultivationTechniques) lastOSR(SimulationID, FieldID, LastGM,

LastNonGM) neighbour(Field1ID, Field2ID, NeighType)

13/09/2007 EnviroInfo 2007 - Warsaw 13

Experiments and results (2)

Discretized target attribute – 0.9% Experimental settings:

Propositional: besides the target relation rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres), only (propositional) data for the target field is included (not using any relations among the fields), i.e., the following predicates are used:

fieldDataYear(FieldID,Year,Crop,SowingDate), for the target field

lastOSR(FieldID,LastGM,LastNonGM), for the target field

Neighbor:the same relations were used as in the Propositional setting, but now other fields are introduced via the neighbour relation, starting at the target field:

neighbour(Field1ID, Field2ID, NeighType)

13/09/2007 EnviroInfo 2007 - Warsaw 14

Experiments and results (3)

TILDE’s experimental results – 3-fold cross-validation

Example of a rule from the Propositional experiments: contamination([neg]):-targetfield(T), fieldDataYear(T, 25, Crop,

SowingDate), SowingDate<252, lastOSR(T, Gm, NonGm), Gm<20

Example of a rule from the Neighbor experiments: contamination([pos]):-targetField(T), fieldDataYear(T, 25, Crop,

SowingDate), SowingDate<252, neighbour(T, FieldA, noborder), fieldDataYear(FieldA, 24, gm-OSR, SowingDate)

PROPOSITIONAL NEIGHBOR

TREE SIZE 15 13

ACCURACY 78.35% 79.66%

13/09/2007 EnviroInfo 2007 - Warsaw 15

Conclusions

Use of relational data mining for analyzing an output of the complex simulation model GENESYS

Predict the contamination of the central field of a large-risk field plan

Built relational classification trees – first-order decision tree learner TILDE

13/09/2007 EnviroInfo 2007 - Warsaw 16

Conclusions (2)

Propositional and relational trees

Relational experiments – slightly better Due to using a fixed field plan and a fixed

target field

Further work: Performing more experiments with GENESYS Different field plans Different target fields Analyze the results of other simulation models

13/09/2007 EnviroInfo 2007 - Warsaw 17

Acknowledgement

SIGMEA (Sustainable Introduction of Genetically Modified organisms into European Agriculture)

top related