studying the presence of genetically modified variants in organic oilseed rape by using relational...

Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using

Relational Data Mining

Aneta Ivanovska1, Celine Vens2, Sašo Džeroski1, Nathalie Colbach3

1 Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia

Emails: aneta.ivanovska@ijs.si, saso.dzeroski@ijs.siTel: +386 1 477 3144 (Aneta Ivanovska)

2 Department of Computer Science, K.U. Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium

Email: celine.vens@cs.kuleuven.be3 UMR1210, Biologie et Gestion des Adventices, INRA, 21000 Dijon, France

Email : colbach@dijon.inra.fr

13/09/2007 EnviroInfo 2007 - Warsaw 2

The GM problem

Genetically Modified (GM) crops First introduced for commercial

production in 1996 Herbicide tolerant Pest-resistant

Concern: GM crops mixing with conventional or organic crops of the same species

The GM problem (2)

Computer simulation model GENESYS Estimates the rate of adventitious

presence of GM varieties in non-GM crops

Ranks the cropping systems according to their probability of gene flow between GM and non-GM oilseed rape (OSR)

Motivation

Predict the contamination of a field with GM material

The dataset produced by GENESYS was previously analyzed using propositional data mining techniques (Ivanovska et al., 2006)

Assumption: contamination of a field with GM seeds mostly depends on the cropping techniques and crops grown in the surrounding fields Exploit neighborhood relations in the predictive

model Create a relational representation of the problem

In this study: investigate the use of relational data mining to analyze the dataset produced by GENESYS and use the relational data mining system TILDE

Field plan

Crop succession

Crop management

For each field and year:

PlantsSeedbankSeeds

produced

•Number per m²•Genotypic proportions

Rape varieties

(Colbach et al., 2001)

Materials and methods: the dataset

Output from GENESYS

Large-risk field plan maximizes the pollen and seed input into the central field

Focus of the analysis: predict the rate of adventitious presence in the central field of a large-risk field plan

Materials and methods: the dataset (2)

100 000 simulations for 25 years Attributes:

Geometry of the region (field-plan) Genetic variables For each field and year: crops and

management techniques Full details kept only for the last 4

Materials and methods: relational data mining, relational classification trees

Propositional data mining techniques Single table Popular DM techniques: classification and

regression decision trees

Relational data mining techniques Multiple tables Relations between them Relational classification or regression trees

Materials and methods: relational data mining, relational classification trees (2)

Data scattered over multiple relations (or tables): can be analyzed by conventional data mining

techniques, by transforming it into a propositional table (attribute-value representation) – propositionalization

multi-relational approach takes into account the structure of the original data

Data represented in terms of relations: target(Field1, contaminated)

Background knowledge is also given

Materials and methods: relational data mining, relational classification trees (3)

Relational vs. propositional classification trees - similarities:

predict the value of a dependent variable (class) from the values of a set of independent variables (attributes)

test in each inner node that tests the value of a certain attribute and compares it with a constant

leaf nodes give a classification that applies to all instances that reach the leaf

Relational vs. propositional classification trees - differences:

Prop. trees: tests in the inner nodes compare the value of a variable (property of the object) to a value

Rel. trees: tests can also refer to background knowledge relations or tables

An example of a relational classification tree

targetField(FieldA) and fieldDataYear(FieldA,0,Crop,SowingDate),

SowingDate<252

fieldDataYear(FieldA,0,Crop,SowingDate), SowingDate<233

neighbour(FieldA,FieldB,noborder) andfieldDataYear(FieldB,1,gm-OSR,SowingDate)

NEGPOS

yes no

Experiments and results

Representation of the data: Target relation (data label):

rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres)

Background relations: fieldDataYear(SimulationID, FieldID, Year,

CultivationTechniques) lastOSR(SimulationID, FieldID, LastGM,

LastNonGM) neighbour(Field1ID, Field2ID, NeighType)

Experiments and results (2)

Discretized target attribute – 0.9% Experimental settings:

Propositional: besides the target relation rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres), only (propositional) data for the target field is included (not using any relations among the fields), i.e., the following predicates are used:

fieldDataYear(FieldID,Year,Crop,SowingDate), for the target field

lastOSR(FieldID,LastGM,LastNonGM), for the target field

Neighbor:the same relations were used as in the Propositional setting, but now other fields are introduced via the neighbour relation, starting at the target field:

neighbour(Field1ID, Field2ID, NeighType)

Experiments and results (3)

TILDE’s experimental results – 3-fold cross-validation

Example of a rule from the Propositional experiments: contamination([neg]):-targetfield(T), fieldDataYear(T, 25, Crop,

SowingDate), SowingDate<252, lastOSR(T, Gm, NonGm), Gm<20

Example of a rule from the Neighbor experiments: contamination([pos]):-targetField(T), fieldDataYear(T, 25, Crop,

SowingDate), SowingDate<252, neighbour(T, FieldA, noborder), fieldDataYear(FieldA, 24, gm-OSR, SowingDate)

PROPOSITIONAL NEIGHBOR

TREE SIZE 15 13

ACCURACY 78.35% 79.66%

Conclusions

Use of relational data mining for analyzing an output of the complex simulation model GENESYS

Predict the contamination of the central field of a large-risk field plan

Built relational classification trees – first-order decision tree learner TILDE

Conclusions (2)

Propositional and relational trees

Relational experiments – slightly better Due to using a fixed field plan and a fixed

target field

Further work: Performing more experiments with GENESYS Different field plans Different target fields Analyze the results of other simulation models

Acknowledgement

SIGMEA (Sustainable Introduction of Genetically Modified organisms into European Agriculture)

studying the presence of genetically modified variants in organic oilseed rape by using relational...

Documents

zapisi - policija · 2018. 7. 11. · dolenc sašo −...

exchange server 2007 - sbsc.si€¦ · exchange server...

sašo Živanović: quantificational aspects of lf

ivica dimitrovski 1, dragi kocev 2, suzana loskovska 1,...

lu-jesenice.netlu-jesenice.net/cvzu/gradivo_tvzu_lu_skofja_loka/internet_in_el... ·...

biljana ivanovska phd, german and english language teacher...

sašo mulalič, dipl. diz. //porfolio

(springerbriefs in computer science) grega jakus, veljko...

izr.prof.dr. sašo medved - ee.fs.uni-lj.si izolacija...

who, dept. essential drugs and medicines policy measuring...

contents the international journal of biochemistry cell...

session2.4 pp5 sašo šantl_mca approach

supporting discovery in medicine by association rule...

matematika za ekonomisti - e-ucebnici za...

inductive logic programming · foundations of inductive...

elena ikonomovska joão gama sašo džeroski · data min...

matematika pËr ekonomistË - е-учебници...

stem cell mechanobiology: diverse lessons from...

coordinator: sašo medved, ul ides-edu - more … · 1...

8 th quality conference session 7.1 e-auctions in public...