johanna gold rough sets theory logical analysis of data. monday, november 26, 2007

Johanna GOLDJohanna GOLD

Rough Sets TheoryRough Sets TheoryLogical Analysis of Data.Logical Analysis of Data.

MondayMonday, , NovemberNovember 26, 2007 26, 2007

IntroductionIntroduction

Comparison of two theories for rules induction.

Different methodologies Same results?

Set of objects described by attributes. Each object belongs to a class. We want decision rules.

GeneralitiesGeneralities

There are two approaches: Rough Sets Theory (RST) Logical Analysis of Data (LAD)

Goal : compare them

ApproachesApproaches

ContentsContents

1. Rough Sets Theory

2. Logical Analysis Of data

3. Comparison

4. Inconsistencies

Two examples having the exact same values in all attributes, but belonging to two different classes.

Example: two sick people have the same symptomas but different disease.

InconsistenciesInconsistencies

RST doesn’t correct or aggregate inconsistencies.

For each class : determination of lower and upper approximations.

Covered by RSTCovered by RST

Lower : objects we are sure they belong to the class.

Upper : objects than can belong to the class.

ApproximationsApproximations

Lower approximation → certain rules

Upper approximation → possible rules

Impact on rulesImpact on rules

Rules induction on numerical data → poor rules → too many rules.

Need of pretreatment.

PretreatmentPretreatment

Goal : convert numerical data into discrete data.

Principle : determination of cut points in order to divide domains into successive intervals.

DiscretizationDiscretization

First algorithm: LEM2 Improved algorithms:

Include the pretreatment MLEM2, MODLEM, …

AlgorithmsAlgorithms

Induction of certain rules from the lower approximation.

Induction of possible rules from the upper approximation.

Same procedure

LEM2LEM2

For an attribute x and its value v, a block [(x,v)] of attribute-value pair (x,v) is all the cases where the attribute x has the value v.

Ex : [(Age,21)]=[Martha]

[(Age,22)]=[David ; Audrey]

Definitions (1)Definitions (1)

Let B be a non-empty lower or upper approximation of a concept represented by a decision-value pair (d,w).

Ex : (level,middle)→B=[obj1 ; obj5 ; obj7]

DefinitionsDefinitions (2) (2)

Let T be a set of pairs attribute-value (a,v). Set B depends on set T if and only if:


Tva

BvaT

),(

)],[(][

A set T is minimal complex of B if and only if B depends on T and there is no subset T’ of T such as B depends on T’.


Let T be a non-empty collection of non-empty set of attribute-value pairs.

T is a set of T. T is a set of (a,v).


T is a local cover of B if and only if:

Each member T of T is a minimal complex of B.

T is minimal


BTT

Τ

][

LEM2’s output is a local cover for each approximation of the decision table concept.

It then convert them into decision rules.

AlgorithmAlgorithmprincipleprinciple

AlgorithmAlgorithm

Among the possible blocks, we choose the one: With the highest priority With the highest intersection With the smallest cardinal

Heuristics detailsHeuristics details

conceptva ,

va,

As long as it is not a minimal complex, pairs are added.

As long as there is not a local cover, minimal complexes are added.

Heuristics detailsHeuristics details

Illustration through an example. We consider that the pretreatment has

already been done.

IllustrationIllustration

Data setData set

Attributes Décision

Case Height (cm) Hair Attraction

1 160 Blond -

2 170 Blond +

3 160 Red +

4 180 Black -

5 160 Black -

6 170 Black -

For the attribute Height, we have the values 160, 170 and 180.

The pretreatment gives us two cut points: 165 and 175.

Cut pointsCut points

[(Height, 160..165)]={1,3,5} [(Height, 165..180)]={2,4} [(Height, 160..175)]={1,2,3,5} [(Height, 175..180)]={4} [(Hair, Blond)]={1,2} [(Hair, Red)]={3} [(Hair, Black)]={4,5,6}

Blocks [(a,v)]Blocks [(a,v)]

G = B = [(Attraction,-)] = {1,4,5,6} Here there is no inconsistencies. If there

were some, it’s at this point that we have to chose between the lower and the upper approximation.

First conceptFirst concept

Pair (a,v) such as [(a,v)]∩[(Attraction,-)]≠Ø

(Height,160..165) (Height,165..180) (Height,160..175) (Height,175..180) (Hair,Blond) (Hair,Black)

Eligible pairsEligible pairs

We chose the most appropriate, which is to say (a,v) for which

| [(a,v)] ∩ [(Attraction,-)] |

is the highest. Here : (Hair, Black)

Choice of a pairChoice of a pair

The pair (Hair, Black) is a minimal complex because:

Minimal complexMinimal complex

)],[()],[( AttractionBlackHair

B = [(Attraction,-)] – [(Hair,Black)]

= {1,4,5,6} - {4,5,6}

= {1}

New conceptNew concept

Through the pairs (Height,160..165), (Height,160..175) and (Hair, Blond).

Intersections having the same cardinality, we chose the pair having the smallest cardinal:

(Hair, Blond)

Choice of a pair (1)Choice of a pair (1)

Problem : (Hair, Blond) is non a minimal complex. We chose the following pair:

(Height,160..165).

Choice of a pair (2)Choice of a pair (2)

)],[()],[( AttractionBlondHair

{(Hair, Blond),(Height,160..165)} is a second minimal complex.

Minimal ComplexMinimal Complex

)],[(

)]165..160,[()],[(

Attraction

HeightBlondHair

{{(Hair, Black)}, {(Hair, Blond), (Height, 160..165)}}

is a local cover of [(Attraction,-)].

End of the conceptEnd of the concept

(Hair, Red) → (Attraction,+) (Hair, Blond) & (Height,165..180 ) → (Attraction,+)

(Hair, Black) → (Attraction,-) (Hair, Blond) & (Height,160..165 ) → (Attraction,-)

RulesRules

ContentsContents



3. Comparison

4. Inconsistencies

Work on binary data. Extension of boolean approach on non-

binary case.

PrinciplePrinciple

Let S be the set of all observations. Each observation is described by n

attributes. Each observation belongs to a class.


The classification can be considered as a partition into two sets

An archive is represented by a boolean function Φ :


SandS),( SS

1,0S

A literal is a boolean variable or its negation:

A term is a conjunction of literals :

The degree of a term is the number of literals.


ii xorx

321321 xxxxxx

A term T covers a point

if T(p)=1. A characteristic term of a point p is the

unique term of degree n covering p. Ex :


np 1,0

4321)0,1,1,0( xxxx

A term T is an implicant of a boolean function f if T(p) ≤ f(p) for all

An implicant is called prime if it is minimal (its degree).


np 1,0

A positive prime pattern is a term covering at least one positive example and no negative example.

A negative prime pattern is a term covering at least one negative example and no positive example.


ExampleExample

1 1 0

0 1 0

1 0 1

1 0 0

0 0 1

0 0 0

1a 2a 3a

S

S

is a positive pattern : There is no negative example such as There is one positive example : the 3rd

line.

It's a positive prime pattern : covers one negative example : 4th

line. covers one negative example : 5th

line.

ExampleExample

31aa131 aa

1a

3a

symmetry between positive and negative patterns.

Two approaches : Top-down Bottom-up

Pattern generationPattern generation

we associate each positive example to its characteristic term→ it’s a pattern.

we take out the literals one by one until having a prime pattern.

Top-downTop-down

we begin with terms of degree one: if it does not cover a negative

example, it is a pattern If not, we add literals until having

a pattern.

Bottom-upBottom-up

We prefer short pattern → simplicity principle.

we also want to cover the maximum of examples with only one model → globality principle.

hybrid approach bottom-up – top-down.

ObjectivesObjectives

Hybrid approachHybrid approach

We fix a degree D. We start by a bottom-up approach to

generate the models of degree lower or equal to D.

For all the points which are not covered by the 1st phase, we proceed to the top-down approach.

Extension from binary case : binerization. Two types of data :

quantitative : age, height, … qualitative : color, shape, …

Extension to the Extension to the non binary casenon binary case

For each value v that a qualitative attribute x can be, we associate a boolean variable b(x,v) :

b(x,v) = 1 if x = v b(x,v) = 0 otherwise

Qualitative dataQualitative data

there are two types of associated variables:

Level variables Interval variables

Quantitative dataQuantitative data

For each attribute x and each cut point t, we introduce a boolean variable b(x,t) :

b(x,t) = 1 if x ≥ t b(x,t) = 0 if x < t

Level variablesLevel variables

For each attribute x and each pair of cut points t’, t’’ (t’<t’’), we introduce a boolean variable b(x,t’,t’’) :

b(x,t’,t’’) = 1 if t’ ≤ x < t’’ b(x,t’,t’’) = 0 otherwise

Intervals variablesIntervals variables

ExampleExample

1 green yes 31

4 blue no 29

2 blue yes 20

4 red no 22

3 red yes 20

2 green no 14

4 green no 7

S

S

1x 2x 3x 4x

ExampleExample

1

4

2

4

3

2

4

S

S

1x 2b 3b1ba 0 0 0

b 1 1 1

c 1 0 0

d 1 1 1

e 1 1 0

f 1 0 0

g 1 1 1

5.35.25.1

13

12

11

xbxbxb

ExampleExample

green

blue

blue

red

red

green

green

S

S

2x 5b 6b4ba 1 0 0

b 0 1 0

c 0 1 0

d 0 0 1

e 0 0 1

f 1 0 0

g 1 0 0

redxb

bluexb

greenxb

26

25

24

ExampleExample

yes

no

yes

no

yes

no

no

S

S

3x 7ba 1

b 0

c 1

d 0

e 1

f 0

g 0

yesxb 37

ExampleExample

31

29

20

22

20

14

17

S

S

4x 9b8ba 1 1

b 1 1

c 1 0

d 1 1

e 1 0

f 0 0

g 0 0

2117

49

48

xbxb

ExampleExample

1

4

2

4

3

2

4

S

S

1x 11b 12b10ba 0 0 0

b 0 0 0

c 1 1 0

d 0 0 0

e 0 1 1

f 1 1 0

g 0 0 0

5.35.25.35.15.25.1

112

111

110

xbxbxb

ExampleExample

31

29

20

22

20

14

17

S

S

4x 13ba 0

b 0

c 1

d 0

e 1

f 0

g 0

2117 413 xb

ExampleExample

13ba 0 0 0 1 0 0 1 1 1 0 0 0 0

b 1 1 1 0 1 0 0 1 1 0 0 0 0

c 1 0 0 0 1 0 1 1 0 1 1 0 1

d 1 1 1 0 0 1 0 1 1 0 0 0 0

e 1 1 0 0 0 1 1 1 0 0 1 1 1

f 1 0 0 1 0 0 0 0 0 1 1 0 0

g 1 1 1 1 0 0 0 0 0 0 0 0 0

1b 2b 3b 4b 5b 6b 7b 8b 9b 10b 11b 12b

A set of binary attributes is called supporting set if the archive obtained by the elimination of all the other attributes will remained "contradiction-free".

A supporting set is irredundant if there is no subset of it which is a supporting set.

Supporting setSupporting set

We associate to the attribute a variable

such as if the attribute belongs to the supporting set.

Application : elements a and e are different on attributes 1, 2, 4, 6, 9, 11, 12 and 13 :

VariablesVariables

ib

iy 1iy

113121196421 yyyyyyyy

We do the same for all pairs of true and false observations :

Exponential number of solutions : we choose the smallest set :

Linear program Linear program

SpSpyppIii '','1)'','(

q

i iy1min

Positive patterns :

Negative patterns :

Solution ofSolution ofour exampleour example

214 x5.25.1 13 xandyesx

2143 xandnox5.25.1 13 xandnox

)5.25.1(21 114 xorxandx

ContentsContents



3. Comparison

4. Inconsistencies

LAD more flexible than RST

Linear program -> modification of parameters

Basic ideaBasic idea

RST : couples (attribute, value) LAD : binary variables Correspondence?

ComparisonComparisonblocks / variablesblocks / variables

For an attribute a taking the values:

Qualitative dataQualitative data

...,, 321 vvv

RST LAD

1,va 2,va 3,va

11 vab

22 vab 33 vab

Discretization : convert numerical data into discrete data.

Principle : determination of cut points in order to divide domains into successive intervals :


max21min ... vppv

RST : for each cut point, we have two blocks :


)..,( 1min pva

)..,( 2min pva

)..,( max1 vpa

)..,( max2 vpa

LAD : for each cut point, we have a level variable :

...


11 pab

22 pab

33 pab

LAD : for each pair of cut points, we have a interval variable :

...


212;1 papb

313;1 papb

323;2 papb

Correspondence :

Level variable :


ii pab

)..,(1 maxvpab ii )..,(0 min ii pvab


)..,()..,(1 minmax; jiji pvaANDvpab

)..,()..,(0 maxmin; vpaORpvab jiji

jiji papb ;

Correspondence :

Interval variable :

Three parameters can change : Right hand side of constraints: coefficients of the objective function: coefficients of the left hand side of the

constraints:

Variation of LP Variation of LP parametersparameters

j

jic

iu

We try to adapt the three heuristics : The highest priority The highest intersection with the concept The smallest cardinality

Heuristics Heuristics adaptationadaptation

Priority on blocks -> priority on attributes

Introduction as weights in the objective function

Minimization : choice of pairs with first priorities

The highest priorityThe highest priority

Pb : in LAD, no notion of concept ; everything is done symmetrically, the same time.

The highest The highest intersectionintersection

Modification of the heuristic : difference between the intersection with a concept and the intersection with the other.

The highest, the better.


Goal of RST : find minimal complexes: Find blocks covering the most examples of

the concept : highest possible intersection with the concept

Find blocks covering the less examples of the other concept : difference of intersections


For LAD : difference between the number of times a variable takes the value 1 in

and in . Introduction as weights in the constraints :

we choose first the variable with the highest difference.


SS

Simple : number of times a variable takes the value 1.

Introduction as weight in the constraints.

The smallest The smallest cardinalitycardinality

Two calculations to be introduced : The highest difference The smallest cardinality

Difference of the two calculations

Weight of the Weight of the constraintsconstraints

Before : everything is 1. Pb : modification of the weights of the

left hand side has no signification.

Right hand side of Right hand side of the constraintsthe constraints

Average of compared to the number of attributes.

Average of in each constraint

Inconvenient : not a real signification

Ideas of Ideas of modificationmodification

jic

jic

Not touch the weight in the constraints: introduce everything in the coefficients of the objective function:

Ideas of Ideas of modificationmodification

ycardinalit

SinofnbSinofnb

priorityui

)11(

ContentsContents



3. Comparison

4. Inconsistencies

Use of two approximations : lower and upper.

Rules generation: sure and possible.

For RSTFor RST

Classification mistakes: positive point classified as negative or the other way.

Two different cases.

For LADFor LAD

All other points are well classify : our point will not be covered.

If the number of non covered points is high: generation of longer patterns.

If this number is small : erroneous classification and we forgot the points for the following.

Pos. PointPos. Pointclassified as neg.classified as neg.

Terms covering a lot of positive points : also some negative points.

Probably wrongly classified : not taken into account for the evaluation of candidates terms.

Neg. PointNeg. Pointclassified as pos.classified as pos.

We introduce a ratio. A term is still candidate if the ratio between

negative and positive points is smallest than:

RatioRatio

S

S

An inconsistence can be considered as a mistake of classification

Inconsistence : two « identical » objects differently classified.

One of them is wrongly classified (approximations)

InconsistenciesInconsistenciesand mistakesand mistakes

Let consider an inconsistence in LAD : two points : two classes :

There are two possibilities : is not covered by small degree patterns is covered by patterns of

Equivalence?Equivalence?

21 petp21 CetC

1C1p

2p

We have only one inconsistence. The covered point is isolated ; it’s not

taken into account. Patterns of will be generated without

the inconsistence point

-> lower approximation

11stst case case

1C

A point covered by the other concept patterns is wrongly classified.

It’s not taken into account for the candidate terms.

It’s not taken into account for the pattern generation of

-> lower approximation

22ndnd case case

2C

Not taken into account for but not a problem for

For : upper approximation

22ndnd case case

2C1C

1C

According to a ratio, LAD decide if a point is well classified or not.

For an inconsistence, it’s the same as consider:

The upper approximation of a class The lower approximation of the other

On more than 1 inconsistence : we re-classify the points.

Equivalence?Equivalence?

ConclusionConclusion

Complete data : we can try to match LAD and RST.

Inconsistencies : classification mistakes of LAD can correspond to approximations.

Missing data : different management

Jerzy W. Grzymala-Busse, MLEM2 - Discretization During Rule Induction, Proceedings of the IIPWM'2003, International Conference on Intelligent Information Processing and WEB Mining Systems, Zakopane, Poland, June 2-5, 2003, 499-508. Springer-Verlag.

Jerzy W. Grzymala-Busse, Jerzy Stefanowski, Three Discretization Methods for Rule Induction, International Journal of Intelligent Systems, 2001.

Endre Boros, Peter L. Hammer, Toshihide Ibaraki, Alexander Kogan, Eddy Mayoraz, Ilya Muchnik, An Implementation of Logical Analysis of Data, Rutcor Research Raport 22-96, 1996.

Sources (1)Sources (1)

Endre Boros, Peter L. Hammer, Toshihide Ibaraki, Alexander Kogan, Logical Analysis of Numerical Data, Rutcor Research Raport 04-97, 1997.

Jerzy W. Grzymala-Busse, Rough Set Strategies to Data with Missing Attribute Values,Proceedings of theWorkshop on Foundation and New Directions in Data Mining, Melbourne, FL, USA. 2003.

Jerzy W. Grzymala-Busse, Sachin Siddhaye, Rough Set Approaches to Rule Induction from Incomplete Data, Proceedings of the IPMU'2004, the 10th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based System[C],Perugia,Italy, July 4, 2004 2 : 923- 930.


Jerzy Stefanowski, Daniel Vanderpooten, Induction of Decision Rules in Classi_cation and Discovery-Oriented Perspectives, International Journal of Intelligent Systems, 16 (1), 2001, 13-28.

Jerzy Stefanowski, The Rough Set based Rule Induction Technique for Classification Problems, Proceedings of 6th European Conference on Intelligent Techniques and Soft Computing EUFIT 98, Aachen 7-10 Sept., (1998) 109.113.

Roman Slowinski, Jerzy Stefanowski, Salvatore Greco, Benedetto Matarazzo, Rough Sets Processing of Inconsistent Information in Decision Analysis, Control and Cybernetics 29, 379±404, 2000.


johanna gold rough sets theory logical analysis of data. monday, november 26, 2007

Documents

pretreatment slide

rst slide

black slide

algorithm slide

illustration slide

discretization slide

algorithmprinciple slide

procedure lem2 slide