rule induction - plone sitemontesi/cbd/beatriz/session4-rule induction.pdf · outline •...

37
Rule induction Dr Beatriz de la Iglesia email: [email protected]

Upload: others

Post on 20-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Rule induction

Dr Beatriz de la Iglesiaemail: [email protected]

Page 2: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Outline

• What  are  rules?• Rule  Evaluation• Classification  rules• Association  rules

2B.  de  la  Iglesia/2016

Page 3: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

3

Rule induction (RI)

• As  their  name  suggests,  RI  algorithms  generate  rules  of  some  type.

• Rules  take  the  form:  IF  antecedent THEN  consequent.

• For  example,  in  medical  insurance  DBIf  (status=unemployed)  then  (diabetes=YES)

• Rules  do  not  imply  causality!  Otherwise  being  unemployed  would  be  the  cause  of  diabetes!  They  are  simply  showing  association  of  values  in  the  real  world.

B.  de  la  Iglesia/2016

Page 4: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

4

RI -­ Classification

• In  context  of  classification rules  are  of  the  form:IF  (set  of  conditions)  then  (class),

e.g.IF(age<25  and  car_group>15)  THEN  (Risk=High)

• X  is  referred  to  as  the  antecedent of  the  rule.• Y  is  the  consequent of  the  rule  describing  in  this  case  a  classification  outcome.

• Association  rules  have  a  conjunctive  consequent,  i.e.  made  up  of  many  clauses  joined  by  and operators• IF  (set  of  items)  THEN  (set  of  items)• IF  (bread  and  cheese)  THEN  (wine  and  crackers)

B.  de  la  Iglesia/2016

Page 5: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Definition – classification rule

• A  general  definition  of  the  antecedent  and  consequent  is:  • The  left-­hand  side  of  the  rule  is  a  description  of  a  subset  of  the  population.

• The  right-­hand  side  of  the  rule  is  a  description  of  interesting  behaviour  particular  to  the  population  on  the  left-­hand  side.

• An  example  of  a  (complex)  rule  is:

true diseaseheart THENSmoker AND 30 IF2

=>heightweight

5

AntecedentConsequent

B.  de  la  Iglesia/2016

Page 6: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Rule Evaluation -­ general

• Support,  confidence  and  coverage.• The  support  for  the  antecedent  sup(ant)  is  the  number  of  records  in  the  database  for  which  the  antecedent  holds.

• The  support  for  the  consequent  sup(con)  is  the  number  of  records  in  the  database  for  which  the  consequent  holds.

• The  support  for  the  rule  sup(rule)  is  the  number  of  records  in  the  database  for  which  the  rule  holds  (antecedent  and  consequent  hold).

• From  those  other  measures  can  be  derived:

)antsup()rulesup(confidence= )consup(

)rulesup(eragecov =

6B.  de  la  Iglesia/2016

Page 7: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Rule Evaluation

IF  BMl >  30  AND  smoker =  true  THEN  heart  disease =  trueBMI smoker age heart disease35 true 55 true40 true 48 true25 true 57 false35 false 72 true31 false 45 false42 true 65 false37 true 43 false

sup(ant) = 4sup(con) = 3sup(rule) = 2

Confidence = 2/4 = 50%Coverage = 2/3 = 67%

7

Default Confidence = 3/7 = 43%)

B.  de  la  Iglesia/2016

Page 8: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Rule Evaluation

• Confidence  can  be  taken  as  an  indication  of  a  rule's  predictive  ability  on  similar  data  – i.e.  accuracy.  

• If  a  rule  has  very  low  coverage  it  may  be  too  specialised  to  have  any  useful  predictive  ability  on  similar  data.

• The  ruleIF  age >  70  THEN  heart  disease =  true

has  confidence  of  100%  (in  the  previous  data  set)  but  includes  only  a  single  case.

• The  coverage  is  only  14.3%.  • Not  a  good  rule,  will  not  generalise

8B.  de  la  Iglesia/2016

Page 9: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

9

RI algorithms

• RI  algorithms  work  by  forming  an  initial  set  of  rules  (possibly  just  one  rule)  based  on  some  starting  criterion  (possibly  random).

• These  rules  are  then  applied  to  the train  set  of  cases  and  the  performance  measured.

• Next,  these  rules  are  refined  by  generalisation,  specialisation  and  adaptation  to  form  new  rules  that  better  classify  the  train  set  cases.

• RI  systems  tend  to  differ  in  the  way  they  adapt  the  rules  and  in  their  stopping  criteria.

B.  de  la  Iglesia/2016

Page 10: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

10

Generalisation

• As  an  example  of  generalisation,  consider  the  condition  part  of  the  rule:

If  (country=France)• This  could  be  generalised  to:

If  (country=France)  or  (gender=M)• It  is  generalised in  the  sense  that  it  is  likely  to  pick  up  more cases  than  the  simple  condition.

• This  is  also  referred  to  as  adding  a  disjunct (an  oroperator).

B.  de  la  Iglesia/2016

Page 11: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

11

Specialisation

• Similarly,  a  rule  can  be  specialised  by  adding  a  conjunct,  an  and operator.

If  (country=France)• May  be  adapted  to

If  (country=France)  and  (age  <  25)• This  is  specialising  in  the  sense  that  it  is  looking  for  a  subset  of  cases  whose  first  condition  is  already  satisfied.

B.  de  la  Iglesia/2016

Page 12: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

12

Adaptation

l Finally,  a  rule  may  be  adapted  by  changing  one  of  the  components  of  the  rule:                                                        

If  (country=France)  and  (age  <  25)

n France  replaced  by  Englandn =  replaced  by  �n and replaced  by  orn <  replaced  by  >n etc.

B.  de  la  Iglesia/2016

Page 13: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

13

Rule induction

• The  most  commonly  used  rule  induction  algorithms  are:• CN2• Heuristics

• but  others  are  used,  such  as  1R,  which  extracts  rules  based  on  a  single  attribute  in  the  antecedent.

B.  de  la  Iglesia/2016

Page 14: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

14

CN2

• The  original  CN2  algorithm  (Clarke  and  Niblett,  1989)  induces  rules  from  examples  using  entropy  as  its  search  measure.

• The  problem  with  this  measure  was  that  it  was  aiming  for  rules  with  very  high  accuracy,  irrespective  of  applicability.

• In  a  customer  database  with  1000  records,  a  rule  such  as  “if  (customer  =  id  326547)  then  (customer  will  churn)  will  have  100%  accuracy  but  very  low  applicability  and  is  not  at  all  interesting.  You  know  this  already!!

B.  de  la  Iglesia/2016

Page 15: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

15

Heuristics

• Heuristic  techniques  make  a  guess  at  a  rule  initiallyIf  (age  >  36)  then  (Buy  =  Yes)

• Then  change  this  in  an  iterative  manner  through  generalisation,  specialisation  and  adaptation.

• Until  the  accuracy  and  applicability  measures  cannot  be  improved  further.

• We  have  develop  rule  induction  algorithms  using  simulated  annealing   and  multi-­objective  optimisation  to  search  for  effective  rules.

B.  de  la  Iglesia/2016

Page 16: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

• Using  Quinlan’s  Golf  dataset  again  propose  3  different  rules  that  could  be  examined  by  a  rule  induction  algorithm.

• For  each  rule  calculate  applicability,  accuracy  and  coverage.• Can  you  rank  those  rules  in  terms  of  interest?

Discussion  time

outlook temp humidity windy classsunny 75 70 TRUE playsunny 80 90 TRUE dontplaysunny 85 85 FALSE dontplaysunny 72 95 FALSE dontplaysunny 69 70 FALSE playovercast 72 90 TRUE playovercast 83 78 FALSE playovercast 64 65 TRUE playovercast 81 75 FALSE playrain 71 80 TRUE dontplayrain 65 70 TRUE dontplayrain 75 80 FALSE playrain 68 80 FALSE playrain 70 96 FALSE play

B.  de  la  Iglesia/2016

Page 17: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Association rules

• The  association  rules  problem  was  introduced  by  Agrawal et.  al  in  1993.  

• These  type  of  rules  express  associations  that  exist  in  transaction  data  (sometimes  called  market  basket  data).

• Transaction  data  comprises  a  set  of  transactions,  each  transaction  comprises  a  set  of  items.

17B.  de  la  Iglesia/2016

Page 18: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Transaction data

• Each  transaction can  be  thought  of  as  a  list  of  itemspurchased  by  a  customer,  transactions  can  vary  in  length.

• Transactions  contain  categorical data.• We  want  to  determine  which  items  are  bought  together,  e.g for  product  positioning.

• Has  wider  applicability  than  buying  items.  • E.g.  Comorbidity,  component  failure  etc.

B.  de  la  Iglesia/2016 18

Page 19: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Example of Transaction Data

Transaction Itemsets1 {bread, cheese, eggs, jam}2 {bread, butter, eggs}3 {bread, cheese, tomatoes, milk}4 {bread, cheese, eggs}5 {cheese, eggs, milk}6 {bread, butter, milk}7 {eggs, milk, salt}

19

� The  antecedent  is  a  set  of  items  from  the  database,  the  consequent  is  a  (single)  item  that  is  not  in  the  antecedent.

An  example  of  a  rule  is:-­

{bread,  cheese}  � {eggs}

Antecedent ConsequentB.  de  la  Iglesia/2016

Page 20: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Example of Transaction Data

Transaction Itemsets1 {bread, cheese, eggs, jam}2 {bread, butter, eggs}3 {bread, cheese, tomatoes, milk}4 {bread, cheese, eggs}5 {cheese, eggs, milk}6 {bread, butter, milk}7 {eggs, milk, salt}

20

{bread,  cheese}  � {eggs}

Evaluation

sup(ant) = 3 sup(con) = 5 sup(rule) = 2

Confidence = 2/3 (67 %) Coverage = 2/5 (40% )

B.  de  la  Iglesia/2016

Page 21: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Example of Transaction Data

Transaction Itemsets1 {bread, cheese, eggs, jam}2 {bread, butter, eggs}3 {bread, cheese, tomatoes, milk}4 {bread, cheese, eggs}5 {cheese, eggs, milk}6 {bread, butter, milk}7 {eggs, milk, salt}

21

• In  this  small  example  with  8  items  there  are  28 =  256  possible  rules.

• 30  items    =  230 =  1  billion  rules

• It  is  a  challenge  to  develop  methods  that  find  interesting  rules  in  a  reasonable  time.

• We  may  need:Feature  subset  selectionDiscretization  for  continuous  features

B.  de  la  Iglesia/2016

Page 22: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori

22

• Apriori  is  an  algorithm  for  discovering  association  rules  in  transaction  data.  

• It  is  widely  documented.  • Apriori uses  minimum  support  minSup as  a  constraint.• An  itemset  is  a  set  of  items  e.g.  {bread, butter, eggs}• A  frequent itemset  is  any  itemset  that  has  support  �minSup• I.e.  the  items  appear  together  in  more  than  minSup records

B.  de  la  Iglesia/2016

Page 23: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ cont.

23

• The  problem  is  decomposed  into  two  parts.

1    -­ Find  all  frequent  itemsets and  determine  their  support.

2  -­ Use  these  frequent  itemsets  to  find  all  association  rules.

The  first  part  is  the  time  consuming  bit.

B.  de  la  Iglesia/2016

Page 24: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ cont.

24

• Multiple  passes  are  made  over  the  data.  • At  pass  k itemsets  of  degree  k are  evaluated.  • In  the  first  pass  the  database  is  scanned  to  determine  support  of  all  single  itemsets.

• In  subsequent  passes  candidate  itemsets are  generated  from  the  frequent  itemsets  found  in  the  previous  pass  and  the  database  is  scanned  to  determine  their  support.  

• This  process  continues  until  a  pass  produces  no  frequent  itemsets.

B.  de  la  Iglesia/2016

Page 25: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ cont.

25

• Let  Lk-­1 be  the  set  of  frequent  itemsets  obtained  from  the  previous  pass.

• Let  Ck be  the  set  of  candidate itemsets  to  be  used  in  the  this  pass.

• Candidate  generation  is  a  two  step  procedure.  • Join  Lk-­1 with  itself  to  produce  Ck.• Prune  Ck .• Evaluate  Ck to  remove  non-­frequent  itemsets  to  create  Lk.

B.  de  la  Iglesia/2016

Page 26: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ cont.

26

• Find  all  frequent  itemsets  of  length  1  – single  pass  through  data• JOIN• Every  pair  of  itemsets  in  Lk-­1 is  considered.  If  the  itemsets  are  identical  except  for  the  last  item  a  new  itemset  is  created  by  taking  one  of  the  itemsets  and  adding  the  last  item  from  the  other  itemset  to  it.

Pass  1 Pass  2 Pass  3 Pass  4

{a} {a,  b} {a,  b,  c} {a,  b,  c,  d}{a,  c}

{b}   {a,  d} {a,  b,  d}

{c} {a,  c,  d}

{d}   {b,  c} {b,  c,  d}{b,  d}

{c,  d}Frequent  itemsets  

This gives all combinations. 24 = 16.With, say, 50 items there are ~1015 possible combinations.Clearly not feasible

B.  de  la  Iglesia/2016

Page 27: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ Prune.

27

• The  pruning stage  removes  from  Ck any  itemsets  thatcannot be  frequent.

• If  an  itemset  is  frequent  then  all  of  its  subsets  must  also  be  frequent.  In  pass  k itemsets  of  length  k are  considered,  therefore  all  subsets  of  k have  been  found  in  previous  passes.

• For  each  new  itemset  that  is  generated  Apriori  checks  that  all  of  its  subsets  of  size  k-­1  exist  in  Lk-­1.  If  they  do  not  exist  then  they  are  not  frequent  and  therefore  the  new  item  cannot  be  frequent  so  it  is  pruned  from  Ck

• The  name  of  the  algorithm  is  taken  from  the  fact  that  it  is  known  apriori that  an  itemset  cannot  be  frequent.

B.  de  la  Iglesia/2016

Page 28: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ cont.

28

Pass  1 Pass  2 Pass  3 Pass  4

{a} {a,  b} {a,  b,  c}{a,  c}

{b}   {a,  d} {a,  b,  d}

{c} {a,  c,  d}

{d}   {b,  c} {b,c,d}{b,  d}

{c,  d}1.  {b,c}  is  not  frequent  

and  is  pruned

2.  Therefore,  {a,  b,c}  is  not  frequent

• Scan remaining itemsets• If they are not frequent remove

• Support can only go down by adding more items

Evaluate

B.  de  la  Iglesia/2016

Page 29: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ Example

29

minSup =  2

Transaction Item Sets1 {bread, cheese, eggs, jam}2 {bread, eggs, butter }3 {bread, cheese, tomatoes, milk}4 {bread, cheese, eggs}5 {cheese, eggs, milk}6 {bread, butter, milk}7 {eggs, milk, salt}

Single SupportItemsets

{bread} 5{cheese} 4{eggs} 5{jam} 1{butter} 2{tomatoes} 1{milk} 4{salt} 1

The  initial  candidate  set  is  the  set  of  single  itemsets.Evaluate  -­ Database  is  scanned  to  count  the  support  of  the  itemsets  in  C1Jam,  tomatoes  and  salt  are  not  frequent  and  so  are  discarded.

Database

C1

B.  de  la  Iglesia/2016

Page 30: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ Example

30

The  remaining  itemsets  are  included  in  L1,(the  full  set  of  frequent  itemsets  of  degree  1),

Single Itemsets

Support

{bread} 5{cheese} 4{eggs} 5{butter} 2{milk} 4

Join L1 is joined with itself to generate the candidates for the next pass, C2.

L1

Itemsets{bread, cheese}{bread, eggs}{bread, butter}{bread, milk}{cheese, eggs}{cheese, butter}{cheese, milk}{eggs, butter}{eggs, milk}{butter, milk}

Itemsets Support{bread, cheese} 3{bread, eggs} 3{bread, butter} 2{bread, milk} 2{cheese, eggs} 3{cheese, butter} 0{cheese, milk} 2{eggs, butter} 1{eggs, milk} 2{butter, milk} 1

C2 Prune. No need to prune C2 because all subsets must exist

Evaluate Database  is  scanned  to  count  the  support  of  the  itemsets  in  C2

B.  de  la  Iglesia/2016

Page 31: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ Example31

L2Itemsets Support{bread, cheese} 3{bread, eggs} 3{bread, butter} 2{bread, milk} 2{cheese, eggs} 3{cheese, milk} 2{eggs, milk} 2

Itemsets{bread, cheese, eggs}{bread, cheese, butter}{bread, cheese, milk}{bread, eggs, butter}{bread, eggs, milk}{bread, butter, milk}{cheese, eggs, milk}

Join to give C3

Itemset Subsets{bread, cheese, eggs} {bread, cheese} {bread, eggs} {cheese, eggs}{bread, cheese, butter} {bread, cheese} {bread, butter} {cheese, butter}{bread, cheese, milk} {bread, cheese} {bread, milk} {cheese, milk}{bread, eggs, butter} {bread, eggs} {bread, butter} {eggs, butter}{bread, eggs, milk} {bread, eggs} {bread, milk} {eggs, milk}{bread, butter, milk} {bread, butter} { bread, milk} {butter, milk}{cheese, eggs, milk} {cheese, eggs} {cheese, milk} {eggs, milk}

Prune C3

Itemsets Support{bread, cheese, eggs} 2{bread, cheese, milk} 1{bread, eggs, milk} 0{cheese, eggs, milk} 1

Evaluate

L2

B.  de  la  Iglesia/2016

Page 32: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ Example

32

Only one remaining itemset. So the algorithm halts having found all frequent itemsets

Itemsets Support{bread, cheese, eggs} 2

L3

• In this  example  there  were  8  items  giving    28 =    256  possible  itemsets.• Apriori  found  all  frequent  itemsets  whilst  only  evaluating  22  itemsets.

• 8  of  which  were  single  items.• Database  scanned  three  times• Of    these  22,  13  were  found  to  be  frequent.

Itemsets Support{bread} 5{cheese} 4{eggs} 5{butter} 2{milk} 4{bread, cheese} 3{bread, eggs} 3

Itemsets Support{bread, butter} 2{bread, milk} 2{cheese, eggs} 3{cheese, milk} 2{eggs, milk} 2{bread, cheese, eggs} 2

Full Set of Frequent Itemsets

B.  de  la  Iglesia/2016

Page 33: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ Generate Rules

33

• Given  a  frequent   item  set,  A,  rules  can  be  constructed  of  the  form

{A -­ i}  ➡ i• Where  i is  any  item  in  A.    • The  support  of  the  rule  is  sup{A}.    i.e.  the  support  for  the  itemsetA

• The  antecedent  of  the  rule  is  (A – i).  This  must  be  a  frequent  itemset    because  it  is  a  subset  of  A.  Therefore,   its  support   is  available  from  stage  1.

• Similarly,  the  consequent  of  the  rule  is i.  This  must  also  be  frequent  because  it  is  a  subset  of  A.  Therefore,   its  support   is  available  from  stage  1.

• Therefore,  we  have  the  support  for  the  antecedent,   the  consequent  and  the  rule  and  so  confidence  and  coverage  can  be  calculated.

B.  de  la  Iglesia/2016

Page 34: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori ~ Example

34

• The  result  of  stage  1  of  the  example  wasFrequent Itemsets Support{bread} 5{cheese} 4{eggs} 5{butter} 2{milk} 4{bread, cheese} 3{bread, eggs} 3{bread, butter} 2{bread, milk} 2{cheese, eggs} 3{cheese, milk} 2{eggs, milk} 2{bread, cheese, eggs}

2

For example;

itemset – {bread, cheese, eggs}

rule - bread, cheese � eggs

sup(ant) = sup(bread, cheese) =3 (available from list)

sup(con) = sup(eggs) = 5 ( available from list)

sup(rule) = sup(bread, cheese, eggs) = 2 (available from list)

confidence = 2/3 (67%)coverage = 2/5 ( 40%)

B.  de  la  Iglesia/2016

Page 35: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Complete Set of Rules

35

antecedent sup(ant) consequent sup(con) sup(rule) conf covbread 5 cheese 4 3 3/5 3/4bread eggsbread butterbread milkcheese eggscheese milkeggs milkbread, cheese eggsbread, eggs cheesecheese, eggs bread

Frequent Itemsets Support{bread} 5{cheese} 4{eggs} 5{butter} 2{milk} 4{bread, cheese} 3{bread, eggs} 3{bread, butter} 2{bread, milk} 2{cheese, eggs} 3{cheese, milk} 2{eggs, milk} 2{bread, cheese, eggs}

2

B.  de  la  Iglesia/2016

Page 36: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Apriori Summary

36

• Apriori  implements  efficient  algorithms  for  sub  itemset  searching  and  database  scanning.

• Apriori  works  well  on  transaction  data  when  there  are  a  large  number  of  transactions  and  relatively  few  items  in  each  transaction.

• Many  databases  do  not  have  these  properties  and  the  use  of  Apriori  can  rapidly  become  intractable.

• SPSS  Modeler 14  includes  a  General  Rule  Induction  (GRI)  node  in  that  extends  the  principles  of  Apriori  to  numeric  data.    

• If  large  number  of  items  with  high  support  are  created  problems  are  often  found  to  be  intractable.

B.  de  la  Iglesia/2016

Page 37: Rule induction - Plone sitemontesi/CBD/Beatriz/Session4-Rule Induction.pdf · Outline • What%are%rules? • Rule%Evaluation • Classification%rules • Association%rules B.%de%la%Iglesia/2016

Learning Outcomes

• What  are  rules?• What  is  the  format  of  classification  rules?• How  are  rules  evaluated?• How  do  rule  induction  algorithms  operated?• What  are  association  rules?• How  does  Apriori work?

B.  de  la  Iglesia/2016 37