internal presentation by : lei wang pervasive and artificial intelligenge research group on: an...

22
Internal Presentation by: Lei Wang Pervasive and Artificial Intelligenge research group http://diuf.unifr.ch/pai On: An Artificial Immune System for E-mail Classification Andy Secker, Alex Freitas, Jon Timmis Computing Laboratory, University of Kent Canterbury, Kent, UK http://www.cs.kent.ac.uk/~ads3 19/02/2004

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Internal Presentation by: Lei Wang

Pervasive and Artificial Intelligenge research grouphttp://diuf.unifr.ch/pai

On: An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of KentCanterbury, Kent, UK

http://www.cs.kent.ac.uk/~ads3

19/02/2004

An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of KentCanterbury, Kent, UK

http://www.cs.kent.ac.uk/~ads3

19/02/2004

19/02/2004

Significance

With the increase in information on the Internet, the strive to find more effective tools for distinguishing between interesting and non-interesting material is increasing.

This paper provides an immune-inspired algorithm called AISEC that is capable of continuously classifying electronic mail as interesting and non-interesting without the need for re-training.

Comparing with a naïve Bayesian classifier, the system proposed in this paper performs as well as the naïve Bayesian system and has a great potential for augmentation.

19/02/2004

AISEC, immunity-inspired system

Immune system Human body constantly under attack. Immune system must adapt and

respond The (natural) immune system is:

1. Dynamic

2. Adaptive

3. Robust

4. Etc.

Artificial Immune Systems (AIS) use principles and process from observed and theoretical immunology to solve problems

19/02/2004

Artificial Immune Systems

Engineering framework Representation of individual immune cells Affinity measures

• Evaluate interaction of individuals with environment and/or each other

Algorithms

• Procedures of adaptation manipulate populations of immune cells

AIS as a classifier AIRS

• A successful supervised AIS algorithm for classification

19/02/2004

AIS for Web Mining

Web mining, an umbrella term used to describe three quite different types of data mining: Content mining

• A process of extracting useful information from the text, images and other forms of content that make up the pages

• The mining of textual data is a common task, often for the purposes of information retrieval

Usage mining Structure mining

AISEC research goal To develop a highly adaptive system capable of retrieving interesting

information from the internet based on user’s current interests The authors believe AIS may offer a number of advantages

19/02/2004

What is AISEC ?

AISEC isn’t a spam filter It has no methods to penalize false positives (loss of important

e-mail) Without a very low false positive rate, a spam filter would not be

trusted

19/02/2004

What is AISEC ?

AISEC is A first step towards an AIS for web mining.

• A study of performance and characteristics of an AIS applied to text mining in a dynamic domain

A text classification algorithm capable of continuous adaptation, which may yield a classification accuracy comparable to a Bayesian approach.

• User behaviour and interaction with e-mail can be similar to web pages• Supervised classification algorithm

E-mail classified as interesting and uninteresting Uses constant(ish) feedback from user Capable of continuous adaptation

• This tracks concept drift and can also handle concept shift• A specialised AIS algorithm based in part on the immune principle of

clonal selection No previously documented algorithm was suited for use in this

situation without extensive changes

19/02/2004

Representation

Each cell contains 3 sets of words (+ state) Punctuation is removed from fields Research literature has suggested header information is

enough to accurately classify e-mail*

A = [<free,DVD> , <sales,com> , < canterbury,UK>]

Subject field

Title of the E-mail

Sender field

Sender’s name

Return field

(Sender’s address)

* Diao, Lu & Wu (2000). A Comparative Study of Classification Based Personal E-mail Filtering, PAKDD 2000

19/02/2004

Affinity

Affinity value is proportion of words in one cell found in another More features would require a less naïve distance measure Cosine distance is an obvious choice Resultant value always between 0 and 1

A = [<free,DVDs> , <offers,DVD,com> , <offers,DVD,com>]

B = [<half,price,sale>,<sales,DVD,com>,<sales,DVD,com>]

affinity(A,B) = 4/9

PROCEDURE affinity (bc1, bc2)IF(bc1 has a shorter feature vector than bc2)

bshort ← bc1, blong ← bc2ELSE

bshort ← bc2, blong ← bc1count ← the number of words in bshort present in

blongbs_len ← the length of bshort’s feature vector

RETURN count/bs_len

19/02/2004

Clone-Mutation

One mutation takes a word previously used in subject or address and replaces single location Subject, sender and return

address libraries are kept separately

Usually >1 mutation per cell takes place

Subjectlib = free,DVDSenderLib = sales,DVD,com ReturnLib = sales,DVD,com

A = [<free,DVD> , <sales,DVD,com> , <sales,DVD,com>]A = [<free,free> , <sales,DVD,com> , <sales,DVD,com>]

PROCEDURE clone_mutate(bc1,bc2) aff ← affinity(bc1,bc2) clones ← ∅ num_clones ← | aff * Kl | num_mutate ← | (1-aff) * bc’s feature vector length * Km | DO(num_clones)TIMES bcx ← a copy of bc1 DO(num_mutate)TIMES p ← a random point in bcx’s feature vector w ← a random word from the appropriate gene library replace word in bcx’s feature vector at location p with w bcx’s stimulation level ← Ksb clones ← clones ∪ {bcx}RETURN clones

19/02/2004

The algorithm - classification

1. System is initialised with known uninteresting e-mail

2. E-mail presented for classification. Classified as uninteresting as it stimulates close cells

Memory cells

Naive cells

19/02/2004

The algorithm – correct classification

3. Highly stimulated cell reproduces 7 times. Less stimulated cell produces only 2 clones but with higher mutation rate

4. Cell with highest affinity is known to be useful therefore rewarded by becoming memory cell.

Classification Region

Stimulation Region

19/02/2004

The algorithm cont…

Cell removal

• Aged naïve cells deleted. Memory cells placed in already covered areas also deleted.

Incorrect classification

5. Any cell responsible for incorrect classification is removed (memory or otherwise)

19/02/2004

Results – Classification accuracy

2268 e-mails (742 uninteresting) received over 6 months E-mails presented in the order of date received Feedback given after EVERY classification AISEC run 10 times, results show mean C5.0, neural network and C&R tree all run in “Clementine” data

mining package Bayesian algorithm used feedback to update like AISEC

C5.0 83.9%

Naïve Bayesian 85.0%

Neural Network 85.6%

AISEC 86.0% 1.29

C&R tree 87.7%

Naïve Bayesian 88.05%

AISEC 89.09% 0.97

Traditional Learning Continuous Learning

19/02/2004

Results – variation of population size

0

50

100

150

200

250

300

350

400

450

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200

Generation Number

Nu

mb

er

of

B-c

ell

s

Naïve B-cellsMemory B-Cells

19/02/2004

User point of view

AISEC runs as a proxy on local machine Advantages

No need to switch e-mail client Can collect mail from multiple locations

AISEC’s user interface would require minimal interaction

19/02/2004

User point of view

Local machine

Server(s)Collect mailAISEC

Classifier

Collect mail

Store

Interesting

Uninteresting

User interaction

Positive user response

Negative user response

Return mail

Client

19/02/2004

Results cont…

Standard measures of quality Precision is the proportion of positive documents retrieved

compared with the total number of positive documents Recall is the proportion of positive documents actually classified as

positive

Precision Recall

Naïve Bayesian 93.93% 67.76%

AISEC 82.20% 2.63 81.13% 4.71

19/02/2004

Results – variation of time between user feedback

86.0%

86.5%

87.0%

87.5%

88.0%

88.5%

89.0%

89.5%

90.0%

90.5%

0 between 13 between 38 between 88 between 176 between 300 between 600 between

E-mails classified between feedback

Cla

ssif

icat

ion

Acc

ura

cy

AISEC

Bayeseian

19/02/2004

Conclusion

AISEC has produced promising results and appears robust Interesting note: Typical accuracy similar to published results from

other AIS for text classification (both traditional and continuous learning)

Use a larger training set and optimise (the many) parameters

• Detect when there are the optimum number of cells

AISEC has been useful providing some evidence AIS applied to this domain would be possible

Research on adaptive systems for retrieval of interesting information, not necessarily purely accurate information

19/02/2004

Questions & Discussions

An Artificial Immune System for E-mail Classification

Andy Secker, Alex Freitas, Jon Timmis

Computing Laboratory, University of Kent

http://www.cs.kent.ac.uk/~ads3c

More information: