issues in data mining infrastructure

80
Page Number : 1 Issues in Data Mining Issues in Data Mining Infrastructure Infrastructure Authors: Nemanja Jovanovic, nemko @ acm .org Valentina Milenkovic, tina @ eunet . yu Voislav Galic, vgalic @bitsyu.net Dusan Zecevic, [email protected] Sonja Tica, ticaz@eunet. yu Prof. Dr. Dusan Tosic, [email protected] Prof. Dr. Veljko Milutinovic, vm @ etf . bg .ac. yu

Upload: glenys

Post on 02-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Issues in Data Mining Infrastructure. Authors:Nemanja Jovanovic, [email protected] Valentina Milenkovic, [email protected] Voislav Galic, [email protected] Dusan Zecevic, [email protected] Sonja Tica, [email protected] Prof. Dr. Dusan Tosic, [email protected] - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Issues in Data Mining Infrastructure

Page Number:

1

Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure

Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure

Authors: Nemanja Jovanovic, [email protected] Milenkovic, [email protected] Voislav Galic, [email protected] Zecevic, [email protected] Tica, [email protected]. Dr. Dusan Tosic, [email protected]. Dr. Veljko Milutinovic, [email protected]

Page 2: Issues in Data Mining Infrastructure

Page Number:

2

Data Mining in the NutshellData Mining in the NutshellData Mining in the NutshellData Mining in the Nutshell

Uncovering the hidden knowledge

Huge n-p complete search space

Multidimensional interface

NOTICE:

All trademarks and service marks mentioned in this document are marks of their respective owners. Furthermore CRISP-DM consortium (NCR Systems Engineering Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA Verzekeringen en Bank Groep B.V (The Netherlands)) permitted presentation of their process model.

Page 3: Issues in Data Mining Infrastructure

Page Number:

3

A Problem …A Problem …A Problem …A Problem …

You are a marketing manager for a cellular phone company

Problem: Churn is too high

Bringing back a customer after quitting is both difficult and expensive

Giving a new telephone to everyone whose contract is expiring is expensive

You pay a sales commission of 250$ per contract

Customers receive free phone (cost 125$)

Turnover (after contract expires) is 40%

Page 4: Issues in Data Mining Infrastructure

Page Number:

4

… … A SolutionA Solution… … A SolutionA Solution

Three months before a contract expires, predict which customers will leave

If you want to keep a customer that is predicted to churn, offer them a new phone

The ones that are not predicted to churn need no attention

If you don’t want to keep the customer, do nothing

How can you predict future behavior?

Tarot Cards?

Magic Ball?

Data Mining?

Page 5: Issues in Data Mining Infrastructure

Page Number:

5

Still Skeptical?Still Skeptical?Still Skeptical?Still Skeptical?

Page 6: Issues in Data Mining Infrastructure

Page Number:

6

The DefinitionThe DefinitionThe DefinitionThe Definition

Automated

The automated extraction of predictive information from (large) databases

Extraction

Predictive

Databases

Page 7: Issues in Data Mining Infrastructure

Page Number:

7

History of Data MiningHistory of Data MiningHistory of Data MiningHistory of Data Mining

Page 8: Issues in Data Mining Infrastructure

Page Number:

8

Repetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar Activity

1613 – Galileo Galilei

1859 – Heinrich Schwabe

Page 9: Issues in Data Mining Infrastructure

Page Number:

9

The Return of theThe Return of theHalley CometHalley Comet

The Return of theThe Return of theHalley CometHalley Comet

1910 1986 2061 ???

1531

1607

1682

239 BC

Edmund Halley (1656 - 1742)

Page 10: Issues in Data Mining Infrastructure

Page Number:

10

Data Mining is NotData Mining is NotData Mining is NotData Mining is Not

Data warehousing

Ad-hoc query/reporting

Online Analytical Processing (OLAP)

Data visualization

Page 11: Issues in Data Mining Infrastructure

Page Number:

11

Data Mining isData Mining isData Mining isData Mining is

Automated extraction of predictive informationfrom various data sources

Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines

Page 12: Issues in Data Mining Infrastructure

Page Number:

12

Data Mining canData Mining canData Mining canData Mining can

Answer question that were too time consuming to resolve in the past

Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision

Page 13: Issues in Data Mining Infrastructure

Page Number:

13

Data Mining ModelsData Mining ModelsData Mining ModelsData Mining Models

Page 14: Issues in Data Mining Infrastructure

Page Number:

14

Neural NetworksNeural NetworksNeural NetworksNeural Networks

Characterizes processed data with single numeric value

Efficient modeling of large and complex problems

Based on biological structures - Neurons

Network consists of neurons grouped into layers

Page 15: Issues in Data Mining Infrastructure

Page Number:

15

Neuron FunctionalityNeuron FunctionalityNeuron FunctionalityNeuron Functionality

I1

I2

I3

In

Output

W1

W2

W3

Wn

f

Output = f (W1*I1, W2*I1, …, Wn*In)Output = f (W1*I1, W2*I1, …, Wn*In)

Page 16: Issues in Data Mining Infrastructure

Page Number:

16

Training Neural NetworksTraining Neural NetworksTraining Neural NetworksTraining Neural Networks

Page 17: Issues in Data Mining Infrastructure

Page Number:

17

Neural NetworksNeural NetworksNeural NetworksNeural Networks

Once trained, Neural Networks can efficiently estimate the value of an output variable for given input

Neurons and network topology are essentials

Usually used for prediction or regression problem types

Difficult to understand

Data pre-processing often required

Page 18: Issues in Data Mining Infrastructure

Page Number:

18

Decision TreesDecision TreesDecision TreesDecision Trees

A way of representing a series of rules that lead to a class or value

Iterative splitting of data into discrete groups maximizing distance between them at each split

CHAID, CHART, Quest, C5.0

Classification trees and regression trees

Unlimited growth and stopping rules

Univariate splits and multivariate splits

Page 19: Issues in Data Mining Infrastructure

Page Number:

19

Decision TreesDecision TreesDecision TreesDecision Trees

Balance>10 Balance<=10

Age<=32 Age>32

Married=NO Married=YES

Page 20: Issues in Data Mining Infrastructure

Page Number:

20

Decision TreesDecision TreesDecision TreesDecision Trees

Page 21: Issues in Data Mining Infrastructure

Page Number:

21

Rule InductionRule InductionRule InductionRule Induction

Method of deriving a set of rules to classify cases

Creates independent rules that are unlikely to form a tree

Rules may not cover all possible situations

Rules may sometimes conflict in a prediction

Page 22: Issues in Data Mining Infrastructure

Page Number:

22

Rule InductionRule InductionRule InductionRule Induction

If balance>100.000 then confidence=HIGH & weight=1.7

If balance>25.000 andstatus=married

then confidence=HIGH & weight=2.3

If balance<40.000 then confidence=LOW & weight=1.9

Page 23: Issues in Data Mining Infrastructure

Page Number:

23

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

Usage of knowledge of previously solved similar problems in solving the new problem

Assigning the class to the group where most of the k-”neighbors” belong

First step – finding the suitable measure for distance between attributes in the data

+ Easy handling of non-standard data types

- Huge models

Page 24: Issues in Data Mining Infrastructure

Page Number:

24

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

Page 25: Issues in Data Mining Infrastructure

Page Number:

25

Data Mining AlgorithmsData Mining AlgorithmsData Mining AlgorithmsData Mining Algorithms

Logistic regression

Discriminant analysis

Generalized Adaptive Models (GAM)

Genetic algorithms

The Apriori algorithm

Etc…

Many other available models and algorithms

Many application specific variations of known models

Final implementation usually involves several techniques

Page 26: Issues in Data Mining Infrastructure

Page Number:

26

The Apriori AlgorithmThe Apriori Algorithm

The task – mining association rules by finding large itemsets and The task – mining association rules by finding large itemsets and translating them to the corresponding association rules;translating them to the corresponding association rules;

A A B, or A B, or A11 AA22 …… A Am m BB1 1 B B22 …… B Bnn, where A , where A B = B = The terminologyThe terminology

– ConfidenceConfidence

– SupportSupport

– k-itemset – a set of k-itemset – a set of kk items; items;

– Large itemsets – the large itemset {A, B} corresponds to the following rules Large itemsets – the large itemset {A, B} corresponds to the following rules (implications): A (implications): A B and B B and B A; A;

Page 27: Issues in Data Mining Infrastructure

Page Number:

27

The The Apriori AlgorithmApriori Algorithm

The The operator definition operator definition– n = 1: Sn = 1: S22 = S = S11 S S11 = {A}, {B}, {C}} = {A}, {B}, {C}} {{A}, {B}, {C}} = {{AB}, {AC}, {BC}} {{A}, {B}, {C}} = {{AB}, {AC}, {BC}}

– n = k: Sn = k: Sk+1k+1 = S = Skk S Skk = {X = {X Y| X, Y Y| X, Y S Skk, |X , |X Y| = k-1} Y| = k-1}

– X and Y must have the same number of elements, and must have exactly X and Y must have the same number of elements, and must have exactly k-1k-1 identical elements;identical elements;

– Every k-element subset of any resulting set element (an Every k-element subset of any resulting set element (an elementelement is actually a is actually a k+1 element set) has to belong to the original set of itemsets;k+1 element set) has to belong to the original set of itemsets;

Page 28: Issues in Data Mining Infrastructure

Page Number:

28

The The Apriori AlgorithmApriori Algorithm

Example:Example:

TIDTID elementselements

1010 AA CC DD

2020 BB CC EE

3030 AA BB CC EE

4040 BB EE

Page 29: Issues in Data Mining Infrastructure

Page Number:

29

The The Apriori AlgorithmApriori Algorithm

Step 1 – generate a candidate set of 1-itemsets Step 1 – generate a candidate set of 1-itemsets CC11

– Every possible 1-element set from the database is potentially a large itemset, Every possible 1-element set from the database is potentially a large itemset, because we don’t know the number of its appearances in the database in because we don’t know the number of its appearances in the database in advance (á priori advance (á priori ););

– The task adds up to identifying (counting) all the different elements in the The task adds up to identifying (counting) all the different elements in the database; every such element forms a 1-element candidate set;database; every such element forms a 1-element candidate set;

– CC1 1 = {{A}, {B}, {C}, {D}, {E}}= {{A}, {B}, {C}, {D}, {E}}

– Now, we are going to scan the entire database, to count the number of Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. appearances for each one of these elements (i.e. one-element setsone-element sets););

Page 30: Issues in Data Mining Infrastructure

Page Number:

30

The The Apriori AlgorithmApriori Algorithm

Now, we are going to scan the entire database, to count the number of Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. appearances for each one of these elements (i.e. one-element setsone-element sets););

{A}{A} 22

{B}{B} 33

{C}{C} 33

{D}{D} 11

{E}{E} 33

Page 31: Issues in Data Mining Infrastructure

Page Number:

31

The The Apriori AlgorithmApriori Algorithm

Step 2 – generate a set of large 1-itemsets Step 2 – generate a set of large 1-itemsets LL11

– Each element in CEach element in C11 with support that exceeds some adopted minimum support with support that exceeds some adopted minimum support

(for example 50%) becomes a member of L(for example 50%) becomes a member of L11;;

– LL11 = {{A}, {B}, {C},{E}} = {{A}, {B}, {C},{E}}

and we can omit D in further and we can omit D in further steps (if D doesn’t have steps (if D doesn’t have enough support alone, enough support alone, there is no way it could there is no way it could satisfy requested support satisfy requested support in a combination with some in a combination with some other element(s));other element(s));

{A}{A} 22

{B}{B} 33

{C}{C} 33

{D}{D} 11

{E}{E} 33

Page 32: Issues in Data Mining Infrastructure

Page Number:

32

The The Apriori AlgorithmApriori Algorithm

Step 3 – generate a candidate set of large 2-itemsets, Step 3 – generate a candidate set of large 2-itemsets, CC22

– CC22 = L = L11 L L11 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}} ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}}

– Count the corresponding appearancesCount the corresponding appearances

Step 4 – generate a set of large 2-itemsets, Step 4 – generate a set of large 2-itemsets, LL22;;

– Eliminate the candidates Eliminate the candidates without minimum support;without minimum support;

– LL22 = {{AC}, {BC}, {BE}, {CE}} = {{AC}, {BC}, {BE}, {CE}}

{AB}{AB} 11

{AC}{AC} 22

{AE}{AE} 11

{BC}{BC} 22

{BE}{BE} 33

{CE}{CE} 22

Page 33: Issues in Data Mining Infrastructure

Page Number:

33

The The Apriori AlgorithmApriori Algorithm

Step 5 (Step 5 (CC33))

– CC33 = L = L22 L L22 = {{BCE}} = {{BCE}}

– Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are not the elements of large 2-itemset set Lnot the elements of large 2-itemset set L22 (calculation is made according to the (calculation is made according to the

operator operator definition); definition);

Step 6 (Step 6 (LL33))

– LL33 = {{BCE}}, since {BCE} satisfies the required support of 50% (two = {{BCE}}, since {BCE} satisfies the required support of 50% (two

appearances);appearances); There can be no further steps in this particular case, There can be no further steps in this particular case,

because Lbecause L33 L L33 = = ;;

Answer = LAnswer = L1 1 L L22 L L33;;

Page 34: Issues in Data Mining Infrastructure

Page Number:

34

The The Apriori AlgorithmApriori Algorithm

LL11 = {large 1-itemsets} = {large 1-itemsets}

forfor (k=2; L (k=2; Lk-1 k-1 ; k++); k++)

CCkk = apriori-gen(L = apriori-gen(Lk-1k-1););

forallforall transactions t transactions t D D dodo beginbegin

CCtt = subset (C = subset (Ckk, t);, t);

forallforall candidates c candidates c C Ctt dodo

c.count++;c.count++;

endend;;

LLkk = {c = {c C Ckk | c.count | c.count minsup} minsup}

endend;;

Answer = Answer = kk L Lkk

Page 35: Issues in Data Mining Infrastructure

Page Number:

35

The The Apriori AlgorithmApriori Algorithm

Enhancements to the basic algorithmEnhancements to the basic algorithm Scan-reductionScan-reduction

– The most time consuming operation in Apriori algorithm is the database scan; it The most time consuming operation in Apriori algorithm is the database scan; it is originally performed after each candidate set generation, to determine the is originally performed after each candidate set generation, to determine the frequency of each candidate in the database;frequency of each candidate in the database;

– Scan number reduction – counting candidates of multiple sizes in one pass;Scan number reduction – counting candidates of multiple sizes in one pass;

– Rather than counting only candidates of size k in the kRather than counting only candidates of size k in the k thth pass, we can also pass, we can also calculate the candidates calculate the candidates C’C’k+1k+1, where , where C’C’k+1 k+1 is generated from is generated from CCkk (instead (instead LLkk), using ), using

the the operator; operator;

Page 36: Issues in Data Mining Infrastructure

Page Number:

36

The The Apriori AlgorithmApriori Algorithm

– Compare: CCompare: C’’k+1k+1 = C = Ckk C Ck k C Ck+1k+1 = L = Lkk L Lkk

– Note that CNote that C’’k+1k+1 C Ck+1k+1

– This variation can pay off in later passes, when the cost of counting and keeping This variation can pay off in later passes, when the cost of counting and keeping in memory additional Cin memory additional C’’

k+1k+1 - C - Ck+1k+1 candidates becomes less than the cost of candidates becomes less than the cost of

scanning the database;scanning the database;

– There has to be enough space in main memory for both CThere has to be enough space in main memory for both Ckk and C and C’’k+1k+1;;

– Following this idea, we can make further scan reduction:Following this idea, we can make further scan reduction:

• C’k+1 is calculated from Ck for k > 1;

• There must be enough memory space for all Ck’s (k >

1);– Consequently, only two database scans need to be performed (the first to Consequently, only two database scans need to be performed (the first to

determine Ldetermine L11, and the second to determine all the other L, and the second to determine all the other Lkk’s);’s);

Page 37: Issues in Data Mining Infrastructure

Page Number:

37

The The Apriori AlgorithmApriori Algorithm

Abstraction levelsAbstraction levels– Higher level associations are stronger (more powerful), but also less certain;Higher level associations are stronger (more powerful), but also less certain;

– A good practice would be adopting different thresholds for different abstraction A good practice would be adopting different thresholds for different abstraction levels (higher thresholds for higher levels of abstraction)levels (higher thresholds for higher levels of abstraction)

Page 38: Issues in Data Mining Infrastructure

Page Number:

38

ReferencesReferences

Devedzic, V., “Devedzic, V., “Inteligentni informacioni sistemi,”Inteligentni informacioni sistemi,” Digit, FON, Beograd, 2003. Digit, FON, Beograd, 2003. http://www.http://www.marconimarconi.com.com http://www.http://www.blueyedblueyed.com.com http://www.http://www.fipafipa.org.org http://www.http://www.rpirpi..eduedu http://research.http://research.microsoftmicrosoft.com.com http://http://imatchimatch..lcslcs..mitmit..eduedu

Page 39: Issues in Data Mining Infrastructure

Page Number:

39

DM Process ModelDM Process ModelDM Process ModelDM Process Model

CRISP – tends to become a standard

5A – used by SPSS Clementine(Assess, Access, Analyze, Act and Automate)

SEMMA – used by SAS Enterprise Miner(Sample, Explore, Modify, Model and Assess)

Page 40: Issues in Data Mining Infrastructure

Page Number:

40

CRISP - DMCRISP - DMCRISP - DMCRISP - DM

CRoss-Industry Standard for DM

Conceived in 1996 by three companies:

Page 41: Issues in Data Mining Infrastructure

Page Number:

41

CRISP – DM methodologyCRISP – DM methodologyCRISP – DM methodologyCRISP – DM methodology

Four level breakdown of the CRISP-DM methodology:

Phases

Generic Tasks

Process Instances

Specialized Tasks

Page 42: Issues in Data Mining Infrastructure

Page Number:

42

Mapping generic modelsMapping generic modelsto specialized modelsto specialized models

Mapping generic modelsMapping generic modelsto specialized modelsto specialized models

Analyze the specific context

Remove any details not applicable to the context

Add any details specific to the context

Specialize generic context according toconcrete characteristic of the context

Possibly rename generic contents to provide more explicit meanings

Page 43: Issues in Data Mining Infrastructure

Page Number:

43

CRISP – DM modelCRISP – DM modelCRISP – DM modelCRISP – DM model

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

Business understanding

Data understanding

Datapreparation

ModelingEvaluation

Deployment

Page 44: Issues in Data Mining Infrastructure

Page Number:

44

Business UnderstandingBusiness UnderstandingBusiness UnderstandingBusiness Understanding

Determine business objectives

Assess situation

Determine data mining goals

Produce project plan

Page 45: Issues in Data Mining Infrastructure

Page Number:

45

Data UnderstandingData UnderstandingData UnderstandingData Understanding

Collect initial data

Describe data

Explore data

Verify data quality

Page 46: Issues in Data Mining Infrastructure

Page Number:

46

Data PreparationData PreparationData PreparationData Preparation

Select data

Clean data

Construct data

Integrate data

Format data

Page 47: Issues in Data Mining Infrastructure

Page Number:

47

ModelingModelingModelingModeling

Select modeling technique

Generate test design

Build model

Assess model

Page 48: Issues in Data Mining Infrastructure

Page Number:

48

EvaluationEvaluationEvaluationEvaluation

Evaluate results

Review process

Determine next steps

results = models + findings

Page 49: Issues in Data Mining Infrastructure

Page Number:

49

DeploymentDeploymentDeploymentDeployment

Plan deployment

Plan monitoring and maintenance

Produce final report

Review project

Page 50: Issues in Data Mining Infrastructure

Page Number:

50

At Last…At Last…At Last…At Last…

Page 51: Issues in Data Mining Infrastructure

Page Number:

51

Evolution Evolution oof Data Miningf Data Mining

Prospective, proactive information delivery

Lockheed,

IBM, SGI,

numerous startups

Advanced algorithms, multiprocessors, massive databases

What’s likely to happen to Boston unit sales next month? Why?

Data MiningData Mining

(2000)(2000)

Retrospective, dynamic data delivery at multiple levels

Pilot, IRI,

Arbor, Redbrick, Evolutionary Technologies

OLAP, Multidimensional databases,

data warehouses

What were unit sales in New England last March?

Drill down to Boston.

Data NavigationData Navigation

(1990s)(1990s)

Retrospective, dynamic data delivery at record level

Oracle, Sybase Informix, IBM, Microsoft

RDBMS,

SQL,

ODBC

What were unit sales in New England

last March?

Data AccessData Access

(1980s)(1980s)

Retrospective,

static data delivery

IBM,

CDC

Computers,

tapes,

disks

What was my average total revenue over the last 5 years?

Data Collection Data Collection (1960s)(1960s)

CharacteristicsProduct ProvidersEnabling Technologies

Business QuestionEvolutionary StepEvolutionary Step

Page 52: Issues in Data Mining Infrastructure

Page Number:

52

Examples of DM projects to stimulate your imaginationExamples of DM projects to stimulate your imagination

Here are six examples of how data mining is helping corporations Here are six examples of how data mining is helping corporations to operate more efficiently and profitably in today's business environment to operate more efficiently and profitably in today's business environment

– Targeting a set of consumers Targeting a set of consumers who are most likely to respond to a direct mail campaign who are most likely to respond to a direct mail campaign

– Predicting the probability of default for consumer loan applicationsPredicting the probability of default for consumer loan applications

– Reducing fabrication flaws in VLSI chipsReducing fabrication flaws in VLSI chips

– Predicting audience share for television programsPredicting audience share for television programs

– Predicting the probability that a cancer patient Predicting the probability that a cancer patient will will respond to radiation therapyrespond to radiation therapy

– Predicting the probability that an offshore oil well is actually going Predicting the probability that an offshore oil well is actually going to produce oil to produce oil

Page 53: Issues in Data Mining Infrastructure

Page Number:

53

Comparison of foComparison of fouurteen DM toolsrteen DM tools

Evaluated by four undergraduates inexperienced at data mining, Evaluated by four undergraduates inexperienced at data mining, a relatively experienced graduate student a relatively experienced graduate student,, and and a profes a professsional data mining consultantional data mining consultant

Run under the MS Windows 95, MS Windows NT, Run under the MS Windows 95, MS Windows NT, Macintosh System 7.5Macintosh System 7.5

Use one of the four technologies: Use one of the four technologies: Decision Trees, Rule Inductions, NeuralDecision Trees, Rule Inductions, Neural,, or Polynomial Networks or Polynomial Networks

Solve two binary classification problems: Solve two binary classification problems: multi-class classification and noiseless estimation problem multi-class classification and noiseless estimation problem

Price from 75$ to 25.000$Price from 75$ to 25.000$

Page 54: Issues in Data Mining Infrastructure

Page Number:

54

Comparison of foComparison of fouurteen DM toolsrteen DM tools

The Decision Tree products were The Decision Tree products were - - CART CART

- Scenario - Scenario - See5 - See5

- S-Plus - S-Plus The Rule Induction tools were The Rule Induction tools were

- - WizWhy WizWhy - - DataMindDataMind

- - DMSK DMSK Neural Networks were built from three programsNeural Networks were built from three programs

- - NeuroShell2NeuroShell2- PcOLPARS - PcOLPARS

- - PRW PRW The Polynomial Network tools were The Polynomial Network tools were

- - ModelQuest Expert ModelQuest Expert - - Gnosis Gnosis - a module of - a module of NeuroShellNeuroShell22

- - KnowledgeMiner KnowledgeMiner

Page 55: Issues in Data Mining Infrastructure

Page Number:

55

Criteria for evaluating DM toolsCriteria for evaluating DM tools

A list of 20 criteria for evaluating DM tools, put into 4 categories:A list of 20 criteria for evaluating DM tools, put into 4 categories:

CapabilityCapability measures what a desktop tool can do, measures what a desktop tool can do, and how well it does itand how well it does it

- Handles- Handles missing datamissing data- - - - Considers misclassification costsConsiders misclassification costs

- Allows data transformations- Allows data transformations- - Includes qIncludes quality of tesing uality of tesing

optionsoptions - Has - Has a a programming languageprogramming language- Provides useful - Provides useful

output reportsoutput reports - - Provides Provides vvisualisationisualisation

Page 56: Issues in Data Mining Infrastructure

Page Number:

56

Visualisation Visualisation

+ excellent capability excellent capability good capabilitygood capability - some capability “blank” no capabilitysome capability “blank” no capability

Page 57: Issues in Data Mining Infrastructure

Page Number:

57

Criteria for evaluating DM toolsCriteria for evaluating DM tools

Learnability/UsabilityLearnability/Usability shows how easy a tool is to learn and use shows how easy a tool is to learn and use

- Tutorials- Tutorials- Wizards- Wizards

- Easy to learn- Easy to learn- User’s - User’s

manualmanual - Online help- Online help- -

Interface Interface

Page 58: Issues in Data Mining Infrastructure

Page Number:

58

Criteria for evaluating DM toolsCriteria for evaluating DM tools

InteroperabilityInteroperability shows a tool’s ability to interface shows a tool’s ability to interface with other computer applicationswith other computer applications

- Importing data- Importing data- Exporting data- Exporting data

- Links to other applications- Links to other applications

Flexibility Flexibility

- Model adjustment flexibility- Model adjustment flexibility- Customizable work - Customizable work

enviromentenviroment - Ability to - Ability to write or change codewrite or change code

Page 59: Issues in Data Mining Infrastructure

Page Number:

59

Data Input & Output ModelData Input & Output Model

+ excellent capability excellent capability good capabilitygood capability - some capabilitysome capability “ “blank” no capabilityblank” no capability

Page 60: Issues in Data Mining Infrastructure

Page Number:

60

A classification of data setsA classification of data sets

Pima Indians Diabetes data setPima Indians Diabetes data set– 768 cases of Native American women from the Pima tribe 768 cases of Native American women from the Pima tribe

some of whom are diabetic, most of whom are not some of whom are diabetic, most of whom are not – 8 attributes plus the binary class variable for diabetes per instance8 attributes plus the binary class variable for diabetes per instance

Wisconsin Breast Cancer data set Wisconsin Breast Cancer data set – 699 instances of breast tumors some of which are malignant, 699 instances of breast tumors some of which are malignant,

most of which are benignmost of which are benign– 10 attributes plus the binary malignancy variable per case10 attributes plus the binary malignancy variable per case

The Forensic Glass Identification data set The Forensic Glass Identification data set – 214 instances of glass collected during crime investigations 214 instances of glass collected during crime investigations – 10 attributes plus the multi-class output variable per instance10 attributes plus the multi-class output variable per instance

Moon Cannon data set Moon Cannon data set – 300 solutions to the equation:300 solutions to the equation:

x = 2v 2 sin(g)cos(g)/g x = 2v 2 sin(g)cos(g)/g – the data were generated without adding noisethe data were generated without adding noise

Page 61: Issues in Data Mining Infrastructure

Page Number:

61

Evaluation of forteen DM toolsEvaluation of forteen DM tools

Page 62: Issues in Data Mining Infrastructure

Potentials of R&DPotentials of R&Dinin

Cooperation with U. of Belgrade Cooperation with U. of Belgrade

An Overview of Advanced Datamining Projects

for High-Tech Computer Industry

in the USA and EU

Page 63: Issues in Data Mining Infrastructure

Page Number:

63

VLSI Detection VLSI Detection for for

Internet/Telephony Interfaces Internet/Telephony Interfaces

Goran Davidović, Miljan Vuletić, Veljko Milutinović,

Tom Chen, and Tom Brunett

* eT

Page 64: Issues in Data Mining Infrastructure

Page Number:

64

INTERNET

SERVICE

PROVIDER

REMOTESITE

Superposition/DETECTION Superposition/DETECTION

. . .

USERS...

HOME/OFFICE/FACTORY AUTOMATION ON THE INTERNET

SPECIALIZED

Page 65: Issues in Data Mining Infrastructure

Page Number:

65

Reconfigurable FPGA for EBI

Božidar Radunović, Predrag Knežević, Veljko Milutinović,

Steve Casselman, and John Schewel*

* Virtual

Page 66: Issues in Data Mining Infrastructure

Page Number:

66

INTERNET

PROVIDER

. . .

USERS

VCC VCC

CUSTOMER SATISFACTION vs CUSTOMER PROFILE

SPECIALIZED

SERVICE

Page 67: Issues in Data Mining Infrastructure

Page Number:

67

BioPoPBioPoP

Veljko Milutinovic, Vladimir Jovicic, Milan Simic,Veljko Milutinovic, Vladimir Jovicic, Milan Simic,

Bratislav Milic, Milan Savic, Veljko Jovanovic, Bratislav Milic, Milan Savic, Veljko Jovanovic,

Stevo Ilic, Djordje Veljkovic, Stojan Omorac,Stevo Ilic, Djordje Veljkovic, Stojan Omorac,

Nebojsa Uskokovic, and Fred DarnellNebojsa Uskokovic, and Fred Darnell

•isItWorking.com

Page 68: Issues in Data Mining Infrastructure

Page Number:

68

Testing the Infrastructure for EBITesting the Infrastructure for EBI

PhonesPhones FaxesFaxes EmailEmail Web linksWeb links ServersServers RoutersRouters SoftwareSoftware

• Statistics

• Correlation

• Innovation

Page 69: Issues in Data Mining Infrastructure

Page Number:

69

CNUCECNUCEIntegration and DataminingIntegration and Datamining

on Ad-Hoc Networks and the Interneton Ad-Hoc Networks and the Internet

Veljko Milutinović,

Luca Simoncini, and Enrico Gregory

*University of Pisa, Santanna, CNUCE

Page 70: Issues in Data Mining Infrastructure

Page Number:

70

GSM

DMAd-Hoc

Internet

Page 71: Issues in Data Mining Infrastructure

Page Number:

71

Genetic SearchGenetic Search with Spatial/Temporal Mutations with Spatial/Temporal Mutations

Jelena Mirković, Dragana Cvetković,and Veljko Milutinović

*Comshare

Page 72: Issues in Data Mining Infrastructure

Page Number:

72

Drawbacks of INDEX-BASED:Drawbacks of INDEX-BASED: Time to index + ranking Time to index + ranking

Advantages of LINKS-BASED:Advantages of LINKS-BASED: Mission critical applications + customer tuned ranking Mission critical applications + customer tuned ranking

Provider

Well organized markets: Best first searchIf elements of disorder: G w DB mutationsChaotic markets: G w S/T mutations

Page 73: Issues in Data Mining Infrastructure

Page Number:

73

e-Banking on the Internete-Banking on the Internet

MiloMiloš Kovačević,š Kovačević, Bratislav Milic, Veljko Milutinovi Bratislav Milic, Veljko Milutinović, ć, Marco Gori, and Roberto GiorgiMarco Gori, and Roberto Giorgi

*University of Siena

Page 74: Issues in Data Mining Infrastructure

Page Number:

74

Bottleneck#1: Searching for Clients and InvestmentsBottleneck#1: Searching for Clients and Investments

1472++

*University of Siena + Banco di Monte dei Paschi

Page 75: Issues in Data Mining Infrastructure

Page Number:

75

WaterMarking forWaterMarking fore-Banking on the Internete-Banking on the Internet

Darko Jovic, Ivana Vujovic, Veljko MilutinovicDarko Jovic, Ivana Vujovic, Veljko Milutinovic

Fraunhofer, IPSI, Darmstadt, Germany

Page 76: Issues in Data Mining Infrastructure

Page Number:

76

Bottleneck#1: SpeedUpBottleneck#1: SpeedUp

Page 77: Issues in Data Mining Infrastructure

Page Number:

77

SSGRRSSGRROrganizing Conferences via the InternetOrganizing Conferences via the Internet

Zoran Horvat, Nataša Kukulj, Vlada Stojanović,

Dušan Dingarac, Marjan Mihanović, Miodrag Stefanović,

Veljko Milutinović, and Frederic Patricelli

*SSGRR, L’Aquila

Page 78: Issues in Data Mining Infrastructure

Page Number:

78

2000:

Arno Penzias

2001:

Bob Richardson

2002:

Jerry Friedman

2003:

Harry Kroto

http://www.ssgrr.it

Page 79: Issues in Data Mining Infrastructure

Page Number:

79

SummarySummary

Books with Nobel Laureates:Books with Nobel Laureates:

Kenneth Wilson, Ohio (North-Holland)Kenneth Wilson, Ohio (North-Holland) Leon Cooper, Brown (Prentice-Hall)Leon Cooper, Brown (Prentice-Hall) Robert Richardson, Cornell (Kluwer-Academics)Robert Richardson, Cornell (Kluwer-Academics) Herb Simon (Kluwer-Academics) Herb Simon (Kluwer-Academics) Jerome Friedman, MIT (IOS Press)Jerome Friedman, MIT (IOS Press)

Harold Kroto (IOS Press)Harold Kroto (IOS Press) Arno Penzias (IOS Press)Arno Penzias (IOS Press)

Page 80: Issues in Data Mining Infrastructure

Page Number:

80

http://galeb.etf.bg.ac.yu/~vm/

e-mail: [email protected]