copyright © curt hill 2003-2013 data mining a brief overview

24
Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Upload: ashlyn-henry

Post on 31-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Copyright © Curt Hill 2003-2013

Data Mining

A Brief Overview

Page 2: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Copyright © Curt Hill 2003-2013

The Problem• Huge volumes of data overwhelm

traditional methods of data analysis such as:

• Spreadsheets• Ad hoc queries• Multidimensional analysis tools• Statistical analysis packages

Page 3: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Copyright © Curt Hill 2003-2013

What is Data Mining?• Exploratory data analysis based on a

data warehouse– Knowledge Discovery in Databases (KDD)

• Data Mining extracts previously unknown and potentially useful information– Rules, constraints, correlations, patterns,

signatures and irregularities

• The goal is to automate the methods for finding these in the data

Page 4: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Copyright © Curt Hill 2003-2013

Data Warehouse• A database usually separated from the

operational database(s)• Used as a base for decision support

systems– Upper and middle management– Not used for day to day management but

for spotting trends and making path decisions

• Typically very large and composed of recent copies from the operational database(s)

• Data Mining is one of the applications that could use

Page 5: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Goals of Data Mining• Prediction of future behaviors

– Seasonal or non-seasonal trends– How will consumers respond to

discounts?– Allows the enterprise to be ready

• Identification of item, event or activity– Intruders may be identified by the

files they access or programs they use

Copyright © Curt Hill 2003-2013

Page 6: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Goals Again• Classification of categories of users

or products– Shoppers may be categorized as:

• Discount seeking• Rush• Regular• Attached to certain brand names

– The store may be made more friendly to such

• Optimize the use of time, space, materials and money

Copyright © Curt Hill 2003-2013

Page 7: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Knowledge Discovery• There are several types of

discoverable knowledge– Association Rules– Classification hierarchies– Sequential patterns– Time series patterns– Clustering

• Each of these needs more information

Copyright © Curt Hill 2003-2013

Page 8: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Association Rules• What we are looking for is

knowledge of associations that are not obvious

• This has gained traction in market basket research– Very profitable information

• If a MRI has characteristic a and b then if often has c– This is an association rule

Copyright © Curt Hill 2003-2013

Page 9: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Copyright © Curt Hill 2003-2013

Market Basket Model• Premise: the items in a checkout

transaction are not random• Thus we analyze customer

transactions for patterns or association rules

• These patterns may guide decisions on – Sale items– Shelf arrangement or product

placement

Page 10: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Copyright © Curt Hill 2003-2013

Retail Example• A young father goes to the store to buy

disposable diapers• On his way through the store he sees a

Sports Illustrated and buys it• In general, people do not impulse buy

disposable diapers, but while buying these, they may buy something else on impulse

• Can we examine retail transaction records and perceive the connection?

Page 11: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Association Rule• Is of the form: X => Y

– Where both X and Y could be sets of items

• The support of this rule is the percent of total transactions that have both

• The confidence of this rule is the number of transactions which have the first one divided by the number of transactions that have both

• High support and high confidence indicates rules that business decisions may be based upon this rule– Put magazine rack on the route to the

diapersCopyright © Curt Hill 2003-2013

Page 12: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Copyright © Curt Hill 2003-2013

Agriculture Example• LandSat are in polar orbits• They record data on all land every 18

days• A pixel is approximately 31 yards on a

side• Seven bands from near infrared to

ultraviolet are recorded for each pixel• Each produce a 1 byte value• Can you get this data in a spreadsheet?

Page 13: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Copyright © Curt Hill 2003-2013

Agriculural rule

• In middle summer a near infrared value in the range 48 to 255 and red in red in range 0 to 31 suggests that the yield will be 128 to 255 bushels acre

• If the support and confidence are high this suggests that the farmer should apply nitrogen to the areas where near infrared was less than 47 and red was greater than 32

Page 14: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Computational Difficulties• Consider how many tickets a

supermarket or department store might generate?

• In general, most of these tickets have more than two or three items

• The store carries thousands of items• Discovering these association rules

become computationally taxing• One good reason to keep this off of

the operational databaseCopyright © Curt Hill 2003-2013

Page 15: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Algorithm Properties• There are a number of algorithms

for finding these rules• These typically exploit two

properties:• Downward closure

• The subset of a large itemset should also have large support

• Removing a few items does not hurt

• Antimonotocity• The superset of a small itemset should

have small supportCopyright © Curt Hill 2003-2013

Page 16: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Classification• Classifying data into

predetermined groups• Then we can deal with the groups

in different ways• AKA supervised learning

– Developed by Artificial Intelligence

• The process of clustering is attempting to classify data in groups that are not predetermined

Copyright © Curt Hill 2003-2013

Page 17: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Models• The two typical models are

decision trees and a set of rules• We look at the data to build the

model and then use the model for new data

• Consider in the next slide a decision tree for granting a credit card to an applicant

Copyright © Curt Hill 2003-2013

Page 18: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Example: Decision Tree

Copyright © Curt Hill 2003-2013

Married

Salary Balance

Age

Yes No

<25K >75K <5K

GoodFairPoor Poor

>5K

<25

Fair

>25

Good

Page 19: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Clustering• AKA unsupervised learning• Classify the data into groups that

you are not aware of to begin with• A distance function must be

supplied that describes the distance between two points– The points are often not purely numeric– They are often not in 2 dimensions or

even 3 which makes things interesting

Copyright © Curt Hill 2003-2013

Page 20: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Applications• Marketing

– Determine advertising, store placement, segmentation of customers

• Finance– Analysis of performance of securities

• Manufacturing– Optimizing resources, designing the

manufacturing process

• Health Care – Discovery of items in X-Ray and MRI

images

Copyright © Curt Hill 2003-2013

Page 21: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Example• Certain diseases switch on genes

characteristic to that disease• Drugs often switch off a gene• In 2011 database of genes and

what affected them was mined• The result was that mice infected

with small cell lung cancer were treated with an antidepressant, imipramine– The tumors were reduced

Copyright © Curt Hill 2003-2013

Page 22: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Telco Example• A local telephone company mines

its connection data for possible marketing opportunities

• A phone very busy in the 3PM to 6PM range suggests a teenager– Pitch a teen phone

• Busy in the 9AM to 5PM suggests a home business– Pitch a business line

Copyright © Curt Hill 2003-2013

Page 23: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Social Media• Publicly viewable social media

presents a very large quantity of data

• However it is:– Noisy– Unstructured– Dynamic

• It is of great interest in political campaigns, marketing, health care– This is where people express things

firstCopyright © Curt Hill 2003-2013

Page 24: Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview

Finally• Much of the analysis done in data

mining has been done for centuries– What is different now is the amount

and types of captured data

• There are a number of commercial tools for mining

• Many large companies have substantial investment and return on their mining activities

Copyright © Curt Hill 2003-2013