data mining 2017 introduction - utrecht university · data mining 2017 introduction ... d...
TRANSCRIPT
Data Mining 2017Introduction
Ad Feelders
Universiteit Utrecht
Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 53
The Course
Literature: Lecture Notes, Articles, Book Chapters, Slides(appear on the course web site).
Course Form: Lectures (Wednesday, Friday) and computer labsessions with R (Wednesday).
Grading: two practical assignments (50%) and a written exam (50%).
Web Site: http://www.cs.uu.nl/docs/vakken/mdm/
Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 53
Practical Assignments
There are two practical assignments: one assignment with emphasis onprogramming and one with emphasis on data analysis.
1 Write your own classification tree and random forest algorithm in R(30%).
2 Text Mining: analyze hotel reviews to distinguish genuine from fakereviews (20%).
Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 53
What is Data Mining?
(Knowledge discovery in databases) is the non-trivial process ofidentifying valid, novel, potentially useful, and ultimatelyunderstandable patterns in data (Fayyad et al.)
Analysis of secondary data (Hand)
The induction of understandable models and patterns from databases(Siebes)
The data-dependent process of selecting a statistical model(Leamer, 1978 (!))
Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 53
What is Data Mining?
Data Mining as a subdiscipline of computer science:
is concerned with the development and analysis of algorithms for the(efficient) extraction of patterns and models from(large, heterogeneous, ...) data bases.
Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 53
Models
A model is an abstraction of (part of) reality (the application domain).
In our case, models describe relationships among:
attributes (variables, features),
tuples (records, cases),
or both.
Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 53
Example Model: Classification Tree
income
marital
status
agegood risk
good risk bad risk
bad risk
> 36,000 £ 36,000
> 37 £ 37
married not married
Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 53
Patterns
Patterns are local models, i.e., models that describe only part of thedatabase.For example, association rules:
Diapers → Beer , support = 20%, confidence = 85%
Although patterns are clearly different from models, we will use model asthe generic term.
Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 53
Reasons to Model
A model
helps to gain insight into the application domain
can be used to make predictions
can be used for manipulating/controlling a system (causality!)
A model that predicts well does not always provide understanding.
Correlation 6= CausationCan causal relations be found from data alone?
Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 53
Causality and Correlation
Heavy Smoking
Nicotine Stain Lung Cancer
Washing your hands or cleaning your teeth won’t help!
Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 53
Induction vs Deduction
Deductive reasoning is truth-preserving:
1 All horses are mammals
2 All mammals have lungs
3 Therefore, all horses have lungs
Inductive reasoning adds information:
1 All horses observed so far have lungs
2 Therefore, all horses have lungs
Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 53
Induction (Statistical)
1 4% of the products we tested are defective
2 Therefore, 4% of all products (tested or otherwise)are defective
Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 53
Inductive vs Deductive: Acceptance Testing Example
100,000 products, sample 1000Suppose 10 are defective (1% of the sample)
Deductive: d ∈ [0.0001, 0.9901]
Inductive: d ∈ [0.004, 0.016] with 95% confidence.
0.01±√
0.01× 0.99
1000× 1.96︸ ︷︷ ︸
≈0.006
Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 53
Experimental data
The experimental method:
Formulate a hypothesis of interest.
Design an experiment that will yield data to test this hypothesis.
Accept or reject hypothesis depending on the outcome.
Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 53
Experimental vs Observational Data
Experimental Scientist:
Assign level of fertilizer randomly to plot of land.
Control for other factors that might influence yield: quality of soil,amount of sunlight,...
Compare mean yield of fertilized and unfertilized plots.
Data Miner:
Notices that the yield is somewhat higher under trees where birdsroost.
Conclusion: droppings increase yield.
Alternative conclusion: moderate amount of shade increasesyield.(“Identification Problem”)
Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 53
Observational Data
In observational data, many variables may move together insystematic ways.
In this case, there is no guarantee that the data will be “rich ininformation”, nor that it will be possible to isolate the relationship orparameter of interest.
Prediction quality may still be good!
Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 53
Example: linear regression
mpg = a + b × cyl + c × eng + d × hp + e × wgt
Estimate a, b, c , d , e from data. Choose values so that sum of squarederrors
N∑i=1
(mpgi − mpgi )2
is minimized.∂mpg
∂eng= c
Expected change in mpg when (all else equal) engine displacementincreases by one unit.
Engine displacement is defined as the total volume of air/fuel mixture anengine can draw in during one complete engine cycle.
Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 53
The Data
> cars.dat[1:10,]
mpg cyl eng hp wgt
1 18 8 307 130 3504 "chevrolet chevelle malibu"
2 15 8 350 165 3693 "buick skylark 320"
3 18 8 318 150 3436 "plymouth satellite"
4 16 8 304 150 3433 "amc rebel sst"
5 17 8 302 140 3449 "ford torino"
6 15 8 429 198 4341 "ford galaxie 500"
7 14 8 454 220 4354 "chevrolet impala"
8 14 8 440 215 4312 "plymouth fury iii"
9 14 8 455 225 4425 "pontiac catalina"
10 15 8 390 190 3850 "amc ambassador dpl"
Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 53
Fitted Model
Coefficients:
Estimate Pr(>|t|)
(Intercept) 45.7567705 < 2e-16 ***
cyl -0.3932854 0.337513
eng 0.0001389 0.987709
hp -0.0428125 0.000963 ***
wgt -0.0052772 1.08e-12 ***
---
Multiple R-Squared: 0.7077
> cor(cars.dat)
mpg cyl eng hp wgt
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
cyl -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
eng -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
hp -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
wgt -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 53
KDD Process: CRISP-DM
This course is mainly concerned with the modeling phase.
Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 53
Data Preparation
Once we know the data available to solve the data mining problem, wehave to prepare the mining database (often one table):
Select data
Clean data
Construct data
Integrate data
Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 53
Select Data: Why Reject Data?
attributes with too many errors
attributes with too many missing values
attributes that have no relevance from a domain experts point of view
a sample may be sufficient
Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 53
Data Cleaning
Cleaning data is a complete topic in itself, we mention two problems:
1 data editing: what to do when records contain impossiblecombinations of values?
2 incomplete data: what to do with missing values?
Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 53
Data Editing: Example
We have the following edits (impossible combinations):
E1 = {Driver’s Licence=yes, Age < 18}E2 = {Married=yes, Age < 18}
Make record
Driver’s Licence=yes, Married=yes, Age=15
consistent with minimal number of changes.
Of course it’s better to prevent such inconsistencies in the data!
Seminal Paper: I.P. Fellegi, D. Holt: A systematic approach to automaticedit and imputation, Journal of the American Statistical Association71(353), 1976, pp. 17-35.
Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 53
What to do with missing values?
One can remove a tuple if one or more attribute values are missing.Danger: how representative and random is the remaining sample?Also, you may have to throw away a large part of the data!
One can remove attributes for which values are missing. Danger: thisattribute may play an important role in the model you want to induce.
You do imputation, i.e., you fill in a value. Danger: the values youguess may have a large influence on the resulting model.
Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 53
Missing Data Mechanisms: MCAR
Suppose we have data on gender and income.Gender is fully observed, income is sometimes missing.
MCAR: Income is missing completely at random.
G I
OI
For example: Pr(income = ?) = 0.1
There will be no bias if we remove tuples with missing income.Imputation:
If person is male, pick a random male with income observed and fill inhis value.If person is female, pick a random female with income observed and fillin her value.
Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 53
Missing Data Mechanisms: MAR
Probability that income is missing depends on gender. For example:
Pr(income = ?|gender=male) = 0.2
Pr(income = ?|gender=female) = 0.05
G I
OI
This time there will be bias if we remove tuples with missing income.Imputation (same as before, still works):
If person is male, pick a random male with income observed and fill inhis value.If person is female, pick a random female with income observed and fillin her value.
Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 53
Missing Data Mechanisms: MNAR
Probability that income is missing depends on value of income itself.
Pr(income = ?|income > 8000) = 0.4
Pr(income = ?|2000 < income < 8000) = 0.01
Pr(income = ?|income < 2000) = 0.2
G I
OI
We can’t “repair” this unless we have knowledge about the missingdata mechanism.
Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 53
Missing Data
Unfortunately, we cannot infer from the observed data alone whetheror not the missing data mechanism is MAR or MNAR.
We might have knowledge about the nature of the missing datamechanism however ...
Practice: if you don’t know, assume MAR and hope for the best.
Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 53
Construct Data
Quite often, the raw data is not in the proper format for analysis, forexample:
You have dates of birth and you guess age plays a role.
You have data on income and fixed expenses and you think disposableincome is important.
You have to analyze text data, for example the body of e-mailmessages to develop a spam filter. You could represent the text as abag-of-words.
Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 53
Integrate Data: The Mining Table
Integration of data from different databases (data fusion),for example:
the ‘same’ attribute has two different names.two attributes with the same name have a different meaning.the ‘same’ attribute has two different domains.
One-one relationships between tables are easy, but what to do with1-n relationships?
Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 53
One to Many Relationships: TV Viewing
Predict household composition from TV viewing behavior.
the program. The Telecast Id has a (m:1) relationship with each entry of the tableTelecast. The Telecast table contains information regarding the television broadcast,such as program duration (Duration Seconds), the broadcast date (Date), Channel Nameand Program Name.
Respondent Demographics Tv Program Viewing
TelecastProgram
Respondent IDPK
Household ID
Number of Children
IDPK
Respondent ID
Telecast ID
Offset Seconds
IDPK
Program ID
IDPK
Program Name
Channel NameDate
Duration Seconds
Date
Duration Seconds
m
1
1
m
1
m
1
Fig. 4.1.: An extremely simplified schema of the database containing the NNPM data usedin this thesis.
The Respondent Demographics table consists of 150 076 respondents, which are constantlymoving in and out the panel. In Section 4.1.1, we will discuss how we deal with this problem.These respondents are living in 46 205 households. The television viewing behaviour dataset (Tv Program Viewing) consists of almost 260 million viewing statements, where eachviewing statement consist of one respondent watching one program at a certain date timeand duration. As a consequence, when multiple respondents in the same household watcha program at the same time, it will generate the same number of viewing statements as thenumber of respondents watching that program.
4.1.1 Filtering and Unification
The data set presented in the previous section is too large to be handled and does not fitinto memory. Thus, we use only the first month of 2014 as our data set. This narrows thenumber of respondents down to 84 199 corresponding to 21 614 households. Time-shiftedviewing statements12 and viewing statements of visitors are not included to narrow thescope of this research. Syndicated programs are also removed, since at the time of writingthis data set was incomplete and the programs do not have a starting time or ending time.However, they do have a duration. Additionally, programs that are not watched or onlywatched by one person are removed. This reduces the number of programs from 7 109 to6 168.
The respondents are constantly subjected to change. Hence, Nielsen uses the unificationprocess to define a consistent sample universe for a specified reporting interval. This ensuresthat all respondents have the same opportunity to view a program [73]. In this thesis, weare also using the default unification process as defined by Nielsen. This requires that eachrespondent watches at least 75% of the days in January 2014. In addition, 16 householdshave inconsistent household level socio-demographics, i.e. one respondent reports he/she isliving with one child and the other with two children, even though they live in the same
12Time-shifted viewing are recorded programs that are watched on a different time than the actualbroadcast.
4.1 Data Processing 29
Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 53
Aggregating the data
Viewing behaviour has to be aggregated, for example:
Weekly viewing frequency of different programs.
Weekly viewing duration of different programs.
Weekly viewing frequency of different program categories.
etc.
Potentially results in a huge number of attributes.
Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 53
Some descriptive statistics
0 50
100
150
200
LAW & ORDER: SVU
NCIS
SPONGEBOB
AMC MOVIE
CASTLE
MOVIE
PAWN STARS
FOX NFC CHAMPIONSHIP
LAW & ORDER
FX MOVIE PRIME
Program
Average DurationHouseh
old
Structure
FFC
HHC
SF
SM
SP
F
SP
M
(a)Top
10program
sw
iththe
highestaverage
duration.
0 50
100
150
200
Program
Average Duration
Househ
old
Structure
FFC
HHC
SF
SM
SP
F
SP
M
(b)D
istributionofthe
averageduration
perprogram
overdifferent
householdcom
positions.
Fig.4.4.:D
istributionofthe
averageduration
ofprograms.
Togeta
betterunderstanding
ofthedata,we
constructascatter
plotofthetwo
programs
with
thehighestduration/frequency.T
hetop
2program
swith
thehighestaverage
durationare
chosenasthe
xand
y-axes(seeFigure
4.6).Itcan
beobserved
thatmostpointscluster
atthe
lowerleft
cornerof
thegraph
forall
householdstructures.
Creating
additionalgraphs
basedon
thetop
25program
sw
iththe
highestaverage
durationyield
300sim
ilargraphs,w
ithless
variancein
thelower
quadrant.W
ealso
observethat
thegraphs
among
differentclasses
aresim
ilar,albeitw
ithless
datapoints.
The
same
scatterplots
canalso
bem
adew
iththe
top2
programs
with
thehighest
averagefrequency.
Again,the
graphcontains
householdsthat
areclustered
towardsthe
lowerbottom
corner(see
Figure4.7).
Forhousehold
structuresw
ithno
childpresent,
weobserve
verylow
frequenciesfor
theprogram
SPONGEBOB.This
suggeststhatwe
canuse
thefrequency
ofSPONGEBOB
toidentify
whether
ahousehold
hasa
child.T
hepoints
inthe
durationscatter
plotsalso
seemto
bem
oredistributed,w
hilethe
pointsin
thefrequency
scatterplots
arem
oreconcentrated
intheaxes.Furthercreating
thesescatterplotsbasedon
thetop25
programsw
iththehighest
averagefrequency
yield300
similar
graphs,butno
obviouspatterns
arebe
observed.
Bycom
putingthe
averageduration
andfrequency
forallprogram
s,wecan
examine
thedistributions
oftheaverage
durationand
frequency.From
thedistribution
oftheaverage
duration,weobserve
thatthetop
2520program
swith
thehighestaverage
durationaccount
for95%ofthe
totalaverageduration
(seeFigure
4.8a).Forfrequency,2791
programsw
iththe
highestfrequency
accountfor
95%of
thetotal
averagefrequency
(seeFigure
4.8b).T
hus,about
40%of
theprogram
saccount
for95%
ofthe
durationand
frequency.T
his
4.2D
ataC
haracteristics35
Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 53
Modeling: Data Mining Tasks
Classification / Regression
Dependency Modeling (Graphical Models; Bayesian Networks)
Frequent Pattern Mining (Association Rules)
Subgroup Discovery (Rule Induction; Bump-hunting)
Clustering
Ranking
Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 53
Subgroup Discovery
Find groups of objects (persons, households, transactions, ...) that scorerelatively high (low) on a particular target attribute.
No condition
[49.7%,50.3%] 100,000
age in [19,24]
[54.6%,56.2%] 14,249
gender = m age in [19,24]
[52.8%,53.7%] 53,179 [60.2%,62.3%] 8,130
age in [19,24] carprice in [59000,79995]
[55.9%,59.6%] 2,831 [61.2, 67.4] 1,134
category = lease gender = m age in [19,24]
[50.7%,52.0%] 20,315 [53.5%,55.4%] 10,778 [59.4%,64.1%] 1,651
Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 53
Dependency Modeling: Coronary Heart Disease
Smoke (Y/ N)
Me nta l work
Phys. work
Syst. BP
Lipo ra tio
Ana mne sis
a
b
c
d
e
f
Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 53
Clustering
Put objects (persons, households, transactions, ...) into a number ofgroups in such a way that the objects within the same group are similar,but the groups are dissimilar.
Variable 1
Vari
able
2
Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 53
Ranking
For example:
Rank web pages with respect to their relevance to a query.
Rank job applicants with respect to their suitability for the job.
Rank loan applicants with respect to default risk.
...
Has similarities with regression and classification, but in ranking we areoften only interested in the order of objects.
Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 53
Components of DM algorithms
Data Mining Algorithms can often be regarded as consisting of thefollowing components:
1 A representation language: what models are we looking for?
2 A quality function: when do we consider a model to be good?
3 A search algorithm: how de we find some or all good or best models?
Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 53
Representation Languages
Representation languages define the set of all possible models, e.g.,
linear models: y = b0 + b1x1 + · · ·+ bnxn
association rules: X → Y
subgroups: X1 ∈ V1 ∧ · · · ∧ Xn ∈ Vn
classification trees
Bayesian networks (DAGs)
Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 53
Quality Functions
The quality score of a model often contains two elements:
How well does the model fit the data?
How complex is the model?
For example (regression)
score =N∑i=1
(yi − yi )2 + 2×# parameters
If independent test data is used, the quality score usually only considersthe fit on the test data.
Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 53
The data
●
●
●
●
●●
●
●
● ●
2 4 6 8 10
1020
3040
x
y
Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 53
Fitting a linear model
●
●
●
●
●●
●
●
● ●
2 4 6 8 10
1020
3040
x
y
Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 53
A fifth degree polynomial fits better
●
●
●
●
●● ●
●
● ●
2 4 6 8 10
1020
3040
50
x
y
Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 53
Over-fitting: linear model generalizes better
● ●
●
●
●
●
●
●
●
●
2 4 6 8 10
1020
3040
50
x
y2
Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 53
Search for the good models
Sometimes we can check all possible models, because there are ruleswith which to prune large parts of the search space; for example, the“a priori principle” in frequent pattern mining.
Usually we have to employ heuristics
A general search strategy, such as a hill-climber or a genetic(evolutionary) algorithm.Search operators that implement the search strategy on therepresentation language. Such as, a neighbour operator for hillclimbing and cross-over and mutation operators for genetic search.
Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 53
Search: example
In linear regression, we want to predict a numeric variable y from a set ofpredictors x1, . . . , xn. We might include any subset of predictors, so thesearch space contains 2n models. E.g., if n = 30, we have230 = 1, 073, 741, 824, i.e. about one billion models in the search space.
It is common to use a hill-climbing approach called stepwise search:
1 Start with some initial model, e.g. y = b0, and compute its quality.
2 Neighbours: add or remove a predictor.
3 If all neighbours have lower quality, then stop and return the currentmodel; otherwise move to the neighbour with highest quality andreturn to 2.
Ad Feelders ( Universiteit Utrecht ) Data Mining 49 / 53
Classical Text Book Approach (Theory Driven)
Specify hypothesis (model) of interest.The model is determined up to a fixed number of unknownparameters.
Collect relevant data.
Estimate the unknown parameters from the data.
Perform test, typically whether a certain parameter is zero,using the same data!
It is allowed to use the same data for fitting the model and testing themodel, because we did not use the data to determine the modelspecification.
Ad Feelders ( Universiteit Utrecht ) Data Mining 50 / 53
Data Mining: Sherlock Holmes Inference
“No data yet…It is a capitalmistake to theorize beforeyou have all the evidence. It biases the judgements.”
A study in scarlet
Ad Feelders ( Universiteit Utrecht ) Data Mining 51 / 53
Data Mining (Data Driven)
A simple analysis scenario could look like this:
Formulate question of interest.
Select potentially relevant data.
Divide the data into a training and test set.
Use the training set to fit (many) different models.
Use the test set to compare how well these models generalize.
Select the model with the best generalization performance.
In this scenario, we cannot use the training data both to fit models and totest models!
Ad Feelders ( Universiteit Utrecht ) Data Mining 52 / 53
K-fold cross-validation
Divide the database in k parts
For each of the k parts do
Use the remaining k − 1 parts to train the model.Predict on the part that was not used for training.Compute accuracy of the predictions.
The average of the k accuracies is an estimate of the accuracy of themodel induced from the complete database.
Ad Feelders ( Universiteit Utrecht ) Data Mining 53 / 53