introduction to data mining outline (in astronomy)sabinemcconnell/tutorial adass 2007.pdf ·...
Post on 19-Jul-2020
19 Views
Preview:
TRANSCRIPT
1
Introduction to Data Mining(in Astronomy)
ADASS 2007 Tutorial
Sabine McConnellDepartment of Computer Science/Studies
Trent University
Outline• Introduction• The Data • Classification• Clustering• Evaluation of Results• Increasing the Accuracy• Some Issues and Concerns• Weka• References
What is data mining?
“The non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.”
(Piatetsky-Shapiro)
“ The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. ”
(Hand)
Data mining is a combination of:
- machine learning- statistics- databases- visualization- application domain
2
(Some) Applications of data-mining techniques
• Science: bioinformatics, discovery of drugs, astronomy
• Government: law enforcement, income tax, anti-terror
• Business: Market basket analysis, targeted marketing
• Engineering: Satellite navigation
Data mining in astronomy
• Classification of stars, galaxies and planetary nebulae, both based on images and spectral parameters
• star/galaxy separation • forecasting of sunspots and of geomagnetic storms from
solar wind • forecasting of seeing • gravitational wave signal detection • antimatter search in cosmic rays • selection of Quasar candidates • detection of expanding HI shells • ....many more
View of the Dataset (= Matrix)
object ID _RAJ2000 _DEJ2000 distance flags x size y size U-B error Bar? class134633 00 03 09.1 +21 57 34 398 A 1629 1654 14.4 low no Irr
3555432 00 03 48.8 +07 28 45 113 D 939 1332 14 medium yes Spiral3432223 00 03 58.6 +20 45 07 835 A 1713 2219 12.7 low no Ell124123 00 05 53.0 +22 32 14 398 A 1092 1400 0 low no Irr333456 00 06 21.4 +17 26 03 398 A 1121 1419 15.1 low no Irr
3355478 00 07 16.7 +27 42 31 398 A 1343 1810 13.4 high no Spiral875 00 07 16.1 +08 18 03 879 A 1095 1281 14.6 medium yes Spiral
33378 00 08 10.7 +27 00 15 578 A 1154 1493 14.4 high no Irr569433 00 08 20.5 +40 37 54 398 A 1661 1683 0 low no Irr
3321347 00 09 54.3 +25 55 28 778 A 1961 2180 12.5 low no Spiral5464648 00 10 47.7 +33 21 18 79 B 929 1359 13.5 high no Ell
454345476 00 12 49.9 +77 47 44 398 A 1393 1671 0 low no Irr4646788 00 13 27.5 +17 29 16 398 A 1141 1573 14.2 medium yes Spiral
• levels of measurement (nominal, ordinal, interval, ratio)• numeric vs. categorical
The data-mining process(Knowledge discovery in databases)
3
Data Preparation Preprocessing and Algorithms• Neural networks like data to be scaled • Decision trees do not care about scaling, but work better with discrete attributes
that have small numbers of possible values• Neural networks can handle irrelevant or redundant attributes, while they may
lead to large decision trees• Neural networks do not like noisy data, especially for small datasets, while decision
trees do not care about noise much• Nearest-neighbour approaches can handle noise if a certain parameter is adjusted• Distance-based approaches do not work well if the attributes are not equally
weighted, and typically work with numerical data only• Expectation Maximization approaches can deal with missing data, but k-means
techniques require substitution of missing data• ….
A Comparison of Neural Network Algorithms and Preprocessing Methods for Star-Galaxy Discrimination, D. Bazell and Y. Peng, Astrophysical Journal Supplement Series:47-55, May 1998
Data Preparation Issues
• transformation of attribute types• selection of attributes• transformation of attributes• normalization of attribute values• sampling• missing values
Data preparation: transformation of attribute
types
• categorical to numeric• numeric to categorical
4
Transformation: categorical to numeric
• map to circle, sphere or hypersphere– may work if categories are ordinal (i.e. days of the week) – usually produces poor results otherwise
• map to generalized tetrahedron– to uniquely represent k possible attribute
values, we need k new attributes. – example: an attribute with three possible values (i.e. circle,
square, triangle) maps to three new attributes with the values (1,0,0) for circle, (0,1,0) for square, and (0,0,1) for triangle
– works for both ordinal and nominal data
Transformation: numeric to categorical
• some data mining algorithms require data to be categorical• we may have to transform continuous attributes into
categorical attributes: discretization or• transform continuous and discrete data into binary data:
binarization• we also have to distinguish between unsupervised (no use
of class information) and supervised (use of class information) discretization methods– unsupervised: equal-width or equal-frequency binning,
k-means, visual inspection– supervised: use some measure of impurity of bins
Data preparation: attribute selection
• Remove irrelevant or redundant attributes to reduce the dimensionality of the dataset
• Preserve the probability distribution of the classes present in the data as much as possible– Filter approach: Start with empty set, add attributes one
at the time– Wrapper approach:Start with full set, remove attributes
one at a time– Reduce search time by combining the two methods– alternative: use the upper levels of a decision tree,
providing there are class labels in the data
Data preparation: transformation of attributes
• Two popular methods:– Wavelet transforms– Principal component analysis
• express data in terms of new attributes• reduce the number of attributes by
truncating
5
Data preparation: normalization
• min-max normalization• z-score normalization (standardization)• normalization by decimal scaling
Data preparation: sampling
• Reduce the number of objects (rows) in the dataset– simple sample without replacement– simple random sample with replacement– cluster sample– stratified sample
Data preparation: sampling
stratified sample: preserves original distributions of classesundersampling/oversampling: equal distributions of classes
Missing values
• data may be missing completely at random, missing at random, or not missing at random (censored)
• depending on why the data is missing, we can use– casewise data deletion– mean substitution– regression– hot deck methods– maximum likelihood methods– multiple imputation– ...
6
Building the model Data-mining categories
• Classification• Clustering• Visualization• Association Rule Mining• Summarization• Outlier detection• Deviation detection• …
Models vs. Patterns
• Models: – Large-Scale Description of the Data– describe/predict/summarize the most common cases
• patterns:– small scale– local models– association rules, outliers– often most interesting objects
Predictive vs Descriptive Techniques• Data- mining techniques can be either
- predictive (supervised)- descriptive (unsupervised)
• predictive: predict (discrete) class attribute based on other attribute values. This is like learning from a teacher. →classification
• descriptive: discover structure of data without prior knowledge of class labels→clustering
• evolving area: semi-supervised (combines predictive and descriptive methods)
7
example: Automated morphological classification of APM galaxies by supervised artificial neural networks, Naim et
al., MNRAS 275, 567-590(1995)
• 830 galaxy images (diameter limited) from APM Equatorial Catalogue of Galaxies
• 24 parameters (inputs), including ellipticity, surface brightness, bulge size, arm number, length and intensity
• outputs: Revised Hubble Type of galaxy• classified by 6 human experts according to the Revised
Hubble System, and by a supervised neural network with • result: rms error for classification by networks as
compared with mean types of classification (1.8 Revised Hubble Types) is comparable to rms dispersion between experts
Predictive data mining: classification
(Learn a model to predict future target variables)
Given a set of points from classes what is the class of new point ?Is the new point a star or a galaxy?
galaxiesstars
Predictive data mining: classification (Decision Trees)
if y > 2 then if x > 5 then blueelse
if x > 4 then redelse blue
else if x > 2 then redelse blue
Y
X2
2
4 5
noy > 2 ?
yes
x > 5 ? x > 2 ?
x > 4 ?
Decision Trees:Choosing a splitting criteria
)|(log)|()(1
02 tiptiptentropy
c
i∑−
=
−=
[ ]21
0
)|(1)( ∑−
=
−=c
i
tiptgini
[ ])|(max1)(_ tipterrortionclassificai
−=
8
Decision tree: Measuring the impurity of a node
goal: a large change in impurity I after split
)()(
)(1
j
k
j
j vINvN
parentIgain ∑=
−=
information gain: when entropy is impurity measure
A
A
A A
A A
BB
BB
B BB B
class A class B
parent 3/7 4/7
left child 2/3 1/3
right child 1/4 3/4
5.074
731
22
=⎟⎠⎞
⎜⎝⎛−⎟
⎠⎞
⎜⎝⎛−=
parentgini
45.031
321
22
_=⎟
⎠⎞
⎜⎝⎛−⎟
⎠⎞
⎜⎝⎛−=
childleftgini
375.043
411
22
_ =⎟⎠⎞
⎜⎝⎛−⎟
⎠⎞
⎜⎝⎛−=childrightgini
1.0375.07445.0
735.0
)()(
)(1
=⎟⎠⎞
⎜⎝⎛−⎟
⎠⎞
⎜⎝⎛−=
−= ∑=
gain
vINvN
parentIgain j
k
j
j
repeat for all possible splits and choose best split.
Decision Trees: extensions
• oblique decision trees: allows test conditions that involve multiple attributes
• regression trees: value assigned to datum is the average of values in the node
• random forests: build multiple decision trees that include a random factor when choosing the attributes to split on
Characteristics of Decision Trees• decision boundaries are typically axis-parallel, • can handle both numeric and nominal attributes• for nominal attributes, decision trees tend to favor the
selection of attributes with larger numbers of possible values as splitting criteria
• the runtime is determined by the fact that numeric attributes need to be sorted, therefore classification is fairlyfast in typical settings
• can easily be converted to (possibly suboptimal) rule sets.• pruning of trees is recommended to reduce their
complexity, the pruning strategy is more important than the choice of splitting criteria.
• robust to noise
9
example: Decision Trees for Automated Identification of Cosmic-Ray Hits in Hubble Space Telescope Images, Salzberg et al., Publications of the Astronomical Society of the Pacific 107:279-288, March 1995
• oblique decision tree, starting at random locations for the hyperplanes
• overcomes local maxima by perturbation of the hyperplanes and restarting of the search at new location
• compares results from 5 different decision trees • reduction of the feature set, use of decision trees to
confirm labeling• over 95% accuracy for single, unpaired images
Predictive data mining: classification (Neural Networks)
- more complex borders- more accurate- may overfit the data
inputlayer
outputlayer
hiddenlayers
data
Neural Networks: Backpropagation
randomly initialize weights
repeat until stopping criterion satisfied
for each sample do
1. present sample to input nodes2. propagate data through layers, using weights and activation functions
3. calculate results at output nodes4. determine error at output nodes5. propagate error backwards to adjust the weights
Neural Networks: Extensions
• Madalines• Adaptive Multilayer Networks• Prediction Networks• Winner-Take-All Networks• Counterpropagation Networks• Learning Vector Quantizers• Principal Component Analysis Networks• Hopfield Networks• ….
10
Applications of Neural Networks in Astronomy:
• star/galaxy separation• spectral and morphological classification of galaxies• spectral classification of stars• determine number of binary stars in a cluster• reduce input dimensionality• classification of planetary nebulae• predictions of solar flux and sunspots• classification of asteroid spectra• adaptive optics• spacecraft control• interpolation of HI distribution in Perseus• classification of white dwards• detection and classification of CCD defects• search for antimatter• …
example: The use of Neural Networks to probe the structure of the nearby universe. d’Abrusco et al., To appear in the proceedings of the Astronomical Data Analysis -IV workshop held in Marseille in 2006.
• supervised neural network applied to SDSS data• training data: spectroscopic, contained 449 370 galaxies• training data divided into training, validation, and test sets• output: distance estimates for roughly 30 million galaxies
distributed over 8 000 sq. deg. • provides list of candidate AGN and QSO
Characteristics of Artificial Neural Networks
• slow• poor interpretability of results.• able to approximate any target function.• can learn to ignore irrelevant or redundant attributes • easy to parallelize.• may converge to local minimum because of greedy optimization, but
convergence to global maximum can be achieved through simulated annealing.
• choice of network structure non-trivial and time-consuming• sensitive to noise (a validation set may help here)
Lazy learners: Nearest-neighbourtechniques
Lazy learners do not build models
when a new datum is to be classified, it is assigned to the majority of the classes of its neighbours
11
Characteristics of Nearest-Neighbour Algorithms
• slow • does not work well with noisy data • does not provide the user with a model.• new data can easily be incorporated because of the
lack of model.• easy to parallelize.• may not work well if attributes are not equally
relevant, • decision boundaries are piece-wise linear.
Difference between predictive and descriptive approaches
• lack of class labels in the descriptive case: we need establish correspondence between clusters and real-life type of objects
• For predictive approaches, it is easier to see if there is agreement with human experts
• evaluation of descriptive approaches is much harder• descriptive approaches avoid the bias that may be
introduced by existing class labels, but introduce bias of their own (choice of distance measure, algorithms, and number of clusters)
Descriptive data mining: clustering
Goal: Find clusters of similar objects(Find groups of similar galaxies)
- which algorithm should I use?- when are objects similar?
Overview of Clustering Techniques
• major distinction:partitioning-based vs. hierarchical methods (fixed number vs. variable number of clusters)
• hierarchical methods are further divided into agglomerative and divisive clustering– agglomerative methods initially assign each sample to a separate
cluster, then merge clusters that are closest to each other in successive steps
– divisive methods start with one cluster containing alldata, then repeatedly split the cluster(s) until each sample belongs to a separate cluster
hierarchical clustering (produces dendrogram)
partition-based
12
Distance measures for objects
• Manhattan distance• Euclidean distance• Squared Euclidean distance• Chebychev distance• Hamming distance• Percent disagreement• …
Distance measures for clusters• minimum distance
(single linkage, nearest-neighbour)
• maximum distance (complete linkage, farthest-neighbour)
• mean distance
• average distance
the choice of distance measure for clusters will determine the cluster shape!
clustersdifferent from are p' and p where||min '
min ppd −=
clustersdifferent from are p' and p where||max '
min ppd −=
centercluster a indicates m where|| jimean mmd −=
clustersdifferent
from arep' and p where||1'
'∑∑ −=p pji
average ppnn
d
Descriptive data mining: clustering (K-means)
(numerical data only)
1) Randomly pick k cluster centers2) Assign every object to its nearest cluster center 3) Move each cluster center 4) Repeat steps 2,3 until stopping criterion is
satisfiedStep 1: randomly choose k cluster centers
K-means algorithm
x
x
x
13
Step 2: assign each point to the closest cluster center
K-means algorithm
x
x
xStep 3: move the cluster centers to represent the means of the
clusters
K-means algorithm
x
x
x
x
x
x
Step 4: reassign the points to the closest cluster center
K-means algorithm
x
x
x
Step 5: move cluster centers
K-means algorithm
x
x
x
xx
x
14
Step 6: reassign points and move cluster centers again, or terminate?
K-means algorithm
xx
x
Characteristics of k-means
• requires user-specified number of clusters• often converges to local optimum• does not perform well in the presence of outliers
and noise• is only useful when mean of a cluster is defined,
therefore most often used with numerical data only
• biased towards spherical clusters• cannot handle missing data
Other clustering approaches
• EM• k-medoids• model-based• grid-based• density-based• …
Evaluating the model
15
How can we evaluate (predictive and descriptive)
models?
Evaluation methods
• holdout method: use training and test sets• stratified holdout: preserve class
distribution• repeated holdout• k-fold cross-validation• leave-one-out cross validation• bootstrap sample: 0.632 sample
Training and test sets• split available data into two sets• one set is used to build the model• other set is used to evaluate the model• typical split: 2/3 of data as training set, rest for test set• does not work well for noisy data and small datasets• if a validation set is needed as well, the data available
for training is even more reduced• if test set is not representative sample of training set,
then accuracy of model may be underestimated
Cross-Validation
• split the data into k folds• use k-1 folds for training, 1 fold for testing• repeat k times so each fold is used for
testing once• repeat the whole process x times• average the resultsa typical value for x and k is 10
16
Increasing the accuracy
• boosting• bagging• randomization• ensembles
Bias-Variance Decomposition
• the classification error is the sum of bias, variance, and Bayes error rate
ec = bias +variance+ eB
• bias: measures how close the classifier will be to the function to be learned, on average
• variance: measures how much the estimates of the classifier will vary with changes in the dataset
• Bayes error rate: the minimum error rate associated with the Bayes optimal classifier
Increasing accuracy: bagging
• reduces variance• sample with replacement to create multiple
datasets• classify each of the datasets to produce
multiple methods• combine the individual models to produce
overall models
Increasing accuracy: boosting
• builds multiple models from dataset• each datum is associated with a weight• weights are adjusted over time:
– decrease the weight for data that are easy to classify– increase the weight for data that are hard to classify– build another model
• final model is constructed from all models, weighted by a score
17
Increasing accuracy: ensembles
• generalizes on the idea of bagging• build multiple models, that can vary in
– the input data– initial parameters: starting points, number of clusters,....– learning algorithms
• can be very powerful if the learning algorithms are weak learners (result changes substantially with change in dataset)
Many more data-mining techniques…
• association rules• sequence mining • random forests• Support Vector Machines• Naïve Bayes• genetic algorithms• …
Data Mining with Genetic Algorithms: Fitting a Galactic Model to An All-Sky Survey, Larsen and Humphreys
AJ, 125:19581979, April 2003
• genetic algorithms: survival of the fittest– fitness function to evaluate population– change population over time: random mutations,
crossover– evaluate population at each timestep, only the fittest
will survive
• derive global parameters for a Galactic model• magnitude-limited star counts from APS catalog• produces model counts for multi-directional data
Step-by-step quide: data preparation
– size of dataset: • number of attributes• number of samples per class
– transform attributes if necessary– normalize/standardize the data– select attributes – reduce dimensionality if possible (PCA for sparse data,
DWT for data with large numbers of attributes)
18
Step-by-step quide: evaluate the model
• 10-fold cross validation• never evaluate the model on the training
data• be careful when comparing models derived
with different techniques
Step-by-step quide: build the model
• descriptive techniques:– visualization– k-means algorithm– EM algorithm
• predictive approaches:– decision trees– neural networks
• combine both through semi-supervised learning
(Some) Data-mining concerns:
• curse of dimensionality• local minima• existing classifications• distributed nature of the
data• how can we describe the
models in general terms• can we standardize the
process somehow?• privacy issues
• missing values• normalization issues• multiple measurements• noisy data• error bars• cost of the models?• …
Crisp-DM
• Cross Industry Standard Process for Data Mining• http://www.crisp-dm.org/• describes commonly used approaches, mainly
from a business perspective• non-proprietary, documented, industry and tool-
independent model • describes best practices and structures of the data
mining process, similar to our model
19
Predictive Model Markup Language (PMML)
• XML-based language• define and share statistical and data-mining
models amongst applications (i.e. DB2, SAS, SPSS…)
<?xml version="1.0" ?> <!DOCTYPE PMML (View Source for full doctype...)>
- <PMML version="2.0">- <Header copyright="Copyright (c) 2001, Oracle Corporation. All rights reserved."><Application name="Oracle 9i Data Mining" version="9.2.0" /> </Header>
- <DataDictionary numberOfFields="1"> <DataField name="item" optype="categorical" /> </DataDictionary>
- <TransformationDictionary>- <DerivedField name="PETAL_LENGTH">+ <Discretize field="PETAL_LENGTH">- <DiscretizeBin binValue="1-1.59"><Interval closure="closedOpen" leftMargin="1.0" rightMargin="1.59" />
-
Curse of Dimensionality
• number of samples needed increases with dimensionality of the data
• data mining algorithms are often more than linear in the number of attributes
Distributed Data Mining
• Meta-learning• Collective Data Mining Framework• Data partitions/Ensembles
Weka Machine Learning Workbench
• available (no cost) at http://www.cs.waikato.nz/ml/weka
20
Wekainterfaces
arff format@relation 'labor-neg-data'@attribute 'duration' real@attribute 'wage-increase-first-year' real@attribute 'wage-increase-second-year' real@attribute 'wage-increase-third-year' real@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}@attribute 'working-hours' real@attribute 'pension' {'none','ret_allw','empl_contr'}@attribute 'vacation' {'below_average','average','generous'}@attribute 'longterm-disability-assistance' {'yes','no'}@attribute 'contribution-to-dental-plan' {'none','half','full'}@attribute 'bereavement-assistance' {'yes','no'}@attribute 'contribution-to-health-plan' {'none','half','full'}@attribute 'class' {'bad','good'}@data1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good'3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good'3,4.5,4.5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good'2,2,2.5,?,?,35,?,?,6,'yes',12,'average',?,?,?,?,'good'3,6.9,4.8,2.3,?,40,?,?,3,?,12,'below_average',?,?,?,?,'good'2,3,7,?,?,38,?,12,25,'yes',11,'below_average','yes','half','yes',?,'good'
Commercial Data-Mining Software
• Clementine• Enterprise Miner• Insightful Miner• Intelligent Miner• Microsoft SQL Server 2005• MineSet• Oracle Data Mining• Cart• …
21
References
• Introduction to Data Mining, P. Tan, M. Steinbach, and V. Kumar, Addison Wesley, 2006
• Data Mining: Practical Machine Learning Tools and Techniques, I. Witten and E. Frank, Morgan Kaufmann, 2005
• Data Mining: Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufmann, 2006
References
• http://people.trentu.ca/sabinemcconnell/• www.kdnuggets.com• http://www.twocrows.com/glossary.htm
top related