zhang yanxia china-vo group 2006.11.30 in guilin
DESCRIPTION
Chinese Virtual Observatory. Data Mining in Astronomy. Zhang Yanxia China-VO Group 2006.11.30 in Guilin. Outline. Why What How Example challenge summary. ROSAT ~keV. DSS Optical. IRAS 25 m. 2MASS 2 m. GB 6cm. WENSS 92cm. NVSS 20cm. IRAS 100 m. Astronomy facing “data avalanche”. - PowerPoint PPT PresentationTRANSCRIPT
Zhang Yanxia
China-VO Group
2006.11.30 in Guilin
Chinese Virtual Observatory
11/29-12/03China-VO 2006, Guilin 2
Outline
• Why• What• How• Example• challenge• summary
11/29-12/03China-VO 2006, Guilin 3
Astronomy facing “data avalanche”
IRAS 252MASS 2DSS OpticalIRAS 100WENSS 92cmNVSS 20cmGB 6cmROSAT ~keV
Necessity Is the Mother of Invention
DM&KDD
11/29-12/03China-VO 2006, Guilin 4
Issues in Astronomy
• Compression (e.g. Galaxy images and spectra)• Classification (e.g. Stars, galaxies, or Gamma Ray Bursts)• Reconstruction (e.g. of blurred galaxy images, mass
distribution from weak gravitational lensing)• Feature extraction (e.g. signatures feature of stars, galaxies
and quasars)• Parameter estimation (e.g. Star parameter measurement,
Photometric redshift prediction, orbital parameters of extra-solar planets, or cosmological parameters )
• Model selection (e.g. are there 0,1,2,……planets around stars, or is there a cosmological model with none-zero neutrino mass more favorable)
Ofer Lahav, 2006, astro-ph/0610703Summary on the 4th meeting on “Statistical Challenge
in Modern Astronomy” held at Penn State University in June 2006
11/29-12/03China-VO 2006, Guilin 5
Science Requirements for DMScience Requirements for DM((Borne K D, 2001, Proc. Of the MPA/ESO/MPE Workshop,671Borne K D, 2001, Proc. Of the MPA/ESO/MPE Workshop,671))
Cross-Identification - refers to the classical problem of associating the source list in one database to the source list in another.
Cross-Correlation - refers to the search for correlations, tendencies, and trends between physical parameters in multi-dimensional data, usually across databases.
Nearest-Neighbor Identification - refers to the general application of clustering algorithms in multi-dimensional parameter space, usually within a database.
Systematic Data Exploration - refers to the application of the broad range of event-based and relationship-based queries to a database in the hope of making a serendipitous discovery of new objects or a new class of objects.
11/29-12/03China-VO 2006, Guilin 6
KDD: Opportunity and KDD: Opportunity and Challenges Challenges
KDD: Opportunity and KDD: Opportunity and Challenges Challenges
Data RichKnowledge Poor(the resource)
Enabling Technology(Interactive MIS, OLAP, parallel computing, Web, etc.)
Competitive Pressure
Data Mining TechnologyMature
KDD
11/29-12/03China-VO 2006, Guilin 7
KDD: A Definition
106-1012 bytes:never see the wholedata set or put it in thememory of computers
What knowledge?How to represent and use it?
Data mining algorithms?
KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.
KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.
11/29-12/03China-VO 2006, Guilin 8
Volume
Value
EDP
MIS
DSS
Benefits of Knowledge Discovery
Generate
Rapid Response
Disseminate
EDP: Electronic Data ProcessingMIS: Management Information Systems
DSS: Decision Support Systems
11/29-12/03China-VO 2006, Guilin
DM: A KDD Process
– Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
11/29-12/03China-VO 2006, Guilin 10
Work at each process of DM
DM object Data preparation Data processing Analysis and Evalution
60
50
40
30
20
10
0
11/29-12/03China-VO 2006, Guilin 11
Primary Tasks of Data Primary Tasks of Data MiningMining Primary Tasks of Data Primary Tasks of Data MiningMining
Classification
Deviation andchange detection
?
Summarization
Clustering
Dependency
Modeling
Regression
finding the descriptionof several predefined classes and classify a data item into one of them.
maps a data item to a real-valued prediction variable.
identifying a finite set of categories or clusters to describe
the data.
finding a compact description
for a subset of data
finding a model which describes
significant dependencies between variables.
discovering the most significant changes in the data
11/29-12/03China-VO 2006, Guilin 12
Feature selection
• Filter method• Wrapper method• Embedded method• Feature weighted method
11/29-12/03China-VO 2006, Guilin 13
Feature extraction• PCA• Factor analysis (Principal FA/Maximum Likelihood FA)• Projection pursuit• ICA• Non-linear PCA/ICA• Random projection• Principal curves• MDS• LLE• ISOMAP• Topological continuous map• Neural network• Vector quantization• Kernel PCA/ICA• LDA (linear discriminant analysis )• QDA (quadratic discriminant analysis)• FDA (Fisher discriminant analysis)• GDA (Generalized discriminant analysis)• KDDA (kernel direct discriminant analysis)
11/29-12/03China-VO 2006, Guilin 14
Classification Methods• Based on statistical theory: SVMs, ML, LDA,FDA,QDA,KNN• Based on NN: LVQ, RBF, PNN, KSOM,BBN,SLP,MLP• Based on Decision Tree: REPTree, RandomTree, CART,C5.0, J48, DecisionStump, RandomForest, NBtree,AC2,Cal5, ADTree,KDTree • Based on Decision Rule: Decision Table,CN2,ITrule, AQ• Based on bayesian theory: Naive Bayes classifier, NBTree• Based on meta learning: adaboost, boosting, bagging• Based on evolution theory: genetic algorithm• Based on fuzzy theory: fuzzy set, rough set• Ensembles of classifiers
Data Mining algorithm patterns
11/29-12/03China-VO 2006, Guilin 15
Regression Methods• (penalized) logistic regression• Bayesian regression analysis• Additive regression• Locally weighted regression• Voted perceptron network• Projection pursuit regression• Recursive partitioning regression• Alternating condition expectation• Stepwise regression• Recursive least square• Fourier transform regression• Ruled-based regression• Principal component regression• Instance-based regression• Multivariate adaptive regression splines• Regression trees (CART, RETIS, M5,random forest, KDtree)• Simple windowed regression• SVM• NN
11/29-12/03China-VO 2006, Guilin 16
Method to estimate errors
• Train-test
• Cross-validation
• Bootstrap
• Leave-one-out
11/29-12/03China-VO 2006, Guilin 17
Evaluation of methods
• Accuracy
• Speed
• Comprehensibility
• Time to learn
• Generalization
11/29-12/03China-VO 2006, Guilin 18
Model Selection for Classifiction
• Accuracy
• G-mean
• F-measure
• ROC (Receive Operating Characteristic Curve)
11/29-12/03China-VO 2006, Guilin 19
Model Selection for Regression
• AIC ( Akaike information criterion)
• BIC (Bayesian information criterion)
• SRM (Structure Risk Minimization)
11/29-12/03China-VO 2006, Guilin 20
Example 1
Lim Jien-sien et al. Machine Learning, 40, 203-229(2000)
33 algorithms on 16 different samples
22 decision treesCART, S-Plus tree, C4.5,FACT,QUEST,IND,OC1,LMDT,CAL5,T1
9 statistical methodsLDA,QDA,NN,LOG,FDA,PDA,MDA,POL
2 neural networksLVQ,RBF
11/29-12/03China-VO 2006, Guilin 21
Example 1Lim Jien-sien et al. Machine Learning, 40, 203-229(2000)
11/29-12/03China-VO 2006, Guilin 22
Example 2
11/29-12/03China-VO 2006, Guilin 23
Example 3Zhao,Y, Zhang,Y., 2006, submitted to cospar
11/29-12/03China-VO 2006, Guilin 24
For NB, ADTree MLP, the corresponding whole accuracyamounts to 97.5%, 98.5% and 98.1%, respectively.
Zhang,Y,Zhao,Y, 2006, submitted to CHJAA
Example 3
11/29-12/03China-VO 2006, Guilin 25
By best-forward search, j-h, b-v,j+ 2.5lgFpeak are optimal features selected from the 10 features.
Decision Table is applied. 10-fold cross-validation for training and test.
98.03%
Zhang,Y, Luo, A, Zhao,Y, 2006, submitted to CosparExample 4
11/29-12/03China-VO 2006, Guilin 26
Li,Y.,Zhang,Y.,Zhao,Y.,2006,submitted to Chinese Science
k-Nearest neighbor classifier
Example 5
11/29-12/03China-VO 2006, Guilin 27
Zhang,Y., Zhao, Y., 2006,ADASS XV,351,173
Example 6
11/29-12/03China-VO 2006, Guilin 28
Challenges and Influential Aspects
Handling of differenttypes of data with
different degree of supervision
Changing data and knowledge
Understandability of patterns, various kinds of requests and
results (decision lists, inference networks, concept hierarchies, etc.)
Interactive,Visualization
KnowledgeDiscovery
Different sources of data (distributed, heterogeneous databases, noise and missing, irrelevant data, etc.)
Massive data sets,high dimensionality(efficiency, scalability)
11/29-12/03China-VO 2006, Guilin 29
Summary
• Linear or non-linear• Gassian or non-gassian• Continous or discrete • Missing or not• Comparision of the number of attributes
with that of records• Choose the appropriate method or
ensemble algorithms according to the task and data characteristics
11/29-12/03China-VO 2006, Guilin 30
Prospect
With the wing of DM, find more, better or best knowledge!
Thank you for your attention!
Thank you !!!Thank you !!!