zhang yanxia china-vo group 2006.11.30 in guilin

31
Zhang Yanxia China-VO Group 2006.11.30 in Guilin nese Virtual Observatory

Upload: arthur-casey

Post on 31-Dec-2015

33 views

Category:

Documents


1 download

DESCRIPTION

Chinese Virtual Observatory. Data Mining in Astronomy. Zhang Yanxia China-VO Group 2006.11.30 in Guilin. Outline. Why What How Example challenge summary. ROSAT ~keV. DSS Optical. IRAS 25 m. 2MASS 2 m. GB 6cm. WENSS 92cm. NVSS 20cm. IRAS 100 m. Astronomy facing “data avalanche”. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

Zhang Yanxia

China-VO Group

2006.11.30 in Guilin

Chinese Virtual Observatory

Page 2: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 2

Outline

• Why• What• How• Example• challenge• summary

Page 3: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 3

Astronomy facing “data avalanche”

IRAS 252MASS 2DSS OpticalIRAS 100WENSS 92cmNVSS 20cmGB 6cmROSAT ~keV

Necessity Is the Mother of Invention

DM&KDD

Page 4: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 4

Issues in Astronomy

• Compression (e.g. Galaxy images and spectra)• Classification (e.g. Stars, galaxies, or Gamma Ray Bursts)• Reconstruction (e.g. of blurred galaxy images, mass

distribution from weak gravitational lensing)• Feature extraction (e.g. signatures feature of stars, galaxies

and quasars)• Parameter estimation (e.g. Star parameter measurement,

Photometric redshift prediction, orbital parameters of extra-solar planets, or cosmological parameters )

• Model selection (e.g. are there 0,1,2,……planets around stars, or is there a cosmological model with none-zero neutrino mass more favorable)

Ofer Lahav, 2006, astro-ph/0610703Summary on the 4th meeting on “Statistical Challenge

in Modern Astronomy” held at Penn State University in June 2006

Page 5: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 5

Science Requirements for DMScience Requirements for DM((Borne K D, 2001, Proc. Of the MPA/ESO/MPE Workshop,671Borne K D, 2001, Proc. Of the MPA/ESO/MPE Workshop,671))

Cross-Identification - refers to the classical problem of associating the source list in one database to the source list in another.

Cross-Correlation - refers to the search for correlations, tendencies, and trends between physical parameters in multi-dimensional data, usually across databases.

Nearest-Neighbor Identification - refers to the general application of clustering algorithms in multi-dimensional parameter space, usually within a database.

Systematic Data Exploration - refers to the application of the broad range of event-based and relationship-based queries to a database in the hope of making a serendipitous discovery of new objects or a new class of objects.

Page 6: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 6

KDD: Opportunity and KDD: Opportunity and Challenges Challenges

KDD: Opportunity and KDD: Opportunity and Challenges Challenges

Data RichKnowledge Poor(the resource)

Enabling Technology(Interactive MIS, OLAP, parallel computing, Web, etc.)

Competitive Pressure

Data Mining TechnologyMature

KDD

Page 7: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 7

KDD: A Definition

106-1012 bytes:never see the wholedata set or put it in thememory of computers

What knowledge?How to represent and use it?

Data mining algorithms?

KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.

KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.

Page 8: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 8

Volume

Value

EDP

MIS

DSS

Benefits of Knowledge Discovery

Generate

Rapid Response

Disseminate

EDP: Electronic Data ProcessingMIS: Management Information Systems

DSS: Decision Support Systems

Page 9: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin

DM: A KDD Process

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 10: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 10

Work at each process of DM

DM object Data preparation Data processing Analysis and Evalution

60

50

40

30

20

10

0

Page 11: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 11

Primary Tasks of Data Primary Tasks of Data MiningMining Primary Tasks of Data Primary Tasks of Data MiningMining

Classification

Deviation andchange detection

?

Summarization

Clustering

Dependency

Modeling

Regression

finding the descriptionof several predefined classes and classify a data item into one of them.

maps a data item to a real-valued prediction variable.

identifying a finite set of categories or clusters to describe

the data.

finding a compact description

for a subset of data

finding a model which describes

significant dependencies between variables.

discovering the most significant changes in the data

Page 12: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 12

Feature selection

• Filter method• Wrapper method• Embedded method• Feature weighted method

Page 13: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 13

Feature extraction• PCA• Factor analysis (Principal FA/Maximum Likelihood FA)• Projection pursuit• ICA• Non-linear PCA/ICA• Random projection• Principal curves• MDS• LLE• ISOMAP• Topological continuous map• Neural network• Vector quantization• Kernel PCA/ICA• LDA (linear discriminant analysis )• QDA (quadratic discriminant analysis)• FDA (Fisher discriminant analysis)• GDA (Generalized discriminant analysis)• KDDA (kernel direct discriminant analysis)

Page 14: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 14

Classification Methods• Based on statistical theory: SVMs, ML, LDA,FDA,QDA,KNN• Based on NN: LVQ, RBF, PNN, KSOM,BBN,SLP,MLP• Based on Decision Tree: REPTree, RandomTree, CART,C5.0, J48, DecisionStump, RandomForest, NBtree,AC2,Cal5, ADTree,KDTree • Based on Decision Rule: Decision Table,CN2,ITrule, AQ• Based on bayesian theory: Naive Bayes classifier, NBTree• Based on meta learning: adaboost, boosting, bagging• Based on evolution theory: genetic algorithm• Based on fuzzy theory: fuzzy set, rough set• Ensembles of classifiers

Data Mining algorithm patterns

Page 15: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 15

Regression Methods• (penalized) logistic regression• Bayesian regression analysis• Additive regression• Locally weighted regression• Voted perceptron network• Projection pursuit regression• Recursive partitioning regression• Alternating condition expectation• Stepwise regression• Recursive least square• Fourier transform regression• Ruled-based regression• Principal component regression• Instance-based regression• Multivariate adaptive regression splines• Regression trees (CART, RETIS, M5,random forest, KDtree)• Simple windowed regression• SVM• NN

Page 16: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 16

Method to estimate errors

• Train-test

• Cross-validation

• Bootstrap

• Leave-one-out

Page 17: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 17

Evaluation of methods

• Accuracy

• Speed

• Comprehensibility

• Time to learn

• Generalization

Page 18: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 18

Model Selection for Classifiction

• Accuracy

• G-mean

• F-measure

• ROC (Receive Operating Characteristic Curve)

Page 19: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 19

Model Selection for Regression

• AIC ( Akaike information criterion)

• BIC (Bayesian information criterion)

• SRM (Structure Risk Minimization)

Page 20: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 20

Example 1

Lim Jien-sien et al. Machine Learning, 40, 203-229(2000)

33 algorithms on 16 different samples

22 decision treesCART, S-Plus tree, C4.5,FACT,QUEST,IND,OC1,LMDT,CAL5,T1

9 statistical methodsLDA,QDA,NN,LOG,FDA,PDA,MDA,POL

2 neural networksLVQ,RBF

Page 21: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 21

Example 1Lim Jien-sien et al. Machine Learning, 40, 203-229(2000)

Page 22: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 22

Example 2

Page 23: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 23

Example 3Zhao,Y, Zhang,Y., 2006, submitted to cospar

Page 24: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 24

For NB, ADTree MLP, the corresponding whole accuracyamounts to 97.5%, 98.5% and 98.1%, respectively.

Zhang,Y,Zhao,Y, 2006, submitted to CHJAA

Example 3

Page 25: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 25

By best-forward search, j-h, b-v,j+ 2.5lgFpeak are optimal features selected from the 10 features.

Decision Table is applied. 10-fold cross-validation for training and test.

98.03%

Zhang,Y, Luo, A, Zhao,Y, 2006, submitted to CosparExample 4

Page 26: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 26

Li,Y.,Zhang,Y.,Zhao,Y.,2006,submitted to Chinese Science

k-Nearest neighbor classifier

Example 5

Page 27: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 27

Zhang,Y., Zhao, Y., 2006,ADASS XV,351,173

Example 6

Page 28: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 28

Challenges and Influential Aspects

Handling of differenttypes of data with

different degree of supervision

Changing data and knowledge

Understandability of patterns, various kinds of requests and

results (decision lists, inference networks, concept hierarchies, etc.)

Interactive,Visualization

KnowledgeDiscovery

Different sources of data (distributed, heterogeneous databases, noise and missing, irrelevant data, etc.)

Massive data sets,high dimensionality(efficiency, scalability)

Page 29: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 29

Summary

• Linear or non-linear• Gassian or non-gassian• Continous or discrete • Missing or not• Comparision of the number of attributes

with that of records• Choose the appropriate method or

ensemble algorithms according to the task and data characteristics

Page 30: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

11/29-12/03China-VO 2006, Guilin 30

Prospect

With the wing of DM, find more, better or best knowledge!

Thank you for your attention!

Page 31: Zhang Yanxia China-VO Group 2006.11.30 in Guilin

Thank you !!!Thank you !!!