data clustering and mining

27
Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer Additional Advising: Dr. Amy Langville Graduate Assistant: Shaina Race

Upload: egan

Post on 04-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Data Clustering and Mining. Dexin Zhou – Bard College (Presenter) Ralph Abbey – North Carolina State U. Jeremy Diepenbrock – Washington U. at St. Louis Project Advisor: Dr. Carl Meyer Additional Advising: Dr. Amy Langville Graduate Assistant: Shaina Race. What is Data Clustering?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Clustering and Mining

Dexin Zhou – Bard College (Presenter)Ralph Abbey – North Carolina State U.

Jeremy Diepenbrock – Washington U. at St. Louis

Project Advisor: Dr. Carl MeyerAdditional Advising: Dr. Amy Langville

Graduate Assistant: Shaina Race

Page 2: Data Clustering and Mining

What is Data Clustering?Clustering is the partitioning of a data set

into subsets (clusters).We are interested in creating good clusters

that allow us to reorganize disordered data into a block structure so that useful information can be extracted.

Page 3: Data Clustering and Mining

A Visible ExampleBefore Clustering After Clustering

Page 4: Data Clustering and Mining

What are we clustering?An 86 mini-document set that we created

with 13 topicsA 185 document set used in Daniel Boley’s

paper with 10 topicsSAS grocery store dataset

Page 5: Data Clustering and Mining

Preparing the data

• Term Aij is in the following form• g term is a function of term i, it downplays

the terms that appear frequently globally• l term is a function of the raw frequency of a

certain term in document j(eg: log)• d term is a normalization factor

Page 6: Data Clustering and Mining

How?Principal Direction Divisive PartitioningPrincipal Direction Gap PartitioningNon-Negative Matrix FactorizationClustering Aggregation

Page 7: Data Clustering and Mining

Singular Value Decomposition

Page 8: Data Clustering and Mining

Principle Direction Divisive Partitioning

Page 9: Data Clustering and Mining

PDDP

Page 10: Data Clustering and Mining

PDDP

Page 11: Data Clustering and Mining

Principle Direction Gap Partitioning

Sorted Indices Sorted Indices

Sor

ted

Val

ue

Sor

ted

Val

ue

Plot of the First Right Singular Vector Plot of the Second Right Singular Vector

Page 12: Data Clustering and Mining

A Comparison of PDGP w/ PDDP

Page 13: Data Clustering and Mining

Centering Vs. Non-Centering

Page 14: Data Clustering and Mining

Non-Negative Matrix Factorization

Page 15: Data Clustering and Mining

NMF Clustering

Page 16: Data Clustering and Mining

Cluster Aggregation

Page 17: Data Clustering and Mining

Cluster Aggregation

Page 18: Data Clustering and Mining

MetricsEntropy Method

A standard measurement based on our prior knowledge to the data file.

Density MethodDoes not require prior knowledge to the data

file.Less accurate.

Page 19: Data Clustering and Mining

Mini-document dataset

Page 20: Data Clustering and Mining

Mini-document dataset Result

Page 21: Data Clustering and Mining

Boley’s J1 Dataset

Page 22: Data Clustering and Mining

Boley’s Dataset Result

Page 23: Data Clustering and Mining

SAS Grocery Dataset

Page 24: Data Clustering and Mining

SAS Grocery Dataset Results

Page 25: Data Clustering and Mining

SAS Grocery Dataset Result

Page 26: Data Clustering and Mining

Conclusion

Page 27: Data Clustering and Mining

For Additional InformationPlease Visit

http://meyer.math.ncsu.edu/Meyer/REU/REU.html