data mining-current status and research directions

34
2022년 6년 7년 Data Mining: Status and Direc tions 1 Data Mining: Current Status and Research Directions Jiawei Han Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca/~han

Upload: tommy96

Post on 12-Jan-2015

3.512 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 1

Data Mining: Current Status and Research

Directions

Jiawei Han

Intelligent Database Systems Research Lab

School of Computing Science

Simon Fraser University, Canada

http://www.cs.sfu.ca/~han

Page 2: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 2

Why Is Data Mining Hot?

Data mining (knowledge discovery in databases)

Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful)

information (knowledge) or patterns from data in

large databases or other information repositories

Necessity is the mother of invention

Data is everywhere—data mining should be

everywhere, too!

Understand and use data—an imminent task!

Page 3: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 3

Data, Data, Everywhere!!

Relational database—A commodity of every enterprise Huge data warehouses are under construction POS (Point of Sales): Transactional DBs in terabytes Object-relational databases, distributed, heterogeneous,

and legacy databases Spatial databases (GIS), remote sensing database (EOS),

and scientific/engineering databases Time-series data (e.g., stock trading) and temporal data Text (documents, emails) and multimedia databases WWW: A huge, hyper-linked, dynamic, global information

system

Page 4: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 4

Data Mining Is Everywhere, too!—A Multi-Dimensional View of Data Mining

Databases to be mined

Relational, transactional, object-relational, active, spatial,

time-series, text, multi-media, heterogeneous, legacy,

WWW, etc. Knowledge to be mined

Characterization, discrimination, association, classification,

clustering, trend, deviation and outlier analysis, etc. Techniques utilized

Database-oriented, data warehouse (OLAP), machine

learning, statistics, visualization, neural network, etc. Applications adapted

Retail, telecommunication, banking, fraud analysis, DNA mining,

stock market analysis, Web mining, Weblog analysis, etc.

Page 5: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 5

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning (AI) Visualization

Page 6: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 6

Data Mining—One Can Trace Back to Early Civilization

Most scientific discoveries involve “data mining” Kepler’s Law, Newton’s Laws, periodic table of

chemical elements, …, from “big bang” to DNA Statistics: A discipline dedicated to data analysis Then why data mining? What are the differences?

Huge amount of data—in giga to tera bytes Fast computer—quick response, interactive analysis Multi-dimensional, powerful, thorough analysis High-level, “declarative”—user’s ease and control Automated or semi-automated—mining functions

hidden or built-in in many systems

Page 7: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 7

A Brief History of Data Mining Activities

1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.

Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases

Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)

Journal of Data Mining and Knowledge Discovery (1997) 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and

SIGKDD Explorations More conferences on data mining

PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.

Page 8: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 8

Research Progress in the Last Decade

Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)

Association, correlation, and causality analysis Classification: scalability and new approaches Clustering and outlier analysis Sequential patterns and time-series analysis Similarity analysis: curves, trends, images, texts,

etc. Text mining, Web mining and Weblog analysis Spatial, multimedia, scientific data analysis Data preprocessing and database compression Data visualization and visual data mining Many others, e.g., collaborative filtering

Page 9: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 9

Multi-Dimensional Data Analysis

Data warehousing: integration from heterogeneous or semi-structured databases

Multi-dimensional modeling of data: star & snowflake schemas

Efficient and scalable computation of data cubes or iceberg cubes

OLAP (on-line analytical processing): drilling, dicing, slicing, etc.

Discovery-driven exploration of data cubes From OLAP to OLAM: A multi-dimensional

view for on-line analytical mining

Page 10: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 10

Association and Frequent Pattern Analysis

Efficient mining of frequent patterns and association rules: Apriori and FP-growth algorithms Multi-level, multi-dimensional, quantitative

association mining From association to correlation, sequential

patterns, partial periodicity, cyclic rules, ratio rules, etc.

Query and constraint-based association analysis

Page 11: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 11

Classification: Scalable Methods and Handling of Complex Types of Data

Classification has been an essential theme in machine learning, and statistics research Decision trees, Bayesian classification, neural

networks, k-nearest neighbors, etc. Tree-pruning, Boosting, bagging techniques

Efficient and scalable classification methods Exploration of attribute-class pairs SLIQ, SPRINT, RainForest, BOAT, etc.

Classification of semi-structured and non-structured data Classification by clustering association rules (ARCS) Association-based classification Web document classification

Page 12: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 12

Clustering and Outlier Analysis

Partitioning methods k-means, k-medoids, CLARANS

Hierarchical methods: micro-clusters Birch, Cure, Chameleon

Density-based methods: DBSCAN and OPTICS, DENCLU

Grid-based methods STING, CLIQUE, WaveCluster

Outlier analysis: statistics-based, distance-based, deviation-

based Constraint-based clustering

COD (Clustering with Obstructed Distance) User-specified constraints

Page 13: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 13

Sequential Patterns and Time-Series Analysis

Trend analysis Trend movement vs. cyclic variations, seasonal

variations and random fluctuations Similarity search in time-series database

Handling gaps, scaling, etc. Indexing methods and query languages for time-

series Sequential pattern mining

Various kinds of sequences, various methods From GSP to PrefixSpan

Periodicity analysis Full periodicity, partial periodicity, cyclic

association rules

Page 14: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 14

Similarity Search: Similar Curves, Trends, Images, and Texts

Various kinds of data, various similarity mining methods

Discovery of similar trends in time-series data Data transformation & high-dimensional structures

Finding similar images based on color, texture, etc. Content-based vs. keyword-based retrieval Color histogram-based signature Multi-feature composed signature

Finding documents with similar texts Similar keywords (synonymy & polysemy) Term frequency matrix Latent semantic indexing

Page 15: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 15

Spatial, Multimedia, Scientific Data Analysis

Multi-dimensional analysis of spatial, multimedia and scientific data Geo-spatial data cube and spatial OLAP The curse of dimensionality problem

Association analysis A progressive refinement methodology Micro-clustering can be used for preprocessing

in the analysis of complex types of data Classification

Association-based for handling high-dimensionality and sparse data

Page 16: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 16

Data Mining Industry and Applications

From research prototypes to data mining products, languages, and standards IBM Intelligent Miner, SAS Enterprise Miner,

SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc.

A few data mining languages and standards (esp. MS OLEDB for Data Mining).

Application achievements in many domains Market analysis, trend analysis, fraud

detection, outlier analysis, Web mining, etc.

Page 17: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 17

Web Mining: A Fast Expanding Frontier in Data Mining

Mine what Web search engine finds

Automatic classification of Web documents

Discovery of authoritative Web pages, Web

structures and Web communities

Meta-Web Warehousing: Web yellow page

service

Web usage mining

Page 18: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 18

Mine What Web Search Engine Finds

Current Web search engines: A convenient source for mining keyword-based, return too many, often low quality

answers, still missing a lot, not customized, etc. Data mining will help:

coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies

better search primitives: user preferences/hints linkage analysis: authoritative pages and clusters Web-based languages: XML + WebSQL + WebML customization: home page + Weblog + user

profiles

Page 19: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 19

Discovery of Authoritative Pages in WWW

Page-rank method ( Brin and Page, 1998): Rank the "importance" of Web pages, based on a

model of a "random browser." Hub/authority method (Kleinberg, 1998):

Prominent authorities often do not endorse one another directly on the Web.

Hub pages have a large number of links to many relevant authorities.

Thus hubs and authorities exhibit a mutually reinforcing relationship:

Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW.

Page 20: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 20

Automatic Classification of Web Documents

Web document classification: Good human classification: Yahoo!, CS term

hierarchies These classifications can be used as training

sets to build up learning model Key-word based classification is different from

multi-dimensional classification Association or clustering-based classification is

often more effective Multi-level classification is important

Page 21: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 21

Web Usage (Click-Stream) Mining

Weblog provides rich information about Web dynamics Multidimensional Weblog analysis:

disclose potential customers, users, markets, etc. Plan mining (mining general Web accessing regularities):

Web linkage adjustment, performance improvements Web accessing association/sequential pattern analysis:

Web cashing, prefetching, swapping Trend analysis:

Dynamics of the Web: what has been changing? Customized to individual users

Page 22: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 22

Querying and Mining: An Integrated Information Analysis Environment

Data mining as a component of DBMS, data warehouse, or Web information system Integrated information processing environment

MS/SQLServer-2000 (Analysis service) IBM IntelligentMiner on DB2 SAS EnterpriseMiner: data warehousing + mining

Query-based mining Querying database/DW/Web knowledge Efficiency and flexibility: preprocessing, on-line

processing, optimization, integration, etc.

Page 23: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 23

Basic Mining Operations and Mining Query Optimization

Relational databases: There are a set of basic relational operations and a standard query language, SQL E.g., selection, projection, join, set difference,

intersection, Cartesian product, etc. Are there a set of standard data mining operations, on

which optimizations can be done? Difficulty: different definitions on operations Importance: optimization can be performed on them

systematically, standardization to facilitate information exchange and system interoperability

Page 24: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 24

“Vertical” Data Mining

Generic data mining tools? —Too simple to match domain-specific, sophisticated applications

Expert knowledge and business logic represent many years of work in their own fields!

Data mining + business logic + domain experts

A multi-dimensional view of data miners Complexity of data: Web, sequence, spatial, multimedia, … Complexity of domains: DNA, astronomy, market, telecom, …

Domain-specific data mining tools Provide concrete, killer solution to specific problems Feedback to build more powerful tools

Page 25: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 25

One Picture May Worth 1000 Words!

Visual Data Mining Visualization of data Visualization of data mining results Visualization of data mining processes Interactive data mining: visual classification

One melody may worth 1000 words too! Audio data mining: turn data into music and

melody! Uses audio signals to indicate the patterns of data

or the features of data mining results

Page 26: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 26

Visualization of data mining results in SAS Enterprise Miner: scatter plots

Page 27: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 27

Visualization of association rules in MineSet 3.0

Page 28: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 28

Visualization of a decision tree in MineSet 3.0

Page 29: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 29

Visualization of Data Mining Processes by Clementine

Page 30: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 30

Interactive Visual Mining by Perception-Based Classification (PBC)

Page 31: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 31

Constraint-Based Mining

What kinds of constraints can be used in mining? Knowledge type constraint: classification, association,

etc. Data constraint: SQL-like queries

Find products sold together in Vancouver in Feb.’01. Dimension/level constraints:

in relevance to region, price, brand, customer category.

Rule constraints: small sales (price < $10) triggers big sales (sum >

$200). Interestingness constraints:

E.g., strong rules (min_support 3%, min_confidence 60%, min_lift > 3.0).

Page 32: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 32

Conclusions

Data mining—A promising research frontier

Data mining research has been striding forward greatly

in the last decade

However, data mining, as an industry, has not been

flying as high as expected

Much research and application exploration are needed Web mining

Towards integrated data mining environments and tools

Towards intelligent, efficient, and scalable data mining methods

Page 33: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 33

http://www.cs.sfu.ca/~han http://db.cs.sfu.ca

Thank you !!!Thank you !!!

Page 34: Data Mining-Current Status and Research Directions

2023년 4월 10일 Data Mining: Status and Directions 34

References

J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.

J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining", COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.