data mining edward, hong zhang cs dept, suny, albany csi 668, march,20. 2001
Post on 19-Dec-2015
216 views
TRANSCRIPT
![Page 1: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/1.jpg)
Data Mining
Edward, Hong ZhangCS Dept, SUNY, Albany
CSI 668, March,20. 2001
![Page 2: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/2.jpg)
Presentation Outline
Motivation
Background (KDD Process)
What’s Data Mining?
Why Data Mining?
The Data Mining Process
Data Mining Algorithms
Data Mining Research Trend
Existing Systems
for Data Mining
Conclusions
![Page 3: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/3.jpg)
Motivation “Necessity is the mother of invention”
Data explosion problem: Automated data collection tools, availability of increasingly cheap
storage devices and mature database technology lead to tremendous amounts of data stored in database, data warehouses
and other information repositories.
We are drowning in data, but starving for knowledge! Data is everywhere
Understand and use data—an imminent task!
Solution: Knowledge Discovery (Data warehousing and data mining)
![Page 4: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/4.jpg)
Evolution of Database Technology
1960s-1970s:
Data collection, database creation, IMS and network DBMS.
1970s-1980s:
Relational data model, relational DBMS implementation.
1980s-1990s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.).
1990s-right now: Data mining and data warehousing, multimedia databases, and
Web-based database technology.
![Page 5: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/5.jpg)
BackgroundKnowledge Discovery (KD):
the process of finding general patterns/principles that summarize/explain a set of "observations".
The Knowledge Discovery in Databases (KDD)
Very Large DataBases (VLDB) have become the industry standard, making it impossible for human beings to mine the data "by hand" to look for interesting patterns. Automated tools are therefore needed to help to extract these patterns.
![Page 6: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/6.jpg)
Background Cont.
The knowledge discovery in databases (KDD) consists of 3 steps: Data Integration (Data Warehousing): Collecting the target data observations from the
different data sources, removing noise from the observations, and integrating them into an appropriate format.
Data Mining: (will be covered in detail) Applying a concrete algorithm to find useful and novel
patterns in the integrated data.
![Page 7: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/7.jpg)
Background Cont.
Pattern Evaluation:
Interpreting mined patterns, evaluating them according to usefulness/interestingness criteria, and possibly using visualization tools to aid in understanding the patterns graphically.
See KDD process graph below:
![Page 8: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/8.jpg)
Data Mining: KDD process
Task-relevant Data
Data Cleaning
Data Warehouse
Data Mining
Pattern Evaluation
Selection
Data IntegrationDatabases
Data mining: the core of knowledge discovery process.
![Page 9: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/9.jpg)
What Is Data Mining?
Data Mining (knowledge discovery in databases) Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) information (knowledge) or patterns from data in
large databases, data warehouse or other information repositories
What is not data mining? (Deductive) query processing. Expert systems or Machine Learning/statistical programs Online Analytical Processing (OLAP) Software Agents
Data Mining: Confluence of Multiple Disciplines
![Page 10: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/10.jpg)
Data Mining
Database, OLAP,
High Performance Computing
MachineLearning (AI)
Visualization
InformationScience
Pattern recognition
Statistics Modeling
![Page 11: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/11.jpg)
Why Data Mining? – Potential Applications
Database analysis and decision support System (DSS)
Market analysis and management target marketing, customer relation management,
market basket analysis, cross selling, market segmentation.
Risk analysis and management Forecasting, customer retention, improved underwriting,
quality control, competitive analysis.
Text mining (Text Databases, documents), key words search and analysis.DNA sequence analysis and gene expression.
![Page 12: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/12.jpg)
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions
Paper, Files, Information Providers, Database Systems, OLTPData Sources
Data Warehouses / Data MartsOLAP, MDA
Data ExplorationStatistical Analysis, Querying and Reporting
Visualization Techniques
Data Mining
Useful Pattern
MakingDecisions
DBA
DataAnalyst
Business Analyst
End User
![Page 13: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/13.jpg)
Why Data Mining? – Potential Applications (Cont.)
Internet Web Surf-Aid (Web Mining) IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
Sports IBM Advanced Scout analyzed NBA game
statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.
![Page 14: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/14.jpg)
The Data Mining Process
Data Mining Algorithm
Score model
model
Results Pattern
Data Mining Systemtraining
evaluation
prediction
Historical Training data
New data
Data set
![Page 15: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/15.jpg)
Examples of “Discovered” Patterns
Association rules: find rules between different attributes 98% of AOL users also have EBay accounts
Classification: Classify data based on the values in a classifying attribute People age less than 40 and salary > 40,000$ trade on-line
Clustering: Group data to form new classes Users A and B access similar URLs, they belong to the
same group, which has similar user profiles.
![Page 16: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/16.jpg)
Are All the “Discovered” Patterns Interesting?
A data mining system/query may generate thousands of patterns, not all of them are interesting. Suggested approach: Query-based, focused mining
Interestingness measures: A pattern is interesting if it is: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks
to confirm
![Page 17: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/17.jpg)
How can we Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness. Can a data mining system find all the interesting
patterns?Search only interesting patterns: Optimization. Can a data mining system find only the interesting
patterns? Approaches
First generate all the patterns and then filter out the uninteresting ones.
Generate only the interesting patterns --- mining query optimization
![Page 18: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/18.jpg)
Data Mining Algorithms
Four common DM algorithm types:
The k-Nearest Neighbor Algorithm (KNN) Artificial Neural Network (ANN) Rule Induction Decision Trees
![Page 19: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/19.jpg)
The k-Nearest Neighbor Algorithm (KNN)
A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset
Use entire training database as the model
Find nearest data point and do the same thing as you did for that record
. xq
-++
-
-
+-
+
-+
![Page 20: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/20.jpg)
The k-Nearest Neighbor Algorithm (KNN) (Cont.)
Distance-weighted nearest neighbor algorithm. Weight the contribution of each of the k neighbors
according to their distance to the query point Xq. giving greater weight to closer neighbors:
Advantages: Calculate the mean values of the k nearest neighbors. Robust to noisy data by averaging k-nearest neighbors.
Very easy to implement.
Disadvantage: Huge Models ( the entire training database ) More difficult to use in production.
![Page 21: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/21.jpg)
Artificial neural networks Algorithm (ANN)
Non-linear predictive models that learn through training and loosely resemble biological neural networks in structure.
Inputs transformed through a network of simple processors
Processor combines (weighted) inputs and produces an output value
![Page 22: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/22.jpg)
Artificial neural networks (Cont.)
x0
x1
xn
f
w0
w1
wn
k-
Inputvector x
weightvector w
weighted sum
Activationfunction
output y
The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
(Learning Rate)
![Page 23: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/23.jpg)
Multi layer perception of Artificial neural networks
Input vector: xi
Input nodes
Hidden nodes
Output nodes
Output vector
![Page 24: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/24.jpg)
Artificial Neural Network evaluation
Advantages: prediction accuracy is generally high robust,still works when training examples contain errors
Disadvantages: Key problem: Difficult to understand
The neural network model is difficult to understand No intuitive understanding of results
Long training time Although after training, process is very quick, the training process itself is time-consuming
Significant pre-processing of data often required
![Page 25: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/25.jpg)
Rule Induction
Rule Induction (rule-based prediction) We first generate a set of rules from a data warehouse,
then use them to predict values for new data item. It works much better on larger (and real)data sets, not
just on samples of data.
Two phases: Rule discovery: analyze a historical database and
generate a set of rules by automatic discovery. Prediction: apply the rules to a new data set and match
the rules to make predictions.
![Page 26: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/26.jpg)
Rule Induction ExampleOutlook Tempreature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N
Training Set
![Page 27: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/27.jpg)
Rule Induction Example (Cont.)
4 attributes: Outlook: can be sunny, overcast, rainy 3 cases Temperature: hot, mild, cool 3 cases Humidity: high, normal 2 cases Windy: true, false 2 cases
1 outcome: class (N: no class, P: have class)
Totally we should have 3*3*2*2=36 possible combinations, of which 14 are present in the
set of input examples.
![Page 28: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/28.jpg)
Rule Induction Example (Cont.)
Some rules inducted from above dataset: Classification rules:
If outlook = sunny and humidity = high then class = n.
If outlook = rainy and windy = true then class = n
if outlook = overcast then class = p Association rules:
If temperature = cool then humidity = normal
If windy=false and class=n then outlook = sunny and
humidity = high
![Page 29: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/29.jpg)
What is a decision tree?
A decision tree is a flow-chart-like tree structure. Internal node denotes a test on an attribute Branch represents an outcome of the test
All tuples in branch have the same value for the tested attribute.
Leaf node represents class label or class label distribution.
A series of nested if/then rules Understandable!
![Page 30: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/30.jpg)
A Sample Decision Tree
Outlook
sunny
humidity
high
N
normal
P
The same Training setwith RuleInduction
overcast
P
rain
windy
true false
N P
![Page 31: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/31.jpg)
Another Example for DT
If x=1 and y=0 then class = aIf x=0 and y=1 then class = aIf x=0 and y=0 then class = bIf x=1 and y=1 then class = b
![Page 32: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/32.jpg)
Another Example for DT
salary education label10000 high school reject40000 under graduate accept15000 under graduate reject75000 graduate accept18000 graduate accept
Credit Analysis
salary < 20000
Yes
no
accept
education in graduate
yes
no
rejectaccept
![Page 33: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/33.jpg)
Decision-Tree Classification Methods
The basic top-down decision tree generation approach usually consists of two phases: Tree construction
At start, all the training examples are at the root. Partition examples recursively based on selected
attributes.
Tree pruning Aiming at removing tree branches that may lead to errors
when classifying test data (training data may contain noise, statistical fluctuations, …)
![Page 34: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/34.jpg)
How to construct a tree?
Algorithmgreedy algorithm
make optimal choice at each step: select the best attribute for each tree node.
top-down recursive divide-and-conquer manner
from root to leafsplit node to several branches for each branch, recursively run the algorithm
![Page 35: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/35.jpg)
How to prune a treeA decision tree constructed using the training data may have too many branches/leaf nodes. Caused by noise, overfitting May result poor accuracy for unseen samples
Prune the tree: merge a subtree into a leaf node. Using a set of data different from the training data. At a tree node, if the accuracy without splitting is
higher than the accuracy with splitting, replace the subtree with a leaf node, label it using the majority class.
![Page 36: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/36.jpg)
How to use a tree?
Directly test the attribute value of unknown sample against
the tree. A path is traced from root to a leaf which holds the
label
Indirectly decision tree is converted to classification rules one rule is created for each path from the root to a
leaf IF-THEN is easier for humans to understand
![Page 37: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/37.jpg)
Decision tree for a covering algorithm
y
x
a
b b
b
b
b
bb
b
b b bb
bb
aa
aa
a
y
a
b b
b
b
b
bb
b
b b bb
bb
a a
aa
a
x1·2
y
a
b b
b
b
b
bb
b
b b bb
bb
a a
aa
a
x1·2
2·6
![Page 38: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/38.jpg)
Data Mining Algorithm Summary
KNN: Quick and easy Models tend to be very
large
ANN: Difficult to interpret Can require significant
amounts of time to train
Rule Induction: Understandable Need to limit calculations
Decision Trees: Understandable
Relatively fast
Other DM Technologies Genetic Algorithms Rough sets Bayesian networks Mixture models Many more...
![Page 39: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/39.jpg)
Data Mining Research Trend
Text mining: Text database and information retrieval
Multimedia data mining
OLAM (OLAP Mining)
Web mining (Data Mining and WWW) E-commerce Information retrieval (search) Network management
![Page 40: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/40.jpg)
Why Mine the Web?Web: A huge, widely-distributed, highly heterogeneous, semi-structured,
hypertext/hypermedia, interconnected, evolving information repository.
Web is a huge collection of documents plus Hyper-link information Access and usage information
Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint)
Lots of data on user access patterns Web logs contain sequence of URLs accessed by users
![Page 41: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/41.jpg)
Why is Web Mining Different?Huge : The Web is a huge collection of documents except for Hyper-link information Access and usage information
Dynamic:The Web is very dynamic New pages are constantly being generated
Unstructured: Complexity of Web pages: far greater than text
document collection
Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-links and access patterns Be incremental
![Page 42: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/42.jpg)
Types of Web Mining
Web Mining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web StructureMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
![Page 43: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/43.jpg)
Web Mining ApplicationsE-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud detection Similar image retrieval
Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents
Network Management Performance management Fault management
![Page 44: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/44.jpg)
Existing Systems for Data Mining
IBM: Intelligent Miner.SAS Institute: Enterprise Miner.Silicon Graphics: MineSet.Integral Solutions Ltd.: Clementine.Information Discovery Inc.: Data Mining Suite.
DBMiner Technology Inc.: DBMinerRutger: DataMine, GMD: Explora, Univ. Munich: VisDB
![Page 45: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/45.jpg)
Microsoft OLE DB for Data Mining
Microsoft OLE, OLE DB, OLE DB for OLAP and OLE
DB for Data Mining
OLE DB for DM: Standardization July 1999 to March
2000
Microsoft SQL Server 2000: Analysis manager Analysis manager consists of OLAP and Data Mining
Data mining: two modules (Classification/Prediction and clustering)
OLDB for DM: Data mining providers (such as association modules
and other classification or clustering modules)
![Page 46: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/46.jpg)
Research Progress for Data Mining in the Last Decade
Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)Association, correlation, and causality analysisClassification: scalability and new approachesClustering and outlier analysisSequential patterns and time-series analysisText mining, Web mining and Weblog analysisSpatial, multimedia, scientific data analysisData preprocessing and database compressionData visualization and visual data mining
![Page 47: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/47.jpg)
Conclusions
Knowledge Discovery in Databases (KDD)
Data warehouse: An industry trend DW stores a huge amount of subject-oriented,
cleansed, integrated, consolidated, time-related data.
Data Mining: A rich, promising, young field with broad applications and many challenging research
issues. Good science - leading position in research community
![Page 48: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/48.jpg)
Conclusions (Cont.)
Data mining tasks: characterization, association, classification, clustering, prediction, sequence and pattern analysis, etc.
Data mining Algorithms: The k-Nearest Neighbor Algorithm (KNN) Artificial Neural Network (ANN) Rule Induction Decision Trees
Research progress and trend in Data Mining
![Page 49: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001](https://reader034.vdocuments.site/reader034/viewer/2022051618/56649d385503460f94a110b5/html5/thumbnails/49.jpg)
Future WorkTheoretical foundations of data mining.Implementation and new data mining methodologies: A set of well-tuned, standard mining operators. Data and knowledge visualization tools. Integration of multiple data mining strategies.
Data mining in advanced information systems: Spatial, multimedia, Web-mining
Data mining applications: content browsing, query optimization, multi-
resolution model, etc.Social issues: A threat to security and privacy.