1.3 applications, issues

22
1 Applications of Data Mining Issues in Data Mining

Upload: krishver2

Post on 13-Feb-2017

233 views

Category:

Education


0 download

TRANSCRIPT

1

Applications of Data Mining Issues in Data

Mining

2

Applications

Financial Data Analysis Retail Industry Telecommunication Industry Biological Data Analysis Other Scientific Applications Intrusion Detection

3

Financial Data Analysis

Financial Data Collected from Banks and Financial Institutions Usually complete and reliable

Design and Construction of data Warehouses for multi-dimensional data analysis and mining Analysis – Changes by month, by region, by sector…and max,

min, total, average, trend etc. Characteristic and Comparative analysis, Outlier Analysis

4

Loan payment and customer credit policy analysis Feature Selection and attribute relevance ranking (Debt ratio,

credit history, income, education level …) Loan granting policy can be adjusted Low risk Customers are granted loans

Classification and Clustering of customers for targeted marketing Customer group identification Multidimensional clustering techniques Can associate new customer with existing groups

Financial Data Analysis

5

Detection of money laundering and financial crimes Data from several sources – integrated Data Analysis tools can be used to detect unusual patterns Data Visualization tools, Linkage Analysis tools Classification tools, Clustering tools Outlier Analysis tools

Financial Data Analysis

6

Retail Industry Sales Data, Customer Shopping history, Goods

Transportation, E-Commerce Mining can help to

Identify buying behaviour, discover shopping trends Improve the quality of customer service, retain customers

Design and Construction of data warehouses Several ways to design a warehouse

Entities involved: Sales, Customers, Employers, Goods transportation… Preliminary data mining exercises can help to guide the design

process Dimensions and levels to involve and pre-processing to be done

7

Multi-dimensional analysis of sales, customers, products, time and region Multi-feature data cube Visualization tools

Analysis of effectiveness of sales campaigns Compare sales and transaction volume Multidimensional analysis

Compare sales amount, number of transactions containing same items before and after the campaign

Association Analysis Identify items likely to be purchased together

Retail Industry

8

Customer Retention Customer loyalty and trends

Sequential pattern mining Adjust pricing strategy and goods range

Purchase recommendation and cross-reference of items Recommender Systems Sales promotion by displaying deal information in association

with items of interest

Retail Industry

9

Telecommunication Industry Computer and Web data transmission, fax, Mobile

phone, Telephone services

Multidimensional analysis of telecommunication data Helps to identify and compare the data traffic, System work load,

Resource usage, User Group Behavior, Profit.. Time-of-day usage patterns

Fraudulent pattern analysis Identify fraudulent users and atypical usage patterns

Illegal Customer account access Automatic Dial-out equipment Switch and route congestion patterns

10

Multidimensional association and sequential pattern analysis Usage patterns for a set of communication services by customer

group, time of day Sales Promotion

Mobile Telecommunication Services Spatio-temporal data mining

Use of visualization tools

Telecommunication Industry

11

Biomedical and DNA Data Analysis Research in DNA Analysis has led to

Development of new drugs Cancer therapies Human genome study Discovery of genetic causes for many diseases

Genome Research Study of DNA Sequences Adenine, Cytosine, Guanine, Thymine 1,00,000 genes – each has hundreds of nucleotides – can be

combined in a number of ways Identifying Gene Sequence patterns is challenging

12

Semantic Integration of Heterogeneous, distributed genome databases Highly distributed generation and use of DNA data Integrated data warehouses and distributed federated databases Efficient Data Cleaning and Integration methods

Similarity Search and Comparison among DNA Sequences Gene sequences – isolated from healthy and diseased tissues Compare frequently occurring patterns in each class Help to identify the genetic factors of the disease and immune factors Non-numeric nature of data poses difficulties

Biomedical and DNA Data Analysis

13

Association Analysis: Identification of co-occurring gene sequences Diseases – triggered by a combination of genes acting together Association analysis helps to detect the kinds of genes that may

co-occur Study interactions and relationships between them

Path Analysis: Linking genes to different stages of disease development Different genes become active at different stages of the disease Develop drug interventions that target specific stages

Biomedical and DNA Data Analysis

14

Visualization tools and genetic data analysis Complex Gene structures – Graphs, trees, Cuboids and

visualization tools Better Understanding and support interactive data

exploration

Biomedical and DNA Data Analysis

15

Intrusion Detection

Intrusions Any set of actions that threaten the integrity, availability, or confidentiality of a

network resource Misuse detection: use patterns of well-known attacks to identify

intrusions Signatures – Must be updated Classification based on known intrusions E.g., three consecutive login failures: password guessing.

Anomaly detection: use deviation from normal usage patterns to identify intrusions Any significant deviations from the expected behavior are reported as possible

attacks

16

Intrusion Detection

Data Mining Algorithms Misuse detection

training data labeled – normal / intrusion Classifier can be used to detect known intrusions Classification algorithms, Association rule mining

Anomaly detection Builds models of normal behavior and detects significant deviations Supervised – ‘normal’ training data Unsupervised – no information about training data Classification, clustering

17

Intrusion Detection

Association and Correlation Analysis Finds relationships between system attributes describing the

network data Helps in selection of useful attributes

Analysis of Stream data Transient and dynamic nature of intrusions An event maybe normal on its own but malicious when viewed as

a part of a sequence Distributed Data Mining

Analysis of data from several locations Visualization and Querying tools

18

Data Mining in other Scientific Applications Old Scenario: Small, homogeneous data sets

Formulate hypothesis, build model, evaluate results

Current Scenario: High-dimensional data, stream data, heterogeneous data (spatial, temporal) Collect and store data, mine for new hypotheses, confirm with

data or experimentation

Vast amounts of data have been collected from Scientific domains Climate and ecosystem modeling, Chemical engineering, fluid

dynamics, structural mechanics…

19

Other Scientific Applications

Data Warehouses and data preprocessing Scientific applications – methods are needed for integrating

data from heterogeneous sources (Geospatial data warehouse) and identifying events (Climate and Ecosystem data)

Mining complex data types Scientific data – Semi-structured and unstructured Multimedia and Spatial data

20

Other Scientific Applications

Graph-based mining Labeled graphs – capture spatial, topological, geometric and

other relational characteristics present in scientific data Nodes – objects to be mined; edges – relationships Scalable and efficient mining methods are needed

Visualization tools and domain specific knowledge High level GUIs and visualization tools are needed Integrated with existing domain-specific systems and database

systems

21

Issues in Data Mining

Mining methodology and user interaction Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad-hoc data mining Expression and visualization of data mining results Handling noise and incomplete data Pattern evaluation

22

Issues in Data Mining

Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous databases and global

information systems (WWW)

Performance and scalability Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods