1.3 applications, issues
TRANSCRIPT
2
Applications
Financial Data Analysis Retail Industry Telecommunication Industry Biological Data Analysis Other Scientific Applications Intrusion Detection
3
Financial Data Analysis
Financial Data Collected from Banks and Financial Institutions Usually complete and reliable
Design and Construction of data Warehouses for multi-dimensional data analysis and mining Analysis – Changes by month, by region, by sector…and max,
min, total, average, trend etc. Characteristic and Comparative analysis, Outlier Analysis
4
Loan payment and customer credit policy analysis Feature Selection and attribute relevance ranking (Debt ratio,
credit history, income, education level …) Loan granting policy can be adjusted Low risk Customers are granted loans
Classification and Clustering of customers for targeted marketing Customer group identification Multidimensional clustering techniques Can associate new customer with existing groups
Financial Data Analysis
5
Detection of money laundering and financial crimes Data from several sources – integrated Data Analysis tools can be used to detect unusual patterns Data Visualization tools, Linkage Analysis tools Classification tools, Clustering tools Outlier Analysis tools
Financial Data Analysis
6
Retail Industry Sales Data, Customer Shopping history, Goods
Transportation, E-Commerce Mining can help to
Identify buying behaviour, discover shopping trends Improve the quality of customer service, retain customers
Design and Construction of data warehouses Several ways to design a warehouse
Entities involved: Sales, Customers, Employers, Goods transportation… Preliminary data mining exercises can help to guide the design
process Dimensions and levels to involve and pre-processing to be done
7
Multi-dimensional analysis of sales, customers, products, time and region Multi-feature data cube Visualization tools
Analysis of effectiveness of sales campaigns Compare sales and transaction volume Multidimensional analysis
Compare sales amount, number of transactions containing same items before and after the campaign
Association Analysis Identify items likely to be purchased together
Retail Industry
8
Customer Retention Customer loyalty and trends
Sequential pattern mining Adjust pricing strategy and goods range
Purchase recommendation and cross-reference of items Recommender Systems Sales promotion by displaying deal information in association
with items of interest
Retail Industry
9
Telecommunication Industry Computer and Web data transmission, fax, Mobile
phone, Telephone services
Multidimensional analysis of telecommunication data Helps to identify and compare the data traffic, System work load,
Resource usage, User Group Behavior, Profit.. Time-of-day usage patterns
Fraudulent pattern analysis Identify fraudulent users and atypical usage patterns
Illegal Customer account access Automatic Dial-out equipment Switch and route congestion patterns
10
Multidimensional association and sequential pattern analysis Usage patterns for a set of communication services by customer
group, time of day Sales Promotion
Mobile Telecommunication Services Spatio-temporal data mining
Use of visualization tools
Telecommunication Industry
11
Biomedical and DNA Data Analysis Research in DNA Analysis has led to
Development of new drugs Cancer therapies Human genome study Discovery of genetic causes for many diseases
Genome Research Study of DNA Sequences Adenine, Cytosine, Guanine, Thymine 1,00,000 genes – each has hundreds of nucleotides – can be
combined in a number of ways Identifying Gene Sequence patterns is challenging
12
Semantic Integration of Heterogeneous, distributed genome databases Highly distributed generation and use of DNA data Integrated data warehouses and distributed federated databases Efficient Data Cleaning and Integration methods
Similarity Search and Comparison among DNA Sequences Gene sequences – isolated from healthy and diseased tissues Compare frequently occurring patterns in each class Help to identify the genetic factors of the disease and immune factors Non-numeric nature of data poses difficulties
Biomedical and DNA Data Analysis
13
Association Analysis: Identification of co-occurring gene sequences Diseases – triggered by a combination of genes acting together Association analysis helps to detect the kinds of genes that may
co-occur Study interactions and relationships between them
Path Analysis: Linking genes to different stages of disease development Different genes become active at different stages of the disease Develop drug interventions that target specific stages
Biomedical and DNA Data Analysis
14
Visualization tools and genetic data analysis Complex Gene structures – Graphs, trees, Cuboids and
visualization tools Better Understanding and support interactive data
exploration
Biomedical and DNA Data Analysis
15
Intrusion Detection
Intrusions Any set of actions that threaten the integrity, availability, or confidentiality of a
network resource Misuse detection: use patterns of well-known attacks to identify
intrusions Signatures – Must be updated Classification based on known intrusions E.g., three consecutive login failures: password guessing.
Anomaly detection: use deviation from normal usage patterns to identify intrusions Any significant deviations from the expected behavior are reported as possible
attacks
16
Intrusion Detection
Data Mining Algorithms Misuse detection
training data labeled – normal / intrusion Classifier can be used to detect known intrusions Classification algorithms, Association rule mining
Anomaly detection Builds models of normal behavior and detects significant deviations Supervised – ‘normal’ training data Unsupervised – no information about training data Classification, clustering
17
Intrusion Detection
Association and Correlation Analysis Finds relationships between system attributes describing the
network data Helps in selection of useful attributes
Analysis of Stream data Transient and dynamic nature of intrusions An event maybe normal on its own but malicious when viewed as
a part of a sequence Distributed Data Mining
Analysis of data from several locations Visualization and Querying tools
18
Data Mining in other Scientific Applications Old Scenario: Small, homogeneous data sets
Formulate hypothesis, build model, evaluate results
Current Scenario: High-dimensional data, stream data, heterogeneous data (spatial, temporal) Collect and store data, mine for new hypotheses, confirm with
data or experimentation
Vast amounts of data have been collected from Scientific domains Climate and ecosystem modeling, Chemical engineering, fluid
dynamics, structural mechanics…
19
Other Scientific Applications
Data Warehouses and data preprocessing Scientific applications – methods are needed for integrating
data from heterogeneous sources (Geospatial data warehouse) and identifying events (Climate and Ecosystem data)
Mining complex data types Scientific data – Semi-structured and unstructured Multimedia and Spatial data
20
Other Scientific Applications
Graph-based mining Labeled graphs – capture spatial, topological, geometric and
other relational characteristics present in scientific data Nodes – objects to be mined; edges – relationships Scalable and efficient mining methods are needed
Visualization tools and domain specific knowledge High level GUIs and visualization tools are needed Integrated with existing domain-specific systems and database
systems
21
Issues in Data Mining
Mining methodology and user interaction Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad-hoc data mining Expression and visualization of data mining results Handling noise and incomplete data Pattern evaluation
22
Issues in Data Mining
Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous databases and global
information systems (WWW)
Performance and scalability Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods