competitive advantage from data mining: some lessons learnt in the information systems field
Post on 13-Jan-2016
30 Views
Preview:
DESCRIPTION
TRANSCRIPT
Competitive advantage from Data Mining: some lessons learnt
in the Information Systems field
Mykola Pechenizkiy, Seppo Puuronen Department of Computer Science
University of Jyväskylä Finland
Alexey Tsymbal
Department of Computer ScienceTrinity College Dublin
Ireland
PMKD’05 Copenhagen, Denmark August 22-26, 2005
2
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
OutlineOutline• Introduction and What is our message?• Part I: Existing frameworks for DM
– Theory-oriented: Databases; Statistics; Machine learning; etc
– Process-oriented: Fayyad’s, CRISP, Reinartz’s
• Part II: Where we are? – rigor vs. relevance in DM
• Part III: Towards the new framework for DM research– DM System as adaptive Information System (IS)– DM research as IS Development: DM system as artefact
– DM success model: success factors
– KM Challenges in KDD– One possible reference for new DM research framework
• Further plans and Discussion
3
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
What is What is Data MiningData MiningData mining or Knowledge discovery is the process of finding previously unknown and potentially interesting patterns and relations in large databases (Fayyad, KDD’96)
Data mining is the emerging science and industry of applying modern statistical and computational technologies to the problem of finding useful patterns hidden within large databases (John 1997)
Intersection of many fields: statistics, AI, machine learning, databases, neural networks, pattern recognition, econometrics, etc.
4
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
H.H. Information SystemsInformation Systems
H.0 GENERAL
H.1 MODELS AND PRINCIPLES
H.2 DATABASE MANAGEMENT
• H.2.0 General
– Security, integrity, and protection
• H.2.8 Database Applications
– Data mining
– Image databases
– Scientific databases
– Spatial databases and GIS
– Statistical databases
• H.2.m Miscellaneous
http://www.acm.org/class/1998/ valid in 2003
5
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
I. Computing MethodologiesI. Computing Methodologies I.5 PATTERN RECOGNITION
• I.5.0 General • I.5.1 Models
– Deterministic – Fuzzy set – Geometric – Neural nets – Statistical – Structural
• I.5.2 Design Methodology – Classifier design &
evaluation – Feature evaluation &
selection – Pattern analysis
• I.5.3 Clustering – Algorithms – Similarity measures
• I.5.4 Applications – Computer vision – Signal processing – Text processing – Waveform analysis
I.2 ARTIFICIAL INTELLIGENCE • I.2.0 General
– Cognitive simulation – Philosophical foundations
• I.2.1 Applications and Expert Systems • I.2.2 Automatic Programming • I.2.3 Deduction and Theorem Proving • I.2.4 Knowledge Representation
Formalisms and Methods • I.2.5 Programming Languages and
Software • I.2.6 Learning
– Analogies – Concept learning – Connectionism and neural nets – Induction – Knowledge acquisition – Language acquisition – Parameter learning
• I.2.7 Natural Language Processing • I.2.m Miscellaneous
6
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
G. Mathematics of ComputingG. Mathematics of Computing
G.3 PROBABILITY AND STATISTICS
• Correlation and regression analysis
• Distribution functions
• Experimental design
• Markov processes
• Multivariate statistics
• Nonparametric statistics
• Probabilistic algorithms (including Monte Carlo)
• Statistical computing
7
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Our MessageOur Message
• DM is still a technology having great expectations to enable organizations to take more benefit of their huge databases.
• There exist some success stories where organizations have managed to have competitive advantage of DM.
• Still the strong focus of most DM-researchers in technology-oriented topics does not support expanding the scope in less rigorous but practically very relevant sub-areas.
• Research in the IS discipline has strong traditions to take into account human and organizational aspects of systems beside the technical ones.
8
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Our MessageOur Message
• Currently the maturation of DM-supporting processes which would take into account human and organizational aspects is still living its childhood.
• DM community might benefit, at least from the practical point of view, looking at some other older sub-areas of IT having traditions to consider solution-driven concepts with a focus also on human and organizational aspects.
• The DM community by becoming more amenable to research results of the IS community might be able to increase its collective understanding of
– how DM artifacts are developed – conceived, constructed, and implemented,
– how DM artifacts are used, supported and evolved,
– how DM artifacts impact and are impacted by the contexts in which they are embedded.
9
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Part IPart I
• Existing Frameworks for DM– Theory-oriented
• Databases;
• Statistics;
• Machine learning;
• Data compression
– Process-oriented• Fayyad’s
• CRISP-DM
• Reinartz’s
Theory-Oriented FrameworksTheory-Oriented Frameworks
11
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Database PerspectiveDatabase Perspective • DM as application to DBs
– “In the same way business applications are currently supported using SQL-based API, the KKD applications need to be provided with application development support.”
– query KDD objects, support for finding NNs, clustering, or discretization and aggregate operations.
• Inductive databases approach– query concept should be applied also to data mining and
knowledge discovery tasks
• “there is no such thing as discovery, it is all in the power of the query language”
– contain not only the data but the theory of the data as well Imielinski, T., and Mannila, H. 1996, A database perspective on knowledge discovery. Communications of the ACM, 39(11), 58-64.
Boulicaut, J., Klemettinen, M., and Mannila, H. 1999, Modeling KDD processes within the inductive database framework. In Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery, Springer-Verlag, London, 293-302
12
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Reductionism ApproachReductionism Approach• Two basic Statistical Paradigms
– “Statistical Experiment”
• Fisher’s version, inductive principle of maximum likelihood
• Neyman and Pearson-Wald’s version, inductive behaviour
• Bayesian version, maximum posterior probability
• “Statistical learning from empirical process”
– “Structural Data Analysis”
• SVD
• Data mining statistics - the issue of computational feasibility has a much clearer role in data mining than in statistics
– data mining area approaches that emphasize on database integration, simplicity of use, and the understandability of results
– theoretical framework of statistics does not concern much about data analysis as a process that includes several steps
13
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Machine Learning ApproachMachine Learning Approach
• “let the data suggest a model” can be seen as a practical alternative to the statistical paradigm “fit a model to the data”
• Constructive Induction – a learning process, two intertwined phases: construction of the “best” representation space and generating hypothesis in the found space (Michalski & Wnek, 1993).
– Feature transformation (PCA, SVD, Random Projection)
– Feature selection
– LSI
14
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Data Compression ApproachData Compression Approach
– Compress the data set by finding some structure or knowledge for it, where knowledge is interpreted as a representation that allows coding the data by using fewer amount of bits.
– Theories should not be ad hoc that is they should not overfit the examples used to build it.
– Occam’s razor principle,14th century. • "when you have two competing models which
make exactly the same predictions, the one that is simpler is the better".
Mehta, M., Rissanen, J., and Agrawal, R. 1995, MDL-based decision tree pruning. In U.M. Fayyad, R. Uthurusamy (Eds.) Proceedings of the KDD 1995, AAAI Press, Montreal, Canada, 216-221.
15
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Other Theoretical frameworks for Other Theoretical frameworks for DMDM
• Microeconomic view – the key point is that data mining is about finding actionable
patterns: the only interest is in patterns that can somehow be used to increase utility;
– a decision theoretic formulation of this principle: the goal can be formulated in finding a decision x that tries to maximise utility function f(x).
Kleinberg, J., Papadimitriou, C., and Raghavan, P. 1998, A microeconomic view of data mining, Data Mining and Knowledge Discovery 2(4), 311-324
• Philosophy of Science– logical empiricism, critical rationalism, systems theory
– formism, mechanism, contextualism
– dispersive vs. integrative, analytical vs. synthetic theories– subjectivist vs. objectivist, nomothetic vs. ideographic,
nominalism vs. realism, voluntarism vs. determinism, epistemological assumptions
– Explanation, prediction, understanding
Process-Oriented FrameworksProcess-Oriented Frameworks
17
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Knowledge discovery as a processKnowledge discovery as a process
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997.
I
18
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
CRISP-DMCRISP-DM
http://www.crisp-dm.org/
19
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
KDD: “Vertical Solutions”KDD: “Vertical Solutions”
Business Understanding
Data Understanding
Data Preparation
Data Exproration
Data Mining
Evaluation & Interpretation
Deployment
Experience accumulat ion
Reinartz, T. 1999, Focusing Solutions for Data Mining. LNAI 1623, Berlin Heidelberg.
20
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Conclusion on different Conclusion on different frameworksframeworks
– Reductionist approach of viewing data mining as statistics has advantages of the strong background, and easy-formulated problems.
– The data mining tasks concerning processed like clusterisation, regression and classification fit easily into these approaches.
– More recent (process-oriented) frameworks address the issues related to a view of data mining as a process, and its iterative and interactive nature
21
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Part IIPart II
Where we are?
Rigor and Relevance in DM Reseach
22
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
So, where are we?So, where are we?
• Lin in Wu et al. notices that a new successful industry (as DM) can follow consecutive phases: 1. discovering a new idea,
2. ensuring its applicability,
3. producing small-scale systems to test the market,
4. better understanding of new technology and
5. producing a fully scaled system.
• At the present moment there are several dozens of DM systems, none of which can be compared to the scale of a DBMS system.– This fact indicates that we are still in the 3rd phase in
the DM area!
23
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Rigor vs Relevance in DM Rigor vs Relevance in DM ResearchResearch
Relevance
Rigor
Relevance Rigor
24
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Where is the focus?Where is the focus?• Still! … speeding-up, scaling-up, and increasing the accuracies of
DM techniques.
• Piatetsky-Shapiro : “we see many papers proposing incremental refinements in association rules algorithms, but very few papers describing how the discovered association rules are used”
• Lin claims that the R&D goals of DM are quite different:
– since research is knowledge-oriented while development is profit-oriented.
– Thus, DM research is concentrated on the development of new algorithms or their enhancements,
– but the DM developers in domain areas are aware of cost considerations: investment in research, product development, marketing, and product support.
• However, we believe that the study of the DM development and DM use processes is equally important as the technological aspects and therefore such research activities are likely to emerge within the DM field.
25
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Part IIIPart III
Towards the new framework for DM research
26
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
DMS in the Kernel of an DMS in the Kernel of an Organization Organization
DM Task(s)
DMS (Artifact)
Organization
Environment
• DM is fundamentally application-oriented area motivated by business and scientific needs to make sense of mountains of data.
• A DMS is generally used to support or do some task(s) by human beings in an organizational environment both having their desires related to DMS.
• Further, the organization has its own environment that has its own interest related to DMS, e.g. that privacy of people is not violated.
27
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
The ISs-based paradigm for The ISs-based paradigm for DMDM
Ives B., Hamilton S., Davis G. (1980). “A Framework for Research in Computer-based MIS” Management Science, 26(9), 910-934.
“Information systems are powerful instruments for organizational problem solving through formal information processing”
Lyytinen, K., 1987, “Different perspectives on ISs: problems and solutions.” ACM Computing Surveys, 19(1), 5-46.
User Environment
IS Development Environment
IS operations
environment
The Use
Process
The Development
Process
The Operation Process
The Organizational Environment
The External Environment
The Information Subsystem
(ISS)
28
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
DM Artifact DevelopmentDM Artifact Development
DM ArtifactDevelopment
Experimentation
Theory Building
Observation
Adapted from: Nunamaker, W., Chen, M., and Purdin, T. 1990-91, Systems development in information systems research, Journal of Management Information Systems, 7(3), 89-106.
A multimethodological approach to the construction of an artefact for DM
29
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Research methods in a paper on DMResearch methods in a paper on DM
– Theoretical approach: theory creating
• Hypothesis, new algorithm, etc.
– Constructive approach • Prototype of a DM tool
– Theoretical approach: theory testing and evaluation
• Artificial, benchmark, real-world data
• Evaluation techniques– Conclusion on theory
30
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
The Action Research and Design The Action Research and Design Science Approach to Artifact Creation Science Approach to Artifact Creation
DesignKnowledge
Awareness of business problem
Action planning
Action taking
Conclusion
BusinessKnowledge
Artifact Development
Artifact Evaluation
ContextualKnowledge
31
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
DM Artifact Use: Success Model 1 DM Artifact Use: Success Model 1 of 3of 3
SystemQuality
InformationQuality
Use
UserSatisfaction
IndividualImpact
OrganizationalImpact
Service Quality
Adapted from D&M IS Success Models
32
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
DM Artifact Use: Success Model 2 DM Artifact Use: Success Model 2 of 3of 3
• What are the key factors of successful use and impact of DMS both at the individual and organizational levels.
1. how the system is used, and also supported and evolved, and
2. how the system impacts and is impacted by the contexts in which it is embedded.
Coppock: the failure factors of DM-related projects.• have nothing to do with the skill of the modeler or the
quality of data.• But those do include:
1. persons in charge of the project did not formulate actionable insights,
2. the sponsors of the work did not communicate the insights derived to key constituents,
3. the results don't agree with institutional truths
the leadership, communication skills and understanding of the culture of the organization are not less important than the traditionally emphasized technological job of turning data into insights
33
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
DM Artifact Use: Success Model 3 DM Artifact Use: Success Model 3 of 3of 3
• Hermiz communicated his beliefs that there are
the four critical success factors for DM projects: • (1) having a clearly articulated business problem that needs
to be solved and for which DM is a proper tool;
• (2) insuring that the problem being pursued is supported by the right type of data of sufficient quality and in sufficient quantity for DM;
• (3) recognizing that DM is a process with many components and dependencies – the entire project cannot be "managed" in the traditional sense of the business word;
• (4) planning to learn from the DM process regardless of the outcome, and clearly understanding, that there is no guarantee that any given DM project will be successful.
34
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
KM PerspectiveKM Perspective• A knowledge-driven approach to enhance the
dynamic integration of DM strategies in knowledge discovery systems.
• Focus here is on knowledge management aimed to organise a systematic process of (meta-)knowledge capture and refinement over time. – knowledge extracted from data– the higher-level knowledge required for managing DM
techniques’ selection, combination and application
• Basic knowledge management processes of – knowledge creation and identification,
representation, collection and organization, sharing, adaptation, and application
• DEXA’05: TAKMA WS paper&presentation are available
Knowledge Creation & Acquisition
Knowledge Organization &
Storage
Knowledge Distribution & Integration
Knowledge Adaptation & Application
Knowledge Evaluation, Validation and Refinement
35
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
New Research Framework for DM New Research Framework for DM Research Research
People Roles Capabilities Characteristics Organizations Strategy Structure&Culture Processes Technology Infrastructure Applications Communications Architecture Development Capabilities
Environment Knowledge Base
Foundations Base-level theories Frameworks Models Instantiation Validation Criteria Design knowledge Methodologies Validation Criteria (not instantiations of models but KDD processes, services, systems)
Develop/Build Theories Artifacts
Justify/ Evaluate Analytical Case Study Experimental Field Study Simulation
Assess Refine
(Un-)Successful Applications in the appropriate environment
Contribution to Knowledge Base
DM Research
Ap
plic
able
Kn
ow
led
ge
Bu
sines
s Ne
eds
Relevance Rigor
36
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Further WorkFurther Work
• Definition of Relevance concept in DM research
• The revision of the book chapter
• Further work on the new framework for DM research
• Organization of Workshop or Special Track or Working conference on – more social directions in DM research likely with one of
the focuses on IS as a sister discipline.
Few options:– IRIS Scandinavian Conference on IS is one option
– Next PMKD
– Workshop in Jyväskylä
37
PMKD’05 Copenhagen, Denmark August 22-26, 2005Competitive advantage from DM: lessons learnt in the IS field by M. Pechenizkiy, S. Puuronen, A.
Tsymbal
Thank You!Thank You!
Book chapter draft is available on request from
Mykola Pechenizkiy
Department of Computer Science and Information Systems,
University of Jyväskylä, FINLANDE-mail: mpechen@cs.jyu.fi
Tel.: +358 14 2602472 Fax: +358 14 260 3011
http://www.cs.jyu.fi/~mpechen
Feedback is very welcome:• Questions
• Suggestions
• Collaboration
top related