issues in data mining infrastructure

Download Issues in Data Mining Infrastructure

If you can't read please download the document

Upload: dayton

Post on 06-Jan-2016

31 views

Category:

Documents


1 download

DESCRIPTION

Issues in Data Mining Infrastructure. Authors:Nemanja Jovanovic, [email protected] Valentina Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected] http://galeb.etf.bg.ac.yu/~vm. Data Mining in the Nutshell. Uncovering the hidden knowledge. - PowerPoint PPT Presentation

TRANSCRIPT

  • Issues in Data Mining InfrastructureAuthors:Nemanja Jovanovic, [email protected] Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected]

    http://galeb.etf.bg.ac.yu/~vm

  • Data Mining in the NutshellUncovering the hidden knowledgeHuge n-p complete search spaceMultidimensional interface

  • A Problem You are a marketing manager for a cellular phone companyProblem: Churn is too highBringing back a customer after quitting is both difficult and expensiveGiving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful)You pay a sales commission of 250$ per contractCustomers receive free phone (cost 125$) with contractTurnover (after contract expires) is 40%

  • A SolutionThree months before a contract expires, predict which customers will leaveIf you want to keep a customer that is predicted to churn, offer them a new phoneThe ones that are not predicted to churn need no attentionIf you dont want to keep the customer, do nothingHow can you predict future behavior?Tarot Cards?Magic Ball?Data Mining?

  • Still Skeptical?

  • The DefinitionAutomatedThe automated extraction of predictive information from (large) databases ExtractionPredictiveDatabases

  • History of Data Mining

  • Repetition in Solar Activity1613 Galileo Galilei1859 Heinrich Schwabe

  • The Return of the Halley Comet191019862061 ???153116071682239 BCEdmund Halley (1656 - 1742)

  • Data Mining is NotData warehousingAd-hoc query/reportingOnline Analytical Processing (OLAP)Data visualization

  • Data Mining isAutomated extraction of predictive information from various data sourcesPowerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines

  • Data Mining canAnswer question that were too time consuming to resolve in the pastPredict future trends and behaviors, allowing us to make proactive, knowledge driven decision

  • Focus of this PresentationData Mining problem typesData Mining models and algorithmsEfficient Data MiningAvailable software

  • Data Mining Problem Types

  • Data Mining Problem Types6 types Often a combination solves the problem

  • Data Description and SummarizationAims at concise description of data characteristicsLower end of scale of problem typesProvides the user an overview of the data structureTypically a sub goal

  • SegmentationSeparates the data into interesting and meaningful subgroups or classesManual or (semi)automaticA problem for itself or just a step in solving a problem

  • ClassificationAssumption: existence of objects with characteristics that belong to different classesBuilding classification models which assign correct labels in advanceExists in wide range of various applicationSegmentation can provide labels or restrict data sets

  • Concept DescriptionUnderstandable description of concepts or classesClose connection to both segmentation and classificationSimilarity and differences to classification

  • Prediction (Regression)Similar to classification- difference: discrete becomes continuousFinds the numerical value of the target attribute for unseen objects

  • Dependency AnalysisFinding the model that describes significant dependences between data items or eventsPrediction of value of a data itemSpecial case: associations

  • Data Mining Models

  • Neural NetworksCharacterizes processed data with single numeric valueEfficient modeling of large and complex problemsBased on biological structures NeuronsNetwork consists of neurons grouped into layers

  • Neuron FunctionalityI1I2I3InOutputW1W2W3WnfOutput = f (W1*I1, W2*I1, , Wn*In)

  • Training Neural Networks

  • Neural Networks - ConclusionOnce trained, Neural Networks can efficiently estimate value of output variable for given inputNeurons and network topology are essentialsUsually used for prediction or regression problem typesDifficult to understandData pre-processing often required

  • Decision TreesA way of representing a series of rules that lead to a class or valueIterative splitting of data into discrete groups maximizing distance between them at each splitCHAID, CHART, Quest, C5.0Classification trees and regression treesUnlimited growth and stopping rulesUnivariate splits and multivariate splits

  • Decision TreesBalance>10Balance
  • Decision Trees

  • Rule InductionMethod of deriving a set of rules to classify casesCreates independent rules that are unlikely to form a treeRules may not cover all possible situationsRules may sometimes conflict in a prediction

  • Rule InductionIf balance>100.000 then confidence=HIGH & weight=1.7If balance>25.000 and status=married then confidence=HIGH & weight=2.3If balance
  • K-nearest Neighbor and Memory-Based Reasoning (MBR)Usage of knowledge of previously solved similar problems in solving the new problemAssigning the class to the group where most of the k-neighbors belongFirst step finding the suitable measure for distance between attributes in the dataHow far is black from green?+ Easy handling of non-standard data types- Huge models

  • K-nearest Neighbor and Memory-Based Reasoning (MBR)

  • Data Mining Models and AlgorithmsLogistic regressionDiscriminant analysisGeneralized Adaptive Models (GAM)Genetic algorithmsEtcMany other available models and algorithmsMany application specific variations of known modelsFinal implementation usually involves several techniquesSelection of solution that match best results

  • Efficient Data Mining

  • Dont Mess With It!YESNOYESYou Shouldnt Have!NOWill it ExplodeIn Your Hands?NOLook The Other WayAnyone ElseKnows?Youre in TROUBLE!YESYESNOHide ItCan You Blame Someone Else?NONO PROBLEM!YESIs It Working?Did You Mess With It?

  • DM Process ModelCRISPDM tends to become a standard5A used by SPSS Clementine (Assess, Access, Analyze, Act and Automate)SEMMA used by SAS Enterprise Miner (Sample, Explore, Modify, Model and Assess)

  • CRISP - DMCRoss-Industry Standard for DMConceived in 1996 by three companies:

  • CRISP DM methodologyFour level breakdown of the CRISP-DM methodology:PhasesGeneric TasksProcess InstancesSpecialized Tasks

  • Mapping generic models to specialized modelsAnalyze the specific contextRemove any details not applicable to the contextAdd any details specific to the contextSpecialize generic context according to concrete characteristic of the contextPossibly rename generic contents to provide more explicit meanings

  • Generalized and Specialized CookingPreparing food on your ownFind out what you want to eatFind the recipe for that mealGather the ingredientsPrepare the mealEnjoy your foodClean up everything (or leave it for later)Raw stake with vegetables?Check the Cookbook or call momDefrost the meat (if you had it in the fridge)Buy missing ingredients or borrow the from the neighborsCook the vegetables and fry the meatEnjoy your food or even moreYou were cooking so convince someone else to do the dishes

  • CRISP DM modelBusiness understandingData understandingData preparationModelingEvaluationDeploymentBusiness understandingData understandingData preparationModelingEvaluationDeployment

  • Business UnderstandingDetermine business objectivesAssess situationDetermine data mining goalsProduce project plan

  • Data UnderstandingCollect initial dataDescribe dataExplore dataVerify data quality

  • Data PreparationSelect dataClean dataConstruct dataIntegrate dataFormat data

  • ModelingSelect modeling techniqueGenerate test designBuild modelAssess model

  • EvaluationEvaluate resultsReview processDetermine next stepsresults = models + findings

  • DeploymentPlan deploymentPlan monitoring and maintenance Produce final reportReview project

  • At Last

  • Available Software14

  • Conclusions

  • WWW.NBA.COM

  • Se7en

  • CD ROM

  • CreditsAnne Stern, SPSS, Inc.Djuro Gluvajic, ITE, DenmarkObrad Milivojevic, PC PRO, Yugoslavia

  • ReferencesBruha, I., Data Mining, KDD and Knowledge Integration: Methodology and A case Study, SSGRR 2000Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, MIT Press, 1996Glumour, C., Maddigan, D., Pregibon, D., Smyth, P., Statistical Themes nad Lessons for Data Mining, Data Mining And Knowledge Discovery 1, 11-28, 1997Hecht-Nilsen, R., Neurocomputing, Addison-Wesley, 1990Pyle, D., Data Preparation for Data Mining, Morgan Kaufman, 1999galeb.etf.bg.ac.yu/~vmwww.thearling.comwww.crisp-dm.comwww.twocrows.comwww.sas.com/products/minerwww.spss.com/clementine

  • The END