introduction to data mining - yale...

38
Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf [email protected] Yale University Fall 2016 CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 1 / 19

Upload: hahuong

Post on 18-May-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

  • Introduction to Data Mining

    CPSC/AMTH 445a/545a

    Guy [email protected]

    Yale UniversityFall 2016

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 1 / 19

  • Outline

    1 What is Data Mining?From data to informationPredictive vs. descriptive informationSupervised vs. unsupervised learning

    2 Data Mining TasksClassification & regressionClustering & anomaly detectionAssociation rules & sequential patternsVisualization & dimensionality reduction

    3 Data mining process

    4 Software for data mining

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 2 / 19

  • Recommended textbooks

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 3 / 19

  • What is data mining?Data MiningNon-trivial extraction of useful,new, hidden, and/or implicit infor-mation from data.

    Deep LearningA set of algorithms that attemptto model high-level data abstrac-tions in data by using multiple pro-cessing layers, composed of multi-ple linear and non-linear transfor-mations.

    Machine LearningField of study that gives computersthe ability to learn without beingexplicitly programmed.

    Big DataExtremely large data sets that maybe analyzed computationally to re-veal patterns, trends, and associa-tions, especially relating to humanbehavior and interactions.

    Related terms: knowledge discovery in databases (KDD), patternrecognition, data warehousing, OLAP, ETL, IT, etc.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19

  • What is data mining?

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19

  • What is data mining?

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19

  • What is data mining?From data to information

    collected data----

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    collected data-

    ---

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    collected data-

    ---

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    collected data--

    --

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    collected data---

    -

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    collected data----

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    collected data----

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    collected data----

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    collected data----

    RO(100+)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

  • What is data mining?From data to information

    Examples of data mining tasks:Recommend movies on Netflix or books on Amazon.Object recognition in images and automatic image taggingCommunity detection in social networks (e.g., Facebook)Automatic medical diagnosis and treatment recommendation

    Examples of data processing tasks that are not considered datamining:

    Signature-based anti-virusRetrieving details from a contact listText-based search in a document or on the webQuicksort, balanced trees, heaps, etc.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 6 / 19

  • What is data mining?Predictive vs. descriptive methods

    Predictive methodsPredict unknown information from known data.

    How much would my house sell for, based on sales stats?Will Bob like Ghostbusters, based on his Netflix history?

    Descriptive methodsInfer or extract interpretable patterns to describe data.

    What consumer profiles should my ads target?If Jims card is trying to charge $300 in a Disney store today, isit reasonable or a fraud?

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 7 / 19

  • What is data mining?Supervised vs. unsupervised learning

    Machine learning data analysis tasks are roughly divided into:

    Supervised learningInferring information from labeled training data.

    Unsupervised learningFinding hidden patterns in unlabeled data.

    Semi-supervised learningCombine information from labeled and unlabeled data to model anddeduce information.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 8 / 19

  • Data mining tasksClassification

    ClassificationClassify items into a finite set of classes, or categories.

    Training phaseLabeled data:

    {(x1, `1), . . . , (xn, `n)} X L ZClassification model:

    F : X L, F (xi) = `i |L|

  • Data mining tasksClassification - examples

    Example (MNIST digit classification)

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 10 / 19

  • Data mining tasksClassification - examples

    Example (CalTech 101 image classification)

    Anchor Joshua-Tree Beaver Lotus Water-Lily

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 10 / 19

  • Data mining tasksRegression

    RegressionCompute (or infer) the value of a (piecewise) continuous functionfrom a finite number of sampled items & values.

    This task is similar to classification, but here the model F can havean infinite range (e.g., R or [0, 1]).

    ExamplesMarket pricing of a house/apartment/car based on its features.Trend line & model fitting from collected experimental data.Weather predictions, such as temperature and probability ofrain/snow.Confidence rating in diagnostics (or binary classifier).

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 11 / 19

  • Data mining tasksClustering

    ClusteringGroup together similar items while separating ones that aredifferent from each other.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19

  • Data mining tasksClustering

    ClusteringGroup together similar items while separating ones that aredifferent from each other.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19

  • Data mining tasksClustering

    ClusteringGroup together similar items while separating ones that aredifferent from each other.

    The quality of obtained clusters stems from their interpretability.Variations include known or unknown number of cluster number, aswell as multiscale hierarchical clustering structures.

    ExamplesClustering stocks to diversify stock market investmentCommunity detection in social networks by clustering profilesClustering genes and cells to uncover activities, reactions, andinteractions.Network activity profiling by clustering packets/sessions.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19

  • Data mining tasksAnomaly detection

    Anomaly/outlier detectionDetect significant deviations from normal behavior expressed byinferred data patterns.

    The notion of normal behavior can be defined in several ways, suchas clustering or model fitting.

    ExamplesFraud detection in credit cardsIntrusion detection in cybersecurityDetecting bot traffic in online advertisingMalfunction detection in process monitoring

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 13 / 19

  • Data mining tasksAssociation rules

    Association rule discoveryProduce dependency rules that model input coocurrences of itemsto predict, given a partial transaction, the remaining items in it.

    Training phaseObserved transactions: T1, . . . , Tn X Z

    Association rules: F : 2X 2X , T Ti 7 F (T ) Ti \ T

    Testing phasePartial transactions: S1, S2, . . . X 7 association rules Z

    Predicted information: i , Si 7 F (Si) X \ Si

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 14 / 19

  • Data mining tasksAssociation rules

    Association rule discoveryProduce dependency rules that model input coocurrences of itemsto predict, given a partial transaction, the remaining items in it.

    ExamplesActive advertisements & recommendations (e.g., Users wholiked/bought this product also liked/bought that product)Support decision making on shelve organization stores &supermarketsName completions in emails, social networks, etc.

    Unlike classification, the actual testing phase is often less importantthan the discovered rules in this case.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 14 / 19

  • Data mining tasksSequential patterns

    Sequential pattern discoveryGiven a set of ordered event sequences, produce rules to predictunknown/missing/future events from prior and/or subsequent events.

    Similar in some sense to association rule discovery, but with an orderor timeline aspect to each transaction.

    ExamplesString mining:

    Natural language processingGene sequencing in DNA and RNA

    Frequent item purchase sequencesPredicting outcomes of medical treatment

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 15 / 19

  • Data mining tasksDimensionality reduction & visualization

    Dimensionality reductionFind low dimensional coordinates (e.g., in Rd with d < 10) torepresent items.

    Used as a helpful, sometimes critical, preprocessing step to alleviatedata analysis challenges arising from the curse of dimensionality.

    VisualizationFind human interpretable 2D or 3D representations of the data viaelements, patterns, trends, structures, etc., in it.

    Used to enable manual data processing and enable a human user todraw conclusions, support decision making, or guide further dataexploration, from the data.

    A combination of these techniques can help create interactive dataprocessing algorithms that utilize unsupervised descriptive elementsto request and enable human input, and then use semi-supervisedpredictive approaches to produce stronger results.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 16 / 19

  • Data mining tasksDimensionality reduction & visualization

    Dimensionality reductionFind low dimensional coordinates (e.g., in Rd with d < 10) torepresent items.

    VisualizationFind human interpretable 2D or 3D representations of the data viaelements, patterns, trends, structures, etc., in it.

    A combination of these techniques can help create interactive dataprocessing algorithms that utilize unsupervised descriptive elementsto request and enable human input, and then use semi-supervisedpredictive approaches to produce stronger results.

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 16 / 19

  • Data mining tasksDimensionality reduction & visualization - example

    Modeling lip motions in speech:

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 17 / 19

  • Data mining tasksDimensionality reduction & visualization - example

    Modeling lip motions in speech:

    Dominating parameters: lips opening and teeth showing

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 17 / 19

  • Data mining tasksDimensionality reduction & visualization - example

    Modeling lip motions in speech:

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 17 / 19

  • Data mining process

    Typical steps in a data mining process1 Recognizing the specific task

    2 Knowing your data

    3 Preprocessing

    4 Apply algorithms

    5 Postprocessing & getting interpretable results

    6 Evaluation & cross validation

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 18 / 19

  • Data mining process

    Typical steps in a data mining process1 Recognizing the specific task

    2 Knowing your data

    3 Preprocessing

    4 Apply algorithms

    5 Postprocessing & getting interpretable results

    6 Evaluation & cross validation

    Pi

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 18 / 19

  • Data mining process

    Typical steps in a data mining process1 Recognizing the specific task

    2 Knowing your data

    3 Preprocessing

    4 Apply algorithms

    5 Postprocessing & getting interpretable results

    6 Evaluation & cross validation

    HY

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 18 / 19

  • Software for data mining

    Software used in this course:MatlabPython (with Numpy & Scipy)

    Other software:WekaR (especially for statisticians)Scilab & Octave (can be used in lieau of Matlab)C/C++, Java, & C# (.Net)FortranMany other scripting and programming platforms

    CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 19 / 19

    TitleOutlinePresentationWhat is Data Mining?From data to informationPredictive vs. descriptive informationSupervised vs. unsupervised learning

    Data Mining TasksClassification & regressionClustering & anomaly detectionAssociation rules & sequential patternsVisualization & dimensionality reduction

    Data mining processSoftware for data mining