understanding data analytics and data mining introduction

21
Understanding Data Analytics and Data Mining Introduction

Upload: emily-george

Post on 26-Dec-2015

230 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Understanding Data Analytics and Data Mining Introduction

Understanding Data Analytics and Data Mining

Introduction

Page 2: Understanding Data Analytics and Data Mining Introduction

Introduction

An important aspect of the decision-making process is the ability to transform seemingly unrelated data into useful

information which is used to influence a person’s decision. Understanding what data is needed to make effective

decisions and where that data comes from is just one step in the process: the next step is mining or analyzing that data

to draw up useful conclusions to aid in decision making.

The Understanding Data Analysis and Data Mining presentation is designed to explore the general principles behind this second step and support the organization in

understanding their options related to using data effectively in their business.

Page 3: Understanding Data Analytics and Data Mining Introduction

Distinguishing Analysis and Mining

The terms, “data analysis” and “data mining,” are sometimes used interchangeably, but they are distinctly different in practice.

In data analysis, a hypothesis is formed and the data is analyzed to support or disprove the hypothesis.

In data mining, no hypothesis is formed initially but the data is analyzed to identify any interesting patterns from which a hypothesis can be drawn.

Despite their differences, the techniques and methods for both data analysis and data mining are similar.

Page 4: Understanding Data Analytics and Data Mining Introduction

Knowledge Discovery in Databases

The Knowledge Discovery in Databases process includes the following steps:

– Selection– Preprocessing– Transformation– Data Mining– Interpretation/Evaluation– Knowledge Presentation

Page 5: Understanding Data Analytics and Data Mining Introduction

Defining Data

Data are a set of facts.

Facts are true or proven.

Data can come in a variety of types:– Relational data– Operational data– Transactional data

Page 6: Understanding Data Analytics and Data Mining Introduction

Define Data Entry

A data entry is a single instance or record in a database. They are also called data objects.

A data entry establishes relationship between data elements.

– person and address– customers and purchases– events and outcomes

Page 7: Understanding Data Analytics and Data Mining Introduction

Define Dimensions

A dimension is a collection of facts about a measurable situation.

Dimensions define the who, what, where, when, and how of a particular focus on the data.

Dimensions are used to construct how data patterns are identified and analyzed.

Page 8: Understanding Data Analytics and Data Mining Introduction

Dimensions – Cube Schema

The cube rendering is a product of online analytical processing (OLAP) and is used to show how the different dimensions of data can be viewed.

Retail Example:– 4 retail locations– 10 products– 12 months– 2 age groups

Product

Time

Location

Page 9: Understanding Data Analytics and Data Mining Introduction

Dimensions – Star Schema

Star schemas are used to design how data is organized in data warehouses.

Product

Time Customer

Location

Orders

Page 10: Understanding Data Analytics and Data Mining Introduction

Online Analytical Processing

Online Analytical Processing is an approach for analyzing multidimensional data from multiple perspectives interactively.

The acronym for online analytical processing is OLAP.

Page 11: Understanding Data Analytics and Data Mining Introduction

Defining Patterns

A pattern is an expression of data which can be modeled.

Data analysis and data mining focuses on identifying, understanding, and drawing conclusions about interesting patterns.

An interesting pattern has the following characteristics:– It can be understood easily by humans– It can be recreated, meaning it has some level certainty to

its validity– It can be potentially used by the organization– It is novel, innovative, and requires investigation– For data analysis, it validates and confirms the hypothesis

Page 12: Understanding Data Analytics and Data Mining Introduction

Queries

Queries are a mechanism for retrieving information from a database: they consist of questions.

Standard queries are predefined questions to ask a database.

Page 13: Understanding Data Analytics and Data Mining Introduction

Data Mining Techniques

There are several techniques of note in data mining:

– Characterization and Discrimination– Associations and Correlations– Classification and regression– Clustering analysis – Outlier analysis

Page 14: Understanding Data Analytics and Data Mining Introduction

Characterization and Discrimination

Characterization will describe the data in summary or general terms.

Discrimination will describe the data, usually by means of comparison.

Page 15: Understanding Data Analytics and Data Mining Introduction

Association and Correlation

Associations and correlations are pattern relationships made against data objects.

Often used in frequent pattern mining.

Page 16: Understanding Data Analytics and Data Mining Introduction

Classification and Regression

Classification attempts to find a predefined data model to describe the data set.

Regression attempts to find an existing data model to describe missing or unavailable numerical data sets.

These are predictive approaches and utilize methods such as decision trees and neural networks.

Page 17: Understanding Data Analytics and Data Mining Introduction

Cluster Analysis

Data objects are analyzed without using class labels, or generating class labels.

Image from visibleearth.nasa.gov

Page 18: Understanding Data Analytics and Data Mining Introduction

Outlier Analysis

Looks at the abnormalities in data: data that does not behave as expected.

Page 19: Understanding Data Analytics and Data Mining Introduction

Standards

Cross Industry Standard Process for Data Mining (CRISP-DM) was developed by the European Strategic Program on Research in Information Technology

Sample, Explore, Modify, Model, and Assess (SEMMA) was developed by SAS Institute Inc.

Page 20: Understanding Data Analytics and Data Mining Introduction

The Toolkit

The Toolkit is designed to enable an organization to improve their capabilities in data warehousing and data analysis, while maintaining a level of neutrality between specific technical solutions. The toolkit is comprised of two parts: an introduction to the concepts and terms used in these areas, and usable templates to pursue and implement specific technical solutions

The goal of the Data Warehouse and Data Analysis Toolkit is to define the contributing factors, major components, and their relationships, while provide the basic tools to take action based on the organization’s needs.

Page 21: Understanding Data Analytics and Data Mining Introduction

Moving Forward

The presentations found within the Toolkit provide education about the different facets of Data Warehousing and Data Analysis: they can be used for self-edification or as the foundation for presenting a case to different levels of the organization.

The process document, Developing Data Analysis Capabilities, is intended to be a step-by-step guide in creating a Data Analysis foundation in your organizations. Multiple templates have been created to support the process and aid organizations in their efforts to improve their Data Analysis capabilities.