cs437 lecture 1-6

55
Lecture 1 Dr. Fawad Hussain GIK Institute Fall 2015 Data Warehousing and Mining Data Warehousing and Mining Data Warehousing and Mining Data Warehousing and Mining (CS437) (CS437) (CS437) (CS437) Some lectures in this course have been partially adapted from lecture series by Stephen A. Brobst, Chief Technology Officer at Teradata and professor at MIT.

Upload: aneebkhawar

Post on 23-Feb-2017

211 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Cs437 lecture 1-6

Lecture 1

Dr. Fawad Hussain

GIK Institute

Fall 2015

Data Warehousing and Mining Data Warehousing and Mining Data Warehousing and Mining Data Warehousing and Mining

(CS437)(CS437)(CS437)(CS437)

Some lectures in this course have been partially adapted from lecture series by Stephen A. Brobst, Chief Technology Officer at Teradata and professor at MIT.

Page 2: Cs437 lecture 1-6

General Course Description� Datawarehousing

� What is the motivation behind Data Warehousing and Mining?� Advanced Indexing, Query Processing and Optimization.� Building Data Warehouses.� Data Cubes, OLAP, De-Normalization,etc.

� Data Mining Techniques� Regression� Clustering� Decision Trees

� Other Information� Office Hours (Pasted on the office door)� Office: G03 (FCSE)

Course TA (Mr. Bilal)

Page 3: Cs437 lecture 1-6

Text Books (Optional)

� Introduction to Data Mining; Tan, Steinbach & Kumar.

� Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers, 2nd Edition, March 2006, ISBN 1-55860-901-6.

� Building a Data Warehouse for Decision Support by VidettePoe.

� Fundamentals of Database Systems by Elmasri and NavatheAddison-Wesley, 5th Edition, 2007.

Page 4: Cs437 lecture 1-6

Grading Plan

Grading Plan for Course %Tentative

Number(s)

Midterm Exam 25 01

Quizzes 10 06

Project 20 02

Final Exam 45 01

Page 5: Cs437 lecture 1-6

Tentative Schedule

Page 6: Cs437 lecture 1-6

Tentative Schedule

Page 7: Cs437 lecture 1-6

Tentative Schedule

Page 8: Cs437 lecture 1-6

Lecture 1

Introduction and Overview

Page 9: Cs437 lecture 1-6

Why this Course?

� The world is changing (actually changed), either change or be left behind.

� Missing the opportunities or going in the wrong direction has prevented us from growing.

� What is the right direction?

� Harnessing the data, in a knowledge driven economy.

Page 10: Cs437 lecture 1-6

The Need

Knowledge is power, Intelligence is

absolute power!

“Drowning in data and starving for

information”

Page 11: Cs437 lecture 1-6

Data Processing Steps

DATA

INFORMATION

POWER

INTELLIGENCE

$End goal?

Page 12: Cs437 lecture 1-6

Historical Overview

1960

Master Files & Reports

1965

Lots of Master files!

1970

Direct Access Memory & DBMS

1975

Online high performance transaction processing

1980

PCs and 4GL Technology (MIS/DSS)

Post 1990

Data Warehousing and Data Mining

Page 13: Cs437 lecture 1-6

Crises of Credibility

���

What is the financial health of our company?

-10%

+10%

�??

Page 14: Cs437 lecture 1-6

Why a Data Warehouse?

� Data recording and storage is growing.

� History is excellent predictor of the future.

� Gives total view of the organization.

� Intelligent decision-support is required for decision-making.

Page 15: Cs437 lecture 1-6

Why Data Warehouse?

� Size of Data Sets are going up ↑.

� Cost of data storage is coming down ↓.

� The amount of data average business collects and stores is doubling every year

� Total hardware and software cost to store and manage 1 Mbyte of data � 1990: ~ $15

� 2002: ~ ¢15 (Down 100 times)

� By 2007: < ¢1 (Down 150 times)

Page 16: Cs437 lecture 1-6

Why Data Warehouse?� A Few Examples

�WalMart: 24 TB

� France Telecom: ~ 100 TB

� CERN: Up to 20 PB by 2006

� Stanford Linear Accelerator Center (SLAC): 500TB

� Businesses demand Intelligence (BI).

� Complex questions from integrated data.

� “Intelligent Enterprise”

Page 17: Cs437 lecture 1-6

List of all items that were sold last month?

List of all items purchased by X?

The total sales of the last month grouped by branch?

How many sales transactions occurred during the month of January?

DBMS Approach

Page 18: Cs437 lecture 1-6

Which items sell together? Which items to stock?

Where and how to place the items? What discounts to offer?

How best to target customers to increase sales at a branch?

Which customers are most likely to respond to my next promotional campaign, and why?

Intelligent Enterprise

Page 19: Cs437 lecture 1-6

What is a Data Warehouse?

A complete repository of historical corporate data extracted from transaction systems that is available for ad-hoc access by knowledge workers.

Page 20: Cs437 lecture 1-6

What is Data Mining?

“There are things that we know that we know…

there are things that we know that we don’t know…

there are things that we don’t know we don’t know.”Donald Rumsfield

Former US Secretary of Defence

Page 21: Cs437 lecture 1-6

What is Data Mining?

Tell me something that I should know.

When you don’t know what you should be knowing, how do you write SQL?

You cant!!

Page 22: Cs437 lecture 1-6

What is Data Mining?

� Knowledge Discovery in Databases (KDD).

� Data mining digs out valuable non-trivial information from largemultidimensional apparently unrelated data bases(sets).

� It’s the integration of business knowledge, people, information, algorithms, statistics and computing technology.

� Discovering useful hidden patterns and relationships in data.

Page 23: Cs437 lecture 1-6

HUGE VOLUME THERE IS WAY TOO MUCH

DATA & GROWING!

� Data collected much faster than it can be processed or

managed. NASA Earth Observation System (EOS), will

alone, collect 15 Peta bytes by 2007

(15,000,000,000,000,000 bytes).

• Much of which won't be used - ever!

• Much of which won't be seen - ever!

• Why not?

� There's so much volume, usefulness of some of it will never

be discovered

� SOLUTION: Reduce the volume and/or raise the

information content by structuring, querying, filtering,

summarizing, aggregating, mining...

Page 24: Cs437 lecture 1-6

Requires solution of fundamentally new

problems

1. developing algorithms and systems to mine large, massive and high dimensional data sets;

2. developing algorithms and systems to mine new types of data (images, music, videos);

3. developing algorithms, protocols, and other infrastructure to mine distributed data; and

4. improving the ease of use of data mining systems;

5. developing appropriate privacy and security techniques for data mining.

Page 25: Cs437 lecture 1-6

Future of Data Mining

� 10 Hottest Jobs of year 2025 TIME Magazine, 22 May, 2000

� 10 emerging areas of technologyMIT’s Magazine of Technology Review, Jan/Feb, 2001

Page 26: Cs437 lecture 1-6

Data Mining

Data MiningMachineLearning

Database Technology

Statistics

Visualization

OtherDisciplines

InformationScience

Page 27: Cs437 lecture 1-6

Logical and Physical Database Logical and Physical Database Logical and Physical Database Logical and Physical Database

DesignDesignDesignDesign

Page 28: Cs437 lecture 1-6

Data Mining is one step of Knowledge

Discovery in Databases (KDD)

Raw

Data

Preprocessing

• Extraction

• Transformation

• Cleansing

• Validation

Data Mining

• Identify Patterns

• Create Models

Interpretation/Evaluation

• Visualization

• Feature Extraction

• Analysis

Clean

Data

$ $ $Knowledge

Page 29: Cs437 lecture 1-6
Page 30: Cs437 lecture 1-6

Information Evolution in a Data

Warehouse Environment

Primarily Batch Event Based

Triggering

Takes Hold

Increase in

Ad Hoc

Queries

Analytical

Modeling

Grows

Continuous Update &

Time Sensitive Queries

Become Important

Batch Ad Hoc Analytics Continuous Update/Short Queries Event-Based Triggering

STAGE 2:

ANALYZE

WHY did it happen?

STAGE 3:

PREDICT

What WILLhappen?

STAGE 1:

REPORT

WHAT happened?

STAGE 4:

OPERATIONALIZE

What IS happening?

STAGE 5:

ACTIVATE

What do you WANT to happen?

Page 31: Cs437 lecture 1-6

Normalization and Denormalization� Normalization

� A relational database relates subsets of a dataset to each other.� A dataset is a set of tables (or schema in Oracle)� A table defines the structure and contains the row and column data for each subset.

� Tables are related to each other by linking them based on common items and values between two tables.

� Normalization is the optimization of record keeping for insertion, deletion and updation (in addition to selection, ofcourse)

� De-normalization� Why denormalize?� When to denormalize� How to denormalize

Page 32: Cs437 lecture 1-6
Page 33: Cs437 lecture 1-6
Page 34: Cs437 lecture 1-6

Why De-normalization?� Do you have performance problems?

� If not, then you shouldn’t be studying this course!

� The root cause of 99% of database performance problems is poorly written SQL code.� Usually as a result of poorly optimized underlying structure

� Do you have disk storage problems?� Consider separating large, less used datasets and frequently used datasets.

Page 35: Cs437 lecture 1-6

When to Denormalize?� Denormalization sometimes implies the undoing of some of the steps of Normalization

� Denormalization is not necessarily the reverse of the steps of Normalization. � Denormalization does not imply complete removal of specific Normal Form levels.

� Denormalization results in duplication.� It is quite possible that table structure is much too granular or possibly even incompatible with structure imposed by applications.

� Denormalization usually involves merging of multiple transactional tables or multiple static tables into single

Page 36: Cs437 lecture 1-6

When to Denormalize?� Look for one-to-one relationships.

� These may be unnecessary if the required removal of null values causes costly joins. Disk space is cheap. Complex SQL join statements can destroy performance.

� Do you have many-to-many join resolution entities? Are they all necessary? Are they all used by applications?

� When constructing SQL statement joins are you finding many tables in joins where those tables are scattered throughout the entity relationship diagram?

� When searching for static data items such as customer details are you querying a single or multiple tables?� A single table is much more efficient than multiple tables.

Page 37: Cs437 lecture 1-6

How to Denormalize?

Common Forms of Denormalization

� Pre-join de-normalization.

� Column replication or movement.

� Pre-aggregation.

Page 38: Cs437 lecture 1-6

Considerations in Assessing

De-normalization� Performance implications

� Storage implications

� Ease-of-use implications

� Maintenance implications� Most commonly missed/disregarded.

Page 39: Cs437 lecture 1-6

Pre-join Denormalization� Take tables which are frequently joined and “glue” them together into a single table.

� Avoids performance impact of the frequent joins.

� Typically increases storage requirements.

Page 40: Cs437 lecture 1-6

Pre-join Denormalization� A simplified retail example...

� Before denormalization:

sale_id store_id sale_dt …

tx_id sale_id item_id … item_qty sale$

1

m

Page 41: Cs437 lecture 1-6

Pre-join Denormalization

tx_id sale_id store_id sale_dt item_id … item_qty $

� A simplified retail example...

� After denormalization:

Points to Ponder

� Which Normal Form is being violated?

� Will there be maintenance issues?

Page 42: Cs437 lecture 1-6

Pre-join Denormalization� Storage implications...

� Assume 1:3 record count ratio between sales header and detail.

� Assume 1 billion sales (3 billion sales detail).

� Assume 8 byte sales_id.

� Assume 30 byte header and 40 byte detail records.

Which businesses will be most hurt, in terms of storage capacity, by this form of denormalization?

Page 43: Cs437 lecture 1-6

Pre-join DenormalizationStorage implications...

Before denormalization: 150 GB raw data.After denormalization: 186 GB raw data.

Net result is 24% increase in raw data size for the database.

Pre-join may actually result in space saving, if many concurrent queries are demanding frequent joins on the joined tables! HOW?

Page 44: Cs437 lecture 1-6

Pre-join DenormalizationSample Query:

What was my total $ volume between Thanksgiving and Christmas in 1999?

Page 45: Cs437 lecture 1-6

Pre-join DenormalizationBefore de-normalization:

select sum(sales_detail.sale_amt)

from sales

,sales_detail

where sales.sales_id =

sales_detail.sales_id

and sales.sales_dt between '1999-11-26'

and '1999-12-25'

;

Page 46: Cs437 lecture 1-6

Pre-join DenormalizationAfter de-normalization:

select sum(d_sales_detail.sale_amt)from d_sales_detailwhere d_sales_detail.sales_dt between '1999-11-26' and '1999-12-25'

;

No join operation performed.

How to compare performance?

Page 47: Cs437 lecture 1-6

Pre-join DenormalizationBut consider the question...

How many sales (transactions) did I make between Thanksgiving and Christmas in 1999?

Page 48: Cs437 lecture 1-6

Pre-join DenormalizationBefore denormalization:

select count(*) from saleswhere sales.sales_dt between '1999-11-26' and '1999-12-25';

After denormalization:

select count(distinct d_sales_detail.sales_id) from d_sales_detailwhere d_sales_detail.sales_dt between '1999-11-26' and '1999-12-25';

Which query will perform better?

Page 49: Cs437 lecture 1-6

Pre-join DenormalizationPerformance implications...

� Performance penalty for count distinct (forces sort) can be quite large.

� May be worth 30 GB overhead to keep sales header records if this is a common query structure because both ease-of-use and performance will be enhanced (at some cost in storage)?

Page 50: Cs437 lecture 1-6

Considerations in Assessing

De-normalization� Performance implications

� Storage implications

� Ease-of-use implications

� Maintenance implications� Most commonly missed/disregarded.

Page 51: Cs437 lecture 1-6

Column Replication or Movement� Take columns that are frequently accessed via large scale joins and replicate (or move) them into detail table(s) to avoid join operation.

� Avoids performance impact of the frequent joins.

� Increases storage requirements for database.

� Possible to “move” frequently accessed column to detail instead of replicating it.

Note: This technique is no different than a limited form of the pre-join denormalization described previously.

Page 52: Cs437 lecture 1-6

ColA ColB

Table_1

ColA ColC ColD … ColZ

Table_2

ColA ColB

Table_1’

ColA ColC ColD … ColZ

Table_2

ColC

Page 53: Cs437 lecture 1-6

Column Replication or Movement� Health Care DW Example: Take member_id from claim header and move it to claim detail.

� Result: An extra ten bytes per row on claim line table allows avoiding join to claim header table on some (many?) queries.

Which normal form does this technique violates?

Page 54: Cs437 lecture 1-6

Column Replication or MovementBeware of the results of de-normalization:

� Assuming a 100 byte record before the denormalization, all scans through the claim line detail will now take 10% longer than previously.

� A significant percentage of queries must get benefit from access to the denormalized column in order to justify movement into the claim line table.

� Need to quantify both cost and benefit of each denormalizationdecision.

Page 55: Cs437 lecture 1-6

Column Replication or Movement�May want to replicate columns in order to facilitate co-location of commonly joined tables.Before denormalization:

A three table join requires re-distribution of significant amounts of data to answer many important questions related to customer transaction behavior.

Customer_Id Customer_Nm Address Ph …

Account_Id Customer_Id Balance $ Open_Dt …

Tx_Id Account_Id Tx$ Tx_Dt Location_Id …

1m

1m

CustTable

Acct Table

TrxTable