plmce2012 starring sakila

79
Starring Sakila A Data Warehousing and Business Intelligence Tutorial

Upload: imad-metalyzer-wolf

Post on 08-Nov-2014

136 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: PLMCE2012 Starring Sakila

Starring Sakila

A Data Warehousing and Business Intelligence

Tutorial

Page 2: PLMCE2012 Starring Sakila

Starring Sakila

Welcome!

Matt CastersChief Data Integration, Pentahohttp://www.ibridge.be/ @mattcasters

Roland BoumanSoftware Engineer, Pentahohttp://rpbouman.blogspot.com/@rolandbouman

Page 3: PLMCE2012 Starring Sakila

Pentaho

● Commercial Open Source Business Intelligence– Full BI suite since 2005

● Projects: Kettle (DI & ETL), Jfree (Reporting), Mondrian (OLAP), Weka (Data Mining)

● Community: CDF (Dashboarding), Saiku (OLAP)

● Recent: Focus on “Big Data”, esp. Hadoop● http://www.pentaho.com● http://sourceforge.net/projects/pentaho/

Page 4: PLMCE2012 Starring Sakila

Agenda

● Business Intelligence● Data Warehousing● Anatomy of a Data Warehouse● Physical Implementation● Sakila – a Star is Born● Filling the Data Warehouse● Presenting the Data - BI Applications

Page 5: PLMCE2012 Starring Sakila

Starring Sakila

Part I:

Business Intelligence

Page 6: PLMCE2012 Starring Sakila

Business Intelligence

● Skills, technologies, applications and practices to acquire a better understanding of the commercial context of your business

● Turning data into information useful for business users

– Management Information– Decision Support

Page 7: PLMCE2012 Starring Sakila

Business Intelligence Scope

Operational

Strategic

Tactical

Customers Partners Employees

Analysts

Executives

Managers

days, weeks:

“Who's available for tomorrow's shift”

weeks, months:

“In what region should we open a new store?”

months, years:

“Should we become an ap-pliance vendor instead of delivering software solu-tions”

Reporting

OLAP/Analysis

Data mining

Page 8: PLMCE2012 Starring Sakila

Functional Parts of a Business Intelligence Solution

● Front end Applications:

– Reports– Charts and Graphs– OLAP Pivot tables– Data Mining– Dashboards

● Back end Infrastructure

– Data Integration– Data Warehouse– Data Mart– Metadata– (ROLAP) Cube

Page 9: PLMCE2012 Starring Sakila

High Level BI Architecture

ERP

Staging Area

Enterprise Data Warehouse

Meta Data

Extract Transform Load Present

Back-end Front-end

Datamarts

Sources

CRM

External Data

OLAP/Analysis

Reporting

Charts /Graphs

Dashboards

Data Mining

Operational Datastore

Page 10: PLMCE2012 Starring Sakila

Starring Sakila

Part II:

Data Warehousing

Page 11: PLMCE2012 Starring Sakila

What is a Data Warehouse?

● A database designed to support Business Intelligence Applications

● Different requirements as compared to Operational Applications

● Analytic Database Systems (ADBMS)– MySQL: Infobright, InfiniDB

– LucidDB, MonetDB

Page 12: PLMCE2012 Starring Sakila

What is a Data Warehouse?

● Ultimately, it's just a Relational Database– Tables, Columns, ...

● Designed for Business Intelligence applications– Ease of use

– Performance

● Data from various source systems– Integration, Standardization, Data cleaning

● Add and maintain history– Corporate memory

Page 13: PLMCE2012 Starring Sakila

What is a Data Warehouse?

● A database designed to support BI applications● BI applications (OLAP) differ from Operational

applications (OLTP)– OLTP: Online Transaction Processing

– OLAP: Online Analytical Processing

● Differences:– Applications, Data Processing, Data Model

Page 14: PLMCE2012 Starring Sakila

OLTP vs OLAP:Application Characterization

● OLTP– Operational

– 'Always' on

– All kinds of users

– Many users

– Directly supports business process

– Keep a Record of Current status

● OLAP– Tactical, Strategic

– Periodically Available

– Managers, Directors

– Few(er) users

– Redesign Business Process

– Decision support, long-term planning

– Maintains a history

Page 15: PLMCE2012 Starring Sakila

OLTP vs OLAP: Data Processing

● OLTP– Transactions

– Subject Oriented

– Add, Modify, Remove single rows

– Human data entry

– Queries for small sets of rows with all their details

– Standard queries

● OLAP– Groups

– Aspect Oriented

– Bulk load, rarely modify, never remove

– Automated ETL jobs

– Scan large sets to return aggregates over arbitrary groups

– Ad-hoc queries

Page 16: PLMCE2012 Starring Sakila

OLTP vs OLAP:Data Model

● OLTP– Entity-Relationship

model

– Entities, Attributes, Relationships

– Foreign key constraints

– Indexes to increase performance

– Normalized to 3NF or BCNF

● OLAP– Dimensional

model

– Facts, Dimensions, Hierarchies

– Ref. integrity ensured in loading process

– Scans on Fact table obliterates indexes

– Denormalized Dimensions (<= 1NF)

Page 17: PLMCE2012 Starring Sakila

High Level BI Architecture

ERP

Staging Area

Enterprise Data Warehouse

Meta Data

Extract Transform Load Present

Back-end Front-end

Datamarts

Sources

CRM

External Data

OLAP/Analysis

Reporting

Charts /Graphs

Dashboards

Data Mining

Operational Datastore

Page 18: PLMCE2012 Starring Sakila

Starring Sakila

Part III:

Dimensional Model

Page 19: PLMCE2012 Starring Sakila

What is the Dimensional Model?

● An aspect-oriented logical data model optimized for querying and data presentation

● Divides data in two kinds:

– Facts– Dimensions

Page 20: PLMCE2012 Starring Sakila

The Dimensional Model

● Facts

– Measures/Metrics of a Business Process– Examples: Cost, Units Sold, Profit

● Dimensions

– Context of Business Process– Who? What? Where? When? Why?– Navigate Facts: Selection, Rollup, Drilldown– Provide and maintain history

Page 21: PLMCE2012 Starring Sakila

Dimensional Data Presentation

Date Dimension 2008 Q4

Location Dimension

All Months

October November December

All locations $ 3850 $ 1000 $ 1350 $ 1500

America All America

$ 2050 $ 500 $ 750 $ 800

North $ 1275 $ 300 $ 500 $ 475

South $ 775 $ 200 $ 250 $ 325

Europe All Europe

$ 1800 $ 500 $ 600 $ 700

East $ 800 $ 250 $ 250 $ 300

West $ 1000 $ 250 $ 350 $ 400

Page 22: PLMCE2012 Starring Sakila

The Dimensional Model: Facts

● Fact table structure:

– Several measures– Keys to dimension tables

● Measures:

– Usually numeric, Additive, Semi-additive– Sometimes pre-calculated

● Rapidly growing!

– Millions, Billions of rows (Terabytes)

Page 23: PLMCE2012 Starring Sakila

The Dimensional Model: Dimensions

● Dimension table structure:

– Surrogate key and descriptive text attributes ● Relatively few rows

– Exception: Customer 'Monster' dimension● Relatively static

– Exception: Slowly changing dimensions● Used to navigate through fact data

– Hierarchies

Page 24: PLMCE2012 Starring Sakila

The Dimensional Model: Navigating data with Dimensions

● Selection (Filter)● Navigation: Attributes organized in Hierarchies

– Date dimension examples:● Year, Quarter, Month, Day● Year, Week, Day

● Groupings for Aggregation

– 'Roll up', 'Drill Down'– 'Slice and Dice'

Page 25: PLMCE2012 Starring Sakila

The Dimensional Model: Maintaining History

● Fact table usually links to a date dimension● Dimensions maintain their own history

– Slowly changing dimensions● Type I Overwrite (no history)● Type II

– History kept in rows (versioning)● Type III

– History kept in columns

Page 26: PLMCE2012 Starring Sakila

Starring Sakila

Part V:

Physical Implementation

Page 27: PLMCE2012 Starring Sakila

Dimensional Model Implementation: Star Schema

● Related metrics stored in a Fact table● Fact table references relevant dimensions● Each Dimension stored in a Dimension Table● Dimension tables shared by multiple fact tables

Page 28: PLMCE2012 Starring Sakila

Rentals

Star Schema example: Sakila Rentals

Store

Date

Time

Film

CustomerStaff

Page 29: PLMCE2012 Starring Sakila

Star Schema example: Sakila Rentals

fact_rentalfact_inventory fact_payment

dim_date

dim_customerdim_store dim_staffdim_store

dim_film

Page 30: PLMCE2012 Starring Sakila

Stars versus Snowflakes

● Star schema is 'just' an implementation– Optimized for simplicity

– Optimized for performance (?)

– Heavily denormalized dimensions

● There is an alternative: Snowflake– Normalized dimensions

Page 31: PLMCE2012 Starring Sakila

Snow Flake example: Sakila Rentals

StoreDate

Minute

Film

Customer

Staff

Month

Hour

Quarter City

Country

City

Country

Language

Rating

Year

Week Rentals

Page 32: PLMCE2012 Starring Sakila

Starring Sakila

Part V:

A Star is Born

Page 33: PLMCE2012 Starring Sakila

Dimensional Model example

● MySQL Sample Database– http://dev.mysql.com/doc/sakila/en/sakila.html

● DVD rental business– Overly simplified database schema

● Typical OLTP database

Page 34: PLMCE2012 Starring Sakila

3NF Source schema: Sakila Rentals

Rental Customer

Film

Store Address

Category Actor

StaffInventory

City

CountryLanguage

Page 35: PLMCE2012 Starring Sakila

Target Star Schema

Fact: Rentals

Store

Date

Time

Film

When?

Where?

What?

CustomerStaff

Who?

Page 36: PLMCE2012 Starring Sakila

Dimensional Design

● Select Business Process– Sales, Purchase, Storage, ...

● Define Facts and Key Metrics– Facts: Key Event in Business Process

– Metrics (Fact Attributes): Count or Amount

● Choose Dimensions and Hierarchies– What? When? Where?

– Who? Why?

Page 37: PLMCE2012 Starring Sakila

Example Business Process:Rentals

● Select Business Process– Rentals

● Identify Facts– Count (number of rentals)

– Rental Duration

● Choose Dimensions– What: Films

– When: Rental, Return

– Who: Customer, Staff

– Where: Store

Page 38: PLMCE2012 Starring Sakila

A star is born: Rentals 3NF

Rental

CustomerStaffInventory

Page 39: PLMCE2012 Starring Sakila

A star is born: Rentals 3NF

Rental

CustomerStaffInventory

StoreFilm

Category

Film Category

Page 40: PLMCE2012 Starring Sakila

A star is born: Denormalize

Rental

CustomerStaffInventory

StoreFilm

Category

Film Category

Page 41: PLMCE2012 Starring Sakila

A star is born: Denormalize

Rental

CustomerStaff

StoreFilm

StoreCategory

Page 42: PLMCE2012 Starring Sakila

A star is born

Rental

CustomerStaff

StoreFilm

Store

Address

Category

Page 43: PLMCE2012 Starring Sakila

A star is born: Denormalize

Rental

CustomerStaff

StoreFilm

Store

Address

Category

Page 44: PLMCE2012 Starring Sakila

A star is born: Denormalize

Rental

CustomerStaff

StoreFilm

Store

AddressAddress

Language

Category

Page 45: PLMCE2012 Starring Sakila

A star is born: Denormalize

Rental

CustomerStaff

StoreFilm

Store

AddressAddress

LanguageCityCity

Category

Page 46: PLMCE2012 Starring Sakila

A star is born: Rental Snowflake

Rental

CustomerStaff

StoreFilm

Store

AddressAddress

LanguageCityCity

CountryCountry

Category

Page 47: PLMCE2012 Starring Sakila

A star is born: Rental Star Schema

Rental

StoreLanguage

Film

Country

City

What: Film Who: CustomerWhere: Store Who: Staff

Address

Store

Staff

Country

City

Address

Customer

Category

Page 48: PLMCE2012 Starring Sakila

Dimensional Design

● Something is missing....– Who ? (Customer, Staff)

– What ? (Film)

– Where ? (Store)

– .... ?

Page 49: PLMCE2012 Starring Sakila

A star is born:Rental Date and Time

Rental

What: Film Who: CustomerWhere: Store Who: Staff

When: Date When: Time

Page 50: PLMCE2012 Starring Sakila

Role Playing: Date/Timefor both Rentals and Returns

Rental

What: Film Who: CustomerWhere: Store Who: Staff

When:Rental Date

When:Rental Time

When:Return Date

When:Return Time

Page 51: PLMCE2012 Starring Sakila
Page 52: PLMCE2012 Starring Sakila

Rental Star Schema

Page 53: PLMCE2012 Starring Sakila

Starring Sakila

Part IV:

Filling the Data Warehouse

Page 54: PLMCE2012 Starring Sakila

High Level Data Warehouse Architecture

ERP

Staging Area

Data Warehouse

Meta Data

BI Applications

ReportingAnalysisVisualizationsDashboardsData Mining

Extract Transform Load Present

Back-end Front-end

Datamarts

Sources

CRM

External Data

Page 55: PLMCE2012 Starring Sakila

Planning the ETL Process

● Physical Design● Source to Target Mapping

– Define how data in the data warehouse is derived from data in the source system(s)

– Specification for designing the ETL process

● Column-level mapping– Source system, schema, table, column, data type

– Target dimension/fact, column, defaults

– Transformation rules, cleansing, lookup, calculation

Page 56: PLMCE2012 Starring Sakila

Designing the ETL Process

● Staging?● Changed Data Capture / Extraction● Denormalization● Derived data / Enrichment● Cleansing / Conforming● History policy (dimensions)● Granularity● Dimension Lookup (facts)

Page 57: PLMCE2012 Starring Sakila

Designing ETL with Kettle

● Flow ETL Engine● Transformations

– Data flow and processing

● Jobs– Workflow of ETL tasks

● Tools– Spoon

– Kitchen

– Pan

Page 58: PLMCE2012 Starring Sakila

Loading a Fact Table

● Load Dimension Tables● Load Fact table

Page 59: PLMCE2012 Starring Sakila

Loading a Dimension Table

● Get Customers source data● Lookup Address (Denormalize)● Update Dimension

Page 60: PLMCE2012 Starring Sakila

Loading a Fact Table

Page 61: PLMCE2012 Starring Sakila

Starring Sakila

Part V:

Presenting the Data:BI Applications

Page 62: PLMCE2012 Starring Sakila

Business Intelligence Scope

Operational

Strategic

Tactical

Customers Partners Employees

Analysts

Executives

Managers

days, weeks:

Who's available for tomorrow's shift

weeks, months:

In what region should we open a new store?

months, years:

Should we become an ap-pliance vendor instead of delivering software solu-tions

Reporting

OLAP/Analysis

Data mining

Page 63: PLMCE2012 Starring Sakila

Reporting

Page 64: PLMCE2012 Starring Sakila

Reporting● Mostly Operational● Lists and Grouping● Typically standardized● Typically no or limited interactivity

– Subreporting

Page 65: PLMCE2012 Starring Sakila

Scope of Reporting

Operational

Customers Partners Employees

days, weeks:

Who's available for tomorrow's shift

weeks, months:

In what region should we open a new store?

months, years:

Should we become an ap-pliance vendor instead of delivering software solu-tions

Reporting

OLAP/Analysis

Data mining

Strategic

Tactical

Analysts

Executives

Managers

Page 66: PLMCE2012 Starring Sakila

Reporting

Page 67: PLMCE2012 Starring Sakila

Analysis

Page 68: PLMCE2012 Starring Sakila

Analysis● Tactical, Strategic● OLAP

– Online Analytical Processing● Pivot tables● Typically Interactive

– Slice and Dice– Drilldown

● Typically Ad-hoc

Page 69: PLMCE2012 Starring Sakila

Scope of OLAP & Analysis

Operational

Customers Partners Employees

days, weeks:

Who's available for tomorrow's shift

weeks, months:

In what region should we open a new store?

months, years:

Should we become an ap-pliance vendor instead of delivering software solu-tions

Reporting

OLAP/Analysis

Data mining

Tactical

AnalystsManagers

Strategic

Executives

Page 70: PLMCE2012 Starring Sakila

Analysis Interactive Pivot table

Page 71: PLMCE2012 Starring Sakila

Data Mining

Page 72: PLMCE2012 Starring Sakila

Data Mining● Strategic, Tactical● Discover hidden patterns in data● Machine learning● Statistic analysis● Typically not interactive, long running● Expert matter● Not readily consumable by end-users

– Characteristics of back-end processing

Page 73: PLMCE2012 Starring Sakila

Scope of Data Mining

Operational

Customers Partners Employees

days, weeks:

Who's available for tomorrow's shift

weeks, months:

In what region should we open a new store?

months, years:

Should we become an ap-pliance vendor instead of delivering software solu-tions

Reporting

OLAP/Analysis

Data mining

Strategic

Tactical

Analysts

Executives

Managers

Page 74: PLMCE2012 Starring Sakila

Data Mining

Page 75: PLMCE2012 Starring Sakila

Charts and Graphs

Page 76: PLMCE2012 Starring Sakila

Charts and Graphs● Operational, Tactical, Strategic● Summarize large dataset● Not a separate class but a presentation

– Data Visualization● Standardized or ad-hoc● Can be interactive

– Drive a subreport– Drive drilldown

Page 77: PLMCE2012 Starring Sakila

Dashboarding

Page 78: PLMCE2012 Starring Sakila

Dashboarding● Operational, Tactical, Strategic● Not a separate class but a presentation● Bundle:

– key metrics for a particular role or perspective

– different views on the same metrics● Can contain reports, pivot tables, charts, graphs● Typically interactive

Page 79: PLMCE2012 Starring Sakila

Dashboard