dr. m. sulaiman khan ([email protected]) dept. of computer science university of liverpool 2010

21
Dr. M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Data Warehousing COMP207: Data Mining

Upload: harriet-wheeler

Post on 02-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

COMP207: Data Mining. COMP207: Data Mining. Dr. M. Sulaiman Khan ([email protected]) ‏ Dept. of Computer Science University of Liverpool 2010. Data Warehousing. Today's Topics. COMP207: Data Mining. Data Warehouses Data Cubes Warehouse Schemas OLAP Materialisation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Dr. M. Sulaiman Khan

([email protected])

Dept. of Computer Science

University of Liverpool

2010

COMP207: Data Mining

Data Warehousing

COMP207:Data Mining

Page 2: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Data WarehousesData CubesWarehouse SchemasOLAPMaterialisation

Today's Topics

Data Warehousing

COMP207:Data Mining

Page 3: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Most common definition:“A data warehouse is a subject-oriented, integrated,

time-variant and nonvolatile collection of data in support of management's decision-making process.” - W. H. Inmon

Corporate focused, assumes a lot of data, and typically sales related

Data for “Decision Support System” or “Management Support System”

1996 survey: Return on Investment of 400+%

Data Warehousing: Process of constructing (and using) a data warehouse

What is a Data Warehouse?

Data Warehousing

COMP207:Data Mining

Page 4: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Subject-oriented: Focused on important subjects, not transactions Concise view with only useful data for decision

making

Integrated: Constructed from multiple, heterogeneous data

sources. Normally distributed relational databases, not necessarily same schema.

Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent data (sounds familiar, I hope)

Data Warehouse

Data Warehousing

COMP207:Data Mining

Page 5: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Time-variant: Has different values for the same fields over time. Operational database only has current value. Data

Warehouse offers historical values.

Nonvolatile: Physically separate store Updates not online, but in offline batch mode only Read only access required, so no concurrency issues

Data Warehouse

Data Warehousing

COMP207:Data Mining

Page 6: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Data Warehouses are distinct from:

Distributed DB: Integrated via wrappers/mediators. Far too slow, semantic integration much more complicated.Integration done before loading, not at run time.

Operational DB: Only records current value, lots of extra non useful information.Different schemas/models, access patterns, users, functions, even though the data is derived from an operational db.

Data Warehouse

Data Warehousing

COMP207:Data Mining

Page 7: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

OLAP: Online Analytical Processing (Data Warehouse)OLTP: Online Transaction Processing (Traditional DBMS)

OLAP data typically: historical, consolidated, and multi-dimensional (eg: product, time, location).

Involves lots of full database scans, across terabytes or more of data.

Typically aggregation and summarisation functions.

Distinctly different uses to OLTP on the operational database.

OLAP vs OLTP

Data Warehousing

COMP207:Data Mining

Page 8: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Data is normally Multi-Dimensional,

and can be thought of as a cube.

Often: 3 dimensions of time, location and product.

No need to have just 3 dimensions -- could have one for cars with make, colour, price, location, and time for example.

Image courtesy of IBM OLAP Miner documentation

Data Cubes

Data Warehousing

COMP207:Data Mining

Page 9: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Can construct many 'cuboids' from the full cube by excluding dimensions.

In an N dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional cuboid (other than non existent!) is called the 'apex cuboid'.

Can think of this as a lattice of cuboids...

(Following lattice courtesy of Han & Kamber)

Data Cubes

Data Warehousing

COMP207:Data Mining

Page 10: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Lattice of Cuboids

Data Warehousing

COMP207:Data Mining

all

time item locationsupplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 11: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Each dimension can also be thought of in terms of different units. Time: decade, year, quarter, month, day, hour (and

week, which isn't strictly hierarchical with the others!) Location: continent, country, state, city, store Product: electronics, computer, laptop, dell, inspiron

This is called a “Star-Net” model in data warehousing, and allows for various operations on the dimensions and the resulting cuboids.

Multi-dimensional Units

Data Warehousing

COMP207:Data Mining

Page 12: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Star-Net Model

Data Warehousing

COMP207:Data Mining

Shipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTS

Customer

ProductPRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

DISTRICT

REGION

COUNTRY

Geography

DAILY QTRLY ANNUALYTime

Page 13: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Roll Up: Summarise data by climbing up hierarchy.Eg. From monthly to quarterly, from Liverpool to England

Drill Down: Opposite of Roll UpEg. From computer to laptop, from £100-999 to £100-

199 Slice: Remove a dimension by setting a value for it

Eg. location/product where time is Q1,2007 Dice: Restrict cube by setting values for multiple

dimensionsEg. Q1,Q2 / North American cities / 3 products sub cube

Pivot: Rotate the cube (mostly for visualisation)

Data Cube Operations

Data Warehousing

COMP207:Data Mining

Page 14: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Star Schema: Single fact table in the middle, with connected set

of dimension tables

(Hence a star) Snowflake Schema: Some of the dimension tables

further refined into smaller dimension tables(Hence looks like a snow flake)

Fact Constellation: Multiple fact tables can share dimension tables(Hence looks like a collection of star schemas. Also called Galaxy Schema)

Data Cube Schemas

Data Warehousing

COMP207:Data Mining

Page 15: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Star Schema

Data Warehousing

COMP207:Data Mining

Sales Fact Table

time_key

item_key

location_key

units_sold

Time Dimension

time_keyday

day_of_weekmonthquarter

year

Item Dimension

item_keynamebrandtype

supplier_type

Loc.n Dimension

location_keystreetcity

statecountry

continent

Measure (value)

Page 16: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Snowflake Schema

Data Warehousing

COMP207:Data Mining

Sales Fact Table

time_key

item_key

location_key

units_sold

Time Dimension

time_keyday

day_of_weekmonthquarter

year

Item Dimension

item_keynamebrandtype

supplier_key

Loc Dimension

location_keystreet

city_key

Measure (value)

City Dimension

city_keycity

statecountry

Page 17: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Fact Constellation

Data Warehousing

COMP207:Data Mining

Sales Fact Table

time_key

item_key

location_key

units_sold

Time Dimension

time_keyday

day_of_weekmonthquarter

year

Item Dimension

item_keynamebrandtype

supplier_key

Loc Dimension

location_keystreet

city_key

Measure (value)

City Dimension

city_keycity

statecountry

Shipping Table

time_key

item_key

from_key

units_shipped

Page 18: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

ROLAP: Relational OLAP Uses relational DBMS to store and manage the warehouse

data Optimised for non traditional access patterns Lots of research into RDBMS to make use of!

MOLAP: Multidimensional OLAP Sparse array based storage engine Fast access to precomputed data

HOLAP: Hybrid OLAP Mixture of both MOLAP and ROLAP

OLAP Server Types

Data Warehousing

COMP207:Data Mining

Page 19: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

Data Warehouse Architecture

Data Warehousing

COMP207:Data Mining

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Data Storage

OLAP Server

(also courtesy of Han & Kamber)

Page 20: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

In order to compute OLAP queries efficiently, need to materialise some of

the cuboids from the data. None: Very slow, as need to compute entire cube at run

time Full: Very fast, but requires a LOT of storage space and

time to compute all possible cuboids Partial: But which ones to materialise? Called an 'iceberg

cube', as only partially materialised and the rest is "below water".Many cells in a cuboid will be empty, only materialise sections that contain more values than a minimum threshold.

Materialisation

Data Warehousing

COMP207:Data Mining

Page 21: Dr. M. Sulaiman Khan (mskhan@liv.ac.uk)  Dept. of Computer Science University of Liverpool 2010

http://en.wikipedia.org/wiki/Data_warehouse

and subsequent links

Further Reading

Data Warehousing

COMP207:Data Mining