data warehouses (dw) · 1 data warehouses (dw) vera goebel department of informatics, university of...

53
1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases designed to support a Decision Support System (DSS). DW is a collection of integrated, subject-oriented databases designed to support the DSS function, where each unit of data is non-volatile and relevant to some moment in time [Inmon 1992].

Upload: others

Post on 23-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

1

Data Warehouses (DW)

Vera Goebel Department of Informatics, University of Oslo

Fall 2016

A Data Warehouse (DW) is a collection of integrated databases designed to support a Decision Support System (DSS). DW is a collection of integrated, subject-oriented databases designed to support the DSS function, where each unit of data is non-volatile and relevant to some moment in time [Inmon 1992].

Page 2: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

2

Data Warehousing: History & Economics •  1990 – IBM, “Business Intelligence”: process of

collecting and analyzing •  1993 – Bill Inmon, “Data Warehouse” •  Growing industry: $8 billion in 1998 •  Data Warehouse solutions offered today by nearly

all commercial DBS vendors! •  Return of Investment (RoI):

- 1996: 3-year RoI 400% - 2002: 1-year RoI 430%

•  Range from desktop to huge: - Walmart example: 900-CPU, 2,700 disk, 23TB, Teradata system

•  Expensive to build: 10-1000 million $ •  Used to evaluate future strategy

Page 3: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

3

Database Systems versus Decision Support Systems (DSS)

•  Operational (normal DBS) – Stored in Normalized Relational Database – Support transactions that represent daily

operations (Not Query Friendly) •  3 Main Differences

– Time Span – Granularity – Dimensionality

Page 4: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

4

Time Span •  Operational DBS / Data Store

– Real Time – Current Transactions – Short Time Frame – Specific Data Facts

•  DSS – Historic – Long Time Frame (Months/Quarters/Years) – Patterns

Page 5: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

5

Granularity

•  Operational DBS / Data Store – Specific Transactions that occur at a given time

•  DSS – Shown at different levels of aggregation – Different Summary Levels – Decompose (drill down) – Summarize (roll up)

Page 6: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

6

Dimensionality •  Most distinguishing characteristic of DSS

data •  Operational

– Represents atomic transactions

•  DSS – Data is related in Many ways – Develop the larger picture – Multi-dimensional view of data

Page 7: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

7

DSS Database Requirements •  DSS Database Scheme

– Support Complex and Non-Normalized data •  Summarized and Aggregate data •  Multiple Relationships •  Queries must extract multi-dimensional time slices •  Redundant Data

Page 8: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

8

DSS Database Requirements - 2 •  Data Extraction and Filtering

–  DSS databases are created mainly by extracting data from operational databases combined with data imported from external source

•  Need for advanced data extraction & filtering tools •  Allow batch / scheduled data extraction •  Support different types of data sources •  Check for inconsistent data / data validation rules •  Support advanced data integration / data formatting conflicts

Page 9: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

9

DSS Database Requirements - 3 •  End User Analytical Interface

–  Must support advanced data modeling and data presentation tools

–  Data analysis tools –  Query generation –  Must Allow the User to Navigate through the DSS

•  Size Requirements –  VERY Large – Terabytes –  Advanced Hardware (Multiple processors, multiple

disk arrays, etc.)

Page 10: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

10

What is a Data Warehouse?

•  Collection of diverse data –  subject oriented –  aimed at executive, decision maker –  often a copy of operational data – with value-added data (e.g., summaries, history)

–  data integrated –  time variant –  non-volatile

more

Page 11: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

11

DW Characteristics •  Subject oriented: data are organized based on

how the users refer to them. •  Data integrated: all inconsistencies regarding

naming convention and value representations are removed.

•  Non-volatile: data are stored in read-only format and do not change over time (except periodic updates/refresh from operational DBS).

•  Time variant: data are not current but normally time series.

Page 12: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

12

DW Characteristics •  Summarized: operational data are mapped into a

decision-usable format. •  Large volume: time series data sets are normally

quite large. •  Not normalized: DW data can be (often are)

redundant. •  Metadata: data about data are stored. •  Data sources: data come from internal and

external un-integrated operational systems.

Page 13: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

13

What is a Data Warehouse?

•  Collection of tools –  gathering data –  cleansing, integrating, ... –  querying, reporting, analysis –  data mining – monitoring, administering DW

Page 14: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

14

Data Warehouse Architecture

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 15: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

15

Why a Data Warehouse?

•  Two Approaches: – Query-Driven (Lazy) – Data Warehouse (Eager)

Source Source

?

Page 16: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

16

Query-Driven Approach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Page 17: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

17

Advantages of Data Warehousing

•  High query performance •  Queries not visible outside DW •  Local processing at sources unaffected •  Can operate when sources unavailable •  Can query data not stored in a DBMS •  Extra information at warehouse

– Modify, summarize (store aggregates) – Add historical information

Page 18: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

18

Advantages of Query-Driven Approach

•  No need to copy data –  less storage –  no need to purchase data

•  More up-to-date data •  Query needs can be unknown •  Only query interface needed at sources •  May be less draining on sources

Page 19: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

19

OLTP vs. OLAP •  OLTP: On-Line Transaction Processing

–  Processing at operational sites, short time horizon –  Optimize throughput (max. # TAs / time unit) –  Large number of short online TAs –  ACID transactions

•  OLAP: On-Line Analytical Processing –  Processing at DW, long time horizon –  Optimize response time (1-n queries / minimal time) –  Read-only data/queries, periodic refresh/update –  Complex queries, involve aggregations –  Store aggregated, historical data in multi-dimensional schemes –  DW data latency: few hours, Data Marts: 1 day

Page 20: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

20

OLTP vs. OLAP

•  Mostly updates •  Many small transactions •  Mb-Tb of data •  Raw data •  Clerical users •  Up-to-date data •  Consistency,

recoverability critical

•  Mostly reads •  Queries long, complex •  Gb-Pb of data •  Summarized, consolidated

data •  Decision-makers, analysts

as users

OLTP OLAP

Page 21: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

21

Data Marts •  Smaller data warehouses •  Spans part of organization

–  e.g., marketing (customers, products, sales) •  Do not require enterprise-wide consensus

–  but long term integration problems?

Page 22: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

22

DW Models & Operators •  Data Models

–  relations – Star schema & snowflake schema – Data Cubes

•  Operators –  slice & dice –  roll-up, drill down –  pivoting –  other

Page 23: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

23

What and Why of Data Warehousing

•  What: A very large database containing materialized views of multiple, independent source databases. The views generally contain aggregation data (aka datacubes).

Data System

Database

Datastore

Database System

Database Database System

Data Warehouse

System

DatacubesDSS app

workstations

...Queries Data

Extraction and Load

•  Why: The data warehouse (DW) supports read-only queries for new applications, e.g., DSS, OLAP & data mining.

Page 24: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

24

DW Life Cycle • The Life Cycle: •  General Problems:

–  Heavy user demand –  Problems with source data

•  ownership, format, heterogeneity

–  Underestimating complexity & resources for all phases

• Boeing Computing Services – DW for DSS in airplane repair •  DW size: 2-3 terabytes •  Online query services: 24×7 service •  Data life cycle: retain data for 70+ years (until the airplane is retired) •  Data update: No “nighttime”; concurrent refresh is required •  Access paths: Support new and old methods for 70+ years

Global Schema Definition

Data Extraction and Load

Query Processing

Data Update

Page 25: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

25

Global Schema Design – Base Tables •  Fact Table

–  Stores basic facts from the source databases (often denormalized) –  Data about past events (e.g., sales, deliverings, factory outputs, ...) –  Have a time (or time period) associated with them –  Data is very unlikely to change at a data source; no updates –  Very large tables (up to 1 TB)

ProductID SupplierID

PurchaseDate DeliveryDate

CustYrs

Fact Table

ProductID ProdName ProdDesc ProdStyle ManufSite

SupplierID SuppName SuppAddr SuppPhone

Date1stOrder

TimeID Quarter

Year AuditName

Dimension Tables

D E & L

G S D

Q P

D U

TimeID Quarter

Year AuditID

AuditComp Addr

AcctName Phone

ContractYr

•  Dimension Table –  Attributes of one dimension of a fact table (typically denormalized) –  A chain of dimension tables to describe attributes on other

dimension tables, (normalized or denormalized) –  Data can change at a data source; updates executed occasionally

Page 26: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

26

Schema Design Patterns

•  Star Schema •  Snowflake Schema

F D1

D2

D3 F

D2.1

D3.1 D1.1

D2.2

D3.2

D1.2

F1 F2 D1

F D1

D2

D3.1 D3.2

D1, D2, D3 are normalized

D E & L

G S D

Q P

D U

•  Starflake Schema •  Constellation Schema

D1, D2, D3 are denormalized

D3 may be normalized or denormalized D1 stores attributes about a relationship between F1 and F2

Page 27: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

27

Design Methodologies

•  Bottom-up: –  Data Marts created first –  Integrate data marts to

create DW –  Bus architecture:

collection of conformed dimensions & facts

•  Top-down: –  Use normalized enterprise

data model –  Atomic data, (data at

lowest level of detail) stored in DW

–  Dimensional data marts containing data needed for spec. business process, created from DW

Page 28: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

28

Summary Tables •  aka datacubes or multidimensional tables •  Store precomputed query results for likely queries

–  Reduce on-the-fly join operations –  Reduce on-the-fly aggregation functions, e.g., sum, avg

•  Stores denormalized data

D E & L

G S D

Q P

D U

Summary Table

Fact Table

Dim Table#1

Dim Table#2

Fact Table

•  Aggregate data from one or more fact tables and/or one or more dimension tables

•  Discard and compute new summaries as the set of likely queries changes

Page 29: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

29

Summary Tables = Datacubes

Product

Fiscal Quarter

Supplier S1

S2 All-S

Q1

Q2

Q3

Q4

All-Q

P11 P14 P19 P27 P33 All-P

Total Expenses paid to all suppliers of parts for Product P19 in the 1st quarter

Total Expenses paid to Supplier S1 for parts for Product P33 in 2nd quarter

•  Typical, pre-computed Measures are: –  Sum, percentage, average, std deviation, count, min-value, max-value, percentile

Total Expenses paid to Supplier S2 for parts for all products in all quarters

Total Expenses for Parts by Product, Supplier and Quarter

2.7M £

4.6M £

1.2M £

1.0M £

0.4M £

0.6M £

0.2M £

1.0M £

2.2M £

GROUP BY product, quarter

GROUP BY supplier, product, quarter

GROUP BY supplier

D E & L

G S D

Q P

D U Average Price paid to all suppliers of parts for Product P11 in the 1st quarter

GROUP BY product, quarter

Page 30: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

30

What to Materialize?

•  Store in warehouse results useful for common queries

•  Example: day 2 c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

129

. . . total sales

materialize

Page 31: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

31

Materialization Factors

•  Type/frequency of queries •  Query response time •  Storage cost •  Update cost

Page 32: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

32

Cube Aggregates Lattice

city, product, date

city, product city, date product, date

city product date

all

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

129

use greedy algorithm to decide what to materialize

Page 33: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

33

Too Many Summary Tables

•  Factors to be considered: –  What queries must DW support? –  What source data are available? –  What is the time-space trade-off to store

versus re-compute joins and measures? –  Cost to acquire and update the data?

D E & L

G S D

Q P

D U

•  An NP-complete optimization problem –  Use heuristics and approximation algorithms

•  Benefit Per Unit Space (BPUS) •  Pick By Size (PBS) •  Pick By Size–Use (PBS-U)

•  The Schema Design Problem: –  Given a finite amount of disk storage, what views

(summaries) will you pre-compute in the data warehouse?

A C B

E D

H G ALL/None

Derivation Lattice of materialized views

F

Use a derivation lattice to analyze possible materialized views }

Page 34: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

34

A Lattice of Summary Tables •  Derivation Lattice

Nodes: The set of attributes that would appear in the ”group by” clause to construct this view

Edges: Connect view V2 to view V1 if V1 can be used to answer queries over V2

MetaData: estimated # of records in each view

D E & L

G S D

Q P

D U

PSC SC PC PS

P C S

ALL/None

Derivation Lattice for parts, supplier, & customers

•  Determine cost and benefit of each view •  Select a subset of the possible views •  Typical simplifying assumptions:

–  Query cost ≈ # of records scanned to answer the query –  I/O costs are much larger than CPU cost to compute measures –  Ignore cost reductions due to using indexes to access records –  All queries are equally likely to occur

6M

0.1M

Page 35: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

35

Data Extraction and Load Step1: Extract and clean data from all sources

–  Select source, remove data inconsistencies, add default values

D E & L

G S D

Q P

D U

Step2: Materialize the views and measures –  Reformat data, recalculate data, merge data from multiple sources, add

time elements to the data, compute measures Step3: Store data in the DW

–  Create metadata and access path data, such as indexes

•  Major Issue: Failure during extraction and load •  Approaches:

–  UNDO/REDO logging •  Too expensive in time and space

–  Incremental Checkpointing •  When to checkpoint? Modularize and divide the long-running tasks •  Must use UNDO/REDO logs also; Need high/performance logging

Page 36: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

36

Materializing Summary Tables •  Scenario: CompsiQ has factories in 7 cities. Each factory

manufactures several of CompsiQ’s 30 hardware products. Each factory has 3 types of manufacturing lines: robotic, hand-assembly, and mixed-line.

•  Target summary query: What is last year’s yield from Factory-A by product type?

•  Schema for source data from Factory-A: YieldInfo

ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year

ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife

D E & L

G S D

Q P

D U

Page 37: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

37

Materialization using SchemaSQL

select p.ProductType, sum(y.lt) from Factory-A::YieldInfo→ lt, Factory-A::YieldInfo y, Factory-A::ProductInfo p where lt < > ”ProductCode and lt < > ”Week” and lt < > ”Year” and y.ProductCode = p.ProductCode and y.Year = 01 group by p.ProductType

D E & L

G S D

Q P

D U

YieldInfo ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year

ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife

What is last year’s yield from Factory-A by product type?

At execution time, lt ranges over the attribute names in relation YieldInfo

Page 38: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

38

Aggregation Over Irregular Blocks

YieldInfo ProductCode RoboticYield Hand-AssemYield MixedLineYield Week Year

ProductInfo ProductCode ProductName ProductType FCS-Date EstProductLife

P11 ATMCard Net 3-8-99 36 P12 SMILCard Video 1-02-98 18 P13 ATMHub Net 1-11-99 36 P14 MPEGCard Video 24-3-00 24 P15 MP3 Audio 17-1-01 36

ProductInfo YieldInfo

P11 17 12 5 45 01 P12 9 11 12 45 01 P13 5 10 3 45 01 P14 22 8 7 45 01 ... P11 20 15 0 46 01 P12 8 9 10 46 01 P13 31 0 0 46 01 P14 15 15 20 46 01 ...

Page 39: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

39

User Queries •  Retrieve pre-computed data or formulate

new measures not materialized in the DW.

D E & L

G S D

Q P

D U

Fiscal Quarter

Supplier

Product

S1 S2 All-S

Q1

Q2

Q3

Q4

All-Q

P11 P14 P19 P27 P33

•  New user operations on logical datacubes: – Roll-up, Drill-down, Pivot/Rotate – Slicing and Dicing with a “data blade” – Sorting – Selection – Derived Attributes

Page 40: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

40

Query Processing •  Traditional query transformations •  Index intersection and union •  Advanced join algorithms

D E & L

G S D

Q P

D U

•  Piggy-backed scans –  Multiple queries with different selection criteria

•  SQL extensions => new operators –  Red Brick Systems has proposed 8 extensions, including:

•  MovingSum and MovingAvg •  Rank … When •  RatioToReport •  Tertiles •  Create Macro

Page 41: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

41

Data Update •  Data sources change over time •  Must “refresh” the DW

–  Adds new historical data to the fact tables –  Updates descriptive attributes in the dimension tables –  Forces recalculation of measures in summary tables

D E & L

G S D

Q P

D U

•  Issues: 1. Monitoring/tracking changes at the data sources 2. Recalculation of aggregated measures 3. Refresh typically forces a shutdown for DW query processing

Page 42: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

42

Monitoring Data Sources Approaches:

D E & L

G S D

Q P

D U 1. Value-deltas - Capture before and after values of all tuples changed by normal DB operations and store them in differential relations.

•  Issues: must take the DW offline to install the modified values 2. Operation-deltas – Capture SQL updates from the

transaction log of each data source and build a new log of all transactions that effect data in the DW.

•  Advantages: DW can remain online for query processing while executing data updates (using traditional concurrency control)

3. Hybrid – use value-deltas and operation-deltas for different data sources or a subset of the relations from a data source.

Page 43: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

43

Creating a Differential Relation Approaches at the Data Source:

D E & L

G S D

Q P

D U 1. Execute the update query 3 times •  (1) Select and record the before values;

(2) Execute the update; (3) Select and record the after values

•  Issues: High cost in time & space; reduces autonomy of the data sources

2. Define and insert DB triggers •  Triggers fire on “insert”, “delete”, and “update”

operations; Log the before and after values •  Issues: Not all data sources support triggers;

reduces autonomy of the data sources

Page 44: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

44

Creating Operation-Deltas

•  The process: – Scan the transaction log at each data source – Select pertinent transactions and delta-log them

•  Advantage: – Op-delta is much smaller than the value-delta

•  Issues: – Must transform the update operation on the data

source schema into an update operation on the DW schema – not always possible. Hence can not be used in all cases.

D E & L

G S D

Q P

D U

Page 45: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

45

Recalculating Aggregated Measures •  Delta Tables

– Assume we have differential relations for the base facts in the data sources (i.e., value deltas)

– Two processing phases (Propagation & Refresh):

D E & L

G S D

Q P

D U

Differential Relations

GlobalDW

Schema

Propagation Process

Delta

Tables

1) Propagation – pre-compute all new tuples and all replacement tuples and store them in a delta table

Page 46: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

46

Recalculating Aggregated Measures 2) Refresh – Scan the DW tuples, replace existing

tuples with the pre-computed tuple values, insert new tuples from the delta tables

D E & L

G S D

Q P

D U

DW

Tables

Updated

DW TablesRefresh Process

Delta

Tables

Issue: Can not pre-compute Delta Table for non-commutative measures Ex: average (without #records), percentiles Must compute these during the refresh phase.

Page 47: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

47

Design •  What data is needed? •  Where does it come from? •  How to clean data? •  How to represent in warehouse (schema)? •  What to summarize? •  What to materialize? •  What to index?

Page 48: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

48

Tools •  Development

–  design & edit: schemas, views, scripts, rules, queries, reports

•  Planning & Analysis –  what-if scenarios (schema changes, refresh rates), capacity planning

•  Warehouse Management –  performance monitoring, usage patterns, exception reporting

•  System & Network Management –  measure traffic (sources, warehouse, clients)

•  Workflow Management –  “reliable scripts” for cleaning & analyzing data

Page 49: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

49

Data Marting •  What: Stores a second copy of a subset of a DW

•  Why build a data mart? –  A user group with special needs (dept.) –  Better performance accessing fewer records –  To support a “different” user access tool –  To enforce access control over different subsets –  To segment data over different hardware platforms

Data Mart System

Datacubes

Data Mart System

Datacubes

Data Extraction and Load

DSS app workstations

...

Queries

Queries

Data Warehouse

System

datacubes

Page 50: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

50

Costs and Benefits of Data Marting •  System costs:

–  More hardware (servers and networks) –  Define a subset of the global data model –  More software to:

•  Extract data from the warehouse •  Load data into the mart •  Update the mart (after the warehouse is updated)

•  User benefits: –  Define new measures not stored in the DW –  Better performance (mart users and DW users) –  Support a more appropriate user interface

•  Ex: a browser with forms versus SQL queries –  Company achieves more reliable access control

Page 51: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

51

Current State of Industry

•  Extraction and integration done off-line – Usually in large, time-consuming, batches

•  Everything copied at DW – Not selective about what is stored – Query benefit vs storage & update cost

•  Query optimization aimed at OLTP – High throughput instead of fast response – Process whole query before displaying anything

Page 52: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

52

Commercial DW Products •  Short list of companies with DW products:

–  Informix/Red Brick Systems –  Oracle –  Prism Solutions –  Software AG

•  Typical Products and Tools –  Specially tuned DB Server –  DW Developer Tools: data extraction, incremental update,

index builder –  User Tools: ad hoc query and spreadsheet tools for DSS

and post-processing (creating graphs, pie-charts, etc.) –  Application Developer Tools (toolkits for OLAP and DSS):

spreadsheet components, statistics packages, trend analysis and forecasting components

Page 53: Data Warehouses (DW) · 1 Data Warehouses (DW) Vera Goebel Department of Informatics, University of Oslo Fall 2016 A Data Warehouse (DW) is a collection of integrated databases

53

Ongoing Research Problems •  How to incorporate domain and business rules

into DW creation and maintenance •  Replacing manual tasks with intelligent agents

– Data acquisition, data cleaning, schema design, DW access paths analysis and index construction

•  Separate (but related) research areas: – Tools for data mining and OLAP – Providing active database services in the DW