motivation: why data warehouse? what is a data warehouse?...

11
Data Warehousing and OLAP Data Warehousing and OLAP Lecture 2/DMBI/IKI83403T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA ([email protected]) Faculty of Computer Science, University of Indonesia Objectives Objectives Motivation: Why data warehouse? What is a data warehouse? Why separate DW? Conceptual modeling of DW Data Mart Data Mart Data Warehousing Architectures Data Warehouse Development Data Warehouse Development Data Warehouse Vendors R l DW Real-time DW 2 Motivation: Why data warehouse? Motivation: Why data warehouse? Construction of data warehouses (DW) involves data cleaning and data integration important preprocessing step for data mining (DM). DW provide OLAP for the interactive analysis of multidimensional data, which facilitates effective DM. Data mining functions can be integrated with OLAP operations to enhance interactive mining of knowledge. operations to enhance interactive mining of knowledge. DW will provide an effective platform for DM. Whil DW t i t t d DM DW t While DWs are not requirements to do DM, DW store massive amounts of data that can be uses for DM. [DO] 3 What is a data warehouse? [JH] What is a data warehouse? [JH] Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s ODB. Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented , integrated , time-variant , and nonvolatile collection of data in support of management’s decision-making process.” W. H. Inmon Case Study 2: Continental Airlines flies high with its real-time data warehouse 4

Upload: nguyenliem

Post on 07-Nov-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

Data Warehousing and OLAPData Warehousing and OLAPLecture 2/DMBI/IKI83403T/MTI/UI

Yudho Giri Sucahyo, Ph.D, CISA ([email protected])Faculty of Computer Science, University of Indonesia

ObjectivesObjectivesMotivation: Why data warehouse?What is a data warehouse?Why separate DW?y pConceptual modeling of DWData MartData MartData Warehousing ArchitecturesData Warehouse DevelopmentData Warehouse DevelopmentData Warehouse VendorsR l DWReal-time DW

2

Motivation: Why data warehouse?Motivation: Why data warehouse?Construction of data warehouses (DW) involves data cleaning and data integration important preprocessing step for data mining (DM).DW provide OLAP for the interactive analysis of multidimensional data, which facilitates effective DM.,Data mining functions can be integrated with OLAP operations to enhance interactive mining of knowledge.operations to enhance interactive mining of knowledge.DW will provide an effective platform for DM.Whil DW t i t t d DM DW t While DWs are not requirements to do DM, DW store massive amounts of data that can be uses for DM. [DO]

3

What is a data warehouse? [JH]What is a data warehouse? [JH]Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from the organization’s ODB.Support information processing by providing a solid platform of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” —W. H. InmonCase Study 2: Continental Airlines flies high with its real-time data warehouse

4

Page 2: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

What is a data warehouse? [ET]What is a data warehouse? [ET]Data warehouseA physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.Characteristics

Subject oriented, Integrated, Time Variant, Non-volatileWeb-based, Relational/multidimensional, Client/server, Real-timeInclude metadata

Data warehousingProcess of constructing and using data warehouses.Requires data integration, data cleaning, and data consolidation.

5

Subject OrientedSubject Oriented

Organized around major subjects, such as Organized around major subjects, such as customer, product, sales.P id i l d i i d Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.Focusing on the modeling and analysis of data Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processingtransaction processing.

6

IntegratedIntegratedIntegrate multiple, heterogeneous data sources

Relational databases, flat-files, on-line transaction records

Data cleaning and data integration techniques are g g qapplied

Ensure consistency in naming conventions, encoding Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sourcessources

E.g., Hotel price: currency, tax, breakfast covered, etc.

Wh d i d h h i i dWhen data is moved to the warehouse, it is converted.

7

Time VariantTime VariantThe time horizon for the data warehouse is significantly longer than that of operational systems.

Operational database: current value data.Operational database: current value data.

Data warehouse data: provide information from a historical perspective (e g past 5-10 years)perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain “time element”.

8

Page 3: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

Non volatileNon-volatileA physically separate store of data transformed from the p y y p

operational environment.

O i l d f d d i h d Operational update of data does not occur in the data

warehouse environment.

Does not require transaction processing, recovery, and

concurrency control mechanismsy

Requires only two operations in data accessing:

i i i l l di f d d f dinitial loading of data and access of data.

9

Data Warehouse vs Heterogeneous DBMSData Warehouse vs. Heterogeneous DBMSTraditional heterogeneous DB integration:

Build wrappers/mediators on top of multiple, heterogeneous databases. Ex: IBM Data Joiner, Informix DataBlade

Q d i h Query driven approach:

When a query is posed to a client site, a metadata-dictionary is used to translate the query into queries appropriate for the individual to translate the query into queries appropriate for the individual heterogeneous sites involved. There queries are then mapped and sent to local query processors. The results returned from the different

d l b l sites are integrated into a global answer set.

Complex information filtering and integration processes, compete forresourcesresources.

Inefficient and potentially expensive for frequent queries, especially for

queries requireing aggregations.q q g gg g

10

Data Warehouse vs Heterogeneous DBMS (2)Data Warehouse vs. Heterogeneous DBMS (2)Using DW update-driven approach

Information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis.

Unlike OLTP DW do not contain the most current informationUnlike OLTP, DW do not contain the most current information.DW brings high performance to the integrated heterogeneous DB system since data are copied preprocessed integrated DB system since data are copied, preprocessed, integrated, annotated, summarized, and restructured into one data store.Query processing in DW does not interfere with the processing Query processing in DW does not interfere with the processing at local sourcesDW can store and integrate historical information and support g ppcomplex multidimensional queries.

11

DW vs ODBDW vs. ODBMajor task of ODB OLTP:

Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

DW f d l i d d i i ki OLAPDW serve for data analysis and decision making OLAPDistinct Features (OLTP vs. OLAP)

U d i i kUser and system orientation: customer vs. marketData contents: current, detailed vs. historical, consolidatedDatabase design: ER + application vs star + subjectDatabase design: ER + application vs. star + subjectView: current, local vs. evolutionary, integratedAccess patterns: update vs. read-only but complex queriesAccess patterns: update vs. read only but complex queries

12

Page 4: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

OLTP vs OLAPOLTP vs OLAPOLTP OLAP

users Clerk IT professional Knowledge workerusers Clerk, IT professional Knowledge workerfunction day to day operations decision supportDB design application-oriented subject-orienteddata current, up-to-date

detailed, flat relationalisolated

historical,summarized, multidimensionalintegrated, consolidated

usage repetitive ad-hocaccess read/write

index/hash on prim. keylots of scans

unit of work short, simple transaction complex query# records accessed tens millions#users thousands hundreds#users thousands hundredsDB size 100MB-GB 100GB-TBmetric transaction throughput query throughput, response

13

Why Separate DW?Why Separate DW?

High performance for both systems:g p yDBMS — tuned for OLTP: access methods, indexing, concurrency control, recoveryWarehouse — tuned for OLAP: complex OLAP queries,computation of large groups of data at summarized levels,multidimensional view, consolidation.,

Processing OLAP queries in operational databases would degrade the performance of operational tasks.In ODB, concurrency control and recovery mechanisms (locking, logging) are required to ensure the consistency

d b f i and robustness of transactions. OLAP read only access. No need for concurrency control and recoverycontrol and recovery.

14

Why Separate DW? (2)Why Separate DW? (2)Different functions and different data:

missing data: Decision support requires historical data which operational DBs do not typically maintain. So, data in ODB is usually far from complete for decision making. y p gdata consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources. ODB

t i d t il d d t (t ti ) hi h d t b contain detailed raw data (transactions) which need to be consolidated before analysis.data quality: different sources typically use inconsistent data q y yp yrepresentations, codes and formats which have to be reconciled.

15

Conceptual Modeling of DWConceptual Modeling of DWData Cube:

see TSBD Lecture Notes on Visualization of Data Cubes

M d li d t h di i & tModeling data warehouses: dimensions & measurementsStar schema: A single object (fact table) in the middle connected to a number of objects (dimension tables one for each to a number of objects (dimension tables, one for each dimension).Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.Fact constellations: Multiple fact tables share dimension tables.

Also known as galaxy schema

16

Page 5: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

Example of Star SchemaExample of Star SchemaDate

Product

DayMonthYear

Sales Fact Table

Date

ProductNoProdNameProdDescCDate

Product

Store

CategoryQOH

Store

CustIdC tN

CustStore

CustomerStoreIDCityState CustName

CustCityCustCountry

unit_sales

dollar_sales

StateCountryRegion

Yen_sales

MeasurementsPotensi Redundansi

Bandung, Bogor keduanya

17

ada di Jawa Barat

Snowflake SchemaSnowflake SchemaProductYear

Day

Date Sales Fact TableProductNoProdNameProdDesc

MonthYear

MonthYear

DayMonth Date

Product

ProdDescCategoryQOH

Year

CustId

CustStore

CustomerCity StoreIDCit

Store

CustIdCustNameCustCityCustCountry

unit_sales

dollar sales

CityState

State

State

City

CustCountry_

Yen_salesCountryRegion

CountryStateCountry

18

MeasurementsRegion

View of Warehouses and HierarchiesView of Warehouses and Hierarchies

Importing data

Table Browsing

Dimension creation

Dimension browsing

Cube buildingg

Cube browsing

19

Data CubeData Cube

Total annual salesD t Total annual salesof TV in U.S.A.

DatesumTV

PC

1Qtr 2Qtr 3Qtr 4QtrU S A

ry

sumVCR

PC U.S.A

Canada

Cou

nt

Ca ada

Mexico

sum

20

Page 6: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

Data CubeData Cube

VisualizationOLAP capabilities

21

pInteractive manipulation

Typical OLAP OperationsTypical OLAP OperationsRoll up (drill-up): summarize data

by climbing up hierarchy or by dimension reductionby climbing up hierarchy or by dimension reductionDrill down (roll down): reverse of roll-up

from higher level summary to lower level summary or detailed data or from higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice: project and select

Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

Other operationsd ill i l i ( ) th f t t bldrill across: involving (across) more than one fact table.drill through: through the bottom level to its back-end relational tables.

More info: More info: www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org

22

Data MartData MartDW collects information about subjects that span the entire organization, such as customers, products, sales, assets, and personnel. Its scope is enterprise-wide.For DW, fact constellation schema is commonly used since it can model multiple, interrelated subjects.Data Mart is a subset of a DW, focuses on a particular subject. Its scope is department-wide. Typically, a data mart

f l b ( k consisting of a single subject area (e.g. marketing, operations).For Data Mart, star or snowflake schema are commonly used since both are geared towards modeling single

bj t lth h th t h i lsubjects, although the star schema is more popular.23

Data MartData MartA data mart can be either dependent or independent.A dependent data mart is a subset that is created directly from the DW.

Consistent data modelProviding quality dataDW must be constructed firstEnsures that the user viewing the same version of the data that

d b ll h d h are accessed by all other data warehouse users

An independent data mart is a small warehouse designed f d d i i EDWfor a department, and its source is not an EDW.

24

Page 7: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

Data Warehousing Process OverviewData Warehousing Process Overview

25

Data Warehousing Process OverviewData Warehousing Process OverviewThe major components of a data warehousing process

Data sources Legacy systems, external data providers (e.g. BPS), OLTP, ERP Systems

Data extraction Data loading Comprehensive database Metadata Middleware tools

26

Data Warehousing ArchitecturesData Warehousing Architectures

27

Data Warehousing ArchitecturesData Warehousing Architectures

28

Page 8: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

Data Warehousing ArchitecturesData Warehousing Architectures

29

Data Warehousing ArchitecturesData Warehousing Architectures

30

Data Integration and the ETL ProcessData Integration and the ETL ProcessVarious integration technologies:

Enterprise Application Integration (EAI)A technology that provides a vehicle for pushing data from source

t i t d t h systems into a data warehouse Integrating application functionality and is focused on sharing functionality across systemsTraditionally, API. Nowadays, SOA (web services).

Enterprise Information Integration (EII)An evolving tool space that promises real-time data integration from

a variety of sources, such as relational databases, Web services, and multidimensional databases A mechanism for pulling data from source systems to satisfy a request for information.

31

Data Integration and the ETL ProcessData Integration and the ETL ProcessETL

60-70% of the time in a data-centric project.Extraction: Reading data from one or more databasesTransformationTransformation

Converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a DW

LoadPutting the

d data into the DW

32

Page 9: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

Data Warehouse DevelopmentData Warehouse DevelopmentDirect benefits

Allowing end users to perform extensive analysis in numerous waysA f ( f A consolidated view of corporate data (i.e a single version of the truth)Better and more timely informationBetter and more timely informationEnhanced system performance. DW frees production processing because some operational system reporting processing because some operational system reporting requirements are moved to DSSSimplification of data access

33

Data Warehouse DevelopmentData Warehouse DevelopmentSome best practices for implementing a DW (Weir, 2002):

Project must fit with corporate strategy and business objectivesThere must be complete buy-in to the project by executives, managers and usersmanagers, and usersIt is important to manage user expectations about the completed projectThe data warehouse must be built incrementallyBuild in adaptabilityM d b b h IT d b i f i lManaged by both IT and business professionalsDevelop a business/supplier relationshipO l l d d t th t h b l d d f lit Only load data that have been cleansed and are of a quality understood by the organizationDo not overlook training requirementsDo not overlook training requirementsBe politically aware

34

Data Warehouse VendorsData Warehouse VendorsComputer Associates MicrosoftDataMirrorData Advantage Group

OracleSASg p

Dell ComputerEmbarcadero Technologies

SiemensSybaseEmbarcadero Technologies

Business ObjectsHP

SybaseTeradataPlease visit:HP

HummingbirdH

Please visit:Data Warehousing Institute (tdwi com)Hyperion

IBM

(tdwi.com)DM Review (dmreview.com)

Informatica35

Data Warehouse VendorsData Warehouse VendorsSix guidelines to considered when developing a g p gvendor list:

1 Financial strength1. Financial strength2. ERP linkages

Q lifi d l3. Qualified consultants4. Market share5. Industry experience6. Established partnerships p p

36

Page 10: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

Real time DWReal-time DWTraditionally, updated on a weekly basis. Unsuitable for some businesses.Real-time (active) data warehousing( ) gThe process of loading and providing data via a data warehouse as they become available yLevels of data warehouses:

1. Reports what happened1. Reports what happened2. Some analysis occurs3. Provides prediction capabilities,p p ,4. Operationalization5. Becomes capable of making events happenp g pp

37

Real time DWReal-time DW

38

Real time DWReal-time DW

39

From DW to DM [JH]From DW to DM [JH]Three kinds of data warehouse applications

Information processingsupports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

Analytical processingmultidimensional analysis of data warehouse datasupports basic OLAP operations, slice-dice, drilling, pivoting

Data miningknowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the

i i lt i i li ti t lmining results using visualization tools.40

Page 11: Motivation: Why data warehouse? What is a data warehouse? [JH]ocw.ui.ac.id/materials/12.01_FASILKOM/IKI83403T_-_Data_Mining_and... · `In ODB, concurrency control and recovery mechanisms

ReferencesReferences[JH] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.[ET] Efraim Turban et al., Decision Support and Business Intelligence Systems, Pearson, 2007. [DO] David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill, 2007.

41