datawarehouse concepts in a nutshell

Upload: meonline7

Post on 02-Apr-2018

230 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    1/16

    Data Warehouse Overview

    Abstract

    Data warehouse is a large database system designed for the purpose of dataanalysis. The design is different with operational database. Data warehousereads data from multiple operational databases instead of getting the data forthe end-user transaction input. Since data warehouse does not require toperform transaction processing, it can perform computation intensive query fordata analysis. Also the data model is different with operational database inorder to make the data browsing easier. This document presents an overview todata warehouse in terms of the architecture, the data model, and the userinterface.

    1.0 IntroductionData warehouse is a large database system designed for data analysis. The datasource comes many operational database systems. The data source can forexample be accounting information, operation information, inventoryinformation, customer information, etc. Data warehouse builds cross referenceinformation between these different data sources to enable data analysis. Itgroups data into subject areas so that users can find data earlier. It maintainshistorical data for trend analysis. Since data warehouse is not used for end usertransaction processing, it can afford the resources to run computation intensive

    query for data analysis.From a business perspective, data warehouse provides a single and consistencedata source. It makes the data collection process much easier and faster forthe users. The users can answer different business questions by issuing queriesto the data warehouse. Potentially, better business decisions can be made in ashorter period of time.

    2.0 Business Driving Force for DataWarehouse

    With the increase in business competition, there is a need to obtain andanalysis business data faster. A lot of important business data is in operationaldatabase systems. However, these systems are not designed for business dataanalysis due to the following reasons:

    Data model is normalized for speed and not for data analysis. Data model is not grouped into subject areas for analysis. Data model is not dimensional.

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    2/16

    Operational database can't afford the resources to perform computationintensive query during analysis.

    There's no cross reference information between data from the differentoperational databases, i.e. between financial and operational database.

    Historical data may not be found in operational database for trend

    analysis. Data warehouse has data description and data browsing facilities. Operational data changes over time.

    With a data warehouse, user could for example ask the following questions. ROI for a new types of distribution mechanism? When are beer buyers most likely to also buy snacks? How likely are we to meet our fourth quarter projections? What is our growth rate in the southwest versus competitor X? What is the financial and the operational information for a geographical

    area?

    Data warehouse can provide easy information access for business people toincrease revenue, profit, customer satisfaction, saving, and market share. Thesystem can be used for different departments in the organization.

    3.0 Development Steps

    The development steps for data warehouse project is similar to otherinformation systems. The following outlines some key steps during developing.The outline is divided into three sections and they are planning & design,

    building & testing, roll out & maintenance.Planning and Design

    Business drivers Objectives User needs User and sponsor expectation Application orientation Data sources Data quality To build data warehouse or data mart Project risk Budget plan Time frame Cost benefit analysis Project team composition (DA, DAB, OLAP development, GUI

    development, query development, report development, user training,network management, system integration)

    Logical and physical data model (depends on access & usage)

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    3/16

    The logical and physical data model design depends on the data access andusage. At the planning & design phase, the data model is just in preliminarydesign.Building and Testing

    HW, SW, transformation SW, middle ware, OLAP SW, system

    management SW Network infrastructure and management Connect to source databases (flat file, ongoing connection, direct

    access) Summarize or aggregate data Prototyping Data mining to find out data patterns.

    Some data extraction & transformation software are very useful todevelopment data transformation routines. These tools are very useful for boththe construction and the maintenance phase. In addition, system management

    software can control the data extraction processes to extract data from otherdatabase systems to the data warehouse.Roll out & Maintenance

    System growth Performance management System maintenance Security Backup, recovery Update data

    Risk management is important to the success of a data warehouse project.

    Some of the project risks are: Technology risk: i.e. new technology to the market place, new

    technology to the organization, and technologies coexist, etc. Complexity risk: i.e. complex data model and database process, business

    process change, mission critical requirement, large number ofinstallations, distributed system, data re-modeling required for legacysystem, etc.

    Integration risk: i.e. integration with other information system, realtime requirements for the interfaces, etc.

    Project team risk: i.e. team member experience, business userinvolvement, etc.

    4.0 Architecture

    Data warehouse reads source data from different database systems in theorganization. The source databases are usually operational databases. Thefollowing is one of the data warehouse logical architecture:

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    4/16

    Figure 1: Data warehouse architectureData warehouse reads data from multiple operational databases. The data isclean, transformed, or aggregated. The data is either updated or inserted intothe data warehouse depending on the trend analysis requirement. In addition,cross reference data is generated based on the new data, for example,

    accounting data from the accounting database needs to be cross referencedwith the facility data from the facility database.Depending on the data model, the amount of data, and the particular query,performance can be a problem for a data warehouse system. In the datawarehouse, some tables can contain millions of entries. Query operation tothese tables can take a long time. For example, the query performs aggregateoperation to summarize the data. Also, if the query needs many join operationsor sub-queries, the performance will even be slower. These long queries can beperformed over night in order to minimize the performance impact to the endusers. Some of these long queries can be speeded up by modifying the datamodel or turning the database.

    Scalability is an important consideration in choosing software, hardware, andsystem architecture for the data warehouse. Both the database size and thenumber of users for the data warehouse can increase substantially over time.The software and hardware must be scalable to support the new requirements.There are different types of database management system such as relationaldatabase system, object oriented database system, hierarchical databasesystem, etc. Relational database is usually the choice for implementing datawarehouse because the following reasons:

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    5/16

    Relational database is the most commonly used database system in thecommercial environment. Many developers already have experience withrelational database products. This reduces the learning curve for thedevelopers.

    Most of the operational database system is constructed with relational

    database. If the data warehouse is also constructed with relationaldatabase, the data conversion process between the operational databaseand the data warehouse can be simpler. Also, there are many databaseproducts that enable direct data transfer between relational databasesystems.

    Relational database is more mature than other types of database systemin terms of its scalability, stability, and efficiency.

    Relational database has less proprietary functions than other databasesystems. This increases the degree of platform independent.

    Object oriented database is sometime used in database application because it

    has a richer set of constructs to represent the data model. For example,hierarchical data structure can be represented better than relational database.Object oriented database provides better integration between data andfunctions. Therefore object oriented database is good for application that hasboth complex data structure and functions (i.e. CAD application, simulationapplication).

    4.1 Differences Between Data Warehouse andData Mart

    Data mart has similar functions as data warehouse except that data mart is alot smaller in size and has smaller group of users. For example, a departmentcan design a data mart that is tailored to the department specific needs. Thedata mart can contain additional domain specific information for thedepartment. Data mart costs less time and money to build and the design canbe more flexible.Some software products can merge multiple data marts into a data warehouseso that the data can be shared by the entire organization. The softwareproduct provides data management capabilities that extract a subset of thedata from the data marts to form the data warehouse. Some suggest that this

    for data warehouse development is more realistic because it is a step-by-stepmethodology to build data warehouse.

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    6/16

    Figure 2: Data mart architecture

    5.0 Data Source and Data Extraction

    Data warehouse reads data from multiple data sources. These data sources areusually operational databases such as accounting information database,financial information database, facility information database, ERP (EnterpriseResources Planning, i.e. SAP), operational information database, research &

    engineering database, GIS (Geographical Information System), etc.Other external data source can be industry data, economic data, credit data,commodity (raw material) data, meteorological data, competitor related data,demographic data, etc.Depending on the business requirements and the types of data, the dataloading frequency can be just once, once a day, once a week, or once a month.Once a day is the most often. There will be on going data and systemadministrative work required to maintain the data warehouse.

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    7/16

    There are different ways to implement data extraction processes. It dependson the requirements and the technical environment. Some implements requiremore maintenance effort than the others. The following lists out someimplementation methods for the data extraction process.

    The data can be extracted into an ASCII report file. The file can be in fix

    width or in CSV format. The ASCII report file is generated throughstandard report function on the operational database system. In somesituation, a custom report function is developed.

    The data can directly be extracted from the source database system.The source database can create a single database view that contains allthe necessary information. With this database view, the datatransformation process can directly request for information and load thedata into the data warehouse.

    Data loading process can have errors. The problems can be data referentialintegrity error, data format error, data range error, or other data quality

    errors. In these situations, the source data has to be modified before it can beloaded into the data warehouse.Depending on the data source, the source database system may need to be re-modeled in order to produce the required data for the data warehouse. Thisdata re-modeling work can be use a lot of time.

    6.0 Data Modeling

    Data modeling is one of the most important steps in building a data warehouse.Data warehouse uses dimensional modeling in a relational database

    environment. There are two types of table and they are dimension table andfact table. Dimension table contains information that is relatively static overtime. Fact table contains transactional type information that changes overtime. Fact table contains multiple foreign keys to dimension tables and hassome of its own attributes.In comparison, entity relationship modeling has data table, primary table,lookup table, characteristic table, virtual table, and summarized table.Data modeling is a creative process and there can be different modelingsolution for the same set of data. The purpose of data modeling is to organizedata to meet business objectives and to provide good performance fordatabase operation.

    Meta data is important information in data modeling. It is the informationabout the data model. For example "$5.64 sales amount", without meta data,the data shows as "5.64" and we don't know what it means. Meta data capturesbusiness rules for data such as data name, description, value range, dataversion, data source, and referential integrity information. The organization ofmeta data can be separated into technical level and business level. Thefollowing tables describes the information to be stored in meta data repository.

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    8/16

    Technical Level Business Level

    Data physical location Data access method Program and script name Dependencies Data transformation logic Data refresh rules Rules to resolve data inconsistencies

    Rules for data derivation (i.e.aggregation)

    Mapping data source & target Valid user entries Frequency of update and

    usage Data update responsibilities Data security Other business rules Data ownership Table size estimates Data access, drill down, and

    roll-up

    Predefined queries andreports

    Meta data can be used as a semantic layer for users to navigate through thedata warehouse without having to understand the complex physical datastructure. Some meta data can be extracted from the database managementsystem or the data modeling case tool.

    6.1 Star Schema

    Star schema is a relational data model. Each schema has one fact tableassociated with multiple dimension tables. Each data warehouse has many starschema. Star schema organizes data for the purpose of end-user analysis. Starschema is easy to understand by end-user. Also, there are many OLAP toolsthat support star schema analysis. Figure 3 is an example of a star schema.

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    9/16

    Figure 3: Star SchemaIn figure 3, dimension tables are Student table, Instructor table, Course table,and Semester table. Information in these dimension tables are relatively staticover time. Each star schema can have multiple dimension tables. There is onlyone fact table for each star schema. The fact table in figure 3 is theAttendance table. The fact table contains multiple foreign keys to the fourdimension tables. The fact table primary key is the composite of the fourforeign keys. Since the fact table contains transaction type information and thedimension table contains relatively static information, the amount of data in

    the fact table is a lot more than the amount of data in the dimension tables.The about data model can for example provide the following query result:

    List of students in a course, a major, or a minor Instructors for a course Courses taught by an instructor List of instructors in a faculty List of students taught by an instructor List of instructors that teach a student Summary of a student's grade Total credit obtained by a student List of courses taken by a student in a semester, or a year Number of students in a course Number of openings in a course

    6.2 Historical Data and Trend Analysis

    Historical data is stored in the data warehouse for trend analyze. Trendanalyze is a very important feature for the data warehouse. Fact table contains

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    10/16

    transaction type information. Data is inserted into the fact table withoutoverwriting existing data. Therefore, fact table already captures historicalinformation.For dimension table, some data modeling changes is required to capturehistorical information. This is because data in dimension table is not

    transaction type information. For example, based on the data model in figure3, user want to change a student's address.

    Method Problem

    Modify the address field in the Student table The new address overwritten theold address.

    Create a new record in the Student table. Thenew record has the same information as theold record except the address information.

    This violates the referentialintegrity of the Student tableand creates database error.

    Create a new record in the Student table. The

    new record has the same information as theold record except the address and the studentID.

    The existing reference to the old

    student record won't have thenew address information.

    To capture historical data in the dimension table, the data model has to bemodified. For example, based on the data model in figure 3, the relationshipbetween the Attendance table and the Student table becomes:

    Figure 4: New relationship between the Attendance table and the Studenttable to capture historical data

    With the data model in figure 4, the student's address can be modified byinserting a new record into the Student table.

    Old record New record

    Student Entity 123 478

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    11/16

    ID

    Student ID 981215 981215

    Student Name John Smith John Smith

    MajorComputer

    science

    Computer

    scienceMinor (NULL) (NULL)

    Address1210 10 Ave.SW

    1215 15 Ave.SW

    Phone 456-7815 456-7815

    Effective FromDate

    Sept 1, 96 Sept 1, 98

    Effective ToDate

    August 31, 98 (NULL)

    6.3 Snow Flake Schema

    Snow flake schema is similar to star schema. It normalizes dimension table tosave data storage space. It can be used to represent hierarchies of information.

    Figure 5: Snow Flake SchemaThe Student table is normalized to contain foreign keys to Major and Minortables. The relationship between Student table to Major table is many-to-one.In other situations, if the relationship is many-to-many, this will create a chainof tables for the dimension table. This makes the data model more difficult to

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    12/16

    use and understand by the end-user. Therefore, the use of snow flake schemacan decrease the browsing performance.In addition, the storage space saving in the dimension table is not significant incomparison to the size of the fact table. Fact table is usually many times largerthan the dimension table.

    6.4 Information Grouping for Analysis

    Information can be grouped for data analysis. The grouping information cancome from the original data source or from the end-user. For example, theoriginal data source contains geographical information for each facility. Thisallows facilities to be grouped by geographical area. The index to book valueinformation for each facility allows calculation of total book value for thegeographical area.End-user can provide other custom grouping information. Storage space anduser interface are needed for the end-user to maintenance this type of

    grouping information.

    6.5 Summary Information

    Some information are summarized before loading into the data warehouse. Thisdepends on the level of detail of the information required by the user. Forexample, a supermarket may have a few thousands of transaction each day.This transaction can be summarized by each product before loading into thedata warehouse.

    6.6 Cross Reference Information

    Cross reference information between information from different databases isvery important for data analysis. For example, financial information andoperational information can come from two different database systems. Thefinancial database contains cost and revenue information for each facility. Theoperational database contains operation information for each facility. Crossreference information between these two database systems can enable costanalysis on operation activity.

    6.7 Data Model PrototypePrototyping is a good way to analysis the data model in early developmentstage. It can demonstrate the benefit of the data warehouse strategy. Inhelping to present the data model, the data model can be divided into twoviews. One is the business view and the other is the developer view. End usercan use the business view to understand the system functionality.There are some limitations to prototyping. Prototyping may not show:

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    13/16

    Data migration processes System with all the data Performance issues Security features

    6.8 Stage Older Data

    A terabyte data warehouse requires 500 to 1000 physical disk drives and plus100 more disk controllers. Some old data should be archived as new data isloaded into the warehouse. Fact table old data can be archived. For example,sale data from 7 years ago may be required by end user for analysis.

    7.0 Data Transformation andLoading

    Data transformation and loading process puts source data into the datawarehouse. The programming logic is usually simple. Depending on the dataquality of the source data, the transformation and loading process can be timeconsuming. For example, if the source data is manually maintained, a lot ofeffort may be needed to clean the source data. Some records may have baddata and require to be corrected before loading into the data warehouse.If data come from two different database systems and the data warehouse isrequired to build cross reference information between them. There can be datareferential integrity problem. For example, one database contains operationaldata for each facility and the other database contains financial data for eachfacility. The operational data may refer to a facility that does not exist in thefinancial data.For every time there is change in the source data definition or the target datamodel, the associated data transformation and loading process has to bemodified accordingly. If there are many changes, a lot of time is required tomodify the processes. There are some visual development tools that arespecialized in developing these data transformation and loading processes.These tools have a GUI interface that allows developer to specify the datatransformation logic. It makes the data transformation and loading processeseasier to develop and to maintenance.There are some other ways to implement the data transformation and loadingprocesses. These processes can be implemented in conventional languages suchas C and Cobol. Using C can achieve a fast execution speed and this isnecessary for some computation intensive data warehouse processes.

    8.0 Process Control and Scheduling

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    14/16

    There are many data transformation and loading processes in the datawarehouse for data population. There are sequences and dependencies forthese processes to execute. A control process is required to control these datatransformation and loading processes. It may be required to access multiplecomputer systems. For example, it can start a data extraction process on

    another database system and transfer the data to the data warehouse serverfor data transformation and loading.

    9.0 User Interfaces

    There are a few types of user interface for the data warehouse system. Theseuser interfaces can be used by the system administrator or the end-user. Withthese user interfaces, system administrator can:

    Maintenance user accounts Monitor and control data loading and transformation processes

    End-user can: Analysis data (OLAP tool) Create and generate report Create and execute query Maintenance user input data (i.e. group information for data analysis)

    9.1 OLAP Tools

    Online Analytical Processing (OLAP) tool is used for data analysis especially fordimensional data model. The tool provides an front end user interface for theuser to access the data warehouse. Through the tool, the user can performdata analysis, design custom report or query. User can perform joins,aggregations, sorts, roll-up and roll-down to the data.Roll-up is done by adding row headers from the dimension tables. Roll-down isdone by subtracting row headers.Security features can be implemented with database view. View is a logicaltable derived from the physical tables in the database. View provides a logicallayer for the user to access the database physical tables. For example,Employee is a physical table with the following attributes:

    Employee View A View BFirst Name X X

    Last Name X X

    Department X X

    Position X X

    Phone Number X X

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    15/16

    Age X

    Salary X

    Employee table has both View A and View B. View A can access all attributes inthe Employee table. View B can access all attributes except for attribute Age

    and Salary. View A is used by manager in the company. View B is used by allother users.

    9.2 Query, Report, and Application

    The data warehouse can have predefined queries and reports. With somereporting tools, user can access the reports through the company intranet. Usercan subscribe to a pre-defined report. The pre-defined report can be pregenerated to save both user and system time.Application can be developed for the data warehouse. The application is fordata analysis purpose. It is not for transaction processing purpose to update the

    data in the data warehouse.

    9.3 System Administration and Maintenance

    Some data are manually maintained is the data warehouse. These can besystem related data for the data warehouse to operate. These data is usuallymaintained by the system administrator. For example, the data warehouse hasinformation about all the data loading processes. Scheduling program canbased on these information to execute the data loading processes and theexecution status can be stored in the data warehouse for process tracking.

    Also, system administrator can maintain information about user account andaccess privilege. A user interface can be developed for the administrator tomaintenance the information.Some lookup data and grouping data are also manually maintained. These datais for data analysis purposes.

    10.0 Conclusion

    Data warehouse is a good solution for storing and analyzing large amount ofdata. It reads data from multiple operational databases on an ongoing basis.

    Cross reference information is generate between the data from the differentdatabases. The data model is designed to provide good browsing performanceto the end user. Data warehouse can be seen as a centralized data repositoryto provide both current and historical data to the end user.

    References

  • 7/27/2019 Datawarehouse Concepts in a Nutshell

    16/16

    Akmal B. Chaudhri, Mary Loomis, (1998). Object Databases in Practice.Hewlett-Packard Company, Prentice-Hall.DCI, (1997). Database & Client/Server World and Data Warehouse WorldSeminars.DCI, (1997). The Roadmap for Data Warehouse Implementation.

    Kimball, (1996). Data Warehouse Toolkit. John Wiley & Sons, Inc.[back to the top of this document]

    http://www.cpsc.ucalgary.ca/~lamsh/SENG/693/datawarehousing.html#Tophttp://www.cpsc.ucalgary.ca/~lamsh/SENG/693/datawarehousing.html#Top