recent developments in data warehousing: a tutorial hugh j. watson terry college of business...
TRANSCRIPT
Recent Developments in Data Warehousing: A Tutorial
Hugh J. WatsonTerry College of BusinessUniversity of [email protected]://www.terry.uga.edu/~hwatson/dw_tutorial.ppt
Tutorial Objectives Provide an overview of data
warehousing Provide materials to support the
teaching of data warehousing Discuss recent developments in data
warehousing
Topics Covered Definitions and concepts The data mart and enterprise-wide data
warehouse strategies Data extraction, cleansing, transformation and
loading Meta data Data stores Online analytical processing (OLAP) Warehouse users, tools, and applications Case study: Harrah’s Entertainment
The Importance of Data Warehousing Provide a “single version of the truth” Improve decision making Support key corporate initiatives such as
performance management, B2C and B2B e-commerce, and customer relationship management
Estimated to be a $113.5 billion market in 2002 for systems, software, services, and in-house expenditures (Palo Alto Management Group)
A Simple Definition
A data warehouse is a collection ofdata created to support decision-making applications.
Data Warehouse Characteristics Subject oriented -- data are organized
around sales, products, etc. Integrated -- data are integrated to
provide a comprehensive view Time variant -- historical data are
maintained Nonvolatile -- data are not updated by
users
Another Definition
Data warehousing is the entire process of data extraction, transformation, and loading of data to the warehouse and the access of the data by end users and applications.
Data Mart
A data mart stores data for a limited number ofsubject areas, such as marketing and sales data. It isused to support specific applications.
An independent data mart is created directly fromsource systems.
A dependent data mart is populated from a datawarehouse.
Operational Data Store
An operational data store consolidates data frommultiple source systems and provides a near real-time, integrated view of volatile, current data.
Its purpose is to provide integrated data foroperational purposes. It has add, change, and deletefunctionality.
It may be created to avoid a full blown ERPimplementation.
Prod
Mkt
HR
Fin
Acctg
Data Sources
Transaction Data
IBM
IMS
VSAM
Oracle
Sybase
ETL Software Data Stores Data AnalysisTools and Applications
Users
Other Internal Data
ERP SAP
Clickstream Informix
Web Data
External Data
Demographic Harte-Hanks
STAG ING
AREA
OPERAT IONAL
DATA
STORE
Ascential
Extract
Sagent
SAS
Clean/ScrubTrans formFirst logic
Load
Informatica
Data MartsTeradataIBM
Data Warehouse
Meta Data
Finance
Marketing
Sales
Essbase
Microsoft
ANALYSTS
MANAGERS
EXECUTIVES
OPERATIONAL PERSONNEL
CUSTOMERS/SUPPLIERS
SQL
Cognos
SAS
Queries,Reporting,DSS/EIS, Data Mining
Micro Strategy
Siebel
BusinessObjects
WebBrowser
Two Data Warehousing Strategies Enterprise-wide warehouse, top
down, the Inmon methodology Data mart, bottom up, the Kimball
methodology When properly executed, both result
in an enterprise-wide data warehouse
The Data Mart Strategy The most common approach Begins with a single mart and architected marts
are added over time for more subject areas Relatively inexpensive and easy to implement Can be used as a proof of concept for data
warehousing Can perpetuate the “silos of information”
problem Can postpone difficult decisions and activities Requires an overall integration plan
The Enterprise-wide Strategy A comprehensive warehouse is built initially An initial dependent data mart is built using
a subset of the data in the warehouse Additional data marts are built using subsets
of the data in the warehouse Like all complex projects, it is expensive,
time consuming, and prone to failure When successful, it results in an integrated,
scalable warehouse
Data Sources and Types Primarily from legacy, operational
systems Almost exclusively numerical data at the
present time External data may be included, often
purchased from third-party sources Technology exists for storing unstructured
data and expect this to become more important over time
Extraction, Transformation, and Loading (ETL) Processes
The “plumbing” work of data warehousing
Data are moved from source to target data bases
A very costly, time consuming part of data warehousing
Recent Development:More Frequent Updates Updates can be done in bulk and
trickle modes Business requirements, such as
trading partner access to a Web site, requires current data
For international firms, there is no good time to load the warehouse
Recent Development: Clickstream Data Results from clicks at web sites A dialog manager handles user
interactions. An ODS helps to custom tailor the dialog
The clickstream data is filtered and parsed and sent to a data warehouse where it is analyzed
Software is available to analyze the clickstream data
Data Extraction Often performed by COBOL routines
(not recommended because of high program maintenance and no automatically generated meta data)
Sometimes source data is copied to the target database using the replication capabilities of standard RDMS (not recommended because of “dirty data” in the source systems)
Increasing performed by specialized ETL software
Sample ETL Tools Teradata Warehouse Builder from
Teradata DataStage from Ascential Software SAS System from SAS Institute Power Mart/Power Center from
Informatica Sagent Solution from Sagent Software Hummingbird Genio Suite from
Hummingbird Communications
Reasons for “Dirty” Data Dummy Values Absence of Data Multipurpose Fields Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules Reused Primary Keys, Non-Unique Identifiers Data Integration Problems
Data Cleansing Source systems contain “dirty data” that
must be cleansed ETL software contains rudimentary data
cleansing capabilities Specialized data cleansing software is often
used. Important for performing name and address correction and householding functions
Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)
Steps in Data Cleansing Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing Parsing locates and identifies
individual data elements in the source files and then isolates these data elements in the target files.
Examples include parsing the first, middle, and last name; street number and street name; and city and state.
Correcting Corrects parsed individual data
components using sophisticated data algorithms and secondary data sources.
Example include replacing a vanity address and adding a zip code.
Standardizing Standardizing applies conversion
routines to transform data into its preferred (and consistent) format using both standard and custom business rules.
Examples include adding a pre name, replacing a nickname, and using a preferred street name.
Matching Searching and matching records
within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
Examples include identifying similar names and addresses.
Consolidating Analyzing and identifying
relationships between matched records and consolidating/merging them into ONE representation.
Data Staging Often used as an interim step between data
extraction and later steps Accumulates data from asynchronous sources
using native interfaces, flat files, FTP sessions, or other processes
At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse
There is usually no end user access to the staging file
An operational data store may be used for data staging
Data Transformation Transforms the data in accordance
with the business rules and standards that have been established
Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates
Data Loading Data are physically moved to the
data warehouse The loading takes place within a
“load window” The trend is to near real time
updates of the data warehouse as the warehouse is increasingly used for operational applications
Meta Data Data about data Needed by both information technology
personnel and users IT personnel need to know data sources and
targets; database, table and column names; refresh schedules; data usage measures; etc.
Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.
Recent Development:Meta Data Integration A growing realization that meta data is
critical to data warehousing success Progress is being made on getting
vendors to agree on standards and to incorporate the sharing of meta data among their tools
Vendors like Microsoft, Computer Associates, and Oracle have entered the meta data marketplace with significant product offerings
Database Vendors High end (i.e., terabyte plus)
vendors include NCR-Teradata (Teradata) and IBM (DB2)
Oracle (8i) and Microsoft (SQL Server 7) are major players for smaller databases
On-line Analytical Processing (OLAP) A set of functionality that facilitates
multidimensional analysis Allows users to analyze data in ways
that are natural to them Comes in many varieties -- ROLAP,
MOLAP, DOLAP, etc.
ROLAP Relational OLAP Uses a RDBMS to implement and OLAP
environment Typically involves a star schema to
provide the multidimensional capabilities OLAP tool manipulates RDBMS star
schema data Called slowlap by MOLAP vendors
MOLAP Multidimensional OLAP Uses a MDDBS (e.g., Essbase) to
store and access data Usually requires proprietary
(non SQL) data access tools Provides exceptionally fast response
times
Star Schema Creates non-normalized data
structures Easier for users to understand Optimized for OLAP Uses fact (facts or measures in the
business) and dimension (establishes the context of the facts) tables
OLAP Tools
Products come from vendors such as Brio, Cognos, Hyperion, and BusinessObjects
Typically available as a fat or thin (i.e., browser) client In a web environment, the browser communicates with a
web server, which talks to an application server, which connects to backend databases
The application server provides query, reporting, and OLAP analysis functionality over the web
Java applets or downloaded components augment the thin client
A broadcast server may be used to schedule, run, publish, and broadcast reports, alerts, and responses over the LAN, email, or personal digital assistant.
Claim# Physician ID# Patient ID# Service Code# Payer ID# Claim Number# Line Item Number# Claim DateDate of ServicesAmount of ChargeUnit of Services
Service#Service CodeService Description#Category Code
Time Periods#Claim DateYearMonthQuarterWeek
Payer#Payer IDNameAddressPhone NumberEDI Number
Star Schema
Patient#Patient IDPatient NameAddressAgeSexInsurance ID
Physician#Physician IDPhysician NameSpecialty IDCredential ID
Dimension Table Examples Retail -- store name, zip code, product
name, product category, day of week Telecommunications -- call origin, call
destination Banking -- customer name, account
number, branch, account officer Insurance -- policy type, insured party
Fact Table Examples Retail -- number of units sold, sales
amount Telecommunications -- length of
call in minutes, average number of calls
Banking -- average monthly balance
Insurance -- claims amount
The Fact Table Key Concatenates the Dimension Keys
Assume that you want to know the number of television sets sold to Best Buys on January 15, 2001.The query might be:SELECT CLIENT.CUSNAME, SALES.NOSOLD
FROM CLIENT, PRODUCT, TIME, SALES
WHERE CLIENT.CUSNAME=SALES.CUSNAME AND
PRODUCT.PRODNAME=SALES.PRODNAME AND
TIME.DATE=SALES.DATE AND CLIENT.CUSNAME=“BEST BUYS”
AND PRODUCT.PRODNAME=“TELEVISION” AND
TIME.DATE=#01/15/2001#
Warehouse Users Analysts Managers Executives Operational personnel Customers and suppliers
Warehouse Tools and Applications SQL queries Managed query environments Structured and ad hoc reports DSS/EIS Portals Data mining Packaged applications Custom-built applications
Recent Development:Enterprise Intelligence Portals
Offers users an effective way to access information scattered across networked enterprise systems through a simple and personalized Web interface
Provides access to structured and unstructured data
Potentially integrates data warehousing and knowledge management
Harrah’s Entertainment
Harrah’s Entertainment -- data warehousing supported a successful shift to a CRM oriented corporate strategy. Winner of the 2000 TDWI Leadership Award
Operates 21 casinos across the country In 1993, the gaming laws changed, which allowed
Harrah’s to expand Harrah’s decided to compete using a brand
strategy supported by information technology Needed to know their customers exceptionally
well
Harrah’s Data Warehousing Architecture WINet sources data from the casino,
hotel, and event systems The patron data base serves as an
operational data store The marketing workbench serves as
the data warehouse
Sample Applications Operational personnel use PDB to
check the preferences, history, and value of customers
Analysts use PDB and MWB to create offers to visit a Harrah’s casino
Analysts use MWB to support predictive modeling efforts
Execute
Right Offer Right Message Right Time
Predict the valueof a customer
Market based onthat expected value
Track transactionsthat are linked tomarketinginitiatives
Evaluate theeffectiveness
Track profitability
Refine MarketingApproaches
Learn
CustomerTreatment
CustomerAction/
Non-Action
Track
Measure: Profit & Loss Behavior change New test report
Define: Objectives Tests Control cells
Customer Relationship Lifecycle
Annual Revenue
Establish Reinvigorate
Length of Relationship
Strengthen
Articles
Cooper, B.L., H.J. Watson, B.H. Wixom, and D.L. Goodhue, "Data Warehousing Supports Corporate Strategy at First American Corporation," MIS Quarterly, (December 2000), pp. 547-567. Provides a case study of how the First American Corporation turned their strategy and fortunes around through the use of data warehousing.
Stoller, Wixom, and Watson, “WISDOM Provides Competitive Advantage at Owens & Minor,” (http://terry.uga.edu/~watson/owens&minor.doc) Provides a case study of how data warehousing can support supply chain integration.
Watson, Wixom, Buonamica, and Revak, “Sherwin-Williams' Data Mart Strategy: Creating Intelligence Across the Supply Chain,” Communications of ACIS, April 2001. Provides a textbook example of how to implement a data mart strategy.
Watson, H.J., D.A. Annino, B.H. Wixom, K.L. Avery, and M. Rutherford, “Current Practices in Data Warehousing,” Information Systems Management, (Winter, 2001), pp. 47-55. Provides data on companies’ data warehousing experiences, with an emphasis on the benefits being realized.
Watson, H.J. and L. Volonino, “Harrah’s High Payoff from Customer Information,” (http://www.terry.uga.edu/~hwatson/harrahs.doc) Provides a case study of how Harrah’s Entertainment has implemented a CRM strategy facilitated by data warehousing.
Books
Devlin, Data Warehouse -- Architecture to Implementation, Addison-
Wesley, 1997.
Gray and Watson, Decision Support in the Data Warehouse, Prentice-Hall,
1998.
Kimball, The Data Warehouse Toolkit, Wiley, 1996.
Kimball and Merz, The Data Webhouse Toolkit, Wiley, 2000.
Inmon, Building the Operational Data Store, second edition, Wiley, 1999.
Inmon, Imhoff, and Sousa, Corporate Information Factory, Wiley, 1999.
Websites http://www.olapreport.com
(provides detailed information about the OLAP market, products, and applications)
http://www.firstlogic.com (includes an interactive demo of their data cleansing tool)
http://www.billinmon.com (a wealth of current information from “the father of data warehousing”)
http://www.metagenix.com (illustrates recent advances in ETL tools)
http://www.microstrategy.com (excellent materials from one of the leading DSS vendors)