isqs 3358, business intelligence dimensional modeling zhangxi lin texas tech university 1 1
TRANSCRIPT
ISQS 3358, Business IntelligenceISQS 3358, Business Intelligence
Dimensional ModelingDimensional ModelingZhangxi LinTexas Tech University
11
OutlineOutlineData Warehousing ApproachesDimensional ModelingData Warehousing with Microsoft SQL
Server 2005Case: Adventure Works Cycles (AWC) : Data Warehouse Design Phases
2
Data Warehousing Data Warehousing ApproachesApproaches
3
4
Data Warehouse Data Warehouse Development ApproachesDevelopment Approaches
Data warehouse development approaches
◦ Inmon Model: EDW approach ◦ Kimball Model: Data mart approach
Which model is better?◦ There is no one-size-fits-all strategy to
data warehousing ◦ One alternative is the hosted warehouse
General Data Warehouse General Data Warehouse Development ApproachesDevelopment Approaches
“Big bang” approach
Incremental approach:◦Top-down incremental approach◦Bottom-up incremental approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 5
““Big Bang” ApproachBig Bang” Approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 6
Analyze enterpriserequirements
Build enterprisedata warehouse
Report in subsets orstore in data marts
Incremental Approach Incremental Approach to Warehouse Developmentto Warehouse Development Multiple iterations Shorter implementations Validation of each phase
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 7
Strategy
Definition
Analysis
Design
Build
Production
Increment 1
Iterative
Top-Down ApproachTop-Down Approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 8
Analyze requirements at the enterprise level
Develop conceptual information model
Identify and prioritize subject areas
Complete a model of selected subject area
Map to available data
Perform a source system analysis
Implement base technical architecture
Establish metadata, extraction, and load processes for the initial subject area
Create and populate the initial subject area data mart within the overall warehouse
framework
Bottom-Up ApproachBottom-Up Approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 9
Define the scope and coverage of the data warehouse and analyze the source systems within this scope
Define the initial increment based on the political pressure, assumed business benefit and data volume
Implement base technical architecture and establish metadata, extraction, and load processes as required by increment
Create and populate the initial subject areas within the overall warehouse framework
Dimensional Dimensional ModelingModeling
10
11
Dimensional ModelDimensional Model Also called star schema
◦ Fact table is in the middle and dimensions serving as the points on the star.
◦ A normalized fact table plus denormalized dimension tables Facts
◦ Measurements associated with a specific business process.◦ Most facts are additive (calculative); others are semi-additive,
non-additive, or descriptive (e.g. factless fact table).◦ Many facts can be derived from other facts. So, non-additive
facts can be avoided by calculating it from additive facts. Grain
◦ The level of detail contained in the fact table◦ The lowest level of detail is called atomic fact table
11
12
DimensionsDimensions The foundation of the dimensional model to describe
the objects of the business The nouns of the DW/BI system
◦ Business processes (facts) are the verbs of the business
Dimension tables link to all the business processes. A dimension shared across all processes is called
conformed dimension The analysis involving data from more than one
business process is called drill-across.
12
13
Data CubeData Cube Data cubes are multidimensional extensions of 2-D
tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.
Data cubes aren't restricted to just three dimensions. Most OLAP systems can build data cubes with many more dimensions allows up to 64 dimensions.
In practice, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions.
13
14
Determining GranularityDetermining Granularity
YEAR?
QUARTER?
MONTH?
WEEK?
DAY?
15
Star Schema ModelStar Schema Model
15
Product TableProduct_idProduct_disc,...
Time TableDay_idMonth_idYear_id,...
Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units, ...
Item TableItem_idItem_desc,...
Store TableStore_idDistrict_id,...
Central fact table
Denormalizeddimensions
16
Snowflake Schema ModelSnowflake Schema Model
16
Time TableWeek_idPeriod_idYear_id
Dept TableDept_id
Dept_descMgr_id
Mgr TableDept_idMgr_id
Mgr_name
Product TableProduct_id
Product_desc
Item TableItem_id
Item_descDept_id
Sales Fact TableItem_idStore_idProduct_idWeek_id
Sales_amountSales_units
Store TableStore_idStore_descDistrict_id
District TableDistrict_idDistrict_desc
17
Snowflake Schema ModelSnowflake Schema Model
◦Direct use by some tools◦More flexible to change◦Provides for speedier data loading◦Can become large and unmanageable◦Degrades query performance◦More complex metadata
17
Country State County City
Dimensional Modeling Dimensional Modeling ProcessProcessHigh level dimensional model design
◦ Choosing business model◦ Declaring the grain◦ Choosing dimensions◦ Identifying the facts
Detailed dimensional model developmentDimensional model review and validation
◦ IS◦ Core users◦ Business community
Final design iteration
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 18
19
Example: Commrex Real Estate Data Example: Commrex Real Estate Data WarehousingWarehousing Analytic themes
◦ How to encourage realtors to use the online ASP services Value Chain
◦ Listors create their account◦ Listors post their real estate properties to the web-based
database services and pay listing fees◦ Property buyers search the website-based database and buy
properties from listors. This is the incentive for listors to use the ASP services
Business Processes◦ Listor sign up◦ Listor account management◦ Property data posting◦ Property search◦ Property database maintenance
19
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 20
Property ID
Listor ID Listor ID
Address
Property Type
City
Company ID
Chapter
Functions
Specializations
Comp Name
Address
Telephone #
Listor Name
Chapter
Feature
Property Type
Subtype 1
Type Name
Subtype 2
Subtype n
M:1
M:M
M:M
Primary Key
Secondary Key
Link to a table
Legends
Property Listing Database Membership Database
IMW’s Database ERD Model
Company ID
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 21
Property ID
Listor ID Listor ID
Address
Prop SubType
City
Company ID
Chapter
Functions
Specializations
Company ID
Address
Telephone #
Listor Name
Chapter
Feature
Prop SubType
Property Type
SubType Name
Primary Key
Secondary Key
Link to a table
Legends
Property Listing Fact Membership Dimension
IMW’s Data Warehouse Dimensional Model
Company Dimension
Property SubTypeDimension
Comp Name
Property Type
Type NameProperty TypeDimension
Data Warehousing Data Warehousing with Microsoft SQL with Microsoft SQL Server 2005Server 2005
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 22
Unified Dimensional Model Unified Dimensional Model (UDM)(UDM) A SQL Server 2005 technology A UDM is a structure that sits over the top of a data
mart and looks exactly like an OLAP system to the end user.
Advantages◦ No need for a data mart. ◦ Can be built over one or more OLTP systems. ◦ Mixed data mart and OLTP system data◦ Can include data from database from other vendors
and XML-formatted data◦ Allows OLAP cubes to be built directly on top of
transactional data◦ Low latency◦ Ease of creation and maintenance
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 23
Microsoft BI ToolsetMicrosoft BI Toolset Relational engine (RDBMS)
◦ T-SQL◦ .NET Framework Command Language Runtime (CLR)
SQL Server Integration Services (SSIS) – ETL◦ Data Transformation Pipeline (DTP)◦ Data Transformation Runtime (DTR)
SQL Server Analysis Service (SSAS) – queries, ad hoc use, OLAP, data mining◦ Multi-Dimensional eXpressions (MDX) – a scripting language for data
retrieval from dimensional database ◦ Dimension design◦ Cube design◦ Data mining
SQL Server Reporting Services (SSRS) – ad hoc query, report building
Microsoft Visual Studio .NET is the fundamental tool for application development
24
Structure and Components Structure and Components of Business Intelligenceof Business Intelligence
25
SSMSSSMS SSISSSISSSASSSAS
SSRSSSRS
SASEM
SASEM
SASEG
SASEG
MS SQL Server 2005MS SQL Server 2005
BIDS
Understanding the Cube Understanding the Cube Designer Tabs Designer Tabs Cube Structure: Use this tab to modify the architecture of a cube. Dimension Usage: Use this tab to define the relationships between
dimensions and measure groups, and the granularity of each dimension within each measure group.
Calculations: Use this tab to examine calculations that are defined for the cube, to define new calculations for the whole cube or for a subcube, to reorder existing calculations, and to debug calculations step by step by using breakpoints.
KPIs: Use this tab to create, edit, and modify the Key Performance Indicators (KPIs) in a cube.
Actions: Use this tab to create or modify drillthrough, reporting, and other actions for the selected cube..
Partitions: Use this tab to create and manage the partitions for a cube. Partitions let you store sections of a cube in different locations with different properties, such as aggregation definitions.
Perspectives: Use this tab to create and manage the perspectives in a cube. A perspective is a defined subset of a cube, and is used to reduce the perceived complexity of a cube to the business user.
Translations: Use this tab to create and manage translated names for cube objects, such as month or product names.
Browser: Use this tab to view data in the cube.
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 26
Case: Adventure Case: Adventure Works Cycles (AWC) Works Cycles (AWC)
27
Case: Adventure Works Case: Adventure Works Cycles (AWC)Cycles (AWC)A fictitious multinational
manufacturer and seller of bicycles and accessories
Based on Bothell, Washington, USA and has regional sales offices in several countries
http://www.msftdwtoolkit.com/
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 28
Basic Business Basic Business InformationInformationProduct orders by categoryProduct Orders by
Country/RegionProduct Orders by Sales ChannelCustomers by Sales Channel
Snapshot
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 29
AWC Business Requirements - AWC Business Requirements - Interview summary Interview summary Interviewee: Brian Welker, VP of Sales Sales to resellers: $37 million last year 17 people report to him including 3 regional sales managers Previous problem: Hard to get information out of the
company’s system Major analytic areas:
Sales planning Growth analysis Customer analysis Territory analysis
Sales performanceBasic sales reporting Price listsSpecial offersCustomer satisfaction International support
Success criteria Easy data access, Flexible reporting and analyzing, All data in one
place What’s missing? – A lot – No indication of business value
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 30
Business ProcessesBusiness ProcessesPurchase OrdersDistribution Center Deliveries Distribution Center InventoryStore DeliveriesStore InventoryStore Sales
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 31
Analytic ThemesAnalytic ThemesSee the Excel file
AW_Analytic_Themes_List.xls
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 32
AWC’s Bus MatrixAWC’s Bus Matrix
Dimensions
Business Process
Da
te
Pro
du
ct
Em
plo
yee
Cu
stom
er (R
ese
ller)
Cu
stom
er (In
tern
et)
Sa
les T
errito
ry
Cu
rren
cy
Ch
an
ne
l
Pro
mo
tion
Ca
ll Re
aso
n
Fa
cility
Sales Forecasting X X X X X X X
Orders X X X X X X X X X
Call Tracking X X X X X X X
Returns X X X X X X X X
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 33
Prioritization GridPrioritization Grid
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 34
Orders
OrdersForecast
CallTracking
ExchangeRates
ReturnsManufacturingCosts
CustomerProfitability
ProductProfitability
FeasibilityHighLow
High
Low
BusinessValue / Impact
Exercise 2 – A quick walk Exercise 2 – A quick walk through an SSAS through an SSAS applicationapplication Learning Objectives
◦ How to design a data source view with SSAS based on an existing data warehouse
◦ How to design and deploy a cube. Tasks
◦ Analysis Service Tutorial Lesson 1: Defining a Data Source View within an Analysis Services Project
◦ Analysis Service Tutorial Lesson 2: Defining and Deploying a Cube
Deliverable: ◦ A Word file with the screenshot of the star schema
emailed to [email protected]◦ The subject of the email is: “ISQS 3358 Exercise 2”
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 35
Supplemental Slides : Supplemental Slides : Data Warehouse Design Data Warehouse Design Phases Phases
36
37
Data Warehouse Database Data Warehouse Database Design PhasesDesign Phases
Phase 1: Defining the business modelPhase 2: Defining the dimensional modelPhase 3: Defining the physical model
37
38
Phase 1: Defining the Business Phase 1: Defining the Business ModelModel
◦Performing strategic analysis◦Creating the business model◦Documenting metadata
38
39
Performing Strategic AnalysisPerforming Strategic Analysis
Identify crucial business processes Understand business processes Prioritize and select the business processes to
implement
39
BusinessBenefit
Low High
Low
High
Feasibility
40
Creating the Business ModelCreating the Business Model
Defining business requirements:◦ Identifying the business measures◦ Identifying the dimensions◦ Identifying the grain◦ Identifying the business definitions and rules
Verifying data sources
40
41
Business Requirements Drive Business Requirements Drive the Design Processthe Design Process
◦Primary input
◦Secondary input
Existing Metadata Production ERD Model
BusinessRequirements
Research
41
42
Identifying MeasuresIdentifying Measuresand Dimensionsand Dimensions
The attribute varies continuously:◦ Balance◦ Units Sold◦ Cost◦ Sales
The attribute is perceived as constant or discrete:◦ Product◦ Location◦ Time◦ Size
42
Measures
Dimensions
43
Using a Business Process Using a Business Process MatrixMatrix
43
Sample of business process matrix
Business Dimensions
Business Processes
Sales Returns Inventor
y
Customer
Date
Product
Channel
Promotion
44
Determining GranularityDetermining Granularity
44
YEAR?
QUARTER?
MONTH?
WEEK?
DAY?
45
Identifying Business RulesIdentifying Business Rules
45
Store
Store > District > Region
Location
Geographic proximity
0 - 1 miles1 - 5 miles > 5 miles
Product
Type Monitor Status
PC 15 inch NewServer 17 inch Rebuilt
19 inch CustomNone
Time
Month > Quarter > Year
46
Documenting MetadataDocumenting MetadataDocumenting metadata should include:
◦Documenting the design process◦Documenting the development process◦Providing a record of changes ◦Recording enhancements over time
46
47
Metadata Documentation Metadata Documentation ApproachesApproaches
◦Automated Data modeling tools ETL tools End-user tools
◦Manual
47
48
Phase 2: Defining the Phase 2: Defining the Dimensional ModelDimensional Model
◦Identify fact tables: Translate business measures into fact tables Analyze source system information for additional
measures◦Identify dimension tables◦Link fact tables to the dimension tables◦Model the time dimension
48
49
Star Dimensional ModelingStar Dimensional Modeling
49
Store TableStore_id
District_id...
Item TableItem_id
Item_desc...
Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units...
Product TableProduct_idProduct_desc
...
Time TableDay_id
Month_idPeriod_idYear_id
50
Fact Table CharacteristicsFact Table Characteristics
◦Contain numerical metrics of the business◦Can hold large volumes of data◦Can grow quickly◦Can contain base, derived,
and summarized data◦Are typically additive◦Are joined to dimension
tables through foreign keys that reference primary keys in the dimension tables
50
Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units...
51
Dimension Table Dimension Table CharacteristicsCharacteristics
Dimension tables have the following characteristics: ◦ Contain textual information that represents the
attributes of the business◦ Contain relatively static data◦ Are joined to a fact table through
a foreign key reference
51
52
Star DimensionalStar DimensionalModel CharacteristicsModel Characteristics
◦ The model is easy for users to understand.◦ Primary keys represent a dimension.◦ Nonforeign key columns are values.◦ Facts are usually highly normalized.◦ Dimensions are completely denormalized.◦ Fast response to queries is provided.◦ Performance is improved by reducing table joins.◦ End users can express complex queries.◦ Support is provided by many front-end tools.
52
53
Using Time in the Data Using Time in the Data WarehouseWarehouse
◦Defining standards for time is critical.◦Aggregation based on time is complex.
53
54
The Time DimensionThe Time Dimension Time is critical to the data warehouse. A consistent
representation of time is required for extensibility.
54
Where should the element of time be stored?
Timedimension
Sales fact
55
Using Data Modeling ToolsUsing Data Modeling Tools
◦ Tools with a GUI enable definition, modeling, and reporting.
◦ Avoid a mix of modeling techniques caused by: Development pressures Developers with lack of knowledge No strategy
◦ Determine a strategy.◦ Write and publish formally.◦ Make available electronically.
55
56
Phase 3: Defining the Phase 3: Defining the Physical ModelPhysical Model
Why◦ Huge amount of data must be effectively processed
and retrieved in realtime. How
◦ Translate the dimensional design to a physical model for implementation.
◦ Define storage strategy for tables and indexes.◦ Perform database sizing.◦ Define initial indexing strategy.◦ Define partitioning strategy.◦ Update metadata document with physical information.
56
57
Storage and Performance Storage and Performance ConsiderationsConsiderations
Database sizingData partitioningIndexingStar query optimization
57
58
Database Sizing - Test Load Database Sizing - Test Load SamplingSampling
Analyze a representative sample of the data chosen using proven statistical methods.
Ensure that the sample reflects:◦Test loads for different periods◦Day-to-day operations◦Seasonal data and worst-case scenarios◦ Indexes and summaries
58
59
Data PartitioningData Partitioning
Breaking up of data into separate physicalunits that can be handled independently
Types of data partitioning ◦ Horizontal partitioning. ◦ Vertical partitioning
59
60
IndexingIndexing
Indexing is used for the following reasons:◦ It is a huge cost saving, greatly
improving performance and scalability.
◦ It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed.
60
61
ParallelismParallelism
61
Parallel Execution Servers
Sales table
Customerstable
P3
P3
P1
P1
P2
P2
62
Using Summary DataUsing Summary Data
Designing summary tables offers the following benefits:◦Provides fast access to precomputed data◦Reduces use of I/O, CPU, and memory
62