an introduction to datawarehoushing and and dimensional modelling - abhishek bhattacharya

An Introduction to Data Warehouse and Dimensional Modelling

1 Abhishek Bhattacharya, Data Architect, Imaginea.com


Abstract In this paper, I would like to describe about multidimensional data model for the DW/ Analytics, which will help to grow up the knowledge of community team on this technical area. In most simple words: - A Dimensional model is surround by facts with as much relevant context dimensions, a data warehouse consists of many such Dimensional mode. Multidimensionality is just a design technique that separates the information into facts and Dimensions. It gives rise to schema with star shape, consists of a fact table surrounded by dimension table(s). Snowflake Schema is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them. Multidimensional model present information to the end-user in a way that corresponds to his normal Understanding of his business, key figures or facts from the different perspectives that influence them .

Introduction DW technology is in full swing in industry as major business intelligence (BI) technology. Many Business Intelligence applications currently run at companies not just to demand more capacity, but also new methods, models, techniques or architectures to satisfy new needs. The concept of DWs emerged during the nineties as an integrated data collection system for companies oriented to decision making. This kind of DB has the following particular features- It contains data that is the result of transformations, quality improvement, and integration of data that comes from operational bases, also including indicators that give it additional value. The DWs have to support complex queries however its maintenance does not suppose transactional load. These features cause the design techniques and the used strategies to be different from the traditional ones.

Why do we need a DW? Traditional databases (OLTP) are not optimized for data access only they have to balance the

requirement of data access with the need to ensure integrity of data. Most of the times the DW users need only read access but, need the access to be fast over a large volume of data.

Most of the data required for DW analysis comes from multiple databases. To understand multiple database sources and the integrate them according to the understand and requirement and to fulfil analysis to get desire output is a real pain to Business users/analysts. In DW it will be in integrated and stored, where it will help cross-functional analysis.

A data warehouse is a set of data and technologies aimed at enabling the executives, managers and analysts to make better and faster decisions. DWs to manage information efficiently as the main organizational asset.

The principal role of DW in taking strategic decisions, quality is fundamental. DWs are databases consisting of cleansed, reconciled, and enhanced data integrated into logical business subject areas for improving decision making stores information in order to satisfy decision-making requests.

Characteristics of DW Separate, The DW is separate from the operational systems in the company. It gets its data out of

these legacy systems. The task of a DW is to make data accessible for the user.

History, questions has to be answered; trends and correlation’s have to be discovered. They are time stamped and associated with defined periods of time.

Subject oriented, most of the time oriented on the subject ’customer’, ‘product’, ‘promotion’, ‘inventory’ ..

Non dynamic, when the data is updated, it is done only periodical, but not as on individual basis.

Aggregation performance, the data which is requested by the user has to perform well on all scales of aggregation.

Consistency, structural and contents of the data is very important and can only be guaranteed by the use of.

Metadata, this is independent from the source and collection date of the data.



Understanding of OLAP and OLTP. As I have already given a brief understanding of data DataWarehouse, I think it’s the better time to understand the “OLTP” and “OLAP” which are very often used especially when you are working with data / analytics. OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, and DELETE). The main emphasis for OLTP systems is put on very fast DML query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema, snowflake schema, star flake schema). OLPT and OLAP are complementing technologies. You can't live without OLTP: it runs your business day by day. OLAP implementation gives you the ability to dig deep in to your data, data analytics, mining, BI reports on huge volume of history data ( that’s collected from OLTP system) which become a key factor on decision making business strategies. Change in strategy will impact OLTP to add the necessary changes. And data related to that will captured via OLTP will arrive to OLAP system to analyse again. Thus the loop goes on.

The following table summarizes the major differences between OLTP and OLAP system design.

OLTP System Online Transaction Processing

(Operational System)

OLAP System Online Analytical Processing

(Data Warehouse)

Source of data Operational data; OLTPs are the original source of

the data. Consolidation data; OLAP data comes from the various

OLTP Databases

Purpose of data To control and run fundamental business tasks To help with planning, problem solving, and decision

support

What the data Reveals a snapshot of ongoing business processes Multi-dimensional views of various kinds of business

activities

Inserts and Updates

Short and fast inserts and updates initiated by end users

Periodic long-running batch jobs refresh the data

Queries Relatively standardized and simple queries Returning

relatively few records Often complex queries involving aggregations

Processing Speed

Typically very fast Depends on the amount of data involved; batch data

refreshes and complex queries may take many hours; query speed can be improved by creating indexes

Space Requirements

Can be relatively small if historical data is archived Larger due to the existence of aggregation structures and

history data; requires more indexes than OLTP

Database Design

Highly normalized with many tables Typically de-normalized with fewer tables; use of star

and/or snowflake schemas

Backup and Recovery

Backup religiously; operational data is critical to run the business, data loss is likely to entail significant

monetary loss and legal liability

Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery

method

Keywords of data warehouse I don’t like to jump directly in to the design process of a datawarehouse / datamart. Rather, let me give you a brief idea of most important key words that are used in DW very often. Once you are clear with these keywords and get an idea of architecture of a data warehouse it will be easy for you to understand the design process.



Key words are listed below- demoralization, facts & fact table, dimensions, attributes, granularity, dimensional modelling, star-schema, snow-flake schema, star-flake schema, conform dimension, hierarchy, Types of “fact”, Types of “fact table”, dense and sparse fact tables, skinny fact tables, keys, slowly changing dimension, rapidly changing dimensions and large dimensions, minidimension, degenerate dimension, junk dimension, heterogeneous products, many-to-many relationships & bridge table, role-playing, data warehouse bus, Data cube, OLAP servers, ROLLUP & drilldown. Demoralization In a relational database, denormalization is an approach to speeding up read performance (data retrieval). This introduces redundancy but reduce the joins between entities that are related through ER(entity relationship). Fact As the name suggests, it’s a measure of an entity. Example, “Sales” is an entity and “quantity” is the measure of this entity. A table which represents entity “Sales” is a fact table. I have seen that fact and fact table are very often misunderstood by users. Dimension A business perspective from which data is looked upon. It is a collection of attributes that are highly correlated. Example:- product, customer, address, store etc. Data in dimension tables are highly denormalized. It’s also very common to have a hierarchy (will explain it later) in dimension data. Attribute Describe a characteristic of a tangible thing. We do not measure them, we usually know them. Text fields with discrete values, e.g., the flavour of a product, the size of a product. Granularity The level of detail of data contained in a table. Example; - in a store table, for each store id there is a record, so granularity of this table is store id. Employee table can have a same employee id against different department id, so empid & depid will become granularity of employee table. In other words it’s the attribute(s) level where a record becomes unique. Dimensional modelling This is a concept. DM is generally used in the context of data warehousing. Modelled dimensionally, a structure is easier to navigate and understand, (each and every subject areas data get separated and segregated into a logical entity and loaded in a table called “Dimension”. Then these dimensions get related to each other by “fact tables”). This is a plus especially when the user of the model is not familiar with database technologies and tools. Star schema One of the dimensional modelling/design techniques (people often mess-up with dimensional model and star schema). A star is a model that has a centred table(s) called “fact table” surrounded by dimensions. See the pic below



Snow-flake schema This is another technique of representing dimensional modelling. There are times when a star has some of its

portions normalized. The design is then said to be snow-flaked. This reduces redundancy but introduce join.

“Product Dimension” (in the above picture), having brand & product group information along with product key information. If brand & product group information removed and separate table created for each of them it will be snow-flaked design.

Star-flake schema Again, another technique. If there is any correlated common set of attributes between dimensions then those common attributes are stored together in a table and those dimensions are linked to it. This approach reduces redundancy but increases join. Between store & customer dimension there are 5 common attributes. Stored them in a separate table as address dimension and relate this dimension with store and customer.

Other Dimensional Design Techniques will be discussed later.



Conform dimension Modelling a business process one should re-use what has been modelled previously. Other modelling sessions might have defined dimensions and facts that can be reused. This notion is also known as conforming dimension. Example: - All the above pictures related to “sales” business processes. While designing “inventory” business process, “product dimension” and “time dimension” should be reused. So these two are conform dimension across business process and ensure cross functional analysis.

Hierarchy This is a relation between different levels of an entity.

Example – product - product group - product category. Day - week- month-quarter - semester- year Types of “fact”

(Perfectly) Additive A fact is additive if it make sense to add it across all the dimensions. e.g., discrete numerical measures of activity, i.e., quantity sold, dollars soled

Semi additive A fact is semi additive if it makes sense to add it along some of the dimensions only e.g., numerical measures of intensity, i.e., account balance, inventory level

Non-additive Facts that cannot be added at all e.g., measurement of room temperature.

Types of “fact table” In addition to being additive or non-additive, fact tables come in four flavours: transactional, periodic, and accumulative and fact less.

Transactional fact table typically contains one record for every transaction. A typical data load method followed for Transactional fact table – INSERT. An example of a transactional snapshot would be a table capturing all order details of a retail website.

Periodic snapshot represents a state of a business event entity in a pre-defined interval or a period. A daily balance fact table of a bank, daily inventory, is an example of a periodic snapshot. A typical data load method followed for Transactional fact table – INSERT. But most of the cases the latest record (against time) is used.

Accumulative snapshot represents business activities over a time period. An accumulative fact could represent a fulfilment process of a mutual fund company. A row in that table will capture a customer’s first contact date, then the literature send date and finally an account open date. As a consequence, this is one of the only times when a fact table is updated in a data warehouse. Otherwise, a fact table row is not updated after it is loaded. This kind of fact table help in business processes performance improvement analytics, lag – lead analytics.

Fact less fact table simply is a bridge between dimensions. It only consists of key columns of dimensions; no measure column will be available. If fact less fact table is present in the design model and that is a single fact table in the design model then measure will be present in dimension table. That measure will be viewed as an attribute of that dimension.

Dense and sparse fact tables

Sparse and dense are a property of the values of an attribute.

Sparse Data is normally stored in sparse form. If no value exists for a given combination of dimension values, no row exists in the fact table. For example, if not every product is sold in every store. In this case, “Store” and

http://gerardnico.com/wiki/data_modeling/attribute



“Product” is sparse dimensions W.R.T “sales fact table”. It's why in the reporting tool (OBIEE for instance), by

default, data are considered sparse.

Dense

Most multidimensional databases contain dense dimensions. A fact table is considered to have dense data if it

has (of a high probability to have) one row for every combination of its associated dimension value. Skinny fact tables As the fact table contains the vast volume of records it is important that it is memory space efficient. Foreign keys are usually represented in integer form and do not require much memory space. Facts too are often numeric properties and can usually be represented as integers (contrast to dimensional attributes which are usually long text strings). This space efficiency is critical to the memory space consumption of the data warehouse. Keys

Business key is similar to primary key as you know. But in dimension table this is not called as primary key. This will not be derived column, original attribute of an entity. Example, “product id” repeated because of “version no”. So for product dimension “product id” + “version no” will be business key.

Surrogate key is also known as integer key / synthetic key. This is so called primary key in dimension table. It must be integer only and always. This surrogate key will be present in “fact tables” and join between facts and dimension happens based on this key. Joining on integer column execute much faster compare to non-integer column joins. Also, if surrogate key was not created then business key need to keep in fact tables, if business key consists of multiple column (like “product id” + “version no”) then fact table data volume grow up and it will increase I/O in query and impacts performance.

Foreign keys in a data warehouse (relationships)? As I said earlier in difference between OLAP and OLTP, OLAP designs (data warehouse, data mart which is a subset of a big datawaresouse, specific to a subject area/business area,data analytics) are made for analysis and reports. There is no need to ensure data integrity and validation. There is a concept of late arriving dimension where record in dimension table arrive late w.r.t associated record in fact table (a product has appeared in service line details but not present in product dimension as product dimension load is weekly and fact table load is daily. But that doesn’t mean to invalid / omit record. That record is still valid w.r.t other dimension. Keeping FK will not let that record insert in fact table. Also in datawarehouse the data load volume is huge; keeping FK will slow down the load process as it is it need to be validating for every record for every dimension. So during loading you can’t keep FK enable. Either you drop or disable. Then once the load completed you have to recreate or eneble and at that time integrity violation happens it will become a huge pain to keep track.

Slowly changing dimension (also known as SCD) Type 1: Overwrite the dimension record with the new values, thereby losing history. For a business key in dimension table, if there is a presence of same business key in incoming feed then the new feed will update already existing feed. Easy to implement, avoids the real goal, which is to accurately track history. Example- 12334 is customer id (business key). Earlier it was “single” now this will be updated by “married”

Type 2: Create a new additional dimension record using a new value of the surrogate key. For a business key in dimension table, if there is a presence of same business key in incoming feed then the existing feed will be updated with “active flag” as ‘N’ and “end date” as ‘ETL load date -1’. Now the incoming record for the same business key, will inserted in dimension table with a new value of surrogate key, “active flag” as ‘Y’, “start date” as “ETL load date” and “end date” as null. (“active flag”, “start date”, “end date” are important to implement Type 2.

http://gerardnico.com/wiki/dat/obiee/dense

http://gerardnico.com/wiki/data_modeling/fact_table



Example- 12334 is customer id (business key). 2 surrogate keys (12334001, 12334002) for same business key. History kept fully. Old records in fact table will point 12334001, new will point 12334002. It will be possible to answer what is sales comparison w.r.t change of customer marital status change.

Type 3: Create a new field in the dimension record to store the new value of the attribute. For a business key in dimension table, if there is a presence of same business key in incoming feed then the existing feed will be updated with the incoming new value in “new value” column. The existing value will move to “old value” column. Next time, if there is a presence of same business key in incoming feed then, the incoming new value in “new value”, “new value” moved to “old value” and “old value” will lose. So this a partial maintaining of history. Example- 12334 is customer id (business key).

Rapidly changing dimensions and large dimensions Large dimensions in simple word there are huge no of columns (very highly denormalized) exists in a dimension table (some time customer dimension). Rapidly changing dimension, as name suggests, some of the attributes changes on regular basis. Example, for a customer dimension, “age”, “income”, “no of Childs”, “marital status” etc.. Changes on regular basis. Ok, when I said regular basis I want to mean that this attributes / columns will have maximum updates. For small dimensions: the same technologies as for slowly changing dimensions may be applied. It will work fine. For large dimension there is a technique to learn

The choice of indexing techniques and data design approaches are important.

Find suppresses duplicate entries in the dimension.

Do not create additional records to handle the slowly changing dimension problem.

Break off some of the attributes into their own separate dimension, a demographic dimension(s).

Force the attributes selected to the demographic dimension(s) to have relatively small number of discrete values. You also use a range (like age <20, 30>age>20 etc..) rather than storing discrete values (like 18, 19, 20, 21 etc..). All depends on requirement.

Build up the demographic dimension with all possible discrete attributes combinations

Construct a surrogate demographic key for this dimension

The demographic attributes are the one of the heavily used attributes. Their values are often compared in order to identify interesting subsets.

Look at the picture below----



Minidimension These are derived from dimension table considering maximum used attribute, take the all possible combination of values from those columns and load in minidimension. (These attributes get stored both in minidimension as well as dimension. This will actually speed up query processing as minidimension will contain much less amount of data. Also if any reporting tool is used and any parameter/ user selection dropdown used on that column then fetching data from that column of minidimension will be much easier. Example- customer dimension will have millions of records, but combination of state and city will have much less no of records. Now if 40% of query processing output data on state / city level then, create a separate table using state, city column and create a surrogate key and keep that surrogate key in fact table(s) so that you can bypass the join with fact table and customer dimension. Degenerate dimension A degenerate dimension is represented by a dimension key attribute(s) with no corresponding dimension table It occurs usually in line-item oriented fact table design. In below picture “PO_NUMBER” is the degenerated dimension.



Junk dimension When a number of miscellaneous flags and text attributes exist, the following design alternatives should be avoided:

i) Leaving the flags and attributes unchanged in the fact table record. ii) Making each flag and attribute into its own separate dimension. iii) Stripping out all of these flags and attributes from the design.

A better alternative is to create a junk dimension. A junk dimension is a convenient grouping of flags and attributes to get them out of a fact table into a useful dimensional framework. Heterogeneous products Some products have many, many distinguishing attributes and many possible permutations (usually on the basis of some customised offer). This results in immense product dimensions and bad browsing performance In order to deal with this,

Fact tables with accompanying product dimensions can be created for each product type - these are known as custom fact tables.

Primary core facts on the products types are kept in a core fact table (but can also be copied to the conformed fact tables).

Example- look at the 1

st picture below, An account can be different types like savings, current, credit etc. there

are different types of attributes for each of these account types which makes this dimension very huge. Also different account types are associated to different “facts” in fact tables.



Below picture shows how to handle this.

Many-to-many relationships & bridge table Every object in this universe is an entity (mainly 3 types, business event entity, component entity, classification entity. Will discuss later on data modelling section) and to keep things running there must be a relation between them. Many-to-many relation is one of them when one unit of an entity can be related to multiple units of other entity and vice versa. Example- one customer can purchase multiple products, can purchase from multiple stores. Similarly, one product can be purchased by multiple stores, by multiple customers etc… To satisfy such relationship in DW, you need to create separate table, which is in data modelling term called as Bridge Table. A bridge table is a table that only contains the keys between the two tables in a many-to-many relationship. These tables typically consist of mostly key fields, the primary keys of the two tables needing the many to many relationships. Sometimes there are more data elements, like if there are some attributes that are function of store, customer, product then that will be stored. In DW this is known as fact table(s). Role-playing A dimension can play different role in a subject area of business. Say for example, for a product company, distributers purchase goods from company’s distribution centres and then sales goods to retailers. Now company sales perspective both distributers and retailers are its customer, so in sales datamart design there will be a single customer dimension where distributer & retailer information will be loaded. Now as per the sales transaction (purchase by distributer, sales of distributer, purchase of distributer) this customer dimension will act as “buying cust dimension” & “selling cust dimension”. Data warehouse bus architecture & matrix While designing an enterprise data warehouse, big data mart consist of multiple subject area it becomes very difficult to remember what are the dimensions and facts and how they are related. Yes, you can create a data model for each of this subject area datamarts and integrate them together but that will become very huge to represent at a glance. While separate fact tables in separate data marts represent the data from each process, the models share several common business dimensions, namely- date, product, and store. Using shared, common dimensions is absolutely critical to designing data marts that can be integrated. They allow us to combine performance measurements from different processes in a single report. We use multipass SQL to query each data mart



separately, and then we outer join the query results based on a common dimension attribute. This linkage, often referred to as drill across, is straightforward if the dimension table attributes are identical. By defining a standard bus interface for the data warehouse environment, separate data marts can be implemented by different groups at different times. The separate data marts can be plugged together and usefully coexist if they adhere to the standard. The team designs a master suite of standardized dimensions and facts that have uniform interpretation across the enterprise. This establishes the data architecture framework. We then tackle the implementation of separate data marts in which each iteration closely adheres to the architecture. As the separate data marts come on line, they fit together like the pieces of a puzzle. At some point, enough data marts exist to make good on the promise of an integrated enterprise data warehouse. Creating the data warehouse bus matrix is one of the most important up-front deliverables of a data warehouse implementation. It is a hybrid resource that is part technical design tool, part project management tool, and part communication tool. Below picture represents typical data ware house bus architecture.

Data cube When we try to extract information from a stack of data, we need tools to help us find what's relevant and what's important and to explore different scenarios. A report, whether printed on paper or viewed on-screen, is at best a two-dimensional representation of data, a table using columns and rows. That's sufficient when we have only two factors to consider, but in the real world we need more powerful tools. Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another. But data cubes aren't restricted to just three dimensions. Most online analytical processing (OLAP) systems can build data cubes with many more dimensions—Microsoft SQL Server 2000 Analysis Services, for example, allows up to 64 dimensions. In practice, therefore, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions Below is a pictorial view of a multidimensional data cube.



MOLAP (multidimensional OLAP) Vs ROLAP (Relational OLAP) Since data cubes are such a useful interpretation tool, most OLAP products are built around a structure in which the cube is modelled as a multidimensional array. These multidimensional OLAP, or MOLAP, products typically run faster than other approaches, primarily because it's possible to index directly into the data cube's structure to collect subsets of data. However, for very large data sets with many dimensions, MOLAP solutions aren't always so effective. As the number of dimensions increases, the cube becomes sparser—that is, many cells representing specific attribute combinations are empty, containing no aggregated data. As with other types of sparse databases, this tends to increase storage requirements, sometimes to unacceptable levels. Compression techniques can help, but using them tends to destroy MOLAP's natural indexing. Data cubes can be built in other ways. Relational OLAP uses the relational database model. The ROLAP data cube is implemented as a collection of relational tables (up to twice as many as the number of dimensions) instead of as a multidimensional array. Each of these tables, called a cuboid, represents a particular view. Because the cuboids are conventional database tables, we can process and query them using traditional RDBMS techniques, such as indexes and joins. This format is likely to be efficient for large data collections, since the tables must include only data cube cells that actually contain data. However, ROLAP cubes lack the built-in indexing of a MOLAP implementation. Instead, each record in a given table must contain all attribute values in addition to any aggregated or summary values. This extra overhead may offset some of the space savings, and the absence of an implicit index means that we must provide one explicitly. From a structural perspective, data cubes are made up of two elements: dimensions and measures/facts. I've already explained dimensions; measures/facts are simply the actual data values. It's important to keep in mind that the data in a data cube has already been processed and aggregated into cube form. Thus we normally don't perform calculations within a data cube. This also means that we're not looking at real-time, dynamic data in a data cube. The data contained within a cube has already been summarized to show figures such as unit sales, store sales, regional sales, net sale profits and average time for order fulfilment. With this data, an analyst can efficiently analyse any or all of those figures for any or all products, customers, sales agents and more. Thus data cubes can be extremely helpful in establishing trends and analysing performance. In contrast, tables are best suited to reporting standardized operational scenarios.



OLAP servers Relational OLAP (ROLAP) servers:

These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces.

ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology.

Advantages: Can handle large amounts of data

Disadvantages: Performance is slow.

Multidimensional OLAP (MOLAP) servers:

These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures.

With multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored.

Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser sub-cubes are identified and stored as array structures, whereas sparse sub-cubes employ compression technology for efficient storage utilization.

Advantages: Faster indexing to pre-computed summarized data.

Disadvantages: Can handle limited amount of data

Hybrid OLAP (HOLAP) servers:

The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP.

For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.

ROLLUP & drilldown Roll-up:

The roll-up operation (also called the drill-up operation) performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction.

Below figure shows the result of a roll-up operation performed on the central cube by climbing up the concept hierarchy for location. This hierarchy was defined as the total order “street< city < province or state < country.” The roll-up operation shown aggregates the data by ascending the location hierarchy from the level of city to the level of country. In other words, rather than grouping the data by city, the resulting cube groups the data by country.

When roll-up is performed by dimension reduction, one or more dimensions are removed from the given cube. For example, consider a sales data cube containing only the two dimensions location and time. Roll-up may be performed by removing, say, the time dimension, resulting in an aggregation of the total sales by location, rather than by location and by time.

Drill-down:

Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data.

Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions.

Figure shows the result of a drill-down operation performed on the central cube by stepping down a concept hierarchy for time defined as “day < month < quarter < year.” Drill-down occurs by descending the time hierarchy from the level of quarter to the more detailed level of month. The resulting data cube details the total sales per month rather than summarizing them by quarter.



Because a drill-down adds more detail to the given data, it can also be performed by adding new dimensions to a cube. For example, a drill-down on the central cube of Figure can occur by introducing an additional dimension, such as customer group.



Architecture of Data Warehouse.

Data warehouses often adopt three-tier architecture, as presented in Figure. Bottom Tier:

The bottom tier is a warehouse database server that is almost always a relational database system.

Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants).

These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse.

The data are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a server. Examples of gateways include ODBC and OLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC.

This tier also contains a metadata repository, which stores information about the data warehouse and its contents.

Middle Tier:

The middle tier is an OLAP server that is typically implemented using either i) A relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations

on multidimensional data to standard relational operations; or ii) A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly

implements multidimensional data and operations. Top Tier:

The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).



The design process of data warehouse / datamart Kimball’s design method is a “first principles” approach, which is based on analysis of user query requirements. It begins by identifying the relevant “facts” that need to be aggregated, the dimensional attributes to aggregate by, and forming star schemas based on these. It results in a data warehouse design which is a set of discrete star schemas. However there are a number of practical problems with this approach:

User analysis requirements are highly unpredictable and subject to change over time, which provides an unstable basis for design.

It can lead to incorrect designs if the designer does not understand the underlying relationships in the data.

It results in loss of information through premature aggregation, which limits the ways in which data can be analysed.

The approach described in this paper overcomes these problems by using an enterprise data model as the basis for data warehouse design. This makes use of the relationships in the data which have been already been documented, and provides a much more structured approach to developing a data warehouse design.



Central Data Warehouse

This represents the “wholesale” level of the data warehouse, which is used to supply data marts with data. The most important requirement of the central data warehouse is that it provides a consistent, integrated and flexible source of data. Data Mart Design

Data marts represent the “retail” level of the data warehouse (subset), where data is accessed directly by end users. Data is extracted from the central data warehouse into data marts to support particular analysis requirements. The most important requirement at this level is that data is structured in a way that is easy for users to understand and use. For this reason, dimensional modelling techniques are most appropriate at this level. This ensures that data structures are as simple as possible in order to simplify user queries. From Entity Relationship Model to Dimensional Model

This section describes a method for developing dimensional models from Entity Relationship models. Before get deep in to the ocean let me give a brief overview how it can be done on high level.

1) Thorough understanding of business. Requirement analysis, data analysis. 2) Identify the entities; business event entities (transaction entities), classification entities (e.g.:- product

type, product group), component entities (product code). 3) For each Business Event Entity, build a logical model (ER diagram) satisfying the requirements. It will

show relations between that Business Event Entity with other (classification, component) entities. 4) Once the step 3 is done for all Business Event Entity, then join all logical models (ER diagrams). This

full picture will be your LOGICAL DATA MODEL. 5) While performing step 3, you should identify the attributes of each entities, satisfying the

requirements. Once step 4 is done, according to LOGICAL DATA MODEL, put attributes as per the entities and build the relationship accordingly, denote PK, FK. Thus, your PHYSICAL DATA MODEL is ready. Now you have built the OLTP. Next step will be to derive OLAP from it.

6) It’s pretty simple, in the PHYSICAL DATA MODEL; all Business Event Entity will act as a Fact table or sometimes as Dimension. All the component entities must act as Dimension table. Now, if Business Event Entity acting as a Dimension then component entities associated to it will become Snow- Flake Dimension.

7) All the classification entities will be merged in the component entities; sometime this will form a hierarchy in Dimension. Merging of classification entities in to component entities will increase attributes of component entities. This is de-normalization.

8) Some cases you will see that in a PHYSICAL DATA MODEL there are 2 Business Event Entity and both of them are related as a parent – child relationship. Like- Booking Master & Booking details, Order & order Line etc... It’s must to have an entry in Booking Details if there an associated entry in a Booking Master, similarly Order & order Line.

9) As a thumb rule, the parent table should be merged into child table but that is not the always case. If both the table have huge volume of data and some reports need only parents and other need child then we might need to keep parent & child separate and relate the entire dimensions to the child directly which are related to parent. Thus 2 physical Star will be present. Also Parent – child relationship will be intact.

10) Remember, as many entities will be identified the design will be that simple to understand, clear and robust to add on new changes. Maintenance will be easier. DML anomalies will be reduced.

Now get deep in to it. Let’s have a built in data model of an OLTP system (sales) which is shown in below picture. So, your OLTP model is ready and functioning well, now you need to derive a OLAP out of it. The highlighted attributes indicate the business keys /primary key of each entity.



Such a model is typical of data models that are used by operational (OLTP) systems. Such a model is well suited to a transaction processing environment. It contains no redundancy, thus maximising efficiency of DML, and explicitly shows all the data and the relationships between them. Unfortunately most decision makers would find this schema incomprehensible. Even quite simple queries require multi-table joins and complex sub queries. As a result, end users will be dependent on technical specialists to write queries for them. Let’s move to build DM-- Step 1. Thorough understanding of business requirements like what kind of analytics will run day by day, daily fixed reports, parameter values, data growth, source systems etc… Step 2. Understanding of data, it requires data analysis. Identify all the source tables that are required to satisfy requirement. Validation of business rules to satisfy requirements. Step 3. Classify Entities The first step in producing a dimensional model from an Entity Relationship model is to classify the entities into

three categories:

Transaction Entities/ Business Event Entities Transaction entities record details about particular events that occur in the business for example, orders, insurance claims, salary payments and hotel bookings. Invariably, it is these events that decision makers want to understand and analyse. The key characteristics of a transaction entity are:

It describes an event that happens at a point in time.

It contains measurements or quantities that may be summarised E.g. dollar amounts, weights, volumes. For example, an insurance claim records a particular business event and (among other things) the amount claimed. Transaction entities are the most important entities in a data warehouse / datamarts, and form the basis for constructing fact tables in star schemas. Not all transaction entities will be of interest for decision support, so user input will be required in identifying which transactions are important.



Component Entities A component entity is one which is directly related to a transaction entity via a one-to-many relationship. Component entities define the details or “components” of each business transaction. Component entities answer the “who”, “what”, “when”, “where”, “how” and “why” of a business event. For example, a sales transaction may be defined by a number of components:

Customer: who made the purchase?

Product: what was sold?

Location: where it was sold?

Period: when it was sold? An important component of any transaction is time. Hhistorical analysis is an important part of any data warehouse. Component entities form the basis for constructing dimension tables in star schemas.

Classification Entities Classification entities are entities which are related to component entities by a chain of one-to-many relationships are functionally dependent on a component entity (directly or transitively). Classification entities represent hierarchies embedded in the data model, which may be collapsed into component entities to form dimension tables in a star schema.

Below picture shows the classification of the entities in the example data model. In the diagram,

Black entities represent Transaction entities

Grey entities indicate Component entities

White entities indicate Classification entities

Resolving Ambiguities In some cases, entities may fit into multiple categories. Therefore a precedence of hierarchy is defined to Resolve such ambiguities: 1. Transaction entity (highest precedence) 2. Classification entity 3. Component entity (lowest precedence) For example, if an entity can be classified as either a classification entity or a component entity, it should be classified as a classification entity. In practice, some entities will not fit into any of these categories. Such entities do not fit the hierarchical structure of a dimensional model, and cannot be included in star schemas. This is where real world data sometimes does not fit the star schema “mould”.



Step 4. Identify Hierarchies Hierarchies are an extremely important concept in dimensional modelling, and form the primary basis for deriving dimensional models from Entity Relationship models. As discussed previously, most dimension tables in star schemas contain embedded hierarchies. A hierarchy in an Entity Relationship model is any sequence of entities joined together by one-to-many relationships, all aligned in the same direction. Below picture shows a hierarchy extracted from the example data model, with State at the top and Sale Item at the bottom.

In hierarchical terminology:

State is the “parent” of Region.

Region is the “child” of State.

Sale Item, Sale, Location and Region are all “descendants” of State.

Sale, Location, Region and State are all “ancestors “of Sale Item.

Maximal Hierarchy A hierarchy is called maximal if it cannot be extended upwards or downwards by including another entity. In all, there are 14 maximal hierarchies in the example data model:

Customer Type-Customer-Sale-Sale Fee

Customer Type-Customer-Sale-Sale Item

Fee Type-Sale Fee

Location Type-Location-Sale-Sale Fee

Location Type-Location-Sale-Sale Item

Period (posted)-Sale-Sale Fee

Period (posted)-Sale-Sale Item

Period (sale)-Sale-Sale Fee

Period (sale)-Sale-Sale Item

Product Type-Product-Sale Item

State-Region-Customer-Sale-Sale Fee

State-Region-Customer-Sale-Sale Item

State-Region-Location-Sale-Sale Fee

State-Region-Location-Sale-Sale Item

An entity is called minimal if it is at the bottom of a maximal hierarchy and maximal if it is at the top of one. Minimal entities can be easily identified as they are entities with no one-to-many relationships (or “leaf” entities in hierarchical terminology), while maximal entities are entities with no many to one relationships (or “root” entities). In the example data model there are

Two minimal entities: Sale Item and Sale Fee

Six maximal entities: Period, Customer_Type, State, Location Type, Product Type and Fee Type.



Hierarchy Collapse Higher level entities can be “collapsed” into lower level entities within hierarchies. Below pictures shows how one entity collapse in to another and so on. The State entity being collapsed into the Region entity. The Region entity contains its original attributes plus the attributes of the collapsed table. This introduces redundancy in the form of a transitive dependency, which is a violation to third normal form (Codd, 1970). Collapsing a hierarchy is therefore a form of denormalisation. Note: - this collapse varies between different Dimensional Design Techniques which will be discussed later.

We reach the bottom of the hierarchy, and end up with a single table (Sale Item). But as I said this collapse varies between different Dimensional Design Techniques.

Step 5. Identify Granularity of Aggregation One of the most critical decisions in star schema design is to choose the appropriate level of granularity— the level of detail at which data is stored. In technical terms, granularity is defined as what each row in the fact table represents. At the top level, there are two main options in choosing the level of granularity:

Unsummarized (transaction level granularity): This is the highest level of granularity where each fact table row corresponds to a single transaction or line item.



Summarized: Transactions may be summarized by a subset of dimensions or dimensional attributes. In this case, each row in the fact table corresponds to multiple transactions

The lower the level of granularity (or conversely, the higher the level of summarization), the less storage space required and the faster queries will be executed. However, the downside is that summarization always loses information and therefore limits the types of analyses that can be carried out. Transaction-level granularity provides maximum flexibility for analysis, as no information is lost from the original normalized model.

The aggregation operator can be applied to a transaction entity to create a new entity containing summarised data. A subset of attributes is chosen from the source entity to aggregate (the aggregation attributes) and another subset of attributes chosen to aggregate by (the grouping attributes). Aggregation attributes must be numerical quantities.

The level of summarization is totally requirement driven.

For example, most of the analytics run on sales of each product level, there are few cases when the report needs to run on each sales line transaction level. So, we could apply the aggregation operator to the Sale Item entity to create a new entity called Product Summary (in below picture product summary table). This aggregated entity shows for each product the total sales amount (quantity*price), the average quantity per order and average price per item on a daily basis. The aggregation attributes are quantity and price, while the grouping attributes are Product ID and Date. The key of this entity is the combination of the attributes used to aggregate by (grouping attributes).

A common misconception is that star schemas always contain summarized data—this is not necessarily the case and is not always desirable. However, for performance reasons—when dealing with large data volumes—or to simplify the view of data for a particular set of users, some level of summarization may be necessary. There is an almost infinite range of possibilities for creating summary level star schemas for a given transaction entity. In general, any combination of dimensions or dimensional attributes may be used to summarize transactions. Step 6. Dimensional Design Techniques There is a wide range of options for producing dimensional models from an Entity Relationship model. These include:

Flat schema

Terraced schema

Star schema

Snowflake schema

Star flake schema

Each of these options represent different trade-offs between complexity and redundancy.



Flat Schema A flat schema is the simplest schema possible without losing information. This is formed by collapsing all entities in the data model down into the minimal entities. This minimises the number of tables in the database and therefore the possibility that joins will be needed in user queries. In a flat schema we end up with one table for each minimal entity in the original data model. Figure below shows the flat schema which results from the example data model.

Such a schema is similar to the “flat files” used by analysts using statistical packages such as SAS and SPSS. Note that this structure does not lose any information from the original data model. It contains redundancy, in the form of transitive and partial dependencies, but does not involve any aggregation. One problem with a flat schema is that it may lead to aggregation errors when there are hierarchical relationships between transaction entities. When we collapse numerical amounts from higher level transaction entities into another they will be repeated. In the example data model, if a Sale consists of three Sale Items, the discount amount will be stored in three different rows in the Sale Item table. Adding the discount amounts together then results in double-counting (or in this case, triples counting). Another problem with flat schemas is that they tend to result in tables with large numbers of attributes, which may be unwieldy. While the number of tables (system complexity) is minimised, the complexity of each table (element complexity) is greatly increased.



Terraced Schema A terraced schema is formed by collapsing entities down maximal hierarchies, but stopping when they reach a transaction entity. This results in a single table for each transaction entity in the data model. Figure below shows the terraced schema those results from the example data model. This schema is less likely to cause problems for an inexperienced user, because the separation between levels of transaction entities is explicitly shown.

Star Schema I have already discussed what a star schema in previous section. But in this section we will get deep into it. A star schema can be easily derived from an Entity Relationship model. Each star schema is formed in the following way:

A fact table is formed for each transaction entity.

The key of each fact table is a composite key consisting of the keys of all dimension tables plus any degenerate dimensions. This is different from how the key is defined for a normal relational table (OLTP).

The non-key attributes of the fact table are measures (facts) that can be analysed using numerical functions. What facts are defined depends on what event information is collected by operational systems—that is, what attributes are stored in transaction entities. However, a key concept in defining facts is that of additivity.

For transaction entities connected in a “master-detail” structure, all attributes of the master record should be allocated down to the item level if possible (Kimball, 1996). For example, if discount is defined at the master (Order) level, the total discount amount should be allocated at the item level (e.g., in proportion to the price for each item). Otherwise, queries will result in multiple counting of discounts. The same should be done to delivery charges, order level taxes, fees, etc. If attributes of the master entity cannot be allocated down to



the item level, separate star schemas may be required for each. But again, as I mentioned earlier, all design are requirement driven, so if you club “master-detail” together into a single table and there are good amount of reports / analytics run on “master” then the query processing time will increase because of huge data volume in that single table.

In some cases, fact tables may contain no numerical facts at all. For example, in a star schema that records the incidence of crimes, the fact table may record when and where a crime took place, the type of offense committed, and who committed it. However, there are no measures involved—the important information is simply that the crime took place. This is called a factless fact table. Other examples of factless fact tables include traffic accident, and disease statistics. The relevant aggregation operation for such types of events is COUNT, which can be used to analyze the frequency of events over time. So better to create a column as count and store a value =1 to speed up aggregate function

A dimension table is formed for each component entity, by collapsing hierarchically related

classification entities into it. Means classification entities will collapse till it reaches the bottom where a component entity exists in hierarchy.

The key of each dimension table should be a simple (single attribute) numeric key (surrogate key/ integer key/ synthetic key/ dimensional key). In most cases, this will just be the key of the underlying component entity.

However, the operational key (business key) needs to be generalized to ensure that it remains unique over time. Because operational systems often only require key uniqueness at a point in time (i.e., keys may be reused over time), this may cause problems when performing historical analysis. Another situation where the dimensional key needs to be generalized is in the case of slowly changing dimensions.

Figure below shows the star schema those results from the Sale transaction entity / business event entity. This star schema has 4 dimensions (physically 3 but logically 4 as period dimension has 2 roles play), each of which contains embedded hierarchies. Look at the “sale_id” column in fact table. This is a de-generated dimension.

To depict the sales item transaction entity / business event entity in dimension model it can be done below mentioned way. Sales item entity directly related with sales transaction entity and product component entity. It looks like below..



Now, as the sales transaction entity is related to period, customer, location dimension and sales & sales item are related as “master – detail” relation (1 to many) so all the dimensions that are related to sales can be brought to sales item. This will help to bypass the join between sales and sales item if the requirement is sales item entity. Similar design approach should be taken for sales fee transaction entity. Now, instead of a number of discrete star schemas, the example data model can be transformed into a constellation schema. This is also known as Galaxy. A constellation schema consists of a set of star schemas with hierarchically linked fact tables. The links between the various fact tables provide the ability to “drill down” between levels of detail (e.g.from Sale to Sale Item). The constellation schema which results from the example data model is shown in below picture.

Snow flake and star flake techniques are already discussed.



Conclusion

We have discussed method for developing data warehouse and data mart designs from an enterprise data model. The method has now been applied in a wide range of industries, including manufacturing, health, insurance and banking. The method has evolved considerably as a result of experiences in practice. On high level the steps of the method are: 1. Develop Enterprise Data Model (if one doesn’t exist already) 2. Design Central Data Warehouse: These will be closely based on the enterprise data model, but will be a subset of the model which is relevant for decision support purposes. A staged approach is recommended for implementing the central data warehouse, starting with the most important subject areas. 3. Classify Entities: classify entities in the central data warehouse model as transaction, component or classification entities. 4. Identify Hierarchies: identify the hierarchies which exist in the data model 5. Design Data Marts: develop star cluster schemas for each transaction entity in the central data warehouse model. Each star cluster will consist of a fact table and a number of dimension and sub-dimension tables. This minimises the number of tables while avoiding overlap between dimensions. The separate star clusters may be combined together to form constellations or galaxies. Design Options

We have identified a range of options for developing data marts to support end user queries from an enterprise data model. These options represent different trade-offs between the number of tables (complexity)

Implications for Data Warehouse Design Practice

The advantages of this approach are:

It provides a more structured approach to developing dimensional models than other working principles.

It ensures that the data marts and the central data warehouse reflect the underlying relationships in the data.

Developing data warehouse and data mart designs based on a common enterprise data model simplifies extract and load processes.

An existing enterprise data model provides a useful basis for identifying information requirements in a bottom up manner - based on what data exists in the enterprise. This can be usefully combined with Kimball’s (1996) top-down analysis approach.

An enterprise data model provides a more stable basis for design than user query requirements, which are unpredictable and subject to frequent change

It ensures that the central data warehouse is flexible enough to support the widest possible range of analysis requirements, by storing data at the level of individual transactions. Aggregation of data at this level reduces the granularity of data in the data warehouse, which limits the types of analyses which are possible.

It provides much more guidance to designers of data warehouses and data marts than approaches. Careful analysis is still required to identify the entities in the enterprise data model which are relevant



for decision making and classifying them. However once this has been done, the development of a dimensional model can take place in a relatively straightforward manner.

Using an Entity Relationship model of the data provides a much better starting point for developing dimensional models than starting from scratch, and can help avoid many of the pitfalls faced by inexperienced designers.

an introduction to datawarehoushing and and dimensional modelling - abhishek bhattacharya

Documents