unit 2 dimensional modeling & data warehouse design

122
Unit 2 Dimensional Modeling & Data Warehouse Design

Upload: primrose-maxwell

Post on 03-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Unit 2 Dimensional Modeling & Data Warehouse Design

Factless fact tableA fact table is said to be empty if it has no measures to be displayed. Fact table represents events (e.g. transaction)Contains no data, only keys.

Why Build a Dimensional ModelOLTP SystemDimensional ModelProcess OrientedSubject OrientedTransactionalAggregateCurrentHistoricWe already have the data in a data model why create another data model? Well

What is currently called Data Warehousing or Business Intelligence was originally often called Decision Support Systems

We already have all the data in the OLTP system, why replicate it in a dimensional model?

Atomic - SummarySupports Transaction throughput Supports Aggregate queriesCurrent - Historic2What is a Dimensional Model?A De-normalized database.Designed for ease of querying, not for transactional updates.Built to support aggregate queriesModelled around business subject areas.On the next slide we will show a simplistic example of a transactional schema and then one possible design for a corresponding dimensional model.3Facts & DimensionsThere are two main types of objects in a dimensional modelFacts are quantitative measures that we wish to analyse and report on.Dimensions contain textual descriptors of the business. They provide context for the facts.Facts work best if they are additive

Dimensions allow us to slice & dice the facts into meaningful groups. The provide context4A Transactional DatabaseOrderDetailsOrderHeaderIDProductIDAmountOrderHeaderOrderHeaderIDCustomerIDOrderDateFreightAmountProductsProductIDDescriptionSizeCustomersCustomerIDAddressIDNameAddressesAddressIDStateIDStreetStatesStateIDCountryIDDescCountriesCountryIDDescriptionA simplistic transactional schema showing 7 tables relating to sales orders. Respective Ids refer to respective tables.5A Dimensional ModelFactSalesCustomerIDProductIDTimeIDSalesAmountProductsProductIDDescriptionSizeSubcategoryCategoryCustomersCustomerIDNameStreetStateCountryTimeTimeIDDateMonthQuarterYearThis is a star schema, (later on we will discuss snowflake schemas.) showing 4 tables that relate to the previous transactional schema. All IDs placed at centre with factual data i.e. SalesAmount. Fact tables are very large than dimension tables.

State and Country have been denormalized under Customer

Dimensions are in Blue These are the things that we analyse by (eg. By Time, By Customer, By Region)

Fact is yellow These are ususally quantitative things that we are interested in 6Star SchemaProductIDTimeIDCustomerIDSalesAmountfactSalesProductIDProductNameSubCategoryNameCategoryNamedimProductdimCustomerdimTime

Snowflake SchemaProductIDTimeIDCustomerIDSalesAmountfactSalesProductIDSubcategoryIDDescriptiondimProductdimCustomerCustAddressSubcategoryIDCategoryIDDescriptiondimSubCategoryCategoryIDDescriptiondimCategory

dimTimeDesigning Dimensional ModelRequirements to Design

Design decisions to be takenChoosing the process:-deciding subjectsChoosing the grainIdentifying and confirming dimensionsChoosing the factsChoosing the duration of the databaseFact tableA Fact table consists of the measurements, metrics or facts of a business process. Located at the center of a star schema or a snowflake schema surrounded by dimension tables. A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non additive, and semi additive measures.Fact tableOften defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined.

E.g. the grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. Building a Model - FactsYou have to talk to the business.Identify Facts by looking for quantitative values that are reported.Make sure the granularity is right.Last point Current Balance could be seen as a property of a bank account, but it may change rapidly over time (for a transaction account)===Should OrderHeader be a dimension?Should it be a factDo we just need OrderNumber in OrderDetail facttable13 Dimensional modeling basics

Formation of the automaker sales fact table

Formation of the automaker dimension tables

How much sales proceeds did the jeep tata mahindra, 2005 model with vxi options, generate in january 2000 at spectra auto dealership for buyers who owned their homes, financed by icici prudential financing?Tips for combining data into dimensional modelProvide best data accessModel should be query-centricModel should be optimized for queries and analysesModel should reveal the interactions between the dimension and fact tablesThere should be drilling down or rolling up along dimension hierarchiesSTAR SCHEMA for automaker sales

ER Model v/s Dimension ModelER diagram is a complex diagram, used to represent multiple processes. A single ER diagram can be broken down into several DM diagrams.In DM, we prefer keeping the tables de-normalized, whereas in a ER diagram, our main aim is to remove redundancyER model is designed to express microscopic relationships between elements. DM captures the business measuresDM is designed to answer queries on business process, whereas the ER model is designed to record the business processes via their transactions.Entity-Relationship vs. Dimensional ModelsE-R DIAGRAM

One table per entity

Minimize data redundancyOptimized for updateThe Transaction Processing ModelDIMENSIONAL MODELOne fact table for data organizationMaximize understandabilityOptimized for retrievalThe data warehousing modelStar Schema-example of order analysis

Query resultUnderstanding drill down analysis from the star schema

Dimension tableContain information about a particular dimension.Dimension table keyTable is wideTextual attributesAttributes not directly relatedNot normalizedDrilling down, rolling upMultiple hierarchiesFewer number of recordsFactsNumeric measurements (values) that represent a specific business aspect or activityStored in a fact table at the center of the star schemeContains facts that are linked through their dimensionsCan be computed or derived at run timeUpdated periodically with data from operational databasesFact tableContains primary information of the warehouseConcatenated keyData grainFully additive measuresSemi-additive measures(derived attributes)Table deep, not wideSparse dataDegenerate dimensions(attributes which are neither fact or a dimension)Star schema for a retail chainTime Dimension TableTime keyYearQuarterMonthWeekDateProduct Dimension TableProduct keyNameBrandCategoryColourPriceCustomer Dimension TableCustomer keyNameAgeIncomeGenderMarital statusStore Dimension TableStore keyNameCityStateOp from yearPayment Mode Dimension TableMode keyPayment modeInterest rateSales Fact TableTime keyProduct keyCustomer keyStore keyMode keyActual salesForecast salesPriceDiscount Star Schema characteristicsStar schema is a relational model with one-to-many relationship between the fact table and the dimension tables.De-normalized relational modelEasy to understand. Reflects how users think. This makes it easy for them to query and analyse the data.Optimizes navigation.Enhances query extraction.Ability to drill down or roll up.Data GranularityWhen fact table at the lowest grain, the users can as well drill down to the lowest grain of detailsBut when data is kept till the lowest level of data, we have to compromise on the storage and maintenance of DWAdvantagesEasier to extract from operational data and load into DWCan be feed directly to the DM applicationSnowflake schemaA snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. Represented by centralized fact tables which are connected to multiple dimensions."Snowflaking" is a method of normalising the dimension tables in a star schema. When it is completely normalised along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. The principle behind snowflaking is normalisation of the dimension tables by removing low cardinality attributes and forming separate tables.[The lower the cardinality, the more duplicated elements in a column. e,g. gender, boolean valuesA complex snowflake shape emerges when the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the child tables have multiple parent tables

Star Vs Snowflake schemaStar schemas should be favored with query tools that largely expose users to the underlying table structures, and in environments where most queries are simpler in nature. Snowflake schemas are often better with more sophisticated query tools that create a layer of abstraction between the users and raw table structures for environments having numerous queries with complex criteria.Star Vs Snowflake schemaFrom a space storage point of view, the dimensional tables are typically small compared to the fact tables. This often removes the storage space benefit of snowflaking the dimension tables, as compared with a star schema.snowflake schema with views built on top of it that perform many of the necessary joins to simulate a star schema.Requires the server to perform the underlying joins automatically resulting in a performance hit while querying as well as extra joins are needed.Star Vs Snowflake schemaThe star schema is a special case of the snowflake schema. The snowflake schema advantages over the star schema : -Some OLAP multidimensional database modeling tools are optimized for snowflake schemas. -Normalizing attributes results in storage savings, the tradeoff being additional complexity in source query joins.

Snowflake schema disadvantagesAdditional levels of attribute normalization adds complexity to source query joins, compared to the star schema.Efficient and compact storage of normalised data but at the significant cost of poor performance Data loads into the snowflake schema must be highly controlled and managed to avoid update and insert anomalies.

Fact Constellation schemaSplitting the original star schema into more star schemasFor each star schema it is possible to construct fact constellation schema The fact constellation architecture contains multiple fact tables that share many dimension tables. More complicated design Dimension tables are still large.

SnapshotsThere are three types of modes that a data warehouse is loaded in: Loads from archival data loads of data from existing systems loads of data into the warehouse on an ongoing basis. The loading of data archival data or from data residing in existing systems is of a "one time onlyenvironment to the data warehouse environment. Snapshots The ongoing load of changes as they have occurred in the operational environment - consume an enormous amount of resources and can be very, very complex.These ongoing loads of data are done in terms of "snapshots" that pass from the operationalSnapshots Data in the data warehouse is stored in units of "snapshots". The records in the data warehouse are created as of some moment in time and are in effect a snapshot taken as of that moment in time. So the data in the data warehouse is fundamentally different from the data in an operational data base environment. Data in an operational data base environment can be updated. Since data in the data warehouse environment is snapshot data it cannot be updated.

Snapshots E VENTS The most basic consideration of a snapshot is that the snapshot has been taken as a result of an event. Figure 2 shows a snapshot being taken as a result of an event occurring.

The event may be triggered by a wide variety of occurrences: an occurrence of a transaction, the periodic passage of time, a threshold having been reached, an audit, a special request, etc.

An example of these triggering events might be: a transaction occurring - a customer makes a purchase, periodic passage of time - the end of the month occurs, a threshold being reached - total orders exceed $1,000,000 for an account for a month, an audit - the inventory level is taken and recorded, a special request - management wants to know how many customers have mademore than ten orders this year. Almost any imaginable condition is capable of triggering a snapshot to be entered into the data warehouse. Once the event occurs the snapshot (or snapshots) is taken and the snapshot is loaded into the data warehouse. Snapshots On some occasions the date the snapshot is taken is entered as part of the record. On other occasions the date of the triggering event is entered. And on other occasions both the date of the snapshot and the date of the event are entered into the data warehouse. Example : date of the snapshot - at the end of the month all accounts have their month ending balance captured. The event is the end of the month, and the month is stored as part of the data warehouse date of the activity - a loan request is processed by the bank and approved. The date of approval is stored in the data warehouse. both date of the activity and date of the snapshot - an insurance company receives payment for premiums. The date of premium receipt is stored in the data warehouse as well as the day the data is moved into the data warehouse is stored as part of the snapshot.

The first step in designing the data warehouse is to identify the events that will trigger an entry of data into the data warehouse. The next step is to fully specify how the data warehouse snapshots will be managed. There are many types of snapshots that can go into the data warehouse, but they all can generally be classified into one of four types: Types of snapshots:Wholesale data base snapshots, Selected record snapshots,Exceptional/special record snapshots, and Cumulative snapshot records.

W HOLESALE DATA BASE SNAPSHOT

The simplest form of snapshot records in the data warehouse

W HOLESALE DATA BASE SNAPSHOTE.g. At the end of every month the customer file is read in the operational environment and passed into the data warehouse.May not be a perfect image of the operational data.-if the operational customer file contains fields of data or records of data that is only useful for the operational environment, then that data will be filtered out as the data passes into the data warehouse environment. Advantages -Simple to execute. -Very little design and very little complex programming are required. Disadvantages -applies only to small files. -ages very quickly. Once the snapshot is taken, changes made to the data after the snapshot is made are not reflected in the data baseS ELECTED RECORD SNAPSHOTSTaken as the result of an event occurring. The records are selected based on some criteria contained within the record. Any data not being used for DSS processing is purged as data passes from the operational environment to the data warehouse environment. E.g. the data architect selects all transactions which have occurred in the month of June for all active accounts with a month ending balance of greater than $5,000. The selection program reads through the operational file and upon encountering a record that meets the qualifications, moves the record to the data warehouse.

S ELECTED RECORD SNAPSHOTSAdvantages -only a subset of operational records have to be considered for input into the data warehouse environment. Disadvantages - the searching of the operational file can become surprisingly complex. In addition, if care is not taken, huge amounts of data can appear in the data warehouse - maintenance of the interface can become a burden E XCEPTIONAL/SPECIAL RECORD SNAPSHOTThere are so many records in the operational environment that only selected records can be trapped and sent to the data warehouse environment. This technique traps only selected records. E.g. accounts with no activity or too many activities

E XCEPTIONAL/SPECIAL RECORD SNAPSHOTAdvantages :-data do not require much space. Disadvantages:Very complex programingdo not form a continuous record of data.CUMULATIVE SNAPSHOT RECORDS Created as a result of gathering related operational records together and summarizing or otherwise calculating the data.

CUMULATIVE SNAPSHOT RECORDSE.g. monthly phone call records are accumulated by phone number and stored in the data warehouseAdvantages - great compaction of data. Disadvantages - loss of functionality when gross levels of detail are required; complexity of processing; complexity of design; the need to sequence input data so that related input records physically reside next to each other. Types of Fact Tables

Transaction the most common type of fact table, used to model a specific business process (typically) at the most granular/atomic level.Periodic Snapshot used to model the status of a business process at a specific point in time on a regularly recurring interval. For example, a periodic snapshot fact table might be used to track account balances on a monthly basis. In this case, a snapshot of the account balance would be taken at the end of each month which represents the net of all withdrawal and deposit transactions occurring during the month. Inventory is another common scenario that makes use of periodic snapshots for tracking quantity on hand (by item) at the end of each month.In both examples, the primary fact (account balance and quantity on hand) in the two tables are semi-additive which simply means they cant be aggregated over time.Accumulating Snapshot model events in progress for business processes (e.g. Claims Processing for an Insurance Company) that involve a predefined series of steps (e.g. claim submitted, claim reviewed, claim approved/rejected). These tables prove useful in measuring/analyzing the duration between steps in a complete process and discovering bottlenecks.

Transaction snapshotRecord every transaction that affects inventoryMore granularity Accumulating snapshotFor the processes that have definite beginning, definite end, & identifiable milestones in betweenE.g. shipping of a productDimensions A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.The dimension is a data set composed of individual, non-overlapping data elements. The primary functions of dimensions are threefold: to provide filtering, grouping and labeling.These functions are often described as "slice and dice". Slicing refers to filtering data. Dicing refers to grouping data. e.g. sales as the measure, with customer and product as dimensions. In each sale a customer buys a product. The data can be sliced by removing all customers except for a group under study, and then diced by grouping by product.

Dimensions A dimensional data element is similar to a categorical variable in statistics.Typically dimensions in a data warehouse are organized internally into one or more hierarchies. "Date" is a common dimension, with several possible hierarchies: --"Days (are grouped into) Months (which are grouped into) Years", --"Days (are grouped into) Weeks (which are grouped into) Years" --"Days (are grouped into) Months (which are grouped into) Quarters (which are grouped into) Years

Types of dimensions 1. Conformed dimension

A set of data attributes that have been physically referenced in multiple database tables using the same key value to refer to the same structure, attributes, domain values, definitions and concepts. A conformed dimension cuts across many facts.Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other. Most important, the row headers produced in two different answer sets from the same conformed dimension(s) must be able to match perfectly.

Types of dimensions1. Conformed dimension

Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension. Dimension tables are not conformed if the attributes are labeled differently or contain different values. Conformed dimensions come in several different flavors. At the most basic level, conformed dimensions mean exactly the same thing with every possible fact table to which they are joined. E.g. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.

Types of dimensions 2. Slowly Changing Dimensions (SCDs)Dimensions in data management and data warehousing contain relatively static data about such entities as geographical locations, customers, or products. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule.Some scenarios can cause Referential integrity problems.

Types of dimensions 2. Slowly Changing Dimensions (SCDs)For e.g., a database may contain a fact table that stores sales records. This fact table would be linked to dimensions by means of foreign keys. One of these dimensions may contain data about the company's salespeople: e.g., the regional offices in which they work. However, the salespeople are sometimes transferred from one regional office to another. For historical sales reporting purposes it may be necessary to keep a record of the fact that a particular sales person had been assigned to a particular regional office at an earlier date, whereas that sales person is presently assigned to a different regional office.Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.

Slowly Changing Dimensions (SCDs)Type 0The Type 0 method is passive. It manages dimensional changes and no action is performed. Values remain as they were at the time the dimension record was first inserted. In certain circumstances history is preserved with a Type 0. High order types are employed to guarantee the preservation of history whereas Type 0 provides the least or no control. Rarely used.

Type 1This methodology overwrites old with new data, and therefore does not track historical data.Example of a supplier table:

Supplier_KeySupplier_CodeSupplier_NameSupplier_State123ABCAcme Supply CoCASupplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the row will be unique by the natural key (Supplier_Code). However, to optimize performance on joins use integer rather than character keys (unless the number of bytes in the character key is less than the number of bytes in the integer key).If the supplier relocates the headquarters to Illinois the record would be overwritten:

Supplier_KeySupplier_CodeSupplier_NameSupplier_State123ABCAcme Supply CoILSCD -Type 1Disadvantage -there is no history in the data warehouse. Advantage - easy to maintain.If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.

SCDType 2This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.For example, if the supplier relocates to Illinois the version numbers will be incremented sequentially:

SCD Type 2Supplier_KeySupplier_CodeSupplier_NameSupplier_StateVersion.123ABCAcme Supply CoCA0124ABCAcme Supply CoIL1SCD Type 2Another method is to add 'effective date' columns.

Supplier_KeySupplier_CodeSupplier_NameSupplier_StateStart_DateEnd_Date123ABCAcme Supply CoCA01-Jan-200021-Dec-2004124ABCAcme Supply CoIL22-Dec-2004SCD Type 2The null End_Date in row two indicates the current tuple version. Surrogate high date (e.g. 9999-12-31) may be used as an end dateTransactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed. SCD Type 2 disadvantageIf there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change.

SCDType 3Tracks changes using separate columns and preserves limited history. Preserves limited history as it is limited to the number of columns designated for storing historical data. The original table structure in Type 1 and Type 2 is the same but Type 3 adds additional columns. In the following example, an additional column has been added to the table to record the supplier's original state - only the previous history is stored.

SCD Type 3Supplier_KeySupplier_CodeSupplier_NameOriginal_Supplier_StateEffective_DateCurrent_Supplier_State123ABCAcme Supply CoCA22-Dec-2004ILSCD Type 3This record contains a column for the original state and current statecannot track the changes if the supplier relocates a second time.One variation of this is to create the field Previous_Supplier_State instead of Original_Supplier_State which would track only the most recent historical change.

SCDType 4Uses "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes. Both the surrogate keys are referenced in the Fact table to enhance query performance.For the above example the original table name is Supplier and the history table is Supplier_History.

SCD Type 4Supplier

Supplier history Supplier_keySupplier_CodeSupplier_NameSupplier_State123ABCAcme & Johnson Supply CoILSupplier_keySupplier_CodeSupplier_NameSupplier_StateCreate_Date123ABCAcme Supply CoCA14-June-2003123ABCAcme & Johnson Supply CoIL22-Dec-2004Type 6 / hybrid

The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). The Supplier table starts out with one record for our example supplier:

Supplier_KeySupplier_CodeSupplier_NameCurrent_StateHistorical_StateStart_DateEnd_DateCurrent_Flag123ABCAcme Supply CoCACA01-Jan-200031-Dec-9999YSCD Type 6Supplier_KeySupplier_CodeSupplier_NameCurrent_StateHistorical_StateStart_DateEnd_DateCurrent_Flag123ABCAcme Supply CoILCA01-Jan-200021-Dec-2004N124ABCAcme Supply CoILIL22-Dec-200431-Dec-9999YThe Current_State and the Historical_State are the same. The Current_Flag attribute indicates that this is the current or most recent record for this supplier.When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:

SCD Type 6We overwrite the Current_State information in the first record (Supplier_Key = 123) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing.For example if the supplier were to relocate again, we would add another record to the Supplier dimension, and we would overwrite the contents of the Current_State column:

Supplier_KeySupplier_CodeSupplier_NameCurrent_StateHistorical_StateStart_DateEnd_DateCurrent_Flag123ABCAcme Supply CoNYCA01-Jan-200021-Dec-2004N124ABCAcme Supply CoNYIL22-Dec-200403-Feb-2008N125ABCAcme Supply CoNYNY04-Feb-200831-Dec-9999YNote that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are always the same.

Clickstream Source DataA clickstream is the recording of the parts of the screen a computer user clicks on while web browsing or using another software application. As the user clicks anywhere in the webpage or application, the action is logged on a client or inside the web server, as well as possibly the web browser, router, proxy server or ad server. Clickstream analysis is useful for web activity analysis, software testing, market research, and for analyzing employee productivity.

Clickstream is not just weblogs. They can be essentially every interaction that you transact with any electronic devices. TV PVRs (personal video recorder). Smart phones. Game consoles. Sensors: security systems, highways. E-Payment cards, -Loyalty cards. Geolocation -Alarm clocks. -Printers etc.....

There are essentially two types of Clickstream data Individual Sites Clickstream Internet Clickstream Data Server weblog accounts for 75% of daily data generation. Facebook alone captures 1.5PB of weblog data daily. Amazon captures 200TB of weblog data daily.

Sample of Clickstream Data

Web logs204.243.130.5 --[26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)

204.243.130.5 --[26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)

204.243.130.5 --[26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437 "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)

Clickstream Click-path Analytics

A click path is the sequence of links a site visitor follows.

Clickstream Click-path Analytics A click path is the sequence of links a site visitor follows.

How Clickstream Data is collected?

How Clickstream Data is collected?

Clickstream - Challenges

Clickstream - Challenges

Clickstream data- solutions

Clickstream data- Data warehouse

Additive, Semi-Additive, and Non-Additive Facts

The numeric measures in a fact table fall into three categories. 1. Fully additive: The most exible and useful factsAdditive facts are facts that can be summed up through all of the dimensions in the fact table. E.g. sales_amt2. Semi-additive measures Can be summed across some dimensions, but not all; Balance amounts are common semi-additive facts because they are additive across all dimensions except time. 3. Completely non-additiveNon-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. Such as ratios e.g. profit marginHierarchy in dimensionsHierarchies are a natural and convenient way to organize data, particularly in space and time. E.g., group cities into countries, and countries into regions. It is useful to be able to query for the child cities codes of a given countryHierarchy in dimensionsParent child relationshipsUsing tree structure- balanced, unbalancedHelpful to drill down

Many to many dimension relationship Patient has more than one diagnosis

Problems Querying for records to find a particular combination of diagnoses requires multiple correlated subqueries

Queries for finding patients with N different diagnoses will need N-level subqueries. Therefore, report generation is very complex and slow;increasing both the processing time and the number of joins. Solutions 1. The Bridge Table

This table is similar to an intersection table that is created for a many-to-many relationship between two entities.Weighing factor & a diagnosis group keyA diagnosis group key is assigned to clusters of diagnosis codes and the combinations are inserted into the bridge tableA group contains combination of deceasesThe weighting factor is a percentage that identifies the contribution of the diagnosis to the specific encounter. Within a diagnosis group, the sum of all the weighting factors must equal oneThe weighting factor is multiplied by fact values, through the joining of the two tables with the diagnosis group key the involvement of each diagnosis in the diagnosis group is correctly calculatedNew query : one to one among 3 tables

Disadvantages Assigning weighting factors could prove to be difficult or cumbersome in a real-world environment; adding a new diagnosis requires recalculating of the weighting factors. The logical structure would lose the simplicity and understandability of the star schema. More joins increase the overhead and query time. The size of the bridge table could increase considerably based on the number of diagnosis assigned to each diagnosis group. 2. Denormalizing the Dimension Table by Positional-Flag Attributes Positional means the location of each attribute is fixed. For example, the first attribute is cancer; the second attribute is heart, etc. Thus, the same disease is always indicated in the same column. In this method, each diagnosis becomes a Boolean attribute being set to either TRUE or FALSE

Disadvantages This technique requires a very large diagnosis dimension table. N diagnoses require 2N recordsadding a new diagnosis value would require to rebuild the dimension table and the fact table. We need to use Data Definition Language (DDL) to add a column and reload the diagnosis dimension this method would only be applicable when the number of positional-attributes is limited and fixed3. Denormalizing the Dimension Table by Non-Positional attributes & a Concatenated Field each attribute can have a different value in different recordsOther than the primary diagnosis, there is no difference between secondary 1 and secondary 20

A concatenated field is used to store the primary and all the secondary values of the diagnoses using the variable character data type

Multi Valued Dimensions and Dimension AttributesA multi valued attribute is an attribute which has more than 1 value per dimension row. A Multi Valued Attribute is different to A Multi Valued Dimension. A Multi Value Attribute occurs in a dimension, whereas a Multi Valued Dimension occurs in a fact table. A Multi Valued Dimension is a dimension with more than 1 value per fact row. E.g. DimCustomerCustomerName|City|PhoneNumber

Multi valued (dimension)attributeThere are several approaches to deal with a dimension with a multi valued attribute.Lower the grain of the dimensionPut the attribute in another dimension, link direct to the fact tableUse a fact table (bridge table) to link the 2 dimensionsHave several columns in the dim for that attributePut the attribute in a snow-flaked sub dimensionKeep in one column using commas or pipes

Multivalued dimensions

ReferencesClickstream.pdf by Albert HuiPaper on An Analysis of Many-to-Many Relationships Between Fact and Dimension Tables in Dimensional Modeling by I-Y. Song, W.Rowen, C. Medsker, E. Ewen