Download - Data Warehouses Project Final

7/30/2019 Data Warehouses Project Final

1/151

ITRI 611/621 Project

Data Warehouses

Final DocumentationMartin Gouws 21776032

JP Taljaard 21735549

Heinrich du Toit - 21077533


2/151

Table of Content

1. Table of Content...................................................................................................................................... 0022. Introduction............................................................................................................................................. 0033. Project Planning....................................................................................................................................... 0044. Project Management................................................................................................................................ 0055. Business Requirements Definition........................................................................................................... 0066. Technology Track..................................................................................................................................... 0077. Data Track................................................................................................................................................ 008

a. Step 1 High Level Plan.............................................................................................................. 008b. Step 2 Choose an ETL Tool........................................................................... ............................. 008c. Step 3 & 4 - Develop Default Strategies and Drill Down by Target Table.................................... 009

i. General Points of Discussion.......................................................................................... 009ii. Data Hierarchies............................................................................................................. 012

iii. Detailed Table Breakdowns............................................................................................ 013iv. Dependency Tree........................................................................................................... 057

d. Step 5.......................................................................................................................................... 057e. Step 6.......................................................................................................................................... 057f. Step 7.......................................................................................................................................... 058g. Step 8.......................................................................................................................................... 058h. Step 9.......................................................................................................................................... 058i. Step 10........................................................................................................................................ 058

i. ETL Automation.............................................................................................................. 0588. Business Intelligence Application Track................................................................................................... 064

a. OLAP cube with Analysis Services............................................................................................... 064b. MDX Queries............................................................................................................................... 084c. Power Pivot................................................................................................................................. 093d. SQL reporting Services (SSRS)...................................................................................................... 099e. Share Point.................................................................................................................................. 107

9. Deployment.............................................................................................................................................. 11610.Management............................................................................................................................................ 11611.Growth..................................................................................................................................................... 11612.Group Participation.................................................................................................................................. 11713.Annexure A............................................................................................................................................... 11814.Annexure B............................................................................................................................................... 12515.Annexure C............................................................................................................................................... 13216.Annexure D............................................................................................................................................... 146

Page 2


3/151

Introduction

The documentation of this project had to be structured in some way so that it made sense. The Kimball

Lifecycle was used as the basis of this structure and will include the following sections:

Project planning Project management Business requirements gathering Technology track Data track Business intelligence track Deployment Maintenance GrowthNot all of the phases of the Kimball Lifecycle are within the scope of this project, and for this reason they will

be included, but specifically ignored for the reasons as stated within that section where applicable.

Page 3


4/151

Project Planning

We decided to do the project break down in terms of Kimballs lifecycle for data warehouses as stated in The

Data Warehouse Lifecycle Toolkit. As stated by the lecturer for the module in the study guide, there are 7

main phases that need to be done by the end of this project. These 7 phases will be used as projectmilestones. Both the lifecycle and the milestones will be put into a Microsoft Project 2010 format with all

related times and durations included to give a clear definition to when what needs to be done, also known

as a WBS (Work Breakdown Structure). The starting date of this project will be taken as 01/03/2012, and the

end date will be 30/09/2012.

The WBS and Gantt chart can be seen in Annexure A.

Page 4


5/151

Project Management

Project planning and project management very often differ significantly. In the case of this project, the

project planning as stated in the above section was an overall guide, but the practical implementation was

very different.

The project progress was not monitored on Project 2010 as planned; rather the project was managed week

by week and in a dynamic fashion, and the group member allocations changed as the project progressed.

Page 5


6/151

Business Requirements Definition

The business requirements were given by the lecturer in association with industry partners. The given

requirements were:

Sales per hour per day. A comparison of how a restaurants sales look in terms of the average of the region that specific

restaurant is situated in. For example, a restaurant in Johannesburg compared to all the restaurants in

all of Gauteng.

A breakdown of restaurants per region per product. For example, the Sloane Square restaurants sales ofburgers, soft drinks etc. versus the average of the how region of Gauteng.

Include a cost price with each menu item in order to calculate profits etc. again reduced to per day andper restaurant levels.

The following requirements were additionally added:

Average amount per purchase. Determine what time of the month is most popular. Top 10 Highest Grossing Restaurants Top 10 Product Sales Preferred Flavours per province. Number of orders(# transactions) per restaurantSeeing as this section is basically just a repetition of the project problem statement, it has not been further

expanded upon.

Page 6


7/151

Technology Track

Technical Architecture

For our project, we decided to use a Virtual Machine for the Operating System, Software and Data to residein. Oracle VirtualBox was used create and run the Virtual Machine.

The Computer hosting the Virtual Machine runs an Intel i7 3.07GHz Processor. The Virtual Machine is assigned 10GB of system memory. A Virtual Hard Drive of 150GB was created and assigned to the Virtual Machine. The Virtual Machine uses a Virtual Network Adapter that is bridged with the host Network Adapter.Software

On the Virtual Machine, we installed the following: Microsoft Server 2008 R2 (Operating System) ORACLE 11g (Database) Microsoft SQL Server 2008 R2 (Databases) Microsoft Sharepoint 2010 (For Dashboards) Oracle Client Tools (.NET Plugin) MDX Studio (for MDX Queries) SQL Server Business Intelligence Development Studio (SSIS, SSRS, SSAS) Microsoft PowerPivot (Excel Plugin for Dashboarding) ORACLE SQL Developer (ORACLE Database Management Environment) Microsoft IIS (Requirement for Microsoft Sharepoint) Remote Desktop Services (For Multiple Concurrent Remote User connections) ORACLE VM VirtualBox Guest Additions Microsoft Office 2010 Microsoft Visual Studio 2010

Page 7


8/151

Data Track

This section, the data track phase, was approached using the Kimball Lifecycle specifications for the ETL

process. Firstly, there are 34 subsystems that are identified by the authors that make up the basis of any ETL

system. These 34 subsystems are then later spread over 10 main steps that comprise the entire process of

ETL development.

In this section, these 10 steps were used as overall structure and to determine work load, while the whole

time retaining the knowledge that it is made up of 34 subsystems.

We do not assume that this documentation may or may not change as the project progresses, but we are

certain that it is indeed a solid start and that it will not change drastically.

Certain steps were grouped together and handled as a unit, but each individual step is still covered,

regardless of its combination with other steps.

Firstly though, a list of assumptions regarding the data:

There were transactions made on 29 February 2011, date did not exist because it was not a leap year. No prices were given, so prices for menu items were extracted from the Nandos website. In the MenuItems table, there were duplicate rows with the same MenuItemDescription but different

primary keys. We decided to remove these duplicates. This was done by running a sequence of queries

to leave only a single unique MenuItemDescription with a corresponding MenuItemCode (primary key).

In the light of the latter the Transaction table had to be updated, replacing the previously duplicteMenuItemCode with the new single MenuItemCode.

Restuarant shortnames....Florida en Adderley

Found duplicate in menuItems....all transactions pointing to second incorrectly spelled item was pointedat correct one and incorrectly spelled menuItem was deleted

No cost prices markup of 200% No selling prices Added extra column in menuItems for cost price and update 0 selling price with dataThe 10 ETL steps were implemented as follow:

Step 1 Draw the High Level Plan

Please refer to Annexure B.

Step 2 Choose an ETL Tool

SSIS (SQL Server Integration Services) was chooses as the ETL Tool of choice.

Wikipedia defines SSIS as the following:

SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server database software that

can be used to perform a broad range of data migration tasks. SSIS is a platform for data integration and

workflow applications. It features a fast and flexible data warehousing tool used for data extraction,

transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server

databases and updates to multidimensional cube data.

Page 8


9/151

Another tool was written in-house to extract the date and time stamp from the transactions table and from

that deduce the attributes of the date and time dimension

Step 3 & 4 Develop Default Strategies & Drill Down by Target Table

This document includes historic and incremental load strategies for each table (dimension and fact) that ishandled by the ETL System, but first, a few general points of discussion followed by table details.

General Points of Discussion:

Default strategy for extracting from each major source system

Only one source system will be used to extract data. This source will be the operational database. The

operational data from the source system was received in flat file format.

Extraction will be done using the Import/Export Tool included in SSIS (SQL Server Integration Service) to the

relevant tables in the development database.

Archival strategy

As the data received is already in flat-file format, archival of this data is easy to implement. The data will be

archived for possible recovery when needed.

As the data is not very large, and in combination with compression it is significantly smaller data can be

stored for as long as needed.

Data quality tracking and metadata

The current load and quality checking is a manual process, but during the ETL the following steps will be

implemented

Data quality will be checked in the following manner.

Errors in flat files that may produce errors, such as quotation marks around all data items. Each row in all tables should be checked for incomplete data - missing fields, null values, etc. All rows should be compared, to see if records exist that can identified as duplicates. Check if identified keys match with corresponding tables. Check for spelling mistakes that could create duplicates. Valid transactional date(s) and time.The following actions are to be taken when errors or problems occur during data checking:

If incomplete data are found, the data can still be transferred to the dimensions and fact tables but anote should be made about the incomplete data found. This will be documented in the audit dimension

and tagged in the fact. A log should also be made that can be reviewed.

When duplicate records are found e.g. Item Descriptions that are the same. This will have to be managesin one of two ways (1) if the problem have not been encountered previously a decision should be

taken by the data warehouse builder on how to handle it. (2) If this has been identified, a procedure

should be written to handle the duplicate e.g. a script that automatically executes the actions that the

data warehouse builder has specified.

Although it should not happen, in theory it is possible that the data may contain errors relating to thedate and time resulting in an invalid time, date or the combination of the two. These errors would

result during the transition from records in the database to the flat file format, where it will be saved as

text text files do not retain data types.

Page 9


10/151

If data under evaluation do not pass the quality checks, a classification of severity should be assigned to each

record that did not meet the standards set by the data warehouse builder. Each dimension or fact record

loaded will be tagged with a classification in the audit dimension.

The classification scale used:

0 No Problem1 Data type conversion error.

2 Data type validation failure

3 Missing data, NULL extracted from flat files, etc.

An audit dimension will be used, due to the fact that merely disregarding dirty records is bad practice, as it

compromises the integrity of the data and creates gaps. This will result in the distortion of the big picture.

Severity of errors during load should be an indication if the load process should be aborted, especially if it is

frequent and severe.

The load process will be automated with scripts that enforce the quality checks and loading of data into thefact and dimension tables. If errors occur during the load process, the data warehouse builder should be

notified to take action as appropriate especially if the classification is assigned to 3.

Errors should be logged and the data warehouse builder should be notified using a reliable communication

medium such as email. The data warehouse builder should then take action according to the severity of the

problem encountered.

A deduplication system was not formally implemented in the ETL system, because data is not retrieved from

multiple sources, thus the need for survivorship and matching is not needed. Deduplication was however

done manually as such problems, like duplicates, matching and survivorship, were encountered in the data.

Two problems in this regard were encountered. The first was the duplicate menu item descriptions, whichwas handled by running a SQL script that isolated a single description and its identifier and replaced all

relevant record fields with the isolated key and description and the duplicates where then dropped from the

table. The second was the renaming of misspelled data as it was encountered during data loading.

A conforming system in the ETL system is of no use in this situation, because all the data is retrieved from a

single source, namely the database flat files that were generated from the sales processing database system.

The single source system, in combination with no dimension being shared results in this system not being

applicable.

Default strategy for managing changes to dimension attributes

The Type 1 technique is a simple overwrite of one or more attributes in an existing dimension row. Therevised data from the change data capture system is used to overwrite existing data in the dimension table.

Type 1 is used when needed to correct data or there is no business need for keeping history of the previous

values.

The Type 2 technique is used to track changes of dimensions and to associate them correctly with existing

and new fact records. To support type 2 changes requires a strong data capture system to detect changes as

soon as they occur. For type 2 updates, copy the previous version of the dimension row and create a new

dimension row with a surrogate key. If there is not a previous version of the dimension row, create a new

one from scratch. Then update this row with the columns what have changed. This technique is used for

handling dimension attributes that changed and that need to be tracked over time.

Type 3 is not implemented in this system therefore it is not discussed.

Page 10


11/151

Refer to tables below for specific dimension and fact change management.

System availability requirements and strategy

The operational data source was made available on 12th

of March 2012 in the format of flat-files.

High-level block sequencing is set out below from flat files through to the development database andfinally to the data warehouse database where facts and dimensions will be loaded.

See Annexure B for high-level sequencing of each dimension and fact.

Due to the nature of incremental batch loads, huge amounts of resources will be consumed on the system to

process the data (ETL) Load, Clean, Save. For this reason we will implement the ETL process on an extra

server that will be responsible for the ETL process.

Considerations will be taken when updating fact and dimension tables with new data, not to overwhelm the

system due to a high incoming load. Rather to split the load into smaller pieces and upload when the data

warehouse systems load is not very high. In this way, the system will be available even during uploads.

Design of the data auditing subsystem

The auditing subsystem will be used to capture data load information and keep track of it. A key will be

created for each type of event that happens while loading the facts or dimensions. This key is then assigned

to the fact or dimension in question. The keys will be stored in the audit dimension with additional

information such as the type of error, time and date of occurrence, batch job name or number, and possibly

more

Locations of staging areas

In the ETL stage, multiple staging areas will exist for use of processing the source data.

1. Import stagea. In this stage, the data will be imported from the source system. (Flat Files)b. Verification of data types need to happen here to limit possible errors and updates later on.

2. Cleaning stagea. The data will be checked for missing data, such as incomplete fields or NULLsb. Duplicate checking and removing of duplicates. All relevant records in other tables should be

updated accordingly to reflect the choice of a single record to identify the duplicates.

c. Audit logs implemented3. Population of Dimension Tables4. Keymapping stage - Keymaps are to be build for use in linking and creation of the fact table data5. Population of Facts Tables6. Transfer from Development database to Oracle Production database.Bulk-load the development data that includes dimension and fact tables to the Oracle Production database.

Page 11


12/151

Data Hierarchies

Page 12


13/151

Detailed Table Breakdown

The data sources that are referred to below are the working database tables, not the original flat files, and

these tables are already cleaned and the data validated as per ETL procedures.

Menu Category Dimension

Table Design

The Menu Category Dimension contains the following attributes (column names), each of which have the

named description, data type, key (if applicable) and constraint (if applicable) respectively:

Column Name Description DataType Key Constraint

MenuCategorySurKey Identification field that contains nometadata

Numeric Surrogatekey

Identity

MenuCategoryCode ID number of the category Numeric Primary key Unique

MenuCategoryDesc Description of the category Varchar n/a not null

AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

Historic data load parameters and volumes

Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

Volumes the data source of this dimension contains 19 records.

Incremental data volumes

The chances of incremental data volumes in this dimension are relatively low, seeing as the data contained

therein is of a very static nature.

Handling of late arriving data

All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

As can be seen from the load frequency below the time between loads are relatively close to one another,

thus late arriving data will be postponed until the next load.

Load frequency

Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

It is suggested that data dumps from the operational data source into flat files are done at a time when the

operation source system is not in use, for example after operating hours. The rest of the ETL operations are

then performed soon after.

This dimension is compared to the source data to identify if changes or additions have occurred. It is then

updated as necessary.

Handling of changes in each attribute

Changes to each attribute in this table are handled as per the type indicated:

Page 13


14/151

Column Name Handling of change - Type 1, 2 or 3

MenuCategorySurKey n/a

MenuCategoryCode 2

MenuCategoryDesc 2

AuditKey n/a

Table partitioning

Table is not partitioned.

Overview of data source(s)

Data source for this table is normal transactional database table with no special or unusual characteristics.

Detailed source to target mapping

See Annexure C for this documentation.

Source data profiling

The source data can be described as follow:

Column Name Description DataType Min Max Count of

Distinct

Values

MenuCategoryID ID number of the

Food Group

Numeric 0 Max Value Possible

in DBMS

1

MenuCategoryCode Category Code Numeric 0 10000 1MenuCategoryDesc Description of

the category

Varchar 1

Auditkey Audit Dimension

Foreign Key

Numeric

Extract strategy for the source data

Refer to default strategy as discussed in general points of discussion above.

Change data capture logic

The agreement between the source system and the data warehouse can be described as follow:

The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

applications that form part of the ETL tool

The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

The temporary dimensions are then bulk loaded to the Oracle Data Warehouse

Page 14


15/151

This procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

Oracle.

Dependencies

This dimension is dependent on a single other dimension, namely the audit dimension.

Transformation logic

See Annexure B for the diagram.

Preconditions to avoid error conditions

Before loading the data into the corresponding tables a script will be executed by the ETL process to check

for necessary space requirements in the tables and the process will only continue if such space does exist.

Recover and restart assumptions for each major step of the ETL pipeline

Every time a load operation is performed from any source to its related destination anywhere in the ETL

pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

operation, the process is stopped and the data is rolled back to the created check point.

Archiving assumptions

Default strategy as mention in general points of discussion in the above section is applied here.

Cleanup steps

When all the data has passed through the ETL system and located in the correct format in the Oracle data

warehouse, all the tables in the development database are truncated and original source flat files areachieved.

Estimated difficulty of implementation

Easy.

Menu Flavour Dimension

Table Design

The Menu Flavour Dimension contains the following attributes (column names), each of which have the



MenuFlavourSurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

MenuFlavourCode ID number of the flavour Numeric Primary key Unique

MenuFlavourDesc Description of the flavour Varchar n/a not null


Page 15


16/151



Volumes the data source of this dimension contains 5 records.Incremental data volumes

The chances of incremental data volumes in this dimension are medium, seeing as the data contained

therein is of a relatively static nature. The only time this dimension would grow is with the change of

business rules or strategies and the adding additional complexity to their products.





Load frequency










MenuFlavourSurKey n/a

MenuFlavourCode 2

MenuFlavourDesc 2

AuditKey n/a

Table partitioning








Page 16


17/151


Distinct

Values

MenuFlavourID ID number of the

Food Group


in DBMS

1

MenuFlavourCode Flavour Code Numeric 0 10000 1

MenuFlavourDesc Description of

the flavour

Varchar 1

Auditkey numeric








The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data


Oracle.

Dependencies













Page 17


18/151

Cleanup steps


warehouse, all the tables in the development database are truncated and original source flat files are

achieved.


Easy.

Menu Item Food Group Dimension

Table Design

The Menu Item Food Group Dimension contains the following attributes (column names), each of which

have the named description, data type, key (if applicable) and constraint (if applicable) respectively:


MenuFoodGroupSurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

MenuFoodGroupCode ID number of the FoodGroup Numeric Primary key Unique

MenuFoodGroupDesc Description of the FoodGroup Varchar n/a not null



Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data. Volumes the data source of this dimension contains 50 records.Incremental data volumes







Load frequency







Page 18


19/151




MenuFoodGroupSurKey n/a

MenuFoodGroupCode 2

MenuFoodGroupDesc 2

AuditKey n/a

Table partitioning









Distinct

ValuesMenuFoodGroupID ID number of the

Food Group


in DBMS

1

MenuFoodGroupCode Food Group

Code

Numeric 0 10000 1

MenuFoodGroupDesc Description of

the FoodGroup

Varchar 1

Auditkey numeric







Page 19


20/151



is checked at every step for incompatibilities before it eventually reaches the data warehouse structures inOracle.

Dependencies





Before loading the data into the corresponding tables a script will be executed by the ETL process to checkfor necessary space requirements in the tables and the process will only continue if such space does exist.







Cleanup steps



achieved.


Easy.

Menu Items Dimension

Table Design

The Menu Items Dimension contains the following attributes (column names), each of which have the



MenuItemSurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

MenuItemCode ID number of the Items Varchar n/a Unique

MenuItemDesciption Description of the item Varchar n/a not null


Page 20


21/151




The chances of incremental data volume are medium. As with all organizations that deliver a product, those

products are continuously expanded. This expansion is not so rapid that the data volumes will increase

dramatically, but they will increase slightly over time.





Load frequency










MenuItemSurKey n/a

MenuItemCode 2

MenuItemDesciption 2

AuditKey n/a

Table partitioning








Page 21


22/151


Distinct

Values

MenuItemID ID number of the

MenuItem


in DBMS

1

MenuItemCode Item Product

Codes

Varchar 1

MenuItemDesciption Description of

the item

Varchar 1

Auditkey numeric










Oracle.

Dependencies

This dimension is dependent on the following dimensions:

Audit Menu Item Flavour

Menu Category Menu Sub Category Menu Food GroupAll the above mentioned dimensions must be loaded first before this dimension may be populated,

otherwise referential integrity will be compromised.






Page 22


23/151







Cleanup steps



achieved.


Medium.

Menu Sub Category Dimension

Table Design

The Menu Sub Category Dimension contains the following attributes (column names), each of which have

the named description, data type, key (if applicable) and constraint (if applicable) respectively:


MenuSubCategorySurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

MenuSubCategoryCode ID number of Sub Category Numeric Primary key Unique

MenuSubCategoryDesc Description of the Sub Category Varchar n/a not null











Page 23


24/151

Load frequency










MenuSubCategorySurKey n/a

MenuSubCategoryCode 2

MenuSubCategoryDesc 2AuditKey n/a

Table partitioning









Distinct

Values

MenuSubCatID ID number of the

Sub Category


in DBMS

1

MenuSubCategoryCode Sub Category

Code

Numeric 0 10000 1

MenuSubCategoryDesc Description of

the Sub

Category

Varchar 1

Auditkey numeric



Page 24


25/151








Oracle.

Dependencies













Cleanup steps

When all the data has passed through the ETL system and located in the correct format in the Oracle datawarehouse, all the tables in the development database are truncated and original source flat files are

achieved.


Easy.

Page 25


26/151

Province Dimension

Table Design

The Province Dimension contains the following attributes (column names), each of which have the named

description, data type, key (if applicable) and constraint (if applicable) respectively:


ProvinceSurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

ProvinceCode ID number of the Province Numeric Primary key Unique

ProvinceDesc Province Name Varchar n/a not null





The chances of incremental data volumes in this dimension are none, seeing as the data contained therein is

set according to geographical bounds that are unchanging. The only exception to this would be if the

organization in question expanded their borders or if the country is re-divided into new provinces.





Load frequency










ProvinceSurKey n/a

ProvinceCode 1

ProvinceDesc 1

AuditKey n/a

Page 26


27/151

Table partitioning









Distinct

ValuesProvince ID ID number of the

Province


in DBMS

1

ProvinceCode Associate

Province Code

Numeric 1 11 1

ProvinceDesc Province Name Varchar 1

Auditkey numeric







The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data


Oracle.

Dependencies




Page 27


28/151










Cleanup steps




Easy.

Restaurant DimensionTable Design

The Restaurant Dimension contains the following attributes (column names), each of which have the named



RestSurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

RestCode ID number of Restaurant Numeric Primary key Unique

RestShortName Shortened version of Restaurant name Varchar n/a not null

RestName Restaurant Name Varchar n/a not nullIsCoastal Is the Restaurant located near the coast

- 0=no; 1=yes

bit n/a not null

ProvinceCode ID of province which restaurant is in numeric Foreign key not null




Volumes the data source of this dimension contains 101 records.

Page 28


29/151


The chances of incremental data volumes in this dimension are medium-high, seeing as the data contained

therein is likely to expand as the organization in question grows and more franchises are established.





Load frequency










RestSurKey n/a

RestCode n/a

RestShortName 1

RestName 1

IsCoastal n/a

ProvinceCode n/a

AuditKey n/a

Table partitioning








Page 29


30/151


Distinct

Values

RestID Restaurant

Identification

code


in DBMS

1

RestCode ID number of

Restaurant

Numeric 0 2000 1

RestShortName Shortened

version of

Restaurant name

Varchar 1

RestName Restaurant

Name

Varchar 1

IsCoastal Is the Restaurant

located near the

coast - 0=no;

1=yes

numeric 1+

ProvinceCode ID of province

which restaurant

is in

numeric 1+

Auditkey numeric








The temporary dimensions are then bulk loaded to the Oracle Data Warehouse

This procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data


Oracle.

Dependencies

This dimension is dependent on the following dimension table:

Audit Regional Manager

Country Province

Page 30


31/151

HubAll the above mentioned dimensions need to be imported before this dimension can be populated,

otherwise referential integrity might be compromised.












Cleanup steps



achieved.


Easy.

Transactions Dimension

Table Design

The Transaction Dimension contains the following attributes (column names), each of which have the named



TransSurKey Identification field that contains no

metadata

numeric Surrogate

key

Identity

TransDate Date and time transaction occurred datetime n/a not null

OrderNr Order number of transaction Numeric n/a not null

ItemNr Item number of order Numeric n/a not null

MenuItemCode MenuItem on order varchar Foreign key not null



Parameters this is the only data from the flat files sources that provide information about the businessprocess itself and there are certain parameters that need to be taken into account. Historic loads will be

Page 31


32/151

done month for month, with a total of 4 months contained in the data. The transactions are not equally

spread over the 4 months.

Volumes the data source of this dimension contains 1,000,000 records.Incremental data volumes

The chances of incremental data volumes in this dimension are extremely high, seeing as the data contained

therein grows with every transaction that is processed. Data volume is huge and needs to be managed at

length.





Load frequency










TransSurKey n/a

TransDate n/a

OrderNr n/a

ItemNr n/a

MenuItemCode n/a

AuditKey n/a

Table partitioning

This table is partitioned according to month.







Page 32


33/151


Distinct

Values

TransID ID number of the

transaction

numeric 0 Max Value Possible

in DBMS

1

TransDate Date and time

transaction

occurred

datetime

OrderNr Order number of

transaction

Numeric 0 2000

ItemNr Item number of

order

Numeric 0 50

MenuItemCode MenuItem on

order

varchar

Auditkey numeric










Oracle.

Dependencies


Audit Menu ItemsThe above mentioned dimension needs to be populated first before this dimension can be populated to

ensure referential integrity.



Page 33


34/151










Cleanup steps




Medium.

Hub Dimension

Table Design

The Hub Dimension contains the following attributes (column names), each of which have the named



HubSurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

HubCode ID Number of hub Numeric Primary key Unique






therein grows with every transaction that is processed. Data volume is huge and needs to be managed at

length.



Page 34


35/151



Load frequency









HubSurKey n/a

HubCode 2AuditKey n/a

Table partitioning









Distinct

Values

HubID Hub

Identificationcode


in DBMS

1

HubCode ID Number of

hub

Numeric 0 1000

Auditkey numeric





Page 35


36/151






Oracle.

Dependencies


Audit RestaurantTransformation logic

See annexure B for the diagram.










Cleanup steps



achieved.


Easy.

Page 36


37/151

Regional Manager Dimension

Table Design

The Regional Manager Dimension contains the following attributes (column names), each of which have the



RegionalManagerSurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

RegionalManagerCode ID number of Regional Manager Numeric Primary key Unique






set according to geographical bounds that are unchanging and limited to a single country. The only exception

to this would be if the organization in question expanded their borders to other countries.





Load frequency







This dimension is checked to see if it is up to date and still correct during these loads. It is then updated as

necessary.




RegionalManagerSurKey n/a

RegionalManagerCode 2

AuditKey n/a

Page 37


38/151

Table partitioning









Distinct

ValuesRegionalManagerID Regional

manager

identification

key


in DBMS

1

RegionalManagerCode Regional

Manager

assigned code

Numeric 0 2000 1

Auditkey numeric






applications that form part of the ETL tool The working database is then used to populate temporary data warehouse dimensions in a different

schema in the working database environment



Oracle.

Dependencies

This dimension is dependent the following dimensions:

Audit

Page 38


39/151

RestaurantsTransformation logic











Cleanup steps



achieved.


Easy.

Country Dimension

Table Design

The Country Dimension contains the following attributes (column names), each of which have the named



CountrySurKey Identification field that contains no

metadata

Numeric Surrogate

key

Identity

CountryCode County Identification Abbreviation Varchar Primary key Unique






set according to geographical bounds that are unchanging and limited to a single country. The only exceptionto this would be if the organization in question expanded their borders to other countries.

Page 39


40/151





Load frequency







This dimension is checked to see if it is up to date and still correct during these loads. It is then updated as

necessary.




CountrySurKey n/a

CountryCode 1

AuditKey n/a

Table partitioning









Distinct

Values

CountryID Country

Identifier Key


in DBMS

1

CountryCode County

Identification

Abbreviation

Varchar

Auditkey numeric

Page 40


41/151








The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because datais checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

Oracle.

Dependencies













Cleanup steps



achieved.


Easy.

Page 41


42/151

Date and Time Dimension

Table Design

The Date and Time Dimension contains the following attributes (column names), each of which have the



DateSurKey Identification field that contains no

metadata

numeric Surrogate

key

Identity

FullDateDescription Original data from flat file DateTime Primary key Unique

CalendarMonthName Varchar n/a not null

CalendarMonthNumberInYear Numeric n/a not null

CalendarQuaterNumberInYear Numeric n/a not null

CalendarSemesterNumberInYear Numeric n/a not null

CalendarWeekEndingDate DateTime n/a not null

CalendarWeekNumberInYear Numeric n/a not null

CalendarWeekStartingDate DateTime n/a not nullCalendarYear Numeric n/a not null

CalendarYYYYMM Numeric n/a not null

DayNumberInCalendarMonth Numeric n/a not null

DayNumberInCalendarWeek Numeric n/a not null

DayNumberInCalendarYear Numeric n/a not null

HourNumberInDay Numeric n/a not null

isBreakfast bit n/a not null

isCoastalSchoolHoliday bit n/a not null

isDinner bit n/a not null

isDuringDay bit n/a not null

isDuringNight bit n/a not nullisFirstDayInMonth bit n/a not null

isFirstDayInQuater bit n/a not null

isFirstDayInSemester bit n/a not null

isFirstDayInWeek bit n/a not null

isFirstDayInYear bit n/a not null

isInlandSchoolHoliday bit n/a not null

isLastDayInMonth bit n/a not null

isLastDayInQuater bit n/a not null

isLastDayInSemester bit n/a not null

isLastDayInWeek bit n/a not null

isLastDayInYear bit n/a not nullisLeapYear bit n/a not null

isLunch bit n/a not null

isPublicHoliday bit n/a not null

isReligiousDay bit n/a not null

isSpecialDay bit n/a not null

isWeekday bit n/a not null


Page 42


43/151


Parameters the data is descriptive by nature and is not tied to any form of date or time and noparameters need to be considered, it is however dependant on a data source that is tied to certain

parameters.

Volumes the data source of this dimension contains 1,000,000 records.



therein grows with every transaction that is processed.





Load frequency



operation source sys

Download - Data Warehouses Project Final

Top Related