Download - Data Warehouses Project Final
-
7/30/2019 Data Warehouses Project Final
1/151
ITRI 611/621 Project
Data Warehouses
Final DocumentationMartin Gouws 21776032
JP Taljaard 21735549
Heinrich du Toit - 21077533
-
7/30/2019 Data Warehouses Project Final
2/151
Table of Content
1. Table of Content...................................................................................................................................... 0022. Introduction............................................................................................................................................. 0033. Project Planning....................................................................................................................................... 0044. Project Management................................................................................................................................ 0055. Business Requirements Definition........................................................................................................... 0066. Technology Track..................................................................................................................................... 0077. Data Track................................................................................................................................................ 008
a. Step 1 High Level Plan.............................................................................................................. 008b. Step 2 Choose an ETL Tool........................................................................... ............................. 008c. Step 3 & 4 - Develop Default Strategies and Drill Down by Target Table.................................... 009
i. General Points of Discussion.......................................................................................... 009ii. Data Hierarchies............................................................................................................. 012
iii. Detailed Table Breakdowns............................................................................................ 013iv. Dependency Tree........................................................................................................... 057
d. Step 5.......................................................................................................................................... 057e. Step 6.......................................................................................................................................... 057f. Step 7.......................................................................................................................................... 058g. Step 8.......................................................................................................................................... 058h. Step 9.......................................................................................................................................... 058i. Step 10........................................................................................................................................ 058
i. ETL Automation.............................................................................................................. 0588. Business Intelligence Application Track................................................................................................... 064
a. OLAP cube with Analysis Services............................................................................................... 064b. MDX Queries............................................................................................................................... 084c. Power Pivot................................................................................................................................. 093d. SQL reporting Services (SSRS)...................................................................................................... 099e. Share Point.................................................................................................................................. 107
9. Deployment.............................................................................................................................................. 11610.Management............................................................................................................................................ 11611.Growth..................................................................................................................................................... 11612.Group Participation.................................................................................................................................. 11713.Annexure A............................................................................................................................................... 11814.Annexure B............................................................................................................................................... 12515.Annexure C............................................................................................................................................... 13216.Annexure D............................................................................................................................................... 146
Page 2
-
7/30/2019 Data Warehouses Project Final
3/151
Introduction
The documentation of this project had to be structured in some way so that it made sense. The Kimball
Lifecycle was used as the basis of this structure and will include the following sections:
Project planning Project management Business requirements gathering Technology track Data track Business intelligence track Deployment Maintenance GrowthNot all of the phases of the Kimball Lifecycle are within the scope of this project, and for this reason they will
be included, but specifically ignored for the reasons as stated within that section where applicable.
Page 3
-
7/30/2019 Data Warehouses Project Final
4/151
Project Planning
We decided to do the project break down in terms of Kimballs lifecycle for data warehouses as stated in The
Data Warehouse Lifecycle Toolkit. As stated by the lecturer for the module in the study guide, there are 7
main phases that need to be done by the end of this project. These 7 phases will be used as projectmilestones. Both the lifecycle and the milestones will be put into a Microsoft Project 2010 format with all
related times and durations included to give a clear definition to when what needs to be done, also known
as a WBS (Work Breakdown Structure). The starting date of this project will be taken as 01/03/2012, and the
end date will be 30/09/2012.
The WBS and Gantt chart can be seen in Annexure A.
Page 4
-
7/30/2019 Data Warehouses Project Final
5/151
Project Management
Project planning and project management very often differ significantly. In the case of this project, the
project planning as stated in the above section was an overall guide, but the practical implementation was
very different.
The project progress was not monitored on Project 2010 as planned; rather the project was managed week
by week and in a dynamic fashion, and the group member allocations changed as the project progressed.
Page 5
-
7/30/2019 Data Warehouses Project Final
6/151
Business Requirements Definition
The business requirements were given by the lecturer in association with industry partners. The given
requirements were:
Sales per hour per day. A comparison of how a restaurants sales look in terms of the average of the region that specific
restaurant is situated in. For example, a restaurant in Johannesburg compared to all the restaurants in
all of Gauteng.
A breakdown of restaurants per region per product. For example, the Sloane Square restaurants sales ofburgers, soft drinks etc. versus the average of the how region of Gauteng.
Include a cost price with each menu item in order to calculate profits etc. again reduced to per day andper restaurant levels.
The following requirements were additionally added:
Average amount per purchase. Determine what time of the month is most popular. Top 10 Highest Grossing Restaurants Top 10 Product Sales Preferred Flavours per province. Number of orders(# transactions) per restaurantSeeing as this section is basically just a repetition of the project problem statement, it has not been further
expanded upon.
Page 6
-
7/30/2019 Data Warehouses Project Final
7/151
Technology Track
Technical Architecture
For our project, we decided to use a Virtual Machine for the Operating System, Software and Data to residein. Oracle VirtualBox was used create and run the Virtual Machine.
The Computer hosting the Virtual Machine runs an Intel i7 3.07GHz Processor. The Virtual Machine is assigned 10GB of system memory. A Virtual Hard Drive of 150GB was created and assigned to the Virtual Machine. The Virtual Machine uses a Virtual Network Adapter that is bridged with the host Network Adapter.Software
On the Virtual Machine, we installed the following: Microsoft Server 2008 R2 (Operating System) ORACLE 11g (Database) Microsoft SQL Server 2008 R2 (Databases) Microsoft Sharepoint 2010 (For Dashboards) Oracle Client Tools (.NET Plugin) MDX Studio (for MDX Queries) SQL Server Business Intelligence Development Studio (SSIS, SSRS, SSAS) Microsoft PowerPivot (Excel Plugin for Dashboarding) ORACLE SQL Developer (ORACLE Database Management Environment) Microsoft IIS (Requirement for Microsoft Sharepoint) Remote Desktop Services (For Multiple Concurrent Remote User connections) ORACLE VM VirtualBox Guest Additions Microsoft Office 2010 Microsoft Visual Studio 2010
Page 7
-
7/30/2019 Data Warehouses Project Final
8/151
Data Track
This section, the data track phase, was approached using the Kimball Lifecycle specifications for the ETL
process. Firstly, there are 34 subsystems that are identified by the authors that make up the basis of any ETL
system. These 34 subsystems are then later spread over 10 main steps that comprise the entire process of
ETL development.
In this section, these 10 steps were used as overall structure and to determine work load, while the whole
time retaining the knowledge that it is made up of 34 subsystems.
We do not assume that this documentation may or may not change as the project progresses, but we are
certain that it is indeed a solid start and that it will not change drastically.
Certain steps were grouped together and handled as a unit, but each individual step is still covered,
regardless of its combination with other steps.
Firstly though, a list of assumptions regarding the data:
There were transactions made on 29 February 2011, date did not exist because it was not a leap year. No prices were given, so prices for menu items were extracted from the Nandos website. In the MenuItems table, there were duplicate rows with the same MenuItemDescription but different
primary keys. We decided to remove these duplicates. This was done by running a sequence of queries
to leave only a single unique MenuItemDescription with a corresponding MenuItemCode (primary key).
In the light of the latter the Transaction table had to be updated, replacing the previously duplicteMenuItemCode with the new single MenuItemCode.
Restuarant shortnames....Florida en Adderley
Found duplicate in menuItems....all transactions pointing to second incorrectly spelled item was pointedat correct one and incorrectly spelled menuItem was deleted
No cost prices markup of 200% No selling prices Added extra column in menuItems for cost price and update 0 selling price with dataThe 10 ETL steps were implemented as follow:
Step 1 Draw the High Level Plan
Please refer to Annexure B.
Step 2 Choose an ETL Tool
SSIS (SQL Server Integration Services) was chooses as the ETL Tool of choice.
Wikipedia defines SSIS as the following:
SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server database software that
can be used to perform a broad range of data migration tasks. SSIS is a platform for data integration and
workflow applications. It features a fast and flexible data warehousing tool used for data extraction,
transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server
databases and updates to multidimensional cube data.
Page 8
-
7/30/2019 Data Warehouses Project Final
9/151
Another tool was written in-house to extract the date and time stamp from the transactions table and from
that deduce the attributes of the date and time dimension
Step 3 & 4 Develop Default Strategies & Drill Down by Target Table
This document includes historic and incremental load strategies for each table (dimension and fact) that ishandled by the ETL System, but first, a few general points of discussion followed by table details.
General Points of Discussion:
Default strategy for extracting from each major source system
Only one source system will be used to extract data. This source will be the operational database. The
operational data from the source system was received in flat file format.
Extraction will be done using the Import/Export Tool included in SSIS (SQL Server Integration Service) to the
relevant tables in the development database.
Archival strategy
As the data received is already in flat-file format, archival of this data is easy to implement. The data will be
archived for possible recovery when needed.
As the data is not very large, and in combination with compression it is significantly smaller data can be
stored for as long as needed.
Data quality tracking and metadata
The current load and quality checking is a manual process, but during the ETL the following steps will be
implemented
Data quality will be checked in the following manner.
Errors in flat files that may produce errors, such as quotation marks around all data items. Each row in all tables should be checked for incomplete data - missing fields, null values, etc. All rows should be compared, to see if records exist that can identified as duplicates. Check if identified keys match with corresponding tables. Check for spelling mistakes that could create duplicates. Valid transactional date(s) and time.The following actions are to be taken when errors or problems occur during data checking:
If incomplete data are found, the data can still be transferred to the dimensions and fact tables but anote should be made about the incomplete data found. This will be documented in the audit dimension
and tagged in the fact. A log should also be made that can be reviewed.
When duplicate records are found e.g. Item Descriptions that are the same. This will have to be managesin one of two ways (1) if the problem have not been encountered previously a decision should be
taken by the data warehouse builder on how to handle it. (2) If this has been identified, a procedure
should be written to handle the duplicate e.g. a script that automatically executes the actions that the
data warehouse builder has specified.
Although it should not happen, in theory it is possible that the data may contain errors relating to thedate and time resulting in an invalid time, date or the combination of the two. These errors would
result during the transition from records in the database to the flat file format, where it will be saved as
text text files do not retain data types.
Page 9
-
7/30/2019 Data Warehouses Project Final
10/151
If data under evaluation do not pass the quality checks, a classification of severity should be assigned to each
record that did not meet the standards set by the data warehouse builder. Each dimension or fact record
loaded will be tagged with a classification in the audit dimension.
The classification scale used:
0 No Problem1 Data type conversion error.
2 Data type validation failure
3 Missing data, NULL extracted from flat files, etc.
An audit dimension will be used, due to the fact that merely disregarding dirty records is bad practice, as it
compromises the integrity of the data and creates gaps. This will result in the distortion of the big picture.
Severity of errors during load should be an indication if the load process should be aborted, especially if it is
frequent and severe.
The load process will be automated with scripts that enforce the quality checks and loading of data into thefact and dimension tables. If errors occur during the load process, the data warehouse builder should be
notified to take action as appropriate especially if the classification is assigned to 3.
Errors should be logged and the data warehouse builder should be notified using a reliable communication
medium such as email. The data warehouse builder should then take action according to the severity of the
problem encountered.
A deduplication system was not formally implemented in the ETL system, because data is not retrieved from
multiple sources, thus the need for survivorship and matching is not needed. Deduplication was however
done manually as such problems, like duplicates, matching and survivorship, were encountered in the data.
Two problems in this regard were encountered. The first was the duplicate menu item descriptions, whichwas handled by running a SQL script that isolated a single description and its identifier and replaced all
relevant record fields with the isolated key and description and the duplicates where then dropped from the
table. The second was the renaming of misspelled data as it was encountered during data loading.
A conforming system in the ETL system is of no use in this situation, because all the data is retrieved from a
single source, namely the database flat files that were generated from the sales processing database system.
The single source system, in combination with no dimension being shared results in this system not being
applicable.
Default strategy for managing changes to dimension attributes
The Type 1 technique is a simple overwrite of one or more attributes in an existing dimension row. Therevised data from the change data capture system is used to overwrite existing data in the dimension table.
Type 1 is used when needed to correct data or there is no business need for keeping history of the previous
values.
The Type 2 technique is used to track changes of dimensions and to associate them correctly with existing
and new fact records. To support type 2 changes requires a strong data capture system to detect changes as
soon as they occur. For type 2 updates, copy the previous version of the dimension row and create a new
dimension row with a surrogate key. If there is not a previous version of the dimension row, create a new
one from scratch. Then update this row with the columns what have changed. This technique is used for
handling dimension attributes that changed and that need to be tracked over time.
Type 3 is not implemented in this system therefore it is not discussed.
Page 10
-
7/30/2019 Data Warehouses Project Final
11/151
Refer to tables below for specific dimension and fact change management.
System availability requirements and strategy
The operational data source was made available on 12th
of March 2012 in the format of flat-files.
High-level block sequencing is set out below from flat files through to the development database andfinally to the data warehouse database where facts and dimensions will be loaded.
See Annexure B for high-level sequencing of each dimension and fact.
Due to the nature of incremental batch loads, huge amounts of resources will be consumed on the system to
process the data (ETL) Load, Clean, Save. For this reason we will implement the ETL process on an extra
server that will be responsible for the ETL process.
Considerations will be taken when updating fact and dimension tables with new data, not to overwhelm the
system due to a high incoming load. Rather to split the load into smaller pieces and upload when the data
warehouse systems load is not very high. In this way, the system will be available even during uploads.
Design of the data auditing subsystem
The auditing subsystem will be used to capture data load information and keep track of it. A key will be
created for each type of event that happens while loading the facts or dimensions. This key is then assigned
to the fact or dimension in question. The keys will be stored in the audit dimension with additional
information such as the type of error, time and date of occurrence, batch job name or number, and possibly
more
Locations of staging areas
In the ETL stage, multiple staging areas will exist for use of processing the source data.
1. Import stagea. In this stage, the data will be imported from the source system. (Flat Files)b. Verification of data types need to happen here to limit possible errors and updates later on.
2. Cleaning stagea. The data will be checked for missing data, such as incomplete fields or NULLsb. Duplicate checking and removing of duplicates. All relevant records in other tables should be
updated accordingly to reflect the choice of a single record to identify the duplicates.
c. Audit logs implemented3. Population of Dimension Tables4. Keymapping stage - Keymaps are to be build for use in linking and creation of the fact table data5. Population of Facts Tables6. Transfer from Development database to Oracle Production database.Bulk-load the development data that includes dimension and fact tables to the Oracle Production database.
Page 11
-
7/30/2019 Data Warehouses Project Final
12/151
Data Hierarchies
Page 12
-
7/30/2019 Data Warehouses Project Final
13/151
Detailed Table Breakdown
The data sources that are referred to below are the working database tables, not the original flat files, and
these tables are already cleaned and the data validated as per ETL procedures.
Menu Category Dimension
Table Design
The Menu Category Dimension contains the following attributes (column names), each of which have the
named description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
MenuCategorySurKey Identification field that contains nometadata
Numeric Surrogatekey
Identity
MenuCategoryCode ID number of the category Numeric Primary key Unique
MenuCategoryDesc Description of the category Varchar n/a not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 19 records.
Incremental data volumes
The chances of incremental data volumes in this dimension are relatively low, seeing as the data contained
therein is of a very static nature.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Page 13
-
7/30/2019 Data Warehouses Project Final
14/151
Column Name Handling of change - Type 1, 2 or 3
MenuCategorySurKey n/a
MenuCategoryCode 2
MenuCategoryDesc 2
AuditKey n/a
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Column Name Description DataType Min Max Count of
Distinct
Values
MenuCategoryID ID number of the
Food Group
Numeric 0 Max Value Possible
in DBMS
1
MenuCategoryCode Category Code Numeric 0 10000 1MenuCategoryDesc Description of
the category
Varchar 1
Auditkey Audit Dimension
Foreign Key
Numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data Warehouse
Page 14
-
7/30/2019 Data Warehouses Project Final
15/151
This procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on a single other dimension, namely the audit dimension.
Transformation logic
See Annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files areachieved.
Estimated difficulty of implementation
Easy.
Menu Flavour Dimension
Table Design
The Menu Flavour Dimension contains the following attributes (column names), each of which have the
named description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
MenuFlavourSurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
MenuFlavourCode ID number of the flavour Numeric Primary key Unique
MenuFlavourDesc Description of the flavour Varchar n/a not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Page 15
-
7/30/2019 Data Warehouses Project Final
16/151
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 5 records.Incremental data volumes
The chances of incremental data volumes in this dimension are medium, seeing as the data contained
therein is of a relatively static nature. The only time this dimension would grow is with the change of
business rules or strategies and the adding additional complexity to their products.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
MenuFlavourSurKey n/a
MenuFlavourCode 2
MenuFlavourDesc 2
AuditKey n/a
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Page 16
-
7/30/2019 Data Warehouses Project Final
17/151
Column Name Description DataType Min Max Count of
Distinct
Values
MenuFlavourID ID number of the
Food Group
Numeric 0 Max Value Possible
in DBMS
1
MenuFlavourCode Flavour Code Numeric 0 10000 1
MenuFlavourDesc Description of
the flavour
Varchar 1
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on a single other dimension, namely the audit dimension.
Transformation logic
See Annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Page 17
-
7/30/2019 Data Warehouses Project Final
18/151
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files are
achieved.
Estimated difficulty of implementation
Easy.
Menu Item Food Group Dimension
Table Design
The Menu Item Food Group Dimension contains the following attributes (column names), each of which
have the named description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
MenuFoodGroupSurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
MenuFoodGroupCode ID number of the FoodGroup Numeric Primary key Unique
MenuFoodGroupDesc Description of the FoodGroup Varchar n/a not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data. Volumes the data source of this dimension contains 50 records.Incremental data volumes
The chances of incremental data volumes in this dimension are relatively low, seeing as the data contained
therein is of a very static nature.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Page 18
-
7/30/2019 Data Warehouses Project Final
19/151
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
MenuFoodGroupSurKey n/a
MenuFoodGroupCode 2
MenuFoodGroupDesc 2
AuditKey n/a
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Column Name Description DataType Min Max Count of
Distinct
ValuesMenuFoodGroupID ID number of the
Food Group
Numeric 0 Max Value Possible
in DBMS
1
MenuFoodGroupCode Food Group
Code
Numeric 0 10000 1
MenuFoodGroupDesc Description of
the FoodGroup
Varchar 1
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
Page 19
-
7/30/2019 Data Warehouses Project Final
20/151
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures inOracle.
Dependencies
This dimension is dependent on a single other dimension, namely the audit dimension.
Transformation logic
See Annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to checkfor necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files are
achieved.
Estimated difficulty of implementation
Easy.
Menu Items Dimension
Table Design
The Menu Items Dimension contains the following attributes (column names), each of which have the
named description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
MenuItemSurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
MenuItemCode ID number of the Items Varchar n/a Unique
MenuItemDesciption Description of the item Varchar n/a not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Page 20
-
7/30/2019 Data Warehouses Project Final
21/151
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 789 records.Incremental data volumes
The chances of incremental data volume are medium. As with all organizations that deliver a product, those
products are continuously expanded. This expansion is not so rapid that the data volumes will increase
dramatically, but they will increase slightly over time.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
MenuItemSurKey n/a
MenuItemCode 2
MenuItemDesciption 2
AuditKey n/a
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Page 21
-
7/30/2019 Data Warehouses Project Final
22/151
Column Name Description DataType Min Max Count of
Distinct
Values
MenuItemID ID number of the
MenuItem
Numeric 0 Max Value Possible
in DBMS
1
MenuItemCode Item Product
Codes
Varchar 1
MenuItemDesciption Description of
the item
Varchar 1
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on the following dimensions:
Audit Menu Item Flavour
Menu Category Menu Sub Category Menu Food GroupAll the above mentioned dimensions must be loaded first before this dimension may be populated,
otherwise referential integrity will be compromised.
Transformation logic
See Annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Page 22
-
7/30/2019 Data Warehouses Project Final
23/151
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files are
achieved.
Estimated difficulty of implementation
Medium.
Menu Sub Category Dimension
Table Design
The Menu Sub Category Dimension contains the following attributes (column names), each of which have
the named description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
MenuSubCategorySurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
MenuSubCategoryCode ID number of Sub Category Numeric Primary key Unique
MenuSubCategoryDesc Description of the Sub Category Varchar n/a not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 320 records.Incremental data volumes
The chances of incremental data volumes in this dimension are relatively low, seeing as the data contained
therein is of a very static nature.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Page 23
-
7/30/2019 Data Warehouses Project Final
24/151
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
MenuSubCategorySurKey n/a
MenuSubCategoryCode 2
MenuSubCategoryDesc 2AuditKey n/a
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Column Name Description DataType Min Max Count of
Distinct
Values
MenuSubCatID ID number of the
Sub Category
Numeric 0 Max Value Possible
in DBMS
1
MenuSubCategoryCode Sub Category
Code
Numeric 0 10000 1
MenuSubCategoryDesc Description of
the Sub
Category
Varchar 1
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Page 24
-
7/30/2019 Data Warehouses Project Final
25/151
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on a single other dimension, namely the audit dimension.
Transformation logic
See Annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle datawarehouse, all the tables in the development database are truncated and original source flat files are
achieved.
Estimated difficulty of implementation
Easy.
Page 25
-
7/30/2019 Data Warehouses Project Final
26/151
Province Dimension
Table Design
The Province Dimension contains the following attributes (column names), each of which have the named
description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
ProvinceSurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
ProvinceCode ID number of the Province Numeric Primary key Unique
ProvinceDesc Province Name Varchar n/a not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 11 records.Incremental data volumes
The chances of incremental data volumes in this dimension are none, seeing as the data contained therein is
set according to geographical bounds that are unchanging. The only exception to this would be if the
organization in question expanded their borders or if the country is re-divided into new provinces.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
ProvinceSurKey n/a
ProvinceCode 1
ProvinceDesc 1
AuditKey n/a
Page 26
-
7/30/2019 Data Warehouses Project Final
27/151
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Column Name Description DataType Min Max Count of
Distinct
ValuesProvince ID ID number of the
Province
Numeric 0 Max Value Possible
in DBMS
1
ProvinceCode Associate
Province Code
Numeric 1 11 1
ProvinceDesc Province Name Varchar 1
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on a single other dimension, namely the audit dimension.
Transformation logic
See Annexure B for the diagram.
Page 27
-
7/30/2019 Data Warehouses Project Final
28/151
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files areachieved.
Estimated difficulty of implementation
Easy.
Restaurant DimensionTable Design
The Restaurant Dimension contains the following attributes (column names), each of which have the named
description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
RestSurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
RestCode ID number of Restaurant Numeric Primary key Unique
RestShortName Shortened version of Restaurant name Varchar n/a not null
RestName Restaurant Name Varchar n/a not nullIsCoastal Is the Restaurant located near the coast
- 0=no; 1=yes
bit n/a not null
ProvinceCode ID of province which restaurant is in numeric Foreign key not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 101 records.
Page 28
-
7/30/2019 Data Warehouses Project Final
29/151
Incremental data volumes
The chances of incremental data volumes in this dimension are medium-high, seeing as the data contained
therein is likely to expand as the organization in question grows and more franchises are established.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
RestSurKey n/a
RestCode n/a
RestShortName 1
RestName 1
IsCoastal n/a
ProvinceCode n/a
AuditKey n/a
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Page 29
-
7/30/2019 Data Warehouses Project Final
30/151
Column Name Description DataType Min Max Count of
Distinct
Values
RestID Restaurant
Identification
code
Numeric 0 Max Value Possible
in DBMS
1
RestCode ID number of
Restaurant
Numeric 0 2000 1
RestShortName Shortened
version of
Restaurant name
Varchar 1
RestName Restaurant
Name
Varchar 1
IsCoastal Is the Restaurant
located near the
coast - 0=no;
1=yes
numeric 1+
ProvinceCode ID of province
which restaurant
is in
numeric 1+
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data Warehouse
This procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on the following dimension table:
Audit Regional Manager
Country Province
Page 30
-
7/30/2019 Data Warehouses Project Final
31/151
HubAll the above mentioned dimensions need to be imported before this dimension can be populated,
otherwise referential integrity might be compromised.
Transformation logic
See Annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files are
achieved.
Estimated difficulty of implementation
Easy.
Transactions Dimension
Table Design
The Transaction Dimension contains the following attributes (column names), each of which have the named
description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
TransSurKey Identification field that contains no
metadata
numeric Surrogate
key
Identity
TransDate Date and time transaction occurred datetime n/a not null
OrderNr Order number of transaction Numeric n/a not null
ItemNr Item number of order Numeric n/a not null
MenuItemCode MenuItem on order varchar Foreign key not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters this is the only data from the flat files sources that provide information about the businessprocess itself and there are certain parameters that need to be taken into account. Historic loads will be
Page 31
-
7/30/2019 Data Warehouses Project Final
32/151
done month for month, with a total of 4 months contained in the data. The transactions are not equally
spread over the 4 months.
Volumes the data source of this dimension contains 1,000,000 records.Incremental data volumes
The chances of incremental data volumes in this dimension are extremely high, seeing as the data contained
therein grows with every transaction that is processed. Data volume is huge and needs to be managed at
length.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
TransSurKey n/a
TransDate n/a
OrderNr n/a
ItemNr n/a
MenuItemCode n/a
AuditKey n/a
Table partitioning
This table is partitioned according to month.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Page 32
-
7/30/2019 Data Warehouses Project Final
33/151
Column Name Description DataType Min Max Count of
Distinct
Values
TransID ID number of the
transaction
numeric 0 Max Value Possible
in DBMS
1
TransDate Date and time
transaction
occurred
datetime
OrderNr Order number of
transaction
Numeric 0 2000
ItemNr Item number of
order
Numeric 0 50
MenuItemCode MenuItem on
order
varchar
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on the following dimensions:
Audit Menu ItemsThe above mentioned dimension needs to be populated first before this dimension can be populated to
ensure referential integrity.
Transformation logic
See Annexure B for the diagram.
Page 33
-
7/30/2019 Data Warehouses Project Final
34/151
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files areachieved.
Estimated difficulty of implementation
Medium.
Hub Dimension
Table Design
The Hub Dimension contains the following attributes (column names), each of which have the named
description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
HubSurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
HubCode ID Number of hub Numeric Primary key Unique
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 19 records.Incremental data volumes
The chances of incremental data volumes in this dimension are extremely high, seeing as the data contained
therein grows with every transaction that is processed. Data volume is huge and needs to be managed at
length.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
Page 34
-
7/30/2019 Data Warehouses Project Final
35/151
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
Handling of changes in each attribute
Column Name Handling of change - Type 1, 2 or 3
HubSurKey n/a
HubCode 2AuditKey n/a
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Column Name Description DataType Min Max Count of
Distinct
Values
HubID Hub
Identificationcode
numeric 0 Max Value Possible
in DBMS
1
HubCode ID Number of
hub
Numeric 0 1000
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
Page 35
-
7/30/2019 Data Warehouses Project Final
36/151
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on the following dimensions:
Audit RestaurantTransformation logic
See annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files are
achieved.
Estimated difficulty of implementation
Easy.
Page 36
-
7/30/2019 Data Warehouses Project Final
37/151
Regional Manager Dimension
Table Design
The Regional Manager Dimension contains the following attributes (column names), each of which have the
named description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
RegionalManagerSurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
RegionalManagerCode ID number of Regional Manager Numeric Primary key Unique
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 101 records.Incremental data volumes
The chances of incremental data volumes in this dimension are none, seeing as the data contained therein is
set according to geographical bounds that are unchanging and limited to a single country. The only exception
to this would be if the organization in question expanded their borders to other countries.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
This dimension is checked to see if it is up to date and still correct during these loads. It is then updated as
necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
RegionalManagerSurKey n/a
RegionalManagerCode 2
AuditKey n/a
Page 37
-
7/30/2019 Data Warehouses Project Final
38/151
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Column Name Description DataType Min Max Count of
Distinct
ValuesRegionalManagerID Regional
manager
identification
key
numeric 0 Max Value Possible
in DBMS
1
RegionalManagerCode Regional
Manager
assigned code
Numeric 0 2000 1
Auditkey numeric
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool The working database is then used to populate temporary data warehouse dimensions in a different
schema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data
is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent the following dimensions:
Audit
Page 38
-
7/30/2019 Data Warehouses Project Final
39/151
RestaurantsTransformation logic
See Annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files are
achieved.
Estimated difficulty of implementation
Easy.
Country Dimension
Table Design
The Country Dimension contains the following attributes (column names), each of which have the named
description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
CountrySurKey Identification field that contains no
metadata
Numeric Surrogate
key
Identity
CountryCode County Identification Abbreviation Varchar Primary key Unique
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.
Volumes the data source of this dimension contains 101 records.Incremental data volumes
The chances of incremental data volumes in this dimension are none, seeing as the data contained therein is
set according to geographical bounds that are unchanging and limited to a single country. The only exceptionto this would be if the organization in question expanded their borders to other countries.
Page 39
-
7/30/2019 Data Warehouses Project Final
40/151
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source system is not in use, for example after operating hours. The rest of the ETL operations are
then performed soon after.
This dimension is compared to the source data to identify if changes or additions have occurred. It is then
updated as necessary.
This dimension is checked to see if it is up to date and still correct during these loads. It is then updated as
necessary.
Handling of changes in each attribute
Changes to each attribute in this table are handled as per the type indicated:
Column Name Handling of change - Type 1, 2 or 3
CountrySurKey n/a
CountryCode 1
AuditKey n/a
Table partitioning
Table is not partitioned.
Overview of data source(s)
Data source for this table is normal transactional database table with no special or unusual characteristics.
Detailed source to target mapping
See Annexure C for this documentation.
Source data profiling
The source data can be described as follow:
Column Name Description DataType Min Max Count of
Distinct
Values
CountryID Country
Identifier Key
numeric 0 Max Value Possible
in DBMS
1
CountryCode County
Identification
Abbreviation
Varchar
Auditkey numeric
Page 40
-
7/30/2019 Data Warehouses Project Final
41/151
Extract strategy for the source data
Refer to default strategy as discussed in general points of discussion above.
Change data capture logic
The agreement between the source system and the data warehouse can be described as follow:
The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other
applications that form part of the ETL tool
The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment
The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because datais checked at every step for incompatibilities before it eventually reaches the data warehouse structures in
Oracle.
Dependencies
This dimension is dependent on a single other dimension, namely the audit dimension.
Transformation logic
See Annexure B for the diagram.
Preconditions to avoid error conditions
Before loading the data into the corresponding tables a script will be executed by the ETL process to check
for necessary space requirements in the tables and the process will only continue if such space does exist.
Recover and restart assumptions for each major step of the ETL pipeline
Every time a load operation is performed from any source to its related destination anywhere in the ETL
pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the
operation, the process is stopped and the data is rolled back to the created check point.
Archiving assumptions
Default strategy as mention in general points of discussion in the above section is applied here.
Cleanup steps
When all the data has passed through the ETL system and located in the correct format in the Oracle data
warehouse, all the tables in the development database are truncated and original source flat files are
achieved.
Estimated difficulty of implementation
Easy.
Page 41
-
7/30/2019 Data Warehouses Project Final
42/151
Date and Time Dimension
Table Design
The Date and Time Dimension contains the following attributes (column names), each of which have the
named description, data type, key (if applicable) and constraint (if applicable) respectively:
Column Name Description DataType Key Constraint
DateSurKey Identification field that contains no
metadata
numeric Surrogate
key
Identity
FullDateDescription Original data from flat file DateTime Primary key Unique
CalendarMonthName Varchar n/a not null
CalendarMonthNumberInYear Numeric n/a not null
CalendarQuaterNumberInYear Numeric n/a not null
CalendarSemesterNumberInYear Numeric n/a not null
CalendarWeekEndingDate DateTime n/a not null
CalendarWeekNumberInYear Numeric n/a not null
CalendarWeekStartingDate DateTime n/a not nullCalendarYear Numeric n/a not null
CalendarYYYYMM Numeric n/a not null
DayNumberInCalendarMonth Numeric n/a not null
DayNumberInCalendarWeek Numeric n/a not null
DayNumberInCalendarYear Numeric n/a not null
HourNumberInDay Numeric n/a not null
isBreakfast bit n/a not null
isCoastalSchoolHoliday bit n/a not null
isDinner bit n/a not null
isDuringDay bit n/a not null
isDuringNight bit n/a not nullisFirstDayInMonth bit n/a not null
isFirstDayInQuater bit n/a not null
isFirstDayInSemester bit n/a not null
isFirstDayInWeek bit n/a not null
isFirstDayInYear bit n/a not null
isInlandSchoolHoliday bit n/a not null
isLastDayInMonth bit n/a not null
isLastDayInQuater bit n/a not null
isLastDayInSemester bit n/a not null
isLastDayInWeek bit n/a not null
isLastDayInYear bit n/a not nullisLeapYear bit n/a not null
isLunch bit n/a not null
isPublicHoliday bit n/a not null
isReligiousDay bit n/a not null
isSpecialDay bit n/a not null
isWeekday bit n/a not null
AuditKey Audit Dimension Foreign Key Numeric Foreign key not null
Page 42
-
7/30/2019 Data Warehouses Project Final
43/151
Historic data load parameters and volumes
Parameters the data is descriptive by nature and is not tied to any form of date or time and noparameters need to be considered, it is however dependant on a data source that is tied to certain
parameters.
Volumes the data source of this dimension contains 1,000,000 records.
Incremental data volumes
The chances of incremental data volumes in this dimension are extremely high, seeing as the data contained
therein grows with every transaction that is processed.
Handling of late arriving data
All late arriving data is managed in the same way across all tables. The way that this is done is as follow:
As can be seen from the load frequency below the time between loads are relatively close to one another,
thus late arriving data will be postponed until the next load.
Load frequency
Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.
It is suggested that data dumps from the operational data source into flat files are done at a time when the
operation source sys