data warehouses project final

Upload: heinrich-du-toit

Post on 14-Apr-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/30/2019 Data Warehouses Project Final

    1/151

    ITRI 611/621 Project

    Data Warehouses

    Final DocumentationMartin Gouws 21776032

    JP Taljaard 21735549

    Heinrich du Toit - 21077533

  • 7/30/2019 Data Warehouses Project Final

    2/151

    Table of Content

    1. Table of Content...................................................................................................................................... 0022. Introduction............................................................................................................................................. 0033. Project Planning....................................................................................................................................... 0044. Project Management................................................................................................................................ 0055. Business Requirements Definition........................................................................................................... 0066. Technology Track..................................................................................................................................... 0077. Data Track................................................................................................................................................ 008

    a. Step 1 High Level Plan.............................................................................................................. 008b. Step 2 Choose an ETL Tool........................................................................... ............................. 008c. Step 3 & 4 - Develop Default Strategies and Drill Down by Target Table.................................... 009

    i. General Points of Discussion.......................................................................................... 009ii. Data Hierarchies............................................................................................................. 012

    iii. Detailed Table Breakdowns............................................................................................ 013iv. Dependency Tree........................................................................................................... 057

    d. Step 5.......................................................................................................................................... 057e. Step 6.......................................................................................................................................... 057f. Step 7.......................................................................................................................................... 058g. Step 8.......................................................................................................................................... 058h. Step 9.......................................................................................................................................... 058i. Step 10........................................................................................................................................ 058

    i. ETL Automation.............................................................................................................. 0588. Business Intelligence Application Track................................................................................................... 064

    a. OLAP cube with Analysis Services............................................................................................... 064b. MDX Queries............................................................................................................................... 084c. Power Pivot................................................................................................................................. 093d. SQL reporting Services (SSRS)...................................................................................................... 099e. Share Point.................................................................................................................................. 107

    9. Deployment.............................................................................................................................................. 11610.Management............................................................................................................................................ 11611.Growth..................................................................................................................................................... 11612.Group Participation.................................................................................................................................. 11713.Annexure A............................................................................................................................................... 11814.Annexure B............................................................................................................................................... 12515.Annexure C............................................................................................................................................... 13216.Annexure D............................................................................................................................................... 146

    Page 2

  • 7/30/2019 Data Warehouses Project Final

    3/151

    Introduction

    The documentation of this project had to be structured in some way so that it made sense. The Kimball

    Lifecycle was used as the basis of this structure and will include the following sections:

    Project planning Project management Business requirements gathering Technology track Data track Business intelligence track Deployment Maintenance GrowthNot all of the phases of the Kimball Lifecycle are within the scope of this project, and for this reason they will

    be included, but specifically ignored for the reasons as stated within that section where applicable.

    Page 3

  • 7/30/2019 Data Warehouses Project Final

    4/151

    Project Planning

    We decided to do the project break down in terms of Kimballs lifecycle for data warehouses as stated in The

    Data Warehouse Lifecycle Toolkit. As stated by the lecturer for the module in the study guide, there are 7

    main phases that need to be done by the end of this project. These 7 phases will be used as projectmilestones. Both the lifecycle and the milestones will be put into a Microsoft Project 2010 format with all

    related times and durations included to give a clear definition to when what needs to be done, also known

    as a WBS (Work Breakdown Structure). The starting date of this project will be taken as 01/03/2012, and the

    end date will be 30/09/2012.

    The WBS and Gantt chart can be seen in Annexure A.

    Page 4

  • 7/30/2019 Data Warehouses Project Final

    5/151

    Project Management

    Project planning and project management very often differ significantly. In the case of this project, the

    project planning as stated in the above section was an overall guide, but the practical implementation was

    very different.

    The project progress was not monitored on Project 2010 as planned; rather the project was managed week

    by week and in a dynamic fashion, and the group member allocations changed as the project progressed.

    Page 5

  • 7/30/2019 Data Warehouses Project Final

    6/151

    Business Requirements Definition

    The business requirements were given by the lecturer in association with industry partners. The given

    requirements were:

    Sales per hour per day. A comparison of how a restaurants sales look in terms of the average of the region that specific

    restaurant is situated in. For example, a restaurant in Johannesburg compared to all the restaurants in

    all of Gauteng.

    A breakdown of restaurants per region per product. For example, the Sloane Square restaurants sales ofburgers, soft drinks etc. versus the average of the how region of Gauteng.

    Include a cost price with each menu item in order to calculate profits etc. again reduced to per day andper restaurant levels.

    The following requirements were additionally added:

    Average amount per purchase. Determine what time of the month is most popular. Top 10 Highest Grossing Restaurants Top 10 Product Sales Preferred Flavours per province. Number of orders(# transactions) per restaurantSeeing as this section is basically just a repetition of the project problem statement, it has not been further

    expanded upon.

    Page 6

  • 7/30/2019 Data Warehouses Project Final

    7/151

    Technology Track

    Technical Architecture

    For our project, we decided to use a Virtual Machine for the Operating System, Software and Data to residein. Oracle VirtualBox was used create and run the Virtual Machine.

    The Computer hosting the Virtual Machine runs an Intel i7 3.07GHz Processor. The Virtual Machine is assigned 10GB of system memory. A Virtual Hard Drive of 150GB was created and assigned to the Virtual Machine. The Virtual Machine uses a Virtual Network Adapter that is bridged with the host Network Adapter.Software

    On the Virtual Machine, we installed the following: Microsoft Server 2008 R2 (Operating System) ORACLE 11g (Database) Microsoft SQL Server 2008 R2 (Databases) Microsoft Sharepoint 2010 (For Dashboards) Oracle Client Tools (.NET Plugin) MDX Studio (for MDX Queries) SQL Server Business Intelligence Development Studio (SSIS, SSRS, SSAS) Microsoft PowerPivot (Excel Plugin for Dashboarding) ORACLE SQL Developer (ORACLE Database Management Environment) Microsoft IIS (Requirement for Microsoft Sharepoint) Remote Desktop Services (For Multiple Concurrent Remote User connections) ORACLE VM VirtualBox Guest Additions Microsoft Office 2010 Microsoft Visual Studio 2010

    Page 7

  • 7/30/2019 Data Warehouses Project Final

    8/151

    Data Track

    This section, the data track phase, was approached using the Kimball Lifecycle specifications for the ETL

    process. Firstly, there are 34 subsystems that are identified by the authors that make up the basis of any ETL

    system. These 34 subsystems are then later spread over 10 main steps that comprise the entire process of

    ETL development.

    In this section, these 10 steps were used as overall structure and to determine work load, while the whole

    time retaining the knowledge that it is made up of 34 subsystems.

    We do not assume that this documentation may or may not change as the project progresses, but we are

    certain that it is indeed a solid start and that it will not change drastically.

    Certain steps were grouped together and handled as a unit, but each individual step is still covered,

    regardless of its combination with other steps.

    Firstly though, a list of assumptions regarding the data:

    There were transactions made on 29 February 2011, date did not exist because it was not a leap year. No prices were given, so prices for menu items were extracted from the Nandos website. In the MenuItems table, there were duplicate rows with the same MenuItemDescription but different

    primary keys. We decided to remove these duplicates. This was done by running a sequence of queries

    to leave only a single unique MenuItemDescription with a corresponding MenuItemCode (primary key).

    In the light of the latter the Transaction table had to be updated, replacing the previously duplicteMenuItemCode with the new single MenuItemCode.

    Restuarant shortnames....Florida en Adderley

    Found duplicate in menuItems....all transactions pointing to second incorrectly spelled item was pointedat correct one and incorrectly spelled menuItem was deleted

    No cost prices markup of 200% No selling prices Added extra column in menuItems for cost price and update 0 selling price with dataThe 10 ETL steps were implemented as follow:

    Step 1 Draw the High Level Plan

    Please refer to Annexure B.

    Step 2 Choose an ETL Tool

    SSIS (SQL Server Integration Services) was chooses as the ETL Tool of choice.

    Wikipedia defines SSIS as the following:

    SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server database software that

    can be used to perform a broad range of data migration tasks. SSIS is a platform for data integration and

    workflow applications. It features a fast and flexible data warehousing tool used for data extraction,

    transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server

    databases and updates to multidimensional cube data.

    Page 8

  • 7/30/2019 Data Warehouses Project Final

    9/151

    Another tool was written in-house to extract the date and time stamp from the transactions table and from

    that deduce the attributes of the date and time dimension

    Step 3 & 4 Develop Default Strategies & Drill Down by Target Table

    This document includes historic and incremental load strategies for each table (dimension and fact) that ishandled by the ETL System, but first, a few general points of discussion followed by table details.

    General Points of Discussion:

    Default strategy for extracting from each major source system

    Only one source system will be used to extract data. This source will be the operational database. The

    operational data from the source system was received in flat file format.

    Extraction will be done using the Import/Export Tool included in SSIS (SQL Server Integration Service) to the

    relevant tables in the development database.

    Archival strategy

    As the data received is already in flat-file format, archival of this data is easy to implement. The data will be

    archived for possible recovery when needed.

    As the data is not very large, and in combination with compression it is significantly smaller data can be

    stored for as long as needed.

    Data quality tracking and metadata

    The current load and quality checking is a manual process, but during the ETL the following steps will be

    implemented

    Data quality will be checked in the following manner.

    Errors in flat files that may produce errors, such as quotation marks around all data items. Each row in all tables should be checked for incomplete data - missing fields, null values, etc. All rows should be compared, to see if records exist that can identified as duplicates. Check if identified keys match with corresponding tables. Check for spelling mistakes that could create duplicates. Valid transactional date(s) and time.The following actions are to be taken when errors or problems occur during data checking:

    If incomplete data are found, the data can still be transferred to the dimensions and fact tables but anote should be made about the incomplete data found. This will be documented in the audit dimension

    and tagged in the fact. A log should also be made that can be reviewed.

    When duplicate records are found e.g. Item Descriptions that are the same. This will have to be managesin one of two ways (1) if the problem have not been encountered previously a decision should be

    taken by the data warehouse builder on how to handle it. (2) If this has been identified, a procedure

    should be written to handle the duplicate e.g. a script that automatically executes the actions that the

    data warehouse builder has specified.

    Although it should not happen, in theory it is possible that the data may contain errors relating to thedate and time resulting in an invalid time, date or the combination of the two. These errors would

    result during the transition from records in the database to the flat file format, where it will be saved as

    text text files do not retain data types.

    Page 9

  • 7/30/2019 Data Warehouses Project Final

    10/151

    If data under evaluation do not pass the quality checks, a classification of severity should be assigned to each

    record that did not meet the standards set by the data warehouse builder. Each dimension or fact record

    loaded will be tagged with a classification in the audit dimension.

    The classification scale used:

    0 No Problem1 Data type conversion error.

    2 Data type validation failure

    3 Missing data, NULL extracted from flat files, etc.

    An audit dimension will be used, due to the fact that merely disregarding dirty records is bad practice, as it

    compromises the integrity of the data and creates gaps. This will result in the distortion of the big picture.

    Severity of errors during load should be an indication if the load process should be aborted, especially if it is

    frequent and severe.

    The load process will be automated with scripts that enforce the quality checks and loading of data into thefact and dimension tables. If errors occur during the load process, the data warehouse builder should be

    notified to take action as appropriate especially if the classification is assigned to 3.

    Errors should be logged and the data warehouse builder should be notified using a reliable communication

    medium such as email. The data warehouse builder should then take action according to the severity of the

    problem encountered.

    A deduplication system was not formally implemented in the ETL system, because data is not retrieved from

    multiple sources, thus the need for survivorship and matching is not needed. Deduplication was however

    done manually as such problems, like duplicates, matching and survivorship, were encountered in the data.

    Two problems in this regard were encountered. The first was the duplicate menu item descriptions, whichwas handled by running a SQL script that isolated a single description and its identifier and replaced all

    relevant record fields with the isolated key and description and the duplicates where then dropped from the

    table. The second was the renaming of misspelled data as it was encountered during data loading.

    A conforming system in the ETL system is of no use in this situation, because all the data is retrieved from a

    single source, namely the database flat files that were generated from the sales processing database system.

    The single source system, in combination with no dimension being shared results in this system not being

    applicable.

    Default strategy for managing changes to dimension attributes

    The Type 1 technique is a simple overwrite of one or more attributes in an existing dimension row. Therevised data from the change data capture system is used to overwrite existing data in the dimension table.

    Type 1 is used when needed to correct data or there is no business need for keeping history of the previous

    values.

    The Type 2 technique is used to track changes of dimensions and to associate them correctly with existing

    and new fact records. To support type 2 changes requires a strong data capture system to detect changes as

    soon as they occur. For type 2 updates, copy the previous version of the dimension row and create a new

    dimension row with a surrogate key. If there is not a previous version of the dimension row, create a new

    one from scratch. Then update this row with the columns what have changed. This technique is used for

    handling dimension attributes that changed and that need to be tracked over time.

    Type 3 is not implemented in this system therefore it is not discussed.

    Page 10

  • 7/30/2019 Data Warehouses Project Final

    11/151

    Refer to tables below for specific dimension and fact change management.

    System availability requirements and strategy

    The operational data source was made available on 12th

    of March 2012 in the format of flat-files.

    High-level block sequencing is set out below from flat files through to the development database andfinally to the data warehouse database where facts and dimensions will be loaded.

    See Annexure B for high-level sequencing of each dimension and fact.

    Due to the nature of incremental batch loads, huge amounts of resources will be consumed on the system to

    process the data (ETL) Load, Clean, Save. For this reason we will implement the ETL process on an extra

    server that will be responsible for the ETL process.

    Considerations will be taken when updating fact and dimension tables with new data, not to overwhelm the

    system due to a high incoming load. Rather to split the load into smaller pieces and upload when the data

    warehouse systems load is not very high. In this way, the system will be available even during uploads.

    Design of the data auditing subsystem

    The auditing subsystem will be used to capture data load information and keep track of it. A key will be

    created for each type of event that happens while loading the facts or dimensions. This key is then assigned

    to the fact or dimension in question. The keys will be stored in the audit dimension with additional

    information such as the type of error, time and date of occurrence, batch job name or number, and possibly

    more

    Locations of staging areas

    In the ETL stage, multiple staging areas will exist for use of processing the source data.

    1. Import stagea. In this stage, the data will be imported from the source system. (Flat Files)b. Verification of data types need to happen here to limit possible errors and updates later on.

    2. Cleaning stagea. The data will be checked for missing data, such as incomplete fields or NULLsb. Duplicate checking and removing of duplicates. All relevant records in other tables should be

    updated accordingly to reflect the choice of a single record to identify the duplicates.

    c. Audit logs implemented3. Population of Dimension Tables4. Keymapping stage - Keymaps are to be build for use in linking and creation of the fact table data5. Population of Facts Tables6. Transfer from Development database to Oracle Production database.Bulk-load the development data that includes dimension and fact tables to the Oracle Production database.

    Page 11

  • 7/30/2019 Data Warehouses Project Final

    12/151

    Data Hierarchies

    Page 12

  • 7/30/2019 Data Warehouses Project Final

    13/151

    Detailed Table Breakdown

    The data sources that are referred to below are the working database tables, not the original flat files, and

    these tables are already cleaned and the data validated as per ETL procedures.

    Menu Category Dimension

    Table Design

    The Menu Category Dimension contains the following attributes (column names), each of which have the

    named description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    MenuCategorySurKey Identification field that contains nometadata

    Numeric Surrogatekey

    Identity

    MenuCategoryCode ID number of the category Numeric Primary key Unique

    MenuCategoryDesc Description of the category Varchar n/a not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 19 records.

    Incremental data volumes

    The chances of incremental data volumes in this dimension are relatively low, seeing as the data contained

    therein is of a very static nature.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Page 13

  • 7/30/2019 Data Warehouses Project Final

    14/151

    Column Name Handling of change - Type 1, 2 or 3

    MenuCategorySurKey n/a

    MenuCategoryCode 2

    MenuCategoryDesc 2

    AuditKey n/a

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Column Name Description DataType Min Max Count of

    Distinct

    Values

    MenuCategoryID ID number of the

    Food Group

    Numeric 0 Max Value Possible

    in DBMS

    1

    MenuCategoryCode Category Code Numeric 0 10000 1MenuCategoryDesc Description of

    the category

    Varchar 1

    Auditkey Audit Dimension

    Foreign Key

    Numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data Warehouse

    Page 14

  • 7/30/2019 Data Warehouses Project Final

    15/151

    This procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on a single other dimension, namely the audit dimension.

    Transformation logic

    See Annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files areachieved.

    Estimated difficulty of implementation

    Easy.

    Menu Flavour Dimension

    Table Design

    The Menu Flavour Dimension contains the following attributes (column names), each of which have the

    named description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    MenuFlavourSurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    MenuFlavourCode ID number of the flavour Numeric Primary key Unique

    MenuFlavourDesc Description of the flavour Varchar n/a not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Page 15

  • 7/30/2019 Data Warehouses Project Final

    16/151

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 5 records.Incremental data volumes

    The chances of incremental data volumes in this dimension are medium, seeing as the data contained

    therein is of a relatively static nature. The only time this dimension would grow is with the change of

    business rules or strategies and the adding additional complexity to their products.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    MenuFlavourSurKey n/a

    MenuFlavourCode 2

    MenuFlavourDesc 2

    AuditKey n/a

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Page 16

  • 7/30/2019 Data Warehouses Project Final

    17/151

    Column Name Description DataType Min Max Count of

    Distinct

    Values

    MenuFlavourID ID number of the

    Food Group

    Numeric 0 Max Value Possible

    in DBMS

    1

    MenuFlavourCode Flavour Code Numeric 0 10000 1

    MenuFlavourDesc Description of

    the flavour

    Varchar 1

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on a single other dimension, namely the audit dimension.

    Transformation logic

    See Annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Page 17

  • 7/30/2019 Data Warehouses Project Final

    18/151

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files are

    achieved.

    Estimated difficulty of implementation

    Easy.

    Menu Item Food Group Dimension

    Table Design

    The Menu Item Food Group Dimension contains the following attributes (column names), each of which

    have the named description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    MenuFoodGroupSurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    MenuFoodGroupCode ID number of the FoodGroup Numeric Primary key Unique

    MenuFoodGroupDesc Description of the FoodGroup Varchar n/a not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data. Volumes the data source of this dimension contains 50 records.Incremental data volumes

    The chances of incremental data volumes in this dimension are relatively low, seeing as the data contained

    therein is of a very static nature.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Page 18

  • 7/30/2019 Data Warehouses Project Final

    19/151

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    MenuFoodGroupSurKey n/a

    MenuFoodGroupCode 2

    MenuFoodGroupDesc 2

    AuditKey n/a

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Column Name Description DataType Min Max Count of

    Distinct

    ValuesMenuFoodGroupID ID number of the

    Food Group

    Numeric 0 Max Value Possible

    in DBMS

    1

    MenuFoodGroupCode Food Group

    Code

    Numeric 0 10000 1

    MenuFoodGroupDesc Description of

    the FoodGroup

    Varchar 1

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    Page 19

  • 7/30/2019 Data Warehouses Project Final

    20/151

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures inOracle.

    Dependencies

    This dimension is dependent on a single other dimension, namely the audit dimension.

    Transformation logic

    See Annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to checkfor necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files are

    achieved.

    Estimated difficulty of implementation

    Easy.

    Menu Items Dimension

    Table Design

    The Menu Items Dimension contains the following attributes (column names), each of which have the

    named description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    MenuItemSurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    MenuItemCode ID number of the Items Varchar n/a Unique

    MenuItemDesciption Description of the item Varchar n/a not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Page 20

  • 7/30/2019 Data Warehouses Project Final

    21/151

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 789 records.Incremental data volumes

    The chances of incremental data volume are medium. As with all organizations that deliver a product, those

    products are continuously expanded. This expansion is not so rapid that the data volumes will increase

    dramatically, but they will increase slightly over time.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    MenuItemSurKey n/a

    MenuItemCode 2

    MenuItemDesciption 2

    AuditKey n/a

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Page 21

  • 7/30/2019 Data Warehouses Project Final

    22/151

    Column Name Description DataType Min Max Count of

    Distinct

    Values

    MenuItemID ID number of the

    MenuItem

    Numeric 0 Max Value Possible

    in DBMS

    1

    MenuItemCode Item Product

    Codes

    Varchar 1

    MenuItemDesciption Description of

    the item

    Varchar 1

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on the following dimensions:

    Audit Menu Item Flavour

    Menu Category Menu Sub Category Menu Food GroupAll the above mentioned dimensions must be loaded first before this dimension may be populated,

    otherwise referential integrity will be compromised.

    Transformation logic

    See Annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Page 22

  • 7/30/2019 Data Warehouses Project Final

    23/151

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files are

    achieved.

    Estimated difficulty of implementation

    Medium.

    Menu Sub Category Dimension

    Table Design

    The Menu Sub Category Dimension contains the following attributes (column names), each of which have

    the named description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    MenuSubCategorySurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    MenuSubCategoryCode ID number of Sub Category Numeric Primary key Unique

    MenuSubCategoryDesc Description of the Sub Category Varchar n/a not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 320 records.Incremental data volumes

    The chances of incremental data volumes in this dimension are relatively low, seeing as the data contained

    therein is of a very static nature.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Page 23

  • 7/30/2019 Data Warehouses Project Final

    24/151

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    MenuSubCategorySurKey n/a

    MenuSubCategoryCode 2

    MenuSubCategoryDesc 2AuditKey n/a

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Column Name Description DataType Min Max Count of

    Distinct

    Values

    MenuSubCatID ID number of the

    Sub Category

    Numeric 0 Max Value Possible

    in DBMS

    1

    MenuSubCategoryCode Sub Category

    Code

    Numeric 0 10000 1

    MenuSubCategoryDesc Description of

    the Sub

    Category

    Varchar 1

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Page 24

  • 7/30/2019 Data Warehouses Project Final

    25/151

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on a single other dimension, namely the audit dimension.

    Transformation logic

    See Annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle datawarehouse, all the tables in the development database are truncated and original source flat files are

    achieved.

    Estimated difficulty of implementation

    Easy.

    Page 25

  • 7/30/2019 Data Warehouses Project Final

    26/151

    Province Dimension

    Table Design

    The Province Dimension contains the following attributes (column names), each of which have the named

    description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    ProvinceSurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    ProvinceCode ID number of the Province Numeric Primary key Unique

    ProvinceDesc Province Name Varchar n/a not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 11 records.Incremental data volumes

    The chances of incremental data volumes in this dimension are none, seeing as the data contained therein is

    set according to geographical bounds that are unchanging. The only exception to this would be if the

    organization in question expanded their borders or if the country is re-divided into new provinces.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    ProvinceSurKey n/a

    ProvinceCode 1

    ProvinceDesc 1

    AuditKey n/a

    Page 26

  • 7/30/2019 Data Warehouses Project Final

    27/151

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Column Name Description DataType Min Max Count of

    Distinct

    ValuesProvince ID ID number of the

    Province

    Numeric 0 Max Value Possible

    in DBMS

    1

    ProvinceCode Associate

    Province Code

    Numeric 1 11 1

    ProvinceDesc Province Name Varchar 1

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on a single other dimension, namely the audit dimension.

    Transformation logic

    See Annexure B for the diagram.

    Page 27

  • 7/30/2019 Data Warehouses Project Final

    28/151

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files areachieved.

    Estimated difficulty of implementation

    Easy.

    Restaurant DimensionTable Design

    The Restaurant Dimension contains the following attributes (column names), each of which have the named

    description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    RestSurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    RestCode ID number of Restaurant Numeric Primary key Unique

    RestShortName Shortened version of Restaurant name Varchar n/a not null

    RestName Restaurant Name Varchar n/a not nullIsCoastal Is the Restaurant located near the coast

    - 0=no; 1=yes

    bit n/a not null

    ProvinceCode ID of province which restaurant is in numeric Foreign key not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 101 records.

    Page 28

  • 7/30/2019 Data Warehouses Project Final

    29/151

    Incremental data volumes

    The chances of incremental data volumes in this dimension are medium-high, seeing as the data contained

    therein is likely to expand as the organization in question grows and more franchises are established.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    RestSurKey n/a

    RestCode n/a

    RestShortName 1

    RestName 1

    IsCoastal n/a

    ProvinceCode n/a

    AuditKey n/a

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Page 29

  • 7/30/2019 Data Warehouses Project Final

    30/151

    Column Name Description DataType Min Max Count of

    Distinct

    Values

    RestID Restaurant

    Identification

    code

    Numeric 0 Max Value Possible

    in DBMS

    1

    RestCode ID number of

    Restaurant

    Numeric 0 2000 1

    RestShortName Shortened

    version of

    Restaurant name

    Varchar 1

    RestName Restaurant

    Name

    Varchar 1

    IsCoastal Is the Restaurant

    located near the

    coast - 0=no;

    1=yes

    numeric 1+

    ProvinceCode ID of province

    which restaurant

    is in

    numeric 1+

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data Warehouse

    This procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on the following dimension table:

    Audit Regional Manager

    Country Province

    Page 30

  • 7/30/2019 Data Warehouses Project Final

    31/151

    HubAll the above mentioned dimensions need to be imported before this dimension can be populated,

    otherwise referential integrity might be compromised.

    Transformation logic

    See Annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files are

    achieved.

    Estimated difficulty of implementation

    Easy.

    Transactions Dimension

    Table Design

    The Transaction Dimension contains the following attributes (column names), each of which have the named

    description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    TransSurKey Identification field that contains no

    metadata

    numeric Surrogate

    key

    Identity

    TransDate Date and time transaction occurred datetime n/a not null

    OrderNr Order number of transaction Numeric n/a not null

    ItemNr Item number of order Numeric n/a not null

    MenuItemCode MenuItem on order varchar Foreign key not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters this is the only data from the flat files sources that provide information about the businessprocess itself and there are certain parameters that need to be taken into account. Historic loads will be

    Page 31

  • 7/30/2019 Data Warehouses Project Final

    32/151

    done month for month, with a total of 4 months contained in the data. The transactions are not equally

    spread over the 4 months.

    Volumes the data source of this dimension contains 1,000,000 records.Incremental data volumes

    The chances of incremental data volumes in this dimension are extremely high, seeing as the data contained

    therein grows with every transaction that is processed. Data volume is huge and needs to be managed at

    length.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    TransSurKey n/a

    TransDate n/a

    OrderNr n/a

    ItemNr n/a

    MenuItemCode n/a

    AuditKey n/a

    Table partitioning

    This table is partitioned according to month.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Page 32

  • 7/30/2019 Data Warehouses Project Final

    33/151

    Column Name Description DataType Min Max Count of

    Distinct

    Values

    TransID ID number of the

    transaction

    numeric 0 Max Value Possible

    in DBMS

    1

    TransDate Date and time

    transaction

    occurred

    datetime

    OrderNr Order number of

    transaction

    Numeric 0 2000

    ItemNr Item number of

    order

    Numeric 0 50

    MenuItemCode MenuItem on

    order

    varchar

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on the following dimensions:

    Audit Menu ItemsThe above mentioned dimension needs to be populated first before this dimension can be populated to

    ensure referential integrity.

    Transformation logic

    See Annexure B for the diagram.

    Page 33

  • 7/30/2019 Data Warehouses Project Final

    34/151

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files areachieved.

    Estimated difficulty of implementation

    Medium.

    Hub Dimension

    Table Design

    The Hub Dimension contains the following attributes (column names), each of which have the named

    description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    HubSurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    HubCode ID Number of hub Numeric Primary key Unique

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 19 records.Incremental data volumes

    The chances of incremental data volumes in this dimension are extremely high, seeing as the data contained

    therein grows with every transaction that is processed. Data volume is huge and needs to be managed at

    length.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    Page 34

  • 7/30/2019 Data Warehouses Project Final

    35/151

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    Handling of changes in each attribute

    Column Name Handling of change - Type 1, 2 or 3

    HubSurKey n/a

    HubCode 2AuditKey n/a

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Column Name Description DataType Min Max Count of

    Distinct

    Values

    HubID Hub

    Identificationcode

    numeric 0 Max Value Possible

    in DBMS

    1

    HubCode ID Number of

    hub

    Numeric 0 1000

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    Page 35

  • 7/30/2019 Data Warehouses Project Final

    36/151

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on the following dimensions:

    Audit RestaurantTransformation logic

    See annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files are

    achieved.

    Estimated difficulty of implementation

    Easy.

    Page 36

  • 7/30/2019 Data Warehouses Project Final

    37/151

    Regional Manager Dimension

    Table Design

    The Regional Manager Dimension contains the following attributes (column names), each of which have the

    named description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    RegionalManagerSurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    RegionalManagerCode ID number of Regional Manager Numeric Primary key Unique

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 101 records.Incremental data volumes

    The chances of incremental data volumes in this dimension are none, seeing as the data contained therein is

    set according to geographical bounds that are unchanging and limited to a single country. The only exception

    to this would be if the organization in question expanded their borders to other countries.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    This dimension is checked to see if it is up to date and still correct during these loads. It is then updated as

    necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    RegionalManagerSurKey n/a

    RegionalManagerCode 2

    AuditKey n/a

    Page 37

  • 7/30/2019 Data Warehouses Project Final

    38/151

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Column Name Description DataType Min Max Count of

    Distinct

    ValuesRegionalManagerID Regional

    manager

    identification

    key

    numeric 0 Max Value Possible

    in DBMS

    1

    RegionalManagerCode Regional

    Manager

    assigned code

    Numeric 0 2000 1

    Auditkey numeric

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool The working database is then used to populate temporary data warehouse dimensions in a different

    schema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because data

    is checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent the following dimensions:

    Audit

    Page 38

  • 7/30/2019 Data Warehouses Project Final

    39/151

    RestaurantsTransformation logic

    See Annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files are

    achieved.

    Estimated difficulty of implementation

    Easy.

    Country Dimension

    Table Design

    The Country Dimension contains the following attributes (column names), each of which have the named

    description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    CountrySurKey Identification field that contains no

    metadata

    Numeric Surrogate

    key

    Identity

    CountryCode County Identification Abbreviation Varchar Primary key Unique

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time, so there is noneed to consider any parameters in the loading of this data.

    Volumes the data source of this dimension contains 101 records.Incremental data volumes

    The chances of incremental data volumes in this dimension are none, seeing as the data contained therein is

    set according to geographical bounds that are unchanging and limited to a single country. The only exceptionto this would be if the organization in question expanded their borders to other countries.

    Page 39

  • 7/30/2019 Data Warehouses Project Final

    40/151

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source system is not in use, for example after operating hours. The rest of the ETL operations are

    then performed soon after.

    This dimension is compared to the source data to identify if changes or additions have occurred. It is then

    updated as necessary.

    This dimension is checked to see if it is up to date and still correct during these loads. It is then updated as

    necessary.

    Handling of changes in each attribute

    Changes to each attribute in this table are handled as per the type indicated:

    Column Name Handling of change - Type 1, 2 or 3

    CountrySurKey n/a

    CountryCode 1

    AuditKey n/a

    Table partitioning

    Table is not partitioned.

    Overview of data source(s)

    Data source for this table is normal transactional database table with no special or unusual characteristics.

    Detailed source to target mapping

    See Annexure C for this documentation.

    Source data profiling

    The source data can be described as follow:

    Column Name Description DataType Min Max Count of

    Distinct

    Values

    CountryID Country

    Identifier Key

    numeric 0 Max Value Possible

    in DBMS

    1

    CountryCode County

    Identification

    Abbreviation

    Varchar

    Auditkey numeric

    Page 40

  • 7/30/2019 Data Warehouses Project Final

    41/151

    Extract strategy for the source data

    Refer to default strategy as discussed in general points of discussion above.

    Change data capture logic

    The agreement between the source system and the data warehouse can be described as follow:

    The source database dumps its content to flat files The flat files are stripped of all illegal characters. The flat files are then imported into the working database using the ETL Tool Transformations and data cleaning is done on these working tables by using sql scripts and other

    applications that form part of the ETL tool

    The working database is then used to populate temporary data warehouse dimensions in a differentschema in the working database environment

    The temporary dimensions are then bulk loaded to the Oracle Data WarehouseThis procession that data needs to follow through the ETL pipeline ensures perfect agreement, because datais checked at every step for incompatibilities before it eventually reaches the data warehouse structures in

    Oracle.

    Dependencies

    This dimension is dependent on a single other dimension, namely the audit dimension.

    Transformation logic

    See Annexure B for the diagram.

    Preconditions to avoid error conditions

    Before loading the data into the corresponding tables a script will be executed by the ETL process to check

    for necessary space requirements in the tables and the process will only continue if such space does exist.

    Recover and restart assumptions for each major step of the ETL pipeline

    Every time a load operation is performed from any source to its related destination anywhere in the ETL

    pipeline a checkpoint is created before each specific stage operation. If an error occurs anywhere during the

    operation, the process is stopped and the data is rolled back to the created check point.

    Archiving assumptions

    Default strategy as mention in general points of discussion in the above section is applied here.

    Cleanup steps

    When all the data has passed through the ETL system and located in the correct format in the Oracle data

    warehouse, all the tables in the development database are truncated and original source flat files are

    achieved.

    Estimated difficulty of implementation

    Easy.

    Page 41

  • 7/30/2019 Data Warehouses Project Final

    42/151

    Date and Time Dimension

    Table Design

    The Date and Time Dimension contains the following attributes (column names), each of which have the

    named description, data type, key (if applicable) and constraint (if applicable) respectively:

    Column Name Description DataType Key Constraint

    DateSurKey Identification field that contains no

    metadata

    numeric Surrogate

    key

    Identity

    FullDateDescription Original data from flat file DateTime Primary key Unique

    CalendarMonthName Varchar n/a not null

    CalendarMonthNumberInYear Numeric n/a not null

    CalendarQuaterNumberInYear Numeric n/a not null

    CalendarSemesterNumberInYear Numeric n/a not null

    CalendarWeekEndingDate DateTime n/a not null

    CalendarWeekNumberInYear Numeric n/a not null

    CalendarWeekStartingDate DateTime n/a not nullCalendarYear Numeric n/a not null

    CalendarYYYYMM Numeric n/a not null

    DayNumberInCalendarMonth Numeric n/a not null

    DayNumberInCalendarWeek Numeric n/a not null

    DayNumberInCalendarYear Numeric n/a not null

    HourNumberInDay Numeric n/a not null

    isBreakfast bit n/a not null

    isCoastalSchoolHoliday bit n/a not null

    isDinner bit n/a not null

    isDuringDay bit n/a not null

    isDuringNight bit n/a not nullisFirstDayInMonth bit n/a not null

    isFirstDayInQuater bit n/a not null

    isFirstDayInSemester bit n/a not null

    isFirstDayInWeek bit n/a not null

    isFirstDayInYear bit n/a not null

    isInlandSchoolHoliday bit n/a not null

    isLastDayInMonth bit n/a not null

    isLastDayInQuater bit n/a not null

    isLastDayInSemester bit n/a not null

    isLastDayInWeek bit n/a not null

    isLastDayInYear bit n/a not nullisLeapYear bit n/a not null

    isLunch bit n/a not null

    isPublicHoliday bit n/a not null

    isReligiousDay bit n/a not null

    isSpecialDay bit n/a not null

    isWeekday bit n/a not null

    AuditKey Audit Dimension Foreign Key Numeric Foreign key not null

    Page 42

  • 7/30/2019 Data Warehouses Project Final

    43/151

    Historic data load parameters and volumes

    Parameters the data is descriptive by nature and is not tied to any form of date or time and noparameters need to be considered, it is however dependant on a data source that is tied to certain

    parameters.

    Volumes the data source of this dimension contains 1,000,000 records.

    Incremental data volumes

    The chances of incremental data volumes in this dimension are extremely high, seeing as the data contained

    therein grows with every transaction that is processed.

    Handling of late arriving data

    All late arriving data is managed in the same way across all tables. The way that this is done is as follow:

    As can be seen from the load frequency below the time between loads are relatively close to one another,

    thus late arriving data will be postponed until the next load.

    Load frequency

    Initial load is completed and every week after that, with the possibility of daily loads if the client so requests.

    It is suggested that data dumps from the operational data source into flat files are done at a time when the

    operation source sys