integrated assignment solution

Integrated Assignment Solution

1

PERFORMING ETL: .........................................................................................2

PERFORMING MDDM: .....................................................................................5

PERFORMING REPORTING: ............................................................................. 26

2

Performing ETL

Description

As discussed in the chapter titled ―Basics of Data Integration (Extraction Transformation and Loading),‖

ETL is the process of transforming the source data (source schema) into a desired target database (target

schema). In our case study, the source data is present partly in MS Access and partly in flat text files.

Our case study involves, in all, five tables viz., Time, Assessment, Trainees, Modules and Score. The ETL

process, in accordance with the best practices, will be performed in three phases:

1. Source to Backup

2. Backup to Staging 3. Staging to Data Warehouse

Steps

Let us now look into the steps one by one.

(A) Source to Backup Data from the source is directly fed into Excel spreadsheets without any transformations. Source

files could be in any form ranging from simple flat text files to complex relational databases. Our case study deals with two such sources, a flat text file for Time data and relational access source for

the Assessment, Score, Modules and Trainees tables. Let us now go through this process for the

Trainees Table in the access source database.

Steps:

1. Click on the “Data‖ Tab from Menu Bar.

3

2. Choose the ―From Access‖ option from the ribbon interface.

3. Select the source Access database file (having extension .accdb).

4

4. Choose the required (Trainees) table from the list.

5. As the last step, select any cell (preferably A1) as the top left point of the table.

5

At the end of this, we have a spreadsheet that has the Trainees table data as follows:

Similarly we can load the Assessment, Modules and Score tables into separate Excel sheets. The outputs for these tables are as follows:

Assessment Table:

6

Modules Table:

Score Table:

Now to load the Time data from the text file source, we can use the text import option available in the form

of the ―From Text‖ button under the ―Data‖ tab. Text source files can be in delimited format or in fixed width format.

1. Delimited Format: Fields are separated using a delimiter character. This character cannot be a part

of data within fields. ―,‖ and ―:‖ are commonly used to separate fields and the ―End Of Line‖ marker generally separates rows.

2. Fixed Width Format: Fields are aligned and separated using spaces between fields. Rows are

separated using ―End Of Line‖ markers.

7

Time data provided to us is in delimited format, the delimiter being ―,‖. With this information let us try to

load time data into the backup Excel sheet.

Steps:

1. Choose the ―From Text‖ option in the ―Data‖ Tab.

2. Browse and select the source file.

8

3. As mentioned before, the Time source file provided is in the delimited format with

delimiter ―,‖. Therefore choose the ―Delimited‖ option and click on Next.

4. Next, choose the correct delimiter character, in our case ―,‖ and click on Next.

9

5. The next step is to choose the data type for each field. Excel provides a ―general‖ data type

which automatically detects and assigns the required data type. Assign field data types to

general and click on Next.

6. As before, the last step is to select any cell (preferably A1) as the top left cell of the table.

10

At the end of this, we have a spreadsheet that has the Time table data as follows:

(B) Backup to Staging

In this stage, data from the backup database is transformed and loaded into a staging database.

As before, let us go through the Trainees tables transformation from backup to staging. The target

schema requires three additional columns in EmpName, EmpKey and BU. EmpName is the concatenation of EmpFirstName, EmpMiddleName and EmpLastName whereas EmpKey is the

surrogate key that has incremental integer values (1, 2, … NoOfRows). Value in the BU field is

determined by the EmpNumber, if EmpNumber is less than 100150 then the BU (Business Unit) is SI (Systems Information) else it is TRPU (Training Practice Unit).

Steps: 1. Click on the ―Data‖ tab in the Menu bar, and then click on the ―From Other Sources‖

button in the ribbon interface and choose the ―From Microsoft Query‖ option.

11

2. Now that we have transferred the entire dataset into a backup database in Excel, we shall

use this Excel file as source. Choose ―Excel Files*‖ and click OK.

3. Browse for the back-up Excel file.

4. Choose the desired table (Trainees Table) from the list. In the list box on the right, we can

view the fields in the table and unselect (using the button labeled ―<‖) any if required.

However, as we require all fields of the Trainees table, proceed by clicking on Next.

12

5. The ―Filter Data‖ window enables selection of those entries in the database that satisfy a

desired condition on a column. As we need the entire dataset from the table, do not add any conditions. Simply click on Next to proceed.

6. The ―Sort Order‖ window enables us to sort a dataset based on any column in the dataset. The dataset is sorted based on the first mentioned column and the remaining columns are

used only in case of clash. It is a good practice to sort data based on the primary key

column, hence let us select an ascending sort on EmpNumber.

13

7. To complete loading data choose ―Return Data to Microsoft Excel‖ and click on Finish.

14

8. As before, select any cell (preferably A1) as the top left point of the table.

9. Now that we have the source dataset, we will add the required new columns. To insert a

new empty column, right click on the column header and choose ―Insert‖. This will insert a

column to the left of the current column.

10. As mentioned before ―EmpName‖ is a concatenation of three fields. To perform this

concatenation we use the ―&‖ operator. Go to the first row where the formula is to be applied (in our case it is B2), use the formula

15

=C2 & ― ‖ & D2 & ― ‖ & E2

Then rename the column as ―EmpName‖.

11. EmpKey is an incremental count and can be derived from RowNumber. For this we can use the in-built Row() function.

To add the column EmpKey, insert an empty column and in the second row. Insert the

formula ―=Row(B2)-1‖ (-1 to negate the count due to the column header). Function: Row()

Syntax: Row(ReferenceValue)

Returns: The row number where the reference value is present

12. The values of the BU column are dependent on EmpNumber. We will use the IF() function

to get these values.

Function: IF() Syntax: IF(<condition>, expr1, expr2)

16

Returns: expr1 if <condition> is TRUE, expr2 if <condition> is FALSE.

To add the column EmpKey, insert an empty column, and in the second row insert the

Formula ―=IF(C2<100150,SI,TRPU)‖.

At the end of these steps we have the desired Trainees table.

Let us now look into the Assessment, Modules and Time tables. The source and target schemas

for these tables are the same and hence they require no transformations. Therefore we shall

simply load these tables into Excel sheets (steps 1–7 of the Trainees table). The outputs of these tables are as follows:

Assessment Table:

17

Modules Table:

Time Table:

18

Score Table:

According to the data warehouse schema, we have EmpKey, ModuleKey and AssessmentTypeKey present in the Score table.

Now in the Score table of the source we have EmpId, ModuleName and AssessmentType. The values of EmpKey, ModuleKey and AssessmentTypeKey need to be extracted from the corresponding tables. This

could be done using VLOOKUP() function in Excel.

Function: VLOOKUP()

Syntax: VLOOKUP (lookup_value,table_array,col_index_num,[range_lookup])

Returns: The values in the column (col_index_num) from table_array corresponding to lookup_value.

The VLOOKUP function searches for the first column of a range (a range is defined as two or more cells on a sheet) of cells, and then returns a value from any cell on the same row of the range. Hence the value

that must be looked up must be present in the first column on the table array.

Example:

Consider two tables, Employee and Department:

Employee Table

A B C D

1 EmpKey EmpName Dno Dname

2 E101 Rahul 1

3 E102 Shyam 2

Department Table

A B

1 Dno Dname

2 1 Mech

3 2 I.T.

The field DName can be fetched into the Employee table using a VLOOKUP function as:

D2=VLOOKUP(C2,Department,2,False)

Here ―C2‖ refers to the column in the current table array that must be looked up (searched for) in the table array named Department. ―Department‖ is a name assigned to the table array. ―2‖ refers to the index

number of the column that contains the value that must be fetched. Column A has index 1, B has 2 and so

on. Thus 2 refers to column ―DName‖.

The FALSE option ensures that VLOOKUP returns the value only on a perfect match. The TRUE option is

set when a perfect match is not expected.

19

Output:

A B C D

1 EmpKey EmpName Dno Dname

2 E101 Rahul 1 Mech

3 E102 Shyam 2 I.T.

Steps:

1. Load the data from the Score table of the back-up Excel sheet onto a new sheet in the staging

Excel sheet as shown above in the steps 1–7. At the end of these steps we have the Score table

as:

2. Now we shall create a table array for the look-up. Go to the staging Trainees Excel sheet and select the dataset. Remember that for look-up, the first column of the table array must be the

look-up value, which in our case is EmpNumber. Therefore bring EmpNumber to the first

column, which can be done simply by cut and paste. Select the entire dataset and name it as EmployeeDetails in the name box (text box on the top right corner of menu, highlighted in the

snapshot below). Similarly, also name the table arrays AssessmentDetails and ModuleDetails in

the respective sheets.

20

3. Now, to add the column EmpKey, insert an empty column and in the second row apply the

formula:

=VLOOKUP(E2,EmployeeDetails,2,FALSE) Then rename the column as EmpKey.

4. Next, add the column AssessmentTypeKey, insert an empty column and in the second row apply the formula:

=VLOOKUP(C2,AssessmentDetails,2,FALSE)

Rename this column as AssessmentTypeKey.

5. Next, add the column ModuleKey, insert an empty column and in the second row apply the

formula:

21

=VLOOKUP(G2,ModuleDetails,2,FALSE)

Rename this column as ModuleKey.

At the end of these steps, we obtain the Score table as:

This brings us to the end of second stage of ETL, i.e., from backup to staging.

(C) Staging to Warehouse

In this stage, data from the staging database is cleaned and loaded into the warehouse. Cleaning

refers to the process of excluding incorrect or unwanted data present in the staging table.

For instance, if the warehouse is to be created for data only of the past decade, then any data older

than that should be excluded from the warehouse. Our case study however specifies no such

cleaning activity, and the data can be directly loaded into the warehouse. However, care must be taken to exclude unnecessary columns present in the staging database like EmpFirstname,

EmpMiddleName and EmpLastName. This can be done by removing the unnecessary columns in

the Query Wizard — Choose Columns window. Follow the same steps as for the backup-to-staging transformation for the Trainees table for each of the five tables in the database to create the

warehouse.

22

Similarly unselect any unnecessary fields in the Score table to obtain the final data warehouse.

DimAssessment Table:

23

DimModule Table:

DimTrainee Table:

24

DimTime Table:

FactScores Table:

25

Multi Dimensional Data Modeling (MDDM)

Performing MDDM

Description

Dimensions are the points of view from which the facts can be seen. From the data warehouse that we got

after performing ETL on the source database we can identify the following dimensions:

1. DimTime

2. DimModule

3. DimAssessment 4. DimTrainee

Also, a fact table is a table from which the highest number of tables in the database are reachable. In our data warehouse FactScores is one such table.

Now in MS Excel, forming a multidimensional cube is not possible. Thus, we will have to get the relevant data that is needed for multidimensional analysis onto a single sheet.

Steps for performing MDDM

1. For creating a cube in Excel, first load the data from the FactScores sheet of the warehouse Excel

file onto a new sheet. 2. We need all the relevant data from all the dimensions to make the cube. In our case we need

EmpName, BatchName, Stream and IBU from DimTrainee, ModuleName and ModuleCreditPoints

from DimModule, AssessmentType and Duration from DimAssessment, FullDateAlternateKey, Year, Quarter, MonthNumber and MonthName from DimTime. These fields can be loaded into the

IPACube sheet using VLOOKUP.

After performing lookup operations on the required columns, the final output of these steps is:

Thus we have got the MS Excel version of a cube. Reports can be made on this cube using a pivot table or a

pivot chart.

26

Reporting

Performing Reporting

In Excel, reports can be made on pivot tables and pivot charts, which have the ability to summarize large volumes of complicated data. A pivot element is comprised of four elements:

1. Report Filters: Report filters are used to enable in-depth analysis of large amounts of data in the pivot element by providing a subset of data. For instance, we can restrict our analysis to a specific

number of products or a specific region or span of time.

2. Row Labels (Axis Fields): Row labels are used to view facts through a dimension, that is, a row label contains an attribute of any dimension. An attribute is preferred as a row label when its

domain is large (i.e., there is a large number of possible values). For instance, the CustomerName

attribute is generally used as a row label.

3. Column Labels (Legend Fields): Columns labels, like row labels, are also used to view facts with respect to a dimension. A column label is an attribute of a dimension and generally attributes with

smaller domains are used as column labels. For example, Quarter, Month of year, etc. are generally

used as column labels. 4. Values: Values are facts (measures) which can be generally aggregated across one or more

dimensions, for example, Quantity or SalesAmount.

Report 1:

The report requires us to create a chart that displays percentage scores of employees for all modules with assignment type as either Test or Retest. Now to achieve this, we create a pivot table on the Excel cube

sheet IPACube with AssessmentType and EmpName as report filters, ModuleName as row label and

percentage as value.

Steps:

1. Open the Excel cube Sheet IPACube and click on the ―Insert‖ Tab to find the ―PivotTable‖ button in the ribbon interface.

27

2. Click on PivotTable and select PivotChart.

3. Next, select the table array on which the pivot chart has to be made. In our case, we select the entire

cube sheet as the source (selected by default by Excel). Also we may choose to create the pivot

chart in a new spread sheet or in the same sheet. For readability we shall go with a new worksheet.

4. This creates an empty pivot sheet in the Excel file. Fields may be dragged into areas such as report

filter, axis fields (row label), legend fields (column label) and values. Note that a field cannot appear in any other area if it is present as a part of the report filter.

28

5. Now, as discussed before, drag ModuleName into the Axis Fields (row label) area, Percentage into

the Values area and AssessmentType and EmpName into the Report Filter area.

6. As a percentage is meaningful only when summarized as an average, we now change the Value field setting to Average. For this, left click on the field in the Value area and click on ―Value Field

Settings…‖.

29

7. Choose the calculation type as Average in the ―Value Field Settings‖ dialog box and click on OK.

8. Our report is required to provide information for the assessment types Test and Retest. Hence in the

AssessmentType report filter on the top left corner of the sheet, check the boxes for Select Multiple Items, Retest and Test.

30

At the end of these steps, we obtain the desired chart report:

Report 2:

The report requires us to create a table to display the module names and its assessment types conducted in a

month of a quarter of a year with drill downs active on the calendar hierarchy. We shall create a pivot table with Year, Quarter, Month and ModuleName as row labels, AssessmentType as column label and finally

percentage summarized as Count of Percentage as a Value. The pivot table is created in the same way as the

pivot chart, therefore follow steps 1–7 to create the table report. It is important that we keep in mind that

percentage must this time be summarized as count and not average or sum. The drill down functionality for the table is created by default in Excel and hence we need not explicitly handle it. The final report shows

the following details:

integrated assignment solution

Documents