0810 its-dwdm mb3g1it

18
Question Paper Data Warehousing and Data Mining (MB3G1IT) : October 2008 Section A : Basic Concepts (30 Marks) This section consists of questions with serial number 1 - 30. Answer all questions. Each question carries one mark. Maximum time for answering Section A is 30 Minutes. 1. Which of the following is not determined in the business requirements stage of data warehouse delivery method? (a) The logical model for information within the data warehouse (b) The source systems that provide this data (i.e. mapping rules) (c) The query profiles for the immediate requirement (d) The capacity plan for hardware and infrastructure (e) The business rules to be applied to data. <Answ er> 2. Which of the following is a system manager who performs backup and archiving the data warehouse? (a) Load manager (b) Warehouse manager (c) Query manager (d) Database manager (e) Event manager. <Answ er> 3. What is the load manager task that is implemented by stored procedure tools? (a) Fast load (b) Simple transformation (c) Complex checking (d) Job control (e) Backup and archive. <Answ er> 4. Which of the following statements is/are true about the types of partitioning? I. Vertical partitioning can take two forms: normalization and row splitting. II. Before using a vertical partitioning there should not be any requirement to perform major join operations between the two partitions. III. In order to maximize the hardware partitioning, minimize the processing power available. (a) Only (I) above (b) Only (II) above (c) Only (III) above (d) Both (I) and (II) above (e) Both (II) and (III) above. <Answ er> 5. Which of the following machine is a set of tightly coupled CPUs that share memory and disk? (a) Symmetric multi-processing (b) Massively multi-processing (c) Segregated multi-processing (d) Asymmetric multi-processing (e) Multidimensional multi-processing. <Answ er> 6. Which Redundant Array of Inexpensive Disks (RAID) level has full mirroring with each disk duplexed? (a) Level 1 (b) Level 2 (c) Level 3 (d) Level 4 (e) Level 5. <Answ er> 1

Upload: jijomt

Post on 18-Nov-2014

181 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0810 Its-dwdm Mb3g1it

Question PaperData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts (30 Marks)bull This section consists of questions with serial number 1 - 30bull Answer all questionsbull Each question carries one markbull Maximum time for answering Section A is 30 Minutes

1 Which of the following is not determined in the business requirements stage of data warehouse delivery method

(a) The logical model for information within the data warehouse(b) The source systems that provide this data (ie mapping rules)(c) The query profiles for the immediate requirement(d) The capacity plan for hardware and infrastructure(e) The business rules to be applied to data

ltAnswergt

2 Which of the following is a system manager who performs backup and archiving the data warehouse

(a) Load manager(b) Warehouse manager(c) Query manager(d) Database manager(e) Event manager

ltAnswergt

3 What is the load manager task that is implemented by stored procedure tools

(a) Fast load(b) Simple transformation(c) Complex checking(d) Job control(e) Backup and archive

ltAnswergt

4 Which of the following statements isare true about the types of partitioning

I Vertical partitioning can take two forms normalization and row splittingII Before using a vertical partitioning there should not be any requirement to perform major join

operations between the two partitionsIII In order to maximize the hardware partitioning minimize the processing power available

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) Both (II) and (III) above

ltAnswergt

5 Which of the following machine is a set of tightly coupled CPUs that share memory and disk

(a) Symmetric multi-processing(b) Massively multi-processing(c) Segregated multi-processing(d) Asymmetric multi-processing(e) Multidimensional multi-processing

ltAnswergt

6 Which Redundant Array of Inexpensive Disks (RAID) level has full mirroring with each disk duplexed

(a) Level 1(b) Level 2(c) Level 3(d) Level 4(e) Level 5

ltAnswergt

1

iexe

7 Which of the following statements isare true

I Snowflake schema is a variant of star schema where each dimension can have its own dimensions II Starflake schema is a logical structure that has a fact table in the center with dimension tables radiating

off of this central tableIII Star schema is a hybrid structure that contains a mix of Starflake and snowflake schemas

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

8 In database sizing what is the mathematical representation of lsquoTrsquo (temporary space) given lsquonrsquo (the number of concurrent queries allowed) and lsquoPrsquo (the size of the partition)

(a) T = (2n + 1)P

(b) T = P (2n + 1)

(c) T = (2n 1)P-

(d) T = (2n 1) P-

(e) T = P (2n 1)-

ltAnswergt

9 In data mining which of the following states a statistical correlation between the occurrence of certain attributes in a database table

(a) Association rules(b) Query tools(c) Visualization(d) Case-based learning(e) Genetic algorithms

ltAnswergt

10

In data mining learning tasks can be divided into

I Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

11Which of the following statements are true about the types of knowledge

I Hidden knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL)

II Shallow knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms

III Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools

IV Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

(a) Both (I) and (II) above(b) Both (II) and (III) above(c) Both (III) and (IV) above(d) (I) (II) and (III) above(e) (II) (III) and (IV) above

ltAnswergt

2

12

There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be

I Time-independentII Non-volatileIII Subject orientedIV Integrated

(a) Both (I) and (II) above(b) Both (II) and (III) above(c) Both (III) and (IV) above(d) (I) (II) and (III) above(e) (II) (III) and (IV) above

ltAnswergt

13

Which of the following statements isare true about Online Analytical Processing (OLAP)

I OLAP tools do not learn they create new knowledgeII OLAP tools are more powerful than data miningIII OLAP tools cannot search for new solutions

(a) Only (I) above(b) Only (II) above(c) Both (I) and (II) above(d) Both (I) and (III) above(e) Both (II) and (III) above

ltAnswergt

14

Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized as I ConnectionsII DisconnectionsIII Data accessIV Data change

(a) Both (I) and (III) above(b) Both (II) and (III) above(c) (I) (III) and (IV) above(d) (II) (III) and (IV) above(e) All (I) (II) (III) and (IV) above

ltAnswergt

15

Which of the following produced the lsquoAlexandriarsquo backup software package

(a) HP(b) Sequent(c) IBM(d) Epoch Systems(e) Legato

ltAnswergt

16

Which of the following statements isare true about aggregations in data warehousing

I Aggregations are performed in order to speed up common queriesII Too few aggregations will lead to unacceptable operational costsIII Too many aggregations will lead to an overall lack of system performance

(a) Only (I) above(b) Only (II) above (c) Both (I) and (II) above(d) Both (II) and (III) above(e) All (I) (II) and (III) above

ltAnswergt

3

17

Which of the following statements isare true about the basic levels of testing the data warehouse

I All unit testing should be complete before any test plan is enactedII In integration testing the separate development units that make up a component of the data warehouse

application are tested to ensure that they work togetherIII In system testing the whole data warehouse application is tested together

(a) Only (I) above(b) Only (II) above (c) Both (I) and (II) above(d) Both (II) and (III) above(e) All (I) (II) and (III) above

ltAnswergt

18

To execute each SQL statement the RDBMS uses an optimizer to calculate the best strategy for performing that statement There are a number of different ways of calculating such a strategy but we can categorize optimizers generally as either rule based or cost based Which of the following statements isare false

I A rule-based optimizer uses known rules to perform the functionII A cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best

strategy for executing the SQL statementIII ldquoNumber of rows in the tablerdquo is generally collected by rule-based optimizer

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

19

In parallel technology which of the following statements isare true

I Data shipping is where a process requests for the data to be shipped to the location where the process is running

II Function shipping is where the function to be performed is moved to the locale of the dataIII Architectures which are designed for shared-nothing or distributed environments use data shipping

exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

20

Which of the following isare the common restriction(s) that may apply to the handling of views

I Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

21

One petabyte is equal to

(a) 1024 terabytes(b) 1024 gigabytes(c) 1024 megabytes(d) 1024 kilobytes(e) 1024 bytes

ltAnswergt

4

22

The formula for the construction of a genetic algorithm for the solution of a problem has the following steps List the steps in the orderI Invent an artificial environment in the computer where the solutions can join in battle with each other

Provide an objective rating to judge success or failure in professional terms called a fitness functionII Develop ways in which possible solutions can be combined Here the so-called cross-over operation in

which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

III Devise a good elegant coding of the problem in terms of strings of a limited alphabetIV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad

solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

(a) I II III and IV(b) I III II and IV(c) III I II and IV(d) II III I and IV(e) III II I and IV

ltAnswergt

23

Which of the following produced the ADSTAR Distributed Storage Manager (ADSM) backup software package

(a) HP(b) Sequent(c) IBM(d) Epoch Systems(e) Legato

ltAnswergt

24

Which of the following does not belongs to the stages in the Knowledge Discovery Process

(a) Data selection(b) Data encapsulation(c) Cleaning(d) Coding(e) Reporting

ltAnswergt

25

Which of the following isare the applications of data mining

I Customer profilingII CAPTAINSIII Reverse engineering

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

26

Which of the following managers are not a part of system managers in a data warehouse

(a) Configuration manager(b) Schedule manager(c) Event manager(d) Database manager(e) Load manager

ltAnswergt

27

In data mining group of similar objects that differ significantly from other objects is known as

(a) Filtering(b) Clustering(c) Coding(d) Scattering(e) Binding

ltAnswergt

5

28

A perceptron with simple three-layered network has ____________ as input units

(a) Photo-receptors(b) Associators(c) Responders(d) Acceptors(e) Rejectors

ltAnswergt

29

In which theory the human brain was described as a neural network

(a) Shannonrsquos communication theory(b) Kolmogorov complexity theory(c) Rissanen theory(d) Freudrsquos theory of psychodynamics(e) Kohonen theory

ltAnswergt

30

Which of the following isare the task(s) maintained by the query manager

I Query syntaxII Query execution planIII Query elapsed time

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

END OF SECTION A

Section B Caselets (50 Marks)bull This section consists of questions with serial number 1 ndash 6bull Answer all questions bull Marks are indicated against each questionbull Detailed explanations should form part of your answer bull Do not spend more than 110 - 120 minutes on Section B

Caselet 1Read the caselet carefully and answer the following questions

1 ldquoBraite selected Symphysis as the provider of choice to create a roadmap for the solution develop a scalable robust and user-friendly framework and deliver the product setrdquo In this context explain the data warehousing architecture (10 marks)

ltAnswergt

2 If you are a project manager at Braite what metrics you will consider which help Braite in meeting its business goals for improved customer satisfaction and process improvement Explain ( 8 marks)

ltAnswergt

3 What might be the important characteristics of the proposed data warehouse and also list the features of a data warehouse ( 7 marks)

ltAnswergt

4 Do you think Software Quality Assurance (SQA) process will play an important role in any data warehousing project Explain (10 marks)

ltAnswergt

Braite a leading provider of software services to financial institutions launched an initiative to enhance its application platform in order to provide better data analytics for its customers Braite partnered with Symphysis to architect and build new Data Warehousing (DW) and Business Intelligence (BI) services for its Business Intelligence Center (BIC) Over time Symphysis has become Braites most strategic product development partner using their global delivery model

Braite faced real challenges in providing effective data analytics to its customers - it supported several complex data sources residing in multiple application layers

6

within its products and faced challenges in implementing business rules and integrating data from disparate systems These source systems included mainframe systems Oracle databases and even Excel spreadsheets Braite also faced several other challenges it required manual data validation and data comparison processes it required manual controls over the credit card creation process its system design process suffered from unclear business requirements it supported multiple disparate data marts

To address these challenges Braite turned to an offshore IT services provider with DWBI experience and deep expertise in healthcare benefits software Braite selected Symphysis as the provider of choice to assess the feasibility of a DWBI solution as well as to create a roadmap for the solution develop a scalable robust and user-friendly DWBI framework and deliver the product set

In this project Symphysis designed and executed the DWBI architecture that has become the cornerstone of Braites data analytics service offerings enhancing its status as a global leader Business Intelligence (BI) services focus on helping clients in collecting and analyzing external and internal data to generate value for their organizations

Symphysis successfully architected built and tested the DWBI solution with the key deliverables created scripts to automate existing manual processes involving file comparison validation quality check and control card generation introduced change and configuration management best practices leading to standardization of Braites delivery process robust Software Quality Assurance (SQA) processes to ensure high software quality The SQA process relies on unit testing and functional testing resulting in reduced effort for business analysts defined and collected metrics at various steps in the software development process (from development to integration testing) in order to improve process methodology and provide the ability to enter benchmarks or goals for performance tracking There are several data warehouse project management metrics worth considering These metrics have helped Braite to meet its business goals for improved customer satisfaction and process improvement

In addition to full lifecycle product development we also provide ongoing product support for Braites existing applications

Symphysis provided DWBI solutions across a wide range of business functions including sales and service relationship value management customer information systems billing and online collections operational data store loans deposits voice recognition custom history and ATM Symphysisrsquos data mart optimization framework enabled integration and consolidation of disparate data marts Symphysis SQA strategy improved delivery deadlines in terms of acceptance and integration Symphysis provided DW process improvements ETL checklists and standardization Braite achieved cost savings of 40 by using Symphysis onsiteoffshore delivery model and a scalable architecture enabling new data warehouse and business intelligence applications

END OF CASELET 1

Caselet 2Read the caselet carefully and answer the following questions

5 Critically analyze the functions of the tools that chairman of Trilog Brokerage Services (TBS) decided to implement in order to increase the efficiency of the organization ( 7 marks)

ltAnswergt

6 Discuss the classification of usage of tools against a data warehouse and also discuss about the types of Online Analytical Processing (OLAP) tools ( 8 marks)

ltAnswergt

Trilog Brokerage Services (TBS) is one of the oldest firms in India with a very strong customer base Many of its customers have more than one security holdings and some even have more than 50 securities in their portfolios And it has become very difficult on the part of TBS to track and maintain which customer is

7

sellingbuying which security and the amounts they have to receive or the amount they have to pay to TBS

It has found that information silos created are running contrary to the goal of the business intelligence organization architecture to ensure enterprise wide informational content to the broadest audience By utilizing the information properly it can enhance customer and supplier relationships improve the profitability of products and services create worthwhile new offerings better manage risk and pare expenses dramatically among many other gains TBS was feeling that it required a category of software tools that help analyze data stored in its database help users analyze different dimensions of the data such as time series and trend analysis views

The chairman of TBS felt that Online Analytical Processing (OLAP) was the need of the hour and decided to implement it immediately so that the processing part would be reduced significantly thereby increasing the efficiency of the organization

END OF CASELET 2

END OF SECTION B

Section C Applied Theory (20 Marks)bull This section consists of questions with serial number 7 - 8bull Answer all questions bull Marks are indicated against each questionbull Do not spend more than 25 - 30 minutes on Section C

7 What is Neural Network and discuss about various forms of Neural Networks ( 10 marks)

ltAnswergt

8 Explain the various responsibilities of a Query manager ( 10 marks)ltAnswergt

END OF SECTION C

END OF QUESTION PAPER

Suggested AnswersData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts

Answer Reason

8

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 2: 0810 Its-dwdm Mb3g1it

7 Which of the following statements isare true

I Snowflake schema is a variant of star schema where each dimension can have its own dimensions II Starflake schema is a logical structure that has a fact table in the center with dimension tables radiating

off of this central tableIII Star schema is a hybrid structure that contains a mix of Starflake and snowflake schemas

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

8 In database sizing what is the mathematical representation of lsquoTrsquo (temporary space) given lsquonrsquo (the number of concurrent queries allowed) and lsquoPrsquo (the size of the partition)

(a) T = (2n + 1)P

(b) T = P (2n + 1)

(c) T = (2n 1)P-

(d) T = (2n 1) P-

(e) T = P (2n 1)-

ltAnswergt

9 In data mining which of the following states a statistical correlation between the occurrence of certain attributes in a database table

(a) Association rules(b) Query tools(c) Visualization(d) Case-based learning(e) Genetic algorithms

ltAnswergt

10

In data mining learning tasks can be divided into

I Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

11Which of the following statements are true about the types of knowledge

I Hidden knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL)

II Shallow knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms

III Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools

IV Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

(a) Both (I) and (II) above(b) Both (II) and (III) above(c) Both (III) and (IV) above(d) (I) (II) and (III) above(e) (II) (III) and (IV) above

ltAnswergt

2

12

There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be

I Time-independentII Non-volatileIII Subject orientedIV Integrated

(a) Both (I) and (II) above(b) Both (II) and (III) above(c) Both (III) and (IV) above(d) (I) (II) and (III) above(e) (II) (III) and (IV) above

ltAnswergt

13

Which of the following statements isare true about Online Analytical Processing (OLAP)

I OLAP tools do not learn they create new knowledgeII OLAP tools are more powerful than data miningIII OLAP tools cannot search for new solutions

(a) Only (I) above(b) Only (II) above(c) Both (I) and (II) above(d) Both (I) and (III) above(e) Both (II) and (III) above

ltAnswergt

14

Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized as I ConnectionsII DisconnectionsIII Data accessIV Data change

(a) Both (I) and (III) above(b) Both (II) and (III) above(c) (I) (III) and (IV) above(d) (II) (III) and (IV) above(e) All (I) (II) (III) and (IV) above

ltAnswergt

15

Which of the following produced the lsquoAlexandriarsquo backup software package

(a) HP(b) Sequent(c) IBM(d) Epoch Systems(e) Legato

ltAnswergt

16

Which of the following statements isare true about aggregations in data warehousing

I Aggregations are performed in order to speed up common queriesII Too few aggregations will lead to unacceptable operational costsIII Too many aggregations will lead to an overall lack of system performance

(a) Only (I) above(b) Only (II) above (c) Both (I) and (II) above(d) Both (II) and (III) above(e) All (I) (II) and (III) above

ltAnswergt

3

17

Which of the following statements isare true about the basic levels of testing the data warehouse

I All unit testing should be complete before any test plan is enactedII In integration testing the separate development units that make up a component of the data warehouse

application are tested to ensure that they work togetherIII In system testing the whole data warehouse application is tested together

(a) Only (I) above(b) Only (II) above (c) Both (I) and (II) above(d) Both (II) and (III) above(e) All (I) (II) and (III) above

ltAnswergt

18

To execute each SQL statement the RDBMS uses an optimizer to calculate the best strategy for performing that statement There are a number of different ways of calculating such a strategy but we can categorize optimizers generally as either rule based or cost based Which of the following statements isare false

I A rule-based optimizer uses known rules to perform the functionII A cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best

strategy for executing the SQL statementIII ldquoNumber of rows in the tablerdquo is generally collected by rule-based optimizer

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

19

In parallel technology which of the following statements isare true

I Data shipping is where a process requests for the data to be shipped to the location where the process is running

II Function shipping is where the function to be performed is moved to the locale of the dataIII Architectures which are designed for shared-nothing or distributed environments use data shipping

exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

20

Which of the following isare the common restriction(s) that may apply to the handling of views

I Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

21

One petabyte is equal to

(a) 1024 terabytes(b) 1024 gigabytes(c) 1024 megabytes(d) 1024 kilobytes(e) 1024 bytes

ltAnswergt

4

22

The formula for the construction of a genetic algorithm for the solution of a problem has the following steps List the steps in the orderI Invent an artificial environment in the computer where the solutions can join in battle with each other

Provide an objective rating to judge success or failure in professional terms called a fitness functionII Develop ways in which possible solutions can be combined Here the so-called cross-over operation in

which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

III Devise a good elegant coding of the problem in terms of strings of a limited alphabetIV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad

solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

(a) I II III and IV(b) I III II and IV(c) III I II and IV(d) II III I and IV(e) III II I and IV

ltAnswergt

23

Which of the following produced the ADSTAR Distributed Storage Manager (ADSM) backup software package

(a) HP(b) Sequent(c) IBM(d) Epoch Systems(e) Legato

ltAnswergt

24

Which of the following does not belongs to the stages in the Knowledge Discovery Process

(a) Data selection(b) Data encapsulation(c) Cleaning(d) Coding(e) Reporting

ltAnswergt

25

Which of the following isare the applications of data mining

I Customer profilingII CAPTAINSIII Reverse engineering

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

26

Which of the following managers are not a part of system managers in a data warehouse

(a) Configuration manager(b) Schedule manager(c) Event manager(d) Database manager(e) Load manager

ltAnswergt

27

In data mining group of similar objects that differ significantly from other objects is known as

(a) Filtering(b) Clustering(c) Coding(d) Scattering(e) Binding

ltAnswergt

5

28

A perceptron with simple three-layered network has ____________ as input units

(a) Photo-receptors(b) Associators(c) Responders(d) Acceptors(e) Rejectors

ltAnswergt

29

In which theory the human brain was described as a neural network

(a) Shannonrsquos communication theory(b) Kolmogorov complexity theory(c) Rissanen theory(d) Freudrsquos theory of psychodynamics(e) Kohonen theory

ltAnswergt

30

Which of the following isare the task(s) maintained by the query manager

I Query syntaxII Query execution planIII Query elapsed time

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

END OF SECTION A

Section B Caselets (50 Marks)bull This section consists of questions with serial number 1 ndash 6bull Answer all questions bull Marks are indicated against each questionbull Detailed explanations should form part of your answer bull Do not spend more than 110 - 120 minutes on Section B

Caselet 1Read the caselet carefully and answer the following questions

1 ldquoBraite selected Symphysis as the provider of choice to create a roadmap for the solution develop a scalable robust and user-friendly framework and deliver the product setrdquo In this context explain the data warehousing architecture (10 marks)

ltAnswergt

2 If you are a project manager at Braite what metrics you will consider which help Braite in meeting its business goals for improved customer satisfaction and process improvement Explain ( 8 marks)

ltAnswergt

3 What might be the important characteristics of the proposed data warehouse and also list the features of a data warehouse ( 7 marks)

ltAnswergt

4 Do you think Software Quality Assurance (SQA) process will play an important role in any data warehousing project Explain (10 marks)

ltAnswergt

Braite a leading provider of software services to financial institutions launched an initiative to enhance its application platform in order to provide better data analytics for its customers Braite partnered with Symphysis to architect and build new Data Warehousing (DW) and Business Intelligence (BI) services for its Business Intelligence Center (BIC) Over time Symphysis has become Braites most strategic product development partner using their global delivery model

Braite faced real challenges in providing effective data analytics to its customers - it supported several complex data sources residing in multiple application layers

6

within its products and faced challenges in implementing business rules and integrating data from disparate systems These source systems included mainframe systems Oracle databases and even Excel spreadsheets Braite also faced several other challenges it required manual data validation and data comparison processes it required manual controls over the credit card creation process its system design process suffered from unclear business requirements it supported multiple disparate data marts

To address these challenges Braite turned to an offshore IT services provider with DWBI experience and deep expertise in healthcare benefits software Braite selected Symphysis as the provider of choice to assess the feasibility of a DWBI solution as well as to create a roadmap for the solution develop a scalable robust and user-friendly DWBI framework and deliver the product set

In this project Symphysis designed and executed the DWBI architecture that has become the cornerstone of Braites data analytics service offerings enhancing its status as a global leader Business Intelligence (BI) services focus on helping clients in collecting and analyzing external and internal data to generate value for their organizations

Symphysis successfully architected built and tested the DWBI solution with the key deliverables created scripts to automate existing manual processes involving file comparison validation quality check and control card generation introduced change and configuration management best practices leading to standardization of Braites delivery process robust Software Quality Assurance (SQA) processes to ensure high software quality The SQA process relies on unit testing and functional testing resulting in reduced effort for business analysts defined and collected metrics at various steps in the software development process (from development to integration testing) in order to improve process methodology and provide the ability to enter benchmarks or goals for performance tracking There are several data warehouse project management metrics worth considering These metrics have helped Braite to meet its business goals for improved customer satisfaction and process improvement

In addition to full lifecycle product development we also provide ongoing product support for Braites existing applications

Symphysis provided DWBI solutions across a wide range of business functions including sales and service relationship value management customer information systems billing and online collections operational data store loans deposits voice recognition custom history and ATM Symphysisrsquos data mart optimization framework enabled integration and consolidation of disparate data marts Symphysis SQA strategy improved delivery deadlines in terms of acceptance and integration Symphysis provided DW process improvements ETL checklists and standardization Braite achieved cost savings of 40 by using Symphysis onsiteoffshore delivery model and a scalable architecture enabling new data warehouse and business intelligence applications

END OF CASELET 1

Caselet 2Read the caselet carefully and answer the following questions

5 Critically analyze the functions of the tools that chairman of Trilog Brokerage Services (TBS) decided to implement in order to increase the efficiency of the organization ( 7 marks)

ltAnswergt

6 Discuss the classification of usage of tools against a data warehouse and also discuss about the types of Online Analytical Processing (OLAP) tools ( 8 marks)

ltAnswergt

Trilog Brokerage Services (TBS) is one of the oldest firms in India with a very strong customer base Many of its customers have more than one security holdings and some even have more than 50 securities in their portfolios And it has become very difficult on the part of TBS to track and maintain which customer is

7

sellingbuying which security and the amounts they have to receive or the amount they have to pay to TBS

It has found that information silos created are running contrary to the goal of the business intelligence organization architecture to ensure enterprise wide informational content to the broadest audience By utilizing the information properly it can enhance customer and supplier relationships improve the profitability of products and services create worthwhile new offerings better manage risk and pare expenses dramatically among many other gains TBS was feeling that it required a category of software tools that help analyze data stored in its database help users analyze different dimensions of the data such as time series and trend analysis views

The chairman of TBS felt that Online Analytical Processing (OLAP) was the need of the hour and decided to implement it immediately so that the processing part would be reduced significantly thereby increasing the efficiency of the organization

END OF CASELET 2

END OF SECTION B

Section C Applied Theory (20 Marks)bull This section consists of questions with serial number 7 - 8bull Answer all questions bull Marks are indicated against each questionbull Do not spend more than 25 - 30 minutes on Section C

7 What is Neural Network and discuss about various forms of Neural Networks ( 10 marks)

ltAnswergt

8 Explain the various responsibilities of a Query manager ( 10 marks)ltAnswergt

END OF SECTION C

END OF QUESTION PAPER

Suggested AnswersData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts

Answer Reason

8

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 3: 0810 Its-dwdm Mb3g1it

12

There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be

I Time-independentII Non-volatileIII Subject orientedIV Integrated

(a) Both (I) and (II) above(b) Both (II) and (III) above(c) Both (III) and (IV) above(d) (I) (II) and (III) above(e) (II) (III) and (IV) above

ltAnswergt

13

Which of the following statements isare true about Online Analytical Processing (OLAP)

I OLAP tools do not learn they create new knowledgeII OLAP tools are more powerful than data miningIII OLAP tools cannot search for new solutions

(a) Only (I) above(b) Only (II) above(c) Both (I) and (II) above(d) Both (I) and (III) above(e) Both (II) and (III) above

ltAnswergt

14

Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized as I ConnectionsII DisconnectionsIII Data accessIV Data change

(a) Both (I) and (III) above(b) Both (II) and (III) above(c) (I) (III) and (IV) above(d) (II) (III) and (IV) above(e) All (I) (II) (III) and (IV) above

ltAnswergt

15

Which of the following produced the lsquoAlexandriarsquo backup software package

(a) HP(b) Sequent(c) IBM(d) Epoch Systems(e) Legato

ltAnswergt

16

Which of the following statements isare true about aggregations in data warehousing

I Aggregations are performed in order to speed up common queriesII Too few aggregations will lead to unacceptable operational costsIII Too many aggregations will lead to an overall lack of system performance

(a) Only (I) above(b) Only (II) above (c) Both (I) and (II) above(d) Both (II) and (III) above(e) All (I) (II) and (III) above

ltAnswergt

3

17

Which of the following statements isare true about the basic levels of testing the data warehouse

I All unit testing should be complete before any test plan is enactedII In integration testing the separate development units that make up a component of the data warehouse

application are tested to ensure that they work togetherIII In system testing the whole data warehouse application is tested together

(a) Only (I) above(b) Only (II) above (c) Both (I) and (II) above(d) Both (II) and (III) above(e) All (I) (II) and (III) above

ltAnswergt

18

To execute each SQL statement the RDBMS uses an optimizer to calculate the best strategy for performing that statement There are a number of different ways of calculating such a strategy but we can categorize optimizers generally as either rule based or cost based Which of the following statements isare false

I A rule-based optimizer uses known rules to perform the functionII A cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best

strategy for executing the SQL statementIII ldquoNumber of rows in the tablerdquo is generally collected by rule-based optimizer

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

19

In parallel technology which of the following statements isare true

I Data shipping is where a process requests for the data to be shipped to the location where the process is running

II Function shipping is where the function to be performed is moved to the locale of the dataIII Architectures which are designed for shared-nothing or distributed environments use data shipping

exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

20

Which of the following isare the common restriction(s) that may apply to the handling of views

I Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

21

One petabyte is equal to

(a) 1024 terabytes(b) 1024 gigabytes(c) 1024 megabytes(d) 1024 kilobytes(e) 1024 bytes

ltAnswergt

4

22

The formula for the construction of a genetic algorithm for the solution of a problem has the following steps List the steps in the orderI Invent an artificial environment in the computer where the solutions can join in battle with each other

Provide an objective rating to judge success or failure in professional terms called a fitness functionII Develop ways in which possible solutions can be combined Here the so-called cross-over operation in

which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

III Devise a good elegant coding of the problem in terms of strings of a limited alphabetIV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad

solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

(a) I II III and IV(b) I III II and IV(c) III I II and IV(d) II III I and IV(e) III II I and IV

ltAnswergt

23

Which of the following produced the ADSTAR Distributed Storage Manager (ADSM) backup software package

(a) HP(b) Sequent(c) IBM(d) Epoch Systems(e) Legato

ltAnswergt

24

Which of the following does not belongs to the stages in the Knowledge Discovery Process

(a) Data selection(b) Data encapsulation(c) Cleaning(d) Coding(e) Reporting

ltAnswergt

25

Which of the following isare the applications of data mining

I Customer profilingII CAPTAINSIII Reverse engineering

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

26

Which of the following managers are not a part of system managers in a data warehouse

(a) Configuration manager(b) Schedule manager(c) Event manager(d) Database manager(e) Load manager

ltAnswergt

27

In data mining group of similar objects that differ significantly from other objects is known as

(a) Filtering(b) Clustering(c) Coding(d) Scattering(e) Binding

ltAnswergt

5

28

A perceptron with simple three-layered network has ____________ as input units

(a) Photo-receptors(b) Associators(c) Responders(d) Acceptors(e) Rejectors

ltAnswergt

29

In which theory the human brain was described as a neural network

(a) Shannonrsquos communication theory(b) Kolmogorov complexity theory(c) Rissanen theory(d) Freudrsquos theory of psychodynamics(e) Kohonen theory

ltAnswergt

30

Which of the following isare the task(s) maintained by the query manager

I Query syntaxII Query execution planIII Query elapsed time

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

END OF SECTION A

Section B Caselets (50 Marks)bull This section consists of questions with serial number 1 ndash 6bull Answer all questions bull Marks are indicated against each questionbull Detailed explanations should form part of your answer bull Do not spend more than 110 - 120 minutes on Section B

Caselet 1Read the caselet carefully and answer the following questions

1 ldquoBraite selected Symphysis as the provider of choice to create a roadmap for the solution develop a scalable robust and user-friendly framework and deliver the product setrdquo In this context explain the data warehousing architecture (10 marks)

ltAnswergt

2 If you are a project manager at Braite what metrics you will consider which help Braite in meeting its business goals for improved customer satisfaction and process improvement Explain ( 8 marks)

ltAnswergt

3 What might be the important characteristics of the proposed data warehouse and also list the features of a data warehouse ( 7 marks)

ltAnswergt

4 Do you think Software Quality Assurance (SQA) process will play an important role in any data warehousing project Explain (10 marks)

ltAnswergt

Braite a leading provider of software services to financial institutions launched an initiative to enhance its application platform in order to provide better data analytics for its customers Braite partnered with Symphysis to architect and build new Data Warehousing (DW) and Business Intelligence (BI) services for its Business Intelligence Center (BIC) Over time Symphysis has become Braites most strategic product development partner using their global delivery model

Braite faced real challenges in providing effective data analytics to its customers - it supported several complex data sources residing in multiple application layers

6

within its products and faced challenges in implementing business rules and integrating data from disparate systems These source systems included mainframe systems Oracle databases and even Excel spreadsheets Braite also faced several other challenges it required manual data validation and data comparison processes it required manual controls over the credit card creation process its system design process suffered from unclear business requirements it supported multiple disparate data marts

To address these challenges Braite turned to an offshore IT services provider with DWBI experience and deep expertise in healthcare benefits software Braite selected Symphysis as the provider of choice to assess the feasibility of a DWBI solution as well as to create a roadmap for the solution develop a scalable robust and user-friendly DWBI framework and deliver the product set

In this project Symphysis designed and executed the DWBI architecture that has become the cornerstone of Braites data analytics service offerings enhancing its status as a global leader Business Intelligence (BI) services focus on helping clients in collecting and analyzing external and internal data to generate value for their organizations

Symphysis successfully architected built and tested the DWBI solution with the key deliverables created scripts to automate existing manual processes involving file comparison validation quality check and control card generation introduced change and configuration management best practices leading to standardization of Braites delivery process robust Software Quality Assurance (SQA) processes to ensure high software quality The SQA process relies on unit testing and functional testing resulting in reduced effort for business analysts defined and collected metrics at various steps in the software development process (from development to integration testing) in order to improve process methodology and provide the ability to enter benchmarks or goals for performance tracking There are several data warehouse project management metrics worth considering These metrics have helped Braite to meet its business goals for improved customer satisfaction and process improvement

In addition to full lifecycle product development we also provide ongoing product support for Braites existing applications

Symphysis provided DWBI solutions across a wide range of business functions including sales and service relationship value management customer information systems billing and online collections operational data store loans deposits voice recognition custom history and ATM Symphysisrsquos data mart optimization framework enabled integration and consolidation of disparate data marts Symphysis SQA strategy improved delivery deadlines in terms of acceptance and integration Symphysis provided DW process improvements ETL checklists and standardization Braite achieved cost savings of 40 by using Symphysis onsiteoffshore delivery model and a scalable architecture enabling new data warehouse and business intelligence applications

END OF CASELET 1

Caselet 2Read the caselet carefully and answer the following questions

5 Critically analyze the functions of the tools that chairman of Trilog Brokerage Services (TBS) decided to implement in order to increase the efficiency of the organization ( 7 marks)

ltAnswergt

6 Discuss the classification of usage of tools against a data warehouse and also discuss about the types of Online Analytical Processing (OLAP) tools ( 8 marks)

ltAnswergt

Trilog Brokerage Services (TBS) is one of the oldest firms in India with a very strong customer base Many of its customers have more than one security holdings and some even have more than 50 securities in their portfolios And it has become very difficult on the part of TBS to track and maintain which customer is

7

sellingbuying which security and the amounts they have to receive or the amount they have to pay to TBS

It has found that information silos created are running contrary to the goal of the business intelligence organization architecture to ensure enterprise wide informational content to the broadest audience By utilizing the information properly it can enhance customer and supplier relationships improve the profitability of products and services create worthwhile new offerings better manage risk and pare expenses dramatically among many other gains TBS was feeling that it required a category of software tools that help analyze data stored in its database help users analyze different dimensions of the data such as time series and trend analysis views

The chairman of TBS felt that Online Analytical Processing (OLAP) was the need of the hour and decided to implement it immediately so that the processing part would be reduced significantly thereby increasing the efficiency of the organization

END OF CASELET 2

END OF SECTION B

Section C Applied Theory (20 Marks)bull This section consists of questions with serial number 7 - 8bull Answer all questions bull Marks are indicated against each questionbull Do not spend more than 25 - 30 minutes on Section C

7 What is Neural Network and discuss about various forms of Neural Networks ( 10 marks)

ltAnswergt

8 Explain the various responsibilities of a Query manager ( 10 marks)ltAnswergt

END OF SECTION C

END OF QUESTION PAPER

Suggested AnswersData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts

Answer Reason

8

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 4: 0810 Its-dwdm Mb3g1it

17

Which of the following statements isare true about the basic levels of testing the data warehouse

I All unit testing should be complete before any test plan is enactedII In integration testing the separate development units that make up a component of the data warehouse

application are tested to ensure that they work togetherIII In system testing the whole data warehouse application is tested together

(a) Only (I) above(b) Only (II) above (c) Both (I) and (II) above(d) Both (II) and (III) above(e) All (I) (II) and (III) above

ltAnswergt

18

To execute each SQL statement the RDBMS uses an optimizer to calculate the best strategy for performing that statement There are a number of different ways of calculating such a strategy but we can categorize optimizers generally as either rule based or cost based Which of the following statements isare false

I A rule-based optimizer uses known rules to perform the functionII A cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best

strategy for executing the SQL statementIII ldquoNumber of rows in the tablerdquo is generally collected by rule-based optimizer

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

19

In parallel technology which of the following statements isare true

I Data shipping is where a process requests for the data to be shipped to the location where the process is running

II Function shipping is where the function to be performed is moved to the locale of the dataIII Architectures which are designed for shared-nothing or distributed environments use data shipping

exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

20

Which of the following isare the common restriction(s) that may apply to the handling of views

I Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

(a) Only (I) above(b) Only (II) above (c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

21

One petabyte is equal to

(a) 1024 terabytes(b) 1024 gigabytes(c) 1024 megabytes(d) 1024 kilobytes(e) 1024 bytes

ltAnswergt

4

22

The formula for the construction of a genetic algorithm for the solution of a problem has the following steps List the steps in the orderI Invent an artificial environment in the computer where the solutions can join in battle with each other

Provide an objective rating to judge success or failure in professional terms called a fitness functionII Develop ways in which possible solutions can be combined Here the so-called cross-over operation in

which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

III Devise a good elegant coding of the problem in terms of strings of a limited alphabetIV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad

solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

(a) I II III and IV(b) I III II and IV(c) III I II and IV(d) II III I and IV(e) III II I and IV

ltAnswergt

23

Which of the following produced the ADSTAR Distributed Storage Manager (ADSM) backup software package

(a) HP(b) Sequent(c) IBM(d) Epoch Systems(e) Legato

ltAnswergt

24

Which of the following does not belongs to the stages in the Knowledge Discovery Process

(a) Data selection(b) Data encapsulation(c) Cleaning(d) Coding(e) Reporting

ltAnswergt

25

Which of the following isare the applications of data mining

I Customer profilingII CAPTAINSIII Reverse engineering

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

26

Which of the following managers are not a part of system managers in a data warehouse

(a) Configuration manager(b) Schedule manager(c) Event manager(d) Database manager(e) Load manager

ltAnswergt

27

In data mining group of similar objects that differ significantly from other objects is known as

(a) Filtering(b) Clustering(c) Coding(d) Scattering(e) Binding

ltAnswergt

5

28

A perceptron with simple three-layered network has ____________ as input units

(a) Photo-receptors(b) Associators(c) Responders(d) Acceptors(e) Rejectors

ltAnswergt

29

In which theory the human brain was described as a neural network

(a) Shannonrsquos communication theory(b) Kolmogorov complexity theory(c) Rissanen theory(d) Freudrsquos theory of psychodynamics(e) Kohonen theory

ltAnswergt

30

Which of the following isare the task(s) maintained by the query manager

I Query syntaxII Query execution planIII Query elapsed time

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

END OF SECTION A

Section B Caselets (50 Marks)bull This section consists of questions with serial number 1 ndash 6bull Answer all questions bull Marks are indicated against each questionbull Detailed explanations should form part of your answer bull Do not spend more than 110 - 120 minutes on Section B

Caselet 1Read the caselet carefully and answer the following questions

1 ldquoBraite selected Symphysis as the provider of choice to create a roadmap for the solution develop a scalable robust and user-friendly framework and deliver the product setrdquo In this context explain the data warehousing architecture (10 marks)

ltAnswergt

2 If you are a project manager at Braite what metrics you will consider which help Braite in meeting its business goals for improved customer satisfaction and process improvement Explain ( 8 marks)

ltAnswergt

3 What might be the important characteristics of the proposed data warehouse and also list the features of a data warehouse ( 7 marks)

ltAnswergt

4 Do you think Software Quality Assurance (SQA) process will play an important role in any data warehousing project Explain (10 marks)

ltAnswergt

Braite a leading provider of software services to financial institutions launched an initiative to enhance its application platform in order to provide better data analytics for its customers Braite partnered with Symphysis to architect and build new Data Warehousing (DW) and Business Intelligence (BI) services for its Business Intelligence Center (BIC) Over time Symphysis has become Braites most strategic product development partner using their global delivery model

Braite faced real challenges in providing effective data analytics to its customers - it supported several complex data sources residing in multiple application layers

6

within its products and faced challenges in implementing business rules and integrating data from disparate systems These source systems included mainframe systems Oracle databases and even Excel spreadsheets Braite also faced several other challenges it required manual data validation and data comparison processes it required manual controls over the credit card creation process its system design process suffered from unclear business requirements it supported multiple disparate data marts

To address these challenges Braite turned to an offshore IT services provider with DWBI experience and deep expertise in healthcare benefits software Braite selected Symphysis as the provider of choice to assess the feasibility of a DWBI solution as well as to create a roadmap for the solution develop a scalable robust and user-friendly DWBI framework and deliver the product set

In this project Symphysis designed and executed the DWBI architecture that has become the cornerstone of Braites data analytics service offerings enhancing its status as a global leader Business Intelligence (BI) services focus on helping clients in collecting and analyzing external and internal data to generate value for their organizations

Symphysis successfully architected built and tested the DWBI solution with the key deliverables created scripts to automate existing manual processes involving file comparison validation quality check and control card generation introduced change and configuration management best practices leading to standardization of Braites delivery process robust Software Quality Assurance (SQA) processes to ensure high software quality The SQA process relies on unit testing and functional testing resulting in reduced effort for business analysts defined and collected metrics at various steps in the software development process (from development to integration testing) in order to improve process methodology and provide the ability to enter benchmarks or goals for performance tracking There are several data warehouse project management metrics worth considering These metrics have helped Braite to meet its business goals for improved customer satisfaction and process improvement

In addition to full lifecycle product development we also provide ongoing product support for Braites existing applications

Symphysis provided DWBI solutions across a wide range of business functions including sales and service relationship value management customer information systems billing and online collections operational data store loans deposits voice recognition custom history and ATM Symphysisrsquos data mart optimization framework enabled integration and consolidation of disparate data marts Symphysis SQA strategy improved delivery deadlines in terms of acceptance and integration Symphysis provided DW process improvements ETL checklists and standardization Braite achieved cost savings of 40 by using Symphysis onsiteoffshore delivery model and a scalable architecture enabling new data warehouse and business intelligence applications

END OF CASELET 1

Caselet 2Read the caselet carefully and answer the following questions

5 Critically analyze the functions of the tools that chairman of Trilog Brokerage Services (TBS) decided to implement in order to increase the efficiency of the organization ( 7 marks)

ltAnswergt

6 Discuss the classification of usage of tools against a data warehouse and also discuss about the types of Online Analytical Processing (OLAP) tools ( 8 marks)

ltAnswergt

Trilog Brokerage Services (TBS) is one of the oldest firms in India with a very strong customer base Many of its customers have more than one security holdings and some even have more than 50 securities in their portfolios And it has become very difficult on the part of TBS to track and maintain which customer is

7

sellingbuying which security and the amounts they have to receive or the amount they have to pay to TBS

It has found that information silos created are running contrary to the goal of the business intelligence organization architecture to ensure enterprise wide informational content to the broadest audience By utilizing the information properly it can enhance customer and supplier relationships improve the profitability of products and services create worthwhile new offerings better manage risk and pare expenses dramatically among many other gains TBS was feeling that it required a category of software tools that help analyze data stored in its database help users analyze different dimensions of the data such as time series and trend analysis views

The chairman of TBS felt that Online Analytical Processing (OLAP) was the need of the hour and decided to implement it immediately so that the processing part would be reduced significantly thereby increasing the efficiency of the organization

END OF CASELET 2

END OF SECTION B

Section C Applied Theory (20 Marks)bull This section consists of questions with serial number 7 - 8bull Answer all questions bull Marks are indicated against each questionbull Do not spend more than 25 - 30 minutes on Section C

7 What is Neural Network and discuss about various forms of Neural Networks ( 10 marks)

ltAnswergt

8 Explain the various responsibilities of a Query manager ( 10 marks)ltAnswergt

END OF SECTION C

END OF QUESTION PAPER

Suggested AnswersData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts

Answer Reason

8

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 5: 0810 Its-dwdm Mb3g1it

22

The formula for the construction of a genetic algorithm for the solution of a problem has the following steps List the steps in the orderI Invent an artificial environment in the computer where the solutions can join in battle with each other

Provide an objective rating to judge success or failure in professional terms called a fitness functionII Develop ways in which possible solutions can be combined Here the so-called cross-over operation in

which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

III Devise a good elegant coding of the problem in terms of strings of a limited alphabetIV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad

solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

(a) I II III and IV(b) I III II and IV(c) III I II and IV(d) II III I and IV(e) III II I and IV

ltAnswergt

23

Which of the following produced the ADSTAR Distributed Storage Manager (ADSM) backup software package

(a) HP(b) Sequent(c) IBM(d) Epoch Systems(e) Legato

ltAnswergt

24

Which of the following does not belongs to the stages in the Knowledge Discovery Process

(a) Data selection(b) Data encapsulation(c) Cleaning(d) Coding(e) Reporting

ltAnswergt

25

Which of the following isare the applications of data mining

I Customer profilingII CAPTAINSIII Reverse engineering

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

26

Which of the following managers are not a part of system managers in a data warehouse

(a) Configuration manager(b) Schedule manager(c) Event manager(d) Database manager(e) Load manager

ltAnswergt

27

In data mining group of similar objects that differ significantly from other objects is known as

(a) Filtering(b) Clustering(c) Coding(d) Scattering(e) Binding

ltAnswergt

5

28

A perceptron with simple three-layered network has ____________ as input units

(a) Photo-receptors(b) Associators(c) Responders(d) Acceptors(e) Rejectors

ltAnswergt

29

In which theory the human brain was described as a neural network

(a) Shannonrsquos communication theory(b) Kolmogorov complexity theory(c) Rissanen theory(d) Freudrsquos theory of psychodynamics(e) Kohonen theory

ltAnswergt

30

Which of the following isare the task(s) maintained by the query manager

I Query syntaxII Query execution planIII Query elapsed time

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

END OF SECTION A

Section B Caselets (50 Marks)bull This section consists of questions with serial number 1 ndash 6bull Answer all questions bull Marks are indicated against each questionbull Detailed explanations should form part of your answer bull Do not spend more than 110 - 120 minutes on Section B

Caselet 1Read the caselet carefully and answer the following questions

1 ldquoBraite selected Symphysis as the provider of choice to create a roadmap for the solution develop a scalable robust and user-friendly framework and deliver the product setrdquo In this context explain the data warehousing architecture (10 marks)

ltAnswergt

2 If you are a project manager at Braite what metrics you will consider which help Braite in meeting its business goals for improved customer satisfaction and process improvement Explain ( 8 marks)

ltAnswergt

3 What might be the important characteristics of the proposed data warehouse and also list the features of a data warehouse ( 7 marks)

ltAnswergt

4 Do you think Software Quality Assurance (SQA) process will play an important role in any data warehousing project Explain (10 marks)

ltAnswergt

Braite a leading provider of software services to financial institutions launched an initiative to enhance its application platform in order to provide better data analytics for its customers Braite partnered with Symphysis to architect and build new Data Warehousing (DW) and Business Intelligence (BI) services for its Business Intelligence Center (BIC) Over time Symphysis has become Braites most strategic product development partner using their global delivery model

Braite faced real challenges in providing effective data analytics to its customers - it supported several complex data sources residing in multiple application layers

6

within its products and faced challenges in implementing business rules and integrating data from disparate systems These source systems included mainframe systems Oracle databases and even Excel spreadsheets Braite also faced several other challenges it required manual data validation and data comparison processes it required manual controls over the credit card creation process its system design process suffered from unclear business requirements it supported multiple disparate data marts

To address these challenges Braite turned to an offshore IT services provider with DWBI experience and deep expertise in healthcare benefits software Braite selected Symphysis as the provider of choice to assess the feasibility of a DWBI solution as well as to create a roadmap for the solution develop a scalable robust and user-friendly DWBI framework and deliver the product set

In this project Symphysis designed and executed the DWBI architecture that has become the cornerstone of Braites data analytics service offerings enhancing its status as a global leader Business Intelligence (BI) services focus on helping clients in collecting and analyzing external and internal data to generate value for their organizations

Symphysis successfully architected built and tested the DWBI solution with the key deliverables created scripts to automate existing manual processes involving file comparison validation quality check and control card generation introduced change and configuration management best practices leading to standardization of Braites delivery process robust Software Quality Assurance (SQA) processes to ensure high software quality The SQA process relies on unit testing and functional testing resulting in reduced effort for business analysts defined and collected metrics at various steps in the software development process (from development to integration testing) in order to improve process methodology and provide the ability to enter benchmarks or goals for performance tracking There are several data warehouse project management metrics worth considering These metrics have helped Braite to meet its business goals for improved customer satisfaction and process improvement

In addition to full lifecycle product development we also provide ongoing product support for Braites existing applications

Symphysis provided DWBI solutions across a wide range of business functions including sales and service relationship value management customer information systems billing and online collections operational data store loans deposits voice recognition custom history and ATM Symphysisrsquos data mart optimization framework enabled integration and consolidation of disparate data marts Symphysis SQA strategy improved delivery deadlines in terms of acceptance and integration Symphysis provided DW process improvements ETL checklists and standardization Braite achieved cost savings of 40 by using Symphysis onsiteoffshore delivery model and a scalable architecture enabling new data warehouse and business intelligence applications

END OF CASELET 1

Caselet 2Read the caselet carefully and answer the following questions

5 Critically analyze the functions of the tools that chairman of Trilog Brokerage Services (TBS) decided to implement in order to increase the efficiency of the organization ( 7 marks)

ltAnswergt

6 Discuss the classification of usage of tools against a data warehouse and also discuss about the types of Online Analytical Processing (OLAP) tools ( 8 marks)

ltAnswergt

Trilog Brokerage Services (TBS) is one of the oldest firms in India with a very strong customer base Many of its customers have more than one security holdings and some even have more than 50 securities in their portfolios And it has become very difficult on the part of TBS to track and maintain which customer is

7

sellingbuying which security and the amounts they have to receive or the amount they have to pay to TBS

It has found that information silos created are running contrary to the goal of the business intelligence organization architecture to ensure enterprise wide informational content to the broadest audience By utilizing the information properly it can enhance customer and supplier relationships improve the profitability of products and services create worthwhile new offerings better manage risk and pare expenses dramatically among many other gains TBS was feeling that it required a category of software tools that help analyze data stored in its database help users analyze different dimensions of the data such as time series and trend analysis views

The chairman of TBS felt that Online Analytical Processing (OLAP) was the need of the hour and decided to implement it immediately so that the processing part would be reduced significantly thereby increasing the efficiency of the organization

END OF CASELET 2

END OF SECTION B

Section C Applied Theory (20 Marks)bull This section consists of questions with serial number 7 - 8bull Answer all questions bull Marks are indicated against each questionbull Do not spend more than 25 - 30 minutes on Section C

7 What is Neural Network and discuss about various forms of Neural Networks ( 10 marks)

ltAnswergt

8 Explain the various responsibilities of a Query manager ( 10 marks)ltAnswergt

END OF SECTION C

END OF QUESTION PAPER

Suggested AnswersData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts

Answer Reason

8

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 6: 0810 Its-dwdm Mb3g1it

28

A perceptron with simple three-layered network has ____________ as input units

(a) Photo-receptors(b) Associators(c) Responders(d) Acceptors(e) Rejectors

ltAnswergt

29

In which theory the human brain was described as a neural network

(a) Shannonrsquos communication theory(b) Kolmogorov complexity theory(c) Rissanen theory(d) Freudrsquos theory of psychodynamics(e) Kohonen theory

ltAnswergt

30

Which of the following isare the task(s) maintained by the query manager

I Query syntaxII Query execution planIII Query elapsed time

(a) Only (I) above(b) Only (II) above(c) Only (III) above(d) Both (I) and (II) above(e) All (I) (II) and (III) above

ltAnswergt

END OF SECTION A

Section B Caselets (50 Marks)bull This section consists of questions with serial number 1 ndash 6bull Answer all questions bull Marks are indicated against each questionbull Detailed explanations should form part of your answer bull Do not spend more than 110 - 120 minutes on Section B

Caselet 1Read the caselet carefully and answer the following questions

1 ldquoBraite selected Symphysis as the provider of choice to create a roadmap for the solution develop a scalable robust and user-friendly framework and deliver the product setrdquo In this context explain the data warehousing architecture (10 marks)

ltAnswergt

2 If you are a project manager at Braite what metrics you will consider which help Braite in meeting its business goals for improved customer satisfaction and process improvement Explain ( 8 marks)

ltAnswergt

3 What might be the important characteristics of the proposed data warehouse and also list the features of a data warehouse ( 7 marks)

ltAnswergt

4 Do you think Software Quality Assurance (SQA) process will play an important role in any data warehousing project Explain (10 marks)

ltAnswergt

Braite a leading provider of software services to financial institutions launched an initiative to enhance its application platform in order to provide better data analytics for its customers Braite partnered with Symphysis to architect and build new Data Warehousing (DW) and Business Intelligence (BI) services for its Business Intelligence Center (BIC) Over time Symphysis has become Braites most strategic product development partner using their global delivery model

Braite faced real challenges in providing effective data analytics to its customers - it supported several complex data sources residing in multiple application layers

6

within its products and faced challenges in implementing business rules and integrating data from disparate systems These source systems included mainframe systems Oracle databases and even Excel spreadsheets Braite also faced several other challenges it required manual data validation and data comparison processes it required manual controls over the credit card creation process its system design process suffered from unclear business requirements it supported multiple disparate data marts

To address these challenges Braite turned to an offshore IT services provider with DWBI experience and deep expertise in healthcare benefits software Braite selected Symphysis as the provider of choice to assess the feasibility of a DWBI solution as well as to create a roadmap for the solution develop a scalable robust and user-friendly DWBI framework and deliver the product set

In this project Symphysis designed and executed the DWBI architecture that has become the cornerstone of Braites data analytics service offerings enhancing its status as a global leader Business Intelligence (BI) services focus on helping clients in collecting and analyzing external and internal data to generate value for their organizations

Symphysis successfully architected built and tested the DWBI solution with the key deliverables created scripts to automate existing manual processes involving file comparison validation quality check and control card generation introduced change and configuration management best practices leading to standardization of Braites delivery process robust Software Quality Assurance (SQA) processes to ensure high software quality The SQA process relies on unit testing and functional testing resulting in reduced effort for business analysts defined and collected metrics at various steps in the software development process (from development to integration testing) in order to improve process methodology and provide the ability to enter benchmarks or goals for performance tracking There are several data warehouse project management metrics worth considering These metrics have helped Braite to meet its business goals for improved customer satisfaction and process improvement

In addition to full lifecycle product development we also provide ongoing product support for Braites existing applications

Symphysis provided DWBI solutions across a wide range of business functions including sales and service relationship value management customer information systems billing and online collections operational data store loans deposits voice recognition custom history and ATM Symphysisrsquos data mart optimization framework enabled integration and consolidation of disparate data marts Symphysis SQA strategy improved delivery deadlines in terms of acceptance and integration Symphysis provided DW process improvements ETL checklists and standardization Braite achieved cost savings of 40 by using Symphysis onsiteoffshore delivery model and a scalable architecture enabling new data warehouse and business intelligence applications

END OF CASELET 1

Caselet 2Read the caselet carefully and answer the following questions

5 Critically analyze the functions of the tools that chairman of Trilog Brokerage Services (TBS) decided to implement in order to increase the efficiency of the organization ( 7 marks)

ltAnswergt

6 Discuss the classification of usage of tools against a data warehouse and also discuss about the types of Online Analytical Processing (OLAP) tools ( 8 marks)

ltAnswergt

Trilog Brokerage Services (TBS) is one of the oldest firms in India with a very strong customer base Many of its customers have more than one security holdings and some even have more than 50 securities in their portfolios And it has become very difficult on the part of TBS to track and maintain which customer is

7

sellingbuying which security and the amounts they have to receive or the amount they have to pay to TBS

It has found that information silos created are running contrary to the goal of the business intelligence organization architecture to ensure enterprise wide informational content to the broadest audience By utilizing the information properly it can enhance customer and supplier relationships improve the profitability of products and services create worthwhile new offerings better manage risk and pare expenses dramatically among many other gains TBS was feeling that it required a category of software tools that help analyze data stored in its database help users analyze different dimensions of the data such as time series and trend analysis views

The chairman of TBS felt that Online Analytical Processing (OLAP) was the need of the hour and decided to implement it immediately so that the processing part would be reduced significantly thereby increasing the efficiency of the organization

END OF CASELET 2

END OF SECTION B

Section C Applied Theory (20 Marks)bull This section consists of questions with serial number 7 - 8bull Answer all questions bull Marks are indicated against each questionbull Do not spend more than 25 - 30 minutes on Section C

7 What is Neural Network and discuss about various forms of Neural Networks ( 10 marks)

ltAnswergt

8 Explain the various responsibilities of a Query manager ( 10 marks)ltAnswergt

END OF SECTION C

END OF QUESTION PAPER

Suggested AnswersData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts

Answer Reason

8

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 7: 0810 Its-dwdm Mb3g1it

within its products and faced challenges in implementing business rules and integrating data from disparate systems These source systems included mainframe systems Oracle databases and even Excel spreadsheets Braite also faced several other challenges it required manual data validation and data comparison processes it required manual controls over the credit card creation process its system design process suffered from unclear business requirements it supported multiple disparate data marts

To address these challenges Braite turned to an offshore IT services provider with DWBI experience and deep expertise in healthcare benefits software Braite selected Symphysis as the provider of choice to assess the feasibility of a DWBI solution as well as to create a roadmap for the solution develop a scalable robust and user-friendly DWBI framework and deliver the product set

In this project Symphysis designed and executed the DWBI architecture that has become the cornerstone of Braites data analytics service offerings enhancing its status as a global leader Business Intelligence (BI) services focus on helping clients in collecting and analyzing external and internal data to generate value for their organizations

Symphysis successfully architected built and tested the DWBI solution with the key deliverables created scripts to automate existing manual processes involving file comparison validation quality check and control card generation introduced change and configuration management best practices leading to standardization of Braites delivery process robust Software Quality Assurance (SQA) processes to ensure high software quality The SQA process relies on unit testing and functional testing resulting in reduced effort for business analysts defined and collected metrics at various steps in the software development process (from development to integration testing) in order to improve process methodology and provide the ability to enter benchmarks or goals for performance tracking There are several data warehouse project management metrics worth considering These metrics have helped Braite to meet its business goals for improved customer satisfaction and process improvement

In addition to full lifecycle product development we also provide ongoing product support for Braites existing applications

Symphysis provided DWBI solutions across a wide range of business functions including sales and service relationship value management customer information systems billing and online collections operational data store loans deposits voice recognition custom history and ATM Symphysisrsquos data mart optimization framework enabled integration and consolidation of disparate data marts Symphysis SQA strategy improved delivery deadlines in terms of acceptance and integration Symphysis provided DW process improvements ETL checklists and standardization Braite achieved cost savings of 40 by using Symphysis onsiteoffshore delivery model and a scalable architecture enabling new data warehouse and business intelligence applications

END OF CASELET 1

Caselet 2Read the caselet carefully and answer the following questions

5 Critically analyze the functions of the tools that chairman of Trilog Brokerage Services (TBS) decided to implement in order to increase the efficiency of the organization ( 7 marks)

ltAnswergt

6 Discuss the classification of usage of tools against a data warehouse and also discuss about the types of Online Analytical Processing (OLAP) tools ( 8 marks)

ltAnswergt

Trilog Brokerage Services (TBS) is one of the oldest firms in India with a very strong customer base Many of its customers have more than one security holdings and some even have more than 50 securities in their portfolios And it has become very difficult on the part of TBS to track and maintain which customer is

7

sellingbuying which security and the amounts they have to receive or the amount they have to pay to TBS

It has found that information silos created are running contrary to the goal of the business intelligence organization architecture to ensure enterprise wide informational content to the broadest audience By utilizing the information properly it can enhance customer and supplier relationships improve the profitability of products and services create worthwhile new offerings better manage risk and pare expenses dramatically among many other gains TBS was feeling that it required a category of software tools that help analyze data stored in its database help users analyze different dimensions of the data such as time series and trend analysis views

The chairman of TBS felt that Online Analytical Processing (OLAP) was the need of the hour and decided to implement it immediately so that the processing part would be reduced significantly thereby increasing the efficiency of the organization

END OF CASELET 2

END OF SECTION B

Section C Applied Theory (20 Marks)bull This section consists of questions with serial number 7 - 8bull Answer all questions bull Marks are indicated against each questionbull Do not spend more than 25 - 30 minutes on Section C

7 What is Neural Network and discuss about various forms of Neural Networks ( 10 marks)

ltAnswergt

8 Explain the various responsibilities of a Query manager ( 10 marks)ltAnswergt

END OF SECTION C

END OF QUESTION PAPER

Suggested AnswersData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts

Answer Reason

8

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 8: 0810 Its-dwdm Mb3g1it

sellingbuying which security and the amounts they have to receive or the amount they have to pay to TBS

It has found that information silos created are running contrary to the goal of the business intelligence organization architecture to ensure enterprise wide informational content to the broadest audience By utilizing the information properly it can enhance customer and supplier relationships improve the profitability of products and services create worthwhile new offerings better manage risk and pare expenses dramatically among many other gains TBS was feeling that it required a category of software tools that help analyze data stored in its database help users analyze different dimensions of the data such as time series and trend analysis views

The chairman of TBS felt that Online Analytical Processing (OLAP) was the need of the hour and decided to implement it immediately so that the processing part would be reduced significantly thereby increasing the efficiency of the organization

END OF CASELET 2

END OF SECTION B

Section C Applied Theory (20 Marks)bull This section consists of questions with serial number 7 - 8bull Answer all questions bull Marks are indicated against each questionbull Do not spend more than 25 - 30 minutes on Section C

7 What is Neural Network and discuss about various forms of Neural Networks ( 10 marks)

ltAnswergt

8 Explain the various responsibilities of a Query manager ( 10 marks)ltAnswergt

END OF SECTION C

END OF QUESTION PAPER

Suggested AnswersData Warehousing and Data Mining (MB3G1IT) October 2008

Section A Basic Concepts

Answer Reason

8

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 9: 0810 Its-dwdm Mb3g1it

1 D The capacity plan for hardware and infrastructure is not determined in the business requirements stage it is identified in technical blueprint stage

lt TOP gt

2 B Warehouse manager is a system manager who performs backup and archiving the data warehouse

lt TOP gt

3 C Stored procedure tools implement Complex checking lt TOP gt

4 D Vertical partitioning can take two forms normalization and row splitting before using a vertical partitioning there should not be any requirements to perform major join operations between the two partitions in order to maximize the hardware partitioning maximize the processing power available

lt TOP gt

5 A Symmetric multi-processing machine is a set of tightly coupled CPUs that share memory and disk

lt TOP gt

6 A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each disk duplexed

lt TOP gt

7 A Snowflake schema is a variant of star schema where each dimension can have its own dimensions star schema is a logical structure that has a fact table in the center with dimension tables radiating off of this central table Starflake schema is a hybrid structure that contains a mix of star and snowflake schemas

lt TOP gt

8 A In database sizing if n is the number of concurrent queries allowed and P is the size

of the partition then temporary space (T) is set to T = (2n + 1)P

lt TOP gt

9 A Association rules that state a statistical correlation between the occurrence of certain attributes in a database table

lt TOP gt

10E Learning tasks can be divided intoI Classification tasksII Knowledge engineering tasksIII Problem-solving tasks

lt TOP gt

11C Shallow knowledge is the information that can be easily retrieved from databases using a query tool such as Structured Query Language (SQL) Hidden knowledge is the data that can be found relatively easily by using pattern recognition or machine-learning algorithms Multi-dimensional knowledge is the information that can be analyzed using online analytical processing tools Deep knowledge is the information that is stored in the database but can only be located if we have a clue that tells us where to look

lt TOP gt

12E There are some specific rules that govern the basic structure of a data warehouse namely that such a structure should be Time dependent Non-volatile Subject oriented Integrated

lt TOP gt

13D OLAP tools do not learn they create new knowledge and OLAP tools cannot search for new solutions data mining is more powerful than OLAP

lt TOP gt

14E Auditing is a specific subset of security that is often mandated by organizations As data warehouse is concerned the audit requirements can basically categorized asI ConnectionsII DisconnectionsIII Data accessIV Data change

lt TOP gt

15B Alexandria backup software package was produced by Sequent lt TOP gt

16A Aggregations are performed in order to speed up common queries too many aggregations will lead to unacceptable operational costs too few aggregations will lead to an overall lack of system performance

9

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 10: 0810 Its-dwdm Mb3g1it

17E All unit testing should be complete before any test plan is enacted in integration testing the separate development units that make up a component of the data warehouse application are tested to ensure that they work together in system testing the whole data warehouse application is tested together

lt TOP gt

18C A rule-based optimizer uses known rules to perform the function a cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best strategy for executing the SQL statement ldquoNumber of rows in the tablerdquo is generally collected by cost-based optimizer

lt TOP gt

19D Data shipping is where a process requests for the data to be shipped to the location where the process is running function shipping is where the function to be performed is moved to the locale of the data architectures which are designed for shared-nothing or distributed environments use function shipping exclusively They can achieve parallelism as long as the data is partitioned or distributed correctly

lt TOP gt

20E The common restrictions that may apply to the handling of views areI Restricted Data Manipulation Language (DML) operationsII Lost query optimization pathsIII Restrictions on parallel processing of view projections

lt TOP gt

21A One petabyte is equal to 1024 terabytes lt TOP gt

22C The formula for the construction of a genetic algorithm for the solution of a problem has the following stepsI Devise a good elegant coding of the problem in terms of strings of a limited

alphabetII Invent an artificial environment in the computer where the solutions can join in

battle with each other Provide an objective rating to judge success or failure in professional terms called a fitness function

III Develop ways in which possible solutions can be combined Here the so-called cross-over operation in which the fatherrsquos and motherrsquos strings are simply cut and after changing stuck together again is very popular In reproduction all kinds of mutation operators can be applied

IV Provide a well-varied initial population and make the computer play lsquoevolutionrsquo by removing the bad solutions from each generation and replacing them with progeny or mutations of good solutions Stop when a family of successful solutions has been produced

lt TOP gt

23C ADSM backup software package was produced by IBM lt TOP gt

24B Data encapsulation is not a stage in the knowledge discovery process lt TOP gt

25E Customer profiling CAPTAINS and reverse engineering are applications of data mining

lt TOP gt

26E Except load manager all the other managers are part of system managers in a data warehouse

lt TOP gt

27B A group of similar objects that differ significantly from other objects is known as Clustering

lt TOP gt

28A A perceptron consists of a simple three-layered network with input units called Photo-receptors

lt TOP gt

29D In Freudrsquos theory of psychodynamics the human brain was described as a neural network

lt TOP gt

30E The tasks maintained by the query managerI Query syntaxII Query execution planIII Query elapsed time

lt TOP gt

Section B Caselets

1 Architecture of a data warehouse lt TOP

10

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 11: 0810 Its-dwdm Mb3g1it

Load Manager ArchitectureThe architecture of a load manager is such that it performs the following operations1 Extract the data from the source system2 Fast-load the extracted data into a temporary data store3 Perform simple transformations into a structure similar to the one in the data warehouse

Load manager architectureWarehouse Manager ArchitectureThe architecture of a warehouse manager is such that it performs the following operations1 Analyze the data to perform consistency and referential integrity checks2 Transform and merge the source data in the temporary data store into the published data warehouse3 Create indexes business views partition views business synonyms against the base data4 Generate denormalizations if appropriate5 Generate any new aggregations that may be required6 Update all existing aggregation7 Back up incrementally or totally the data within the data warehouse8 Archive data that has reached the end of its capture lifeIn some cases the warehouse manager also analyzes query profiles to determine which indexes and aggregations are appropriate

Architecture of a warehouse managerQuery Manager ArchitectureThe architecture of a query manager is such that it performs the following operations1 Direct queries to the appropriate table(s)2 Schedule the execution of user queriesThe actual problem specified is tight project schedule within which it had to be delivered The field errors had to be reduced to a great extent as the solution was for the company The requirements needed to be defined very clearly and there was a need for a scalable and reliable architecture and solutionThe study had conducted on the companyrsquos current business information requirements current process of getting that information and prepared a business case for a data warehousing and business intelligence solution

gt

11

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 12: 0810 Its-dwdm Mb3g1it

2 Metrics are essential in the assessment of software development quality They may provide information about the development process itself and the yielded products Metrics may be grouped into Quality Areas which define a perspective for metrics interpretation The adoption of a measurement program includes the definition of metrics that generate useful information To do so organizationrsquos goals have to be defined and analyzed along with what the metrics are expected to deliver Metrics may be classified as direct and indirect A direct metric is independent of the measurement of any other Indirect metrics also referred to as derived metrics represent functions upon other metrics direct or derived Productivity (code size programming time) is an example of derived metric The existence of a timely and accurate capturing mechanism for direct metrics is critical in order to produce reliable results Indicators establish the quality factors defined in a measurement program Metrics also have a number of components and for data warehousing can be broken down in the following manner Objects - the ldquothemesrdquo in the data warehouse environment which need to be assessed Objects can include business drivers warehouse contents refresh processes accesses and tools Subjects - things in the data warehouse to which we assign numbers or a quantity For example subjects include the cost or value of a specific warehouse activity access frequency duration and utilization Strata - a criterion for manipulating metric information This might include day of the week specific tables accessed location time or accesses by departmentThese metric components may be combined to define an ldquoapplicationrdquo which states how the information will be applied For example ldquoWhen actual monthly refresh cost exceeds targeted monthly refresh cost the value of each data collection in the warehouse must be re-establishedrdquo There are several data warehouse project management metrics worth considering The first three arebull Business Return On Investment (ROI)

The best metric to use is business return on investment Is the business achieving bottom line success (increased sales or decreased expenses) through the use of the data warehouse This focus will encourage the development team to work backwards to do the right things day in and day out for the ultimate arbiter of success -- the bottom line

bull Data usage The second best metric is data usage You want to see the data warehouse used for its intended purposes by the target users The objective here is increasing numbers of users and complexity of usage With this focus user statistics such as logins and query bands are tracked

bull Data gathering and availability The third best data warehouse metric category is data gathering and availability Under this focus the data warehouse team becomes an internal data brokerage serving up data for the organizationrsquos consumption Success is measured in the availability of the data more or less according to a service level agreement I would say to use these business metrics to gauge the success

lt TOP gt

3 The important characteristics of data warehouse areTime dependent That is containing information collected over time which implies there must always be a connection between information in the warehouse and the time when it was entered This is one of the most important aspects of the warehouse as it related to data mining because information can then be sourced according to periodNon-volatile That is data in a data warehouse is never updated but used only for queries Thus such data can only be loaded from other databases such as the operational database End-users who want to update must use operational database as only the latter can be updated changed or deleted This means that a data warehouse will always be filled with historical dataSubject oriented That is built around all the existing applications of the operational data Not all the information in the operational database is useful for a data warehouse since the data warehouse is designed specifically for decision support while the operational database contains information for day-to-day useIntegrated That is it reflects the business information of the organization In an operational data environment we will find many types of information being used in a variety of applications and some applications will be using different names for the same entities

lt TOP gt

12

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 13: 0810 Its-dwdm Mb3g1it

However in a data warehouse it is essential to integrate this information and make it consistent only one name must exist to describe each individual entityThe following are the features of a data warehousebull A scalable information architecture that will allow the information base to be extended and

enhanced over time bull Detailed analysis of member patterns including trading delivery and funds payment bull Fraud detection and sequence of event analysis bull Ease of reporting on voluminous historical data bull Provision for ad hoc queries and reporting facilities to enhance the efficiency of

knowledge workers bull Data mining to identify the co-relation between apparently independent entities

4 Due to the principal role of Data warehouses in making strategy decisions data warehouse quality is crucial for organizations The typical Quality Assurance (QA) activities aimed at ensuring both process and product quality at Braite include software testing resulting in bull Reduced development and maintenance costsbull Improved software products qualitybull Reduced project cycle timebull Increased customer satisfactionbull Improved staff morale thanks to predictable results in stable conditions with less overtime

crisisturnoverQuality assurance means different things to different individuals To some QA means testing but quality cannot be tested at the end of a project It must be built in as the solution is conceived evolves and is developed To some QA resources are the ldquoprocess policerdquo ndash nitpickers insisting on 100 compliance with a defined development process methodology Rather it is important to implement processes and controls that will really benefit the project Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements Testing is just one of those activities In the typical software QA methodology the key tasks are bull Articulate the development methodology for all to knowbull Rigorously define and inspect the requirementsbull Ensure that the requirements are testablebull Prioritize based on riskbull Create test plansbull Set up the test environment and databull Execute test casesbull Document and manage defects and test resultsbull Gather metrics for management decisionsbull Assess readiness to implement Quality assurance (QA) in a data warehousebusiness intelligence environment is a challenging undertaking For one thing very little is written about business intelligence QA Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions However data quality does not make for BI quality assurance and practitioners within the software QA discipline focus almost exclusively on application development efforts They do not seem to appreciate the unique aspects of quality assurance in a data warehousebusiness intelligence environment An effective software QA should be ingrained within each DWBI project It should have the following characteristics bull QA goals and objectives should be defined from the outset of the projectbull The role of QA should be clearly defined within the project organizationbull The QA role needs to be staffed with talented resources well trained in the techniques

needed to evaluate the data in the types of sources that will be used

lt TOP gt

13

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 14: 0810 Its-dwdm Mb3g1it

bull QA processes should be embedded to provide a self-monitoring update cyclebull QA activities are needed in the requirements design mapping and development project

phases

5 Online Analytical Processing (OLAP) a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of multidimensional data For example it provides time series and trend analysis views OLAP often is used in data mining The chief component of OLAP is the OLAP server which sits between a client and a Database Management Systems (DBMS) The OLAP server understands how data is organized in the database and has special functions for analyzing the data There are OLAP servers available for nearly all the major database systems OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view Designed for managers looking to make sense of their information OLAP tools structure data hierarchically ndash the way managers think of their enterprises but also allows business analysts to rotate that data changing the relationships to get more detailed insight into corporate information OLAP tools are geared towards slicing and dicing of the data As such they require a strong metadata layer as well as front-end flexibility Those are typically difficult features for any home-built systems to achieve The term lsquoon-line analytic processingrsquo is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations Decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies The most common way to access a data mart or data warehouse is to run reports Another very popular approach is to use OLAP tools To compare different types of reporting and analysis interface it is useful to classify reports along a spectrum of increasing flexibility and decreasing ease of useAd hoc queries as the name suggests are queries written by (or for) the end user as a one-off exercise The only limitations are the capabilities of the reporting tool and the data available Ad hoc reporting requires greater expertise but need not involve programming as most modern reporting tools are able to generate SQL OLAP tools can be thought of as interactive reporting environments they allow the user to interact with a cube of data and create views that can be saved and reused as generic interactive reports They are excellent for exploring summarised data and some will allow the user to drill through from the cube into the underlying database to view the individual transaction details

lt TOP gt

6 Classifying the usage of tools against a data warehouse into three broad categoriesi Data dippingii Data miningiii Data analysisData dipping toolsThese are the basic business tools They allow the generation of standard business reports They can perform basic analysis answering standard business questions As these tools are relational they can also be used as data browsers and generally have reasonable drill-down capabilities Most of the tools will use metadata to isolate the user from the complexities of the data warehouse and present a business friendly schemaData mining toolsThese are specialist tools designed for finding trends and patterns in the underlying data These tools use techniques such as artificial intelligence and neural networks to mine the data and find connections that may not be immediately obvious A data mining tool could be used to find common behavioral trends in a businessrsquos customers or to root out market segments by grouping customers with common attributesData analysis toolsThese are used to perform complex analysis of data They will normally have a rich set of analytic functions which allow sophisticated analysis of the data These tools are designed for business analysis and will generally understand the common business metrics Data analysis tools can again be subdivided in to two categories Multidimensional Online Analytical Processing (MOLAP) and Relational Online Analytical Processing (ROLAP) Online Analytical Processing (OLAP) is a category of software tools that provides analysis of data stored in a database OLAP tools enable users to analyze different dimensions of

lt TOP gt

14

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 15: 0810 Its-dwdm Mb3g1it

multidimensional data For example it provides time series and trend analysis views OLAP is a technology designed to provide superior performance for ad hoc business intelligence queries OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses MOLAP This is the more traditional way of OLAP analysis In MOLAP data is stored in a multidimensional cube The storage is not in the relational database but in proprietary formats Advantages bull Excellent performance MOLAP cubes are built for fast data retrieval and is optimal for

slicing and dicing operations bull Can perform complex calculations All calculations have been pre-generated when the

cube is created Hence complex calculations are not only doable but they return quickly Disadvantages bull Limited in the amount of data it can handle Because all calculations are performed when

the cube is built it is not possible to include a large amount of data in the cube itself This is not to say that the data in the cube cannot be derived from a large amount of data Indeed this is possible But in this case only summary-level information will be included in the cube itself

bull Requires additional investment Cube technology are often proprietary and do not already exist in the organization Therefore to adopt MOLAP technology chances are additional investments in human and capital resources are needed

ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAPrsquos slicing and dicing functionality In essence each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement Advantages bull Can handle large amounts of data The data size limitation of ROLAP technology is the

limitation on data size of the underlying relational database In other words ROLAP itself places no limitation on data amount

bull Can leverage functionalities inherent in the relational database Often relational database already comes with a host of functionalities ROLAP technologies since they sit on top of the relational database can therefore leverage these functionalities

Disadvantages bull Performance can be slow Because each ROLAP report is essentially a SQL query (or

multiple SQL queries) in the relational database the query time can be long if the underlying data size is large

bull Limited by SQL functionalities Because ROLAP technology mainly relies on generating SQL statements to query the relational database and SQL statements do not fit all needs (for example it is difficult to perform complex calculations using SQL) ROLAP technologies are therefore traditionally limited by what SQL can do ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

Section C Applied Theory

7 Neural networks Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain In Freudrsquos theory of psychodynamics the human brain was described as a neural network and recent investigations have corroborated this view The human brain consists of a very large number of neurons about1011 connected to each other via a huge number of so-called synapses A single neuron is connected to other neurons by a couple of thousand of these synapses Although neurons could be described as the simple building blocks of the brain the human brain can handle very complex tasks despite this relative sim-plicity This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so-called artificial neural networks Such networks can be built using special hardware but most are just software programs that can operate on normal computers Typically a neural network consists of a set of nodes input nodes receive the input signals output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes When using neural

lt TOP gt

15

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 16: 0810 Its-dwdm Mb3g1it

networks we have to distinguish between two stages - the encoding stage in which the neural network is trained to perform a certain task and the decoding stage in which the network is used to classify examples make predictions or execute whatever learning task is involved There are several different forms of neural network but we shall discuss only three of them here bull Perceptrons bull Back propagation networks bull Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron one of the first implementations of what would later be known as a neural network A perceptron consists of a simple three-layered network with input units called photo-receptors intermediate units called associators and output units called responders The perceptron could learn simple categories and thus could be used to perform simple classification tasks Later in 1969 Minsky and Papert showed that the class of problem that could be solved by a machine with a perceptron architecture was very limited It was only in the 1980s that researchers began to develop neural networks with a more sophisticated architecture that could overcomemiddot these difficulties A major improvement was the intro-duction of hidden layers in the so-called back propagation networks A back propagation network not only has input and output nodes but also a set of intermediate layers with hidden nodes In its initial stage a back propagation network has random weightings on its synapses When we train the network we expose it to a training set of input data For each training instance the actual output of the network is compared with the desired output that would give a correct answer if there is a difference between the correct answer and the actual answer the weightings of the individual nodes and synapses of the network are adjusted This process is repeated until the responses are more or less accurate Once the structure of the network stabilizes the learning stage is over and the network is now trained and ready to categorize unknown input Figure1 represents a simple architecture of a neural network that can perform an analysis on part of our marketing database The age attribute has been split into three age classes each represented by a separate input node house and car ownership also have an input node There are four addi-tional nodes identifying the four areas so that in this way each input node corresponds to a simple yes-no Decision The same holds for the output nodes each magazine has a node It is clear that this coding corresponds well with the information stored in the database The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes In an untrained network the branches between the nodes have equal weights During the training stage the network receives examples of input and output pairs corresponding to records in the database and adapts the weights of the different branches until all the inputs match the appropriate outputsIn Figure 2 the network learns to recognize readers of the car magazine and comics Figure 3 shows the internal state of the network after training The configuration of the internal nodes shows that there is a certain connection between the car magazine and comics readers However the networks do not provide a rule to identify this association Back propagation networks are a great improvement on the perceptron architecture However they also have disadvantages one being that they need an extremely large training set Another problem of neural networks is that although they learn they do not provide us with a theory about what they have learned - they are simply black boxes-that give answers but provide no clear idea as to how they arrived at these answers In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that is currently known as Kohonenrsquos self-organizing maps These neural networks can be seen as the artificial counterparts of maps that exist in several places in the brain such as visual maps maps of the spatial possibilities of limbs and so on A Kohonen self-organizing map is a collection of neurons or units each of which is connected to a small number of other units called its neighbors Most of the time the Kohonen map is two- dimensional each node or unit contains a factor that is related to the space whose structure we are investigating In its initial setting the self-organizing map has a random assignment of vectors to each unit During the training stage these vectors are incre mentally adjusted to give a better coverage of the space A natural way to visualize the process of training a self- organizing map is the so-called Kohonen movie which is a series of frames showing the positions of the vectors and their connections with neighboring cells The network resembles an elastic surface that is pulled out over the sample space Neural networks perform well on classification tasks and

16

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 17: 0810 Its-dwdm Mb3g1it

can be very useful in data mining

Figure 1

Figure2

Figure 38 The query manager has several distinct responsibilities It is used to control the following

bull User access to the databull Query schedulingbull Query monitoringThese areas are all very different in nature and each area requires its own tools bespoke software and procedures The query manager is one of the most bespoke pieces of software in the data warehouseUser access to the data The query manager is the software interface between the users and the data It presents the data to the users in a form they understand It also controls the user

lt TOP gt

17

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18

Page 18: 0810 Its-dwdm Mb3g1it

access to the data In a data warehouse the raw data will often be an amalgamation of data needs to be tied together somehow to achieve this raw data is often abstracted Data in this raw format can often be difficult to interpret This coupled with the fact that data from a single logical table is often partitioned into multiple real tables can make ad hoc querying of raw data difficultThe query managerrsquos task is to address this problem by presenting a meaningful schema to the users via a friendly front end The query manager will at one end take in the userrsquos requirements and in the background using the metadata it will transform these requirements into queries against the appropriate dataIdeally all user access tools should work via the query manager However as a number of different tools are likely to be used and the tools used are likely to change over time it is possible that not all tools will work directly via the query managerIf users have access via tools that do not interface directly through the query manager you should try setting up some form of indirect control by the query manager you should try setting up some form of indirect control by the query manager Certainly no large ad hoc queries should be allowed to be run by anyone other than the query manager It may be possible to get the tool to dump the query request to a flat file where the query manager can pick it up If queries do bypass the query manager query statistics gathering will be less accurateQuery Scheduling Scheduling of ad hoc queries is a responsibility of the query manager Simultaneous large ad hoc queries if not controlled can severely affect the performance of any system in particular if the queries are run using parallelism where a single query can potentially use all the CPU resource made available to it One aspect of query control that it is glaringly visible by its absence is the ability to predict how long a query will take to completeQuery monitoring One of the main functions of the query manager is to monitor the queries as they run This is one of the reasons why all queries should be run via or at least notified to the query manager One of the keys to success of usage of data ware house is to that success is the tuning of ad hoc environment to meet the userrsquos needs To achieve this query profiles of different groups of users need to be known This can be achieved only if there is long-term statistics on the queries run by each user and the resources used by each query The query execution plan needs to be stored along the statistics of the resources used and the query syntax usedThe query manager has to be capable of gathering these statistics which should then be stored in the database for later analysis It should also maintain a query history Every query created or executed via query manager should be logged This allows query profiles to be built up over time This enables identification of frequently run queries or types of queries These queries can then be tuned possibly by adding new indexes or by creating new aggregations

lt TOP OF THE DOCUMENT gt

18