113612646 data warehousing and data mining

48
P REREQUISITES This course is a part of the Information Technology program (B.Sc. (IT)) of Kuvempu University. A student registering for the fifth semester of B.Sc.(IT) of Kuvempu University must have completed fourth semester of B.Sc.(IT). The student should have attained the knowledge of the following modules: Algorithms Java Programming Unix & Shell Programming Software Engineering ©NIIT Coordinator Guide – Data Warehousing and Data Mining 1

Upload: rachit-khandelwal

Post on 08-Nov-2015

216 views

Category:

Documents


0 download

DESCRIPTION

bscit 53 document foe exam questions

TRANSCRIPT

  • PREREQUISITESThis course is a part of the Information Technology program (B.Sc. (IT)) of Kuvempu University.

    A student registering for the fifth semester of B.Sc.(IT) of Kuvempu University must have completed fourth semester of B.Sc.(IT). The student should have attained the knowledge of the following modules:

    Algorithms Java Programming Unix & Shell Programming Software Engineering

    NIIT Coordinator Guide Data Warehousing and Data Mining 1

  • CHAPTER-SPECIFIC INPUTS

    Chapter One

    ObjectivesIn this chapter, the students have learned to:

    Describe basic database management concepts Define data warehouse and data mining systems

    Focus AreasIntroduce the need for a database management system with the help of an example, such as the need for a company to organize its data and manage it through a front-end application. Ensure that the students understand the drawbacks of a file management system as compared to a database management system (DBMS). Explain elementary database concepts using ample examples.

    Next, give the following analogy to the students to introduce data warehouses.

    "A large insurance company has stored and organized its relevant official papers on racks in each room. It stores a large volume of papers in a separate warehouse. Some of the papers have now become historical and are not required in day-to-day working. However, these papers cannot be destroyed, as they are an essential asset for analysis of market trends and the companys progress. Moreover, it is also mandatory to maintain such papers till a fixed period of time. Therefore, it becomes essential that the papers required for day-to-day processing are stored in areas where they can be accessed quickly and the historical papers are stored for future reference in a bigger storage area.

    If you compare a database to the racks containing papers for day-to-day processing, then the database's counterpart for the warehouse is called a data warehouse."

    Now, explain the term data warehouse and the usage of a data warehouse. Similarly, explain data mining.

    After introducing these concepts, tell the students that before proceeding further, they must understand two important database management concepts called normalization and entity relationships.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Database Normalization

    Normalization is the process in which database objects such as attributes, keys, and relationships are restructured to remove redundancy and dependencies, and stabilize and simplify the database.

    Normalization is an important concept because it has a direct impact on the storage efficiency of the database.

    Normalization is done on the basis of certain rules. Each set of rules makes a database object, such as a table, more "normal". Each set of rules is called a "normal form." If the first set of rules is observed, the database is said to be in the "first normal form." If the second set of rules is observed, the database is considered to be in the "second normal form." Finally, if the third set of rules is observed, the database is considered to be in the "third normal form." The third normal form is the highest level of normalization necessary for most applications.

    2 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Consider the following definitions of normal forms:

    First Normal Form or 1NF

    1NF aims at removing repeating, multi-valued attributes from a table. The steps to be followed for converting a table to a 1NF table are:

    1. Eliminate repeating groups in individual tables.2. Create a separate table for each set of related data.3. Identify each set of related data with a primary key.

    Second normal form (2NF)

    2NF aims at removing attributes that are not dependent on the primary key completely. The steps to be followed for converting a table to a 2NF table are:

    1. Ensure that the table meets all the requirements of the first normal form. 2. Remove subsets of data that apply to multiple rows of a table and place them in separate tables. 3. Create relationships between these new tables and their predecessors using foreign keys.

    Third normal form (3NF)

    3NF aims at removing attributes that are dependent on other attributes, which are not a part of the primary key. The steps to be followed for converting a table to a 2NF table are:

    1. Ensure that the table meets all the requirements of the second normal form. 2. Remove columns that are not dependent upon the primary key.

    Example:

    Consider the following unnormalized table to understand normalization:

    Students

    Student Subject 1 Teacher 1 Subject 2 Teacher 2 Subject 3 Teacher 3

    John English-01 George Mathematics-01 Mary Science-01 Greg

    Tony English-01 James Mathematics-02 Mary Science-01 Peter

    John English-02 Pat Mathematics 02 Mary Science 01 Greg

    The sample table Students, given above, is unnormalized because it has repeating (subject-teacher) groups for the student and also does not have any primary key. To convert the table into 1 NF, it must be given a primary key and repeating groups must be eliminated.

    The I NF form of the above table is:

    RNo Student Subject Teacher Code

    1102 John English George 01

    1102 John Mathematics Mary 01

    1102 John Science Greg 01

    1105 Tony English James 01

    1105 Tony Mathematics Mary 02

    NIIT Coordinator Guide Data Warehousing and Data Mining 3

  • RNo Student Subject Teacher Code

    1105 Tony Science Peter 02

    1109 John English Pat 02

    1109 John Mathematics Mary 02

    1109 John Science Greg 01

    After implementing 1NF on the above table:

    Each row is uniquely identified by a primary key (Roll No now identifies that there two different students with the name John)

    The sequence of rows and columns is insignificant (The sequence of columns was not insignificant in the unnormalized form).

    Each column is unique (Unlike the unnormalized form where there were repeating groups such as subject 1, subject 2 and subject 3).

    Each column has single values (Now subject name and code are two separate columns, unlike the unnormalized form where code was stored as a part of the column name for subject).

    Similarly, other normal forms can be implemented after the requirements for the previous normal forms are met.

    Entity Relationship Model

    An entity is a concrete or abstract object that exists and is distinguishable from other objects. It is represented by a set of attributes, that is, properties. For example, "Employee" is an entity and has attributes like Name, and Address, which define its characteristics. In terms of a database, an entity can be mapped to a table or a view. For instance, an entity can be represented as:

    Entities in a database can share a relationship. A relationship is an association between several entities. For example, two entities Employee and Department can be related as shown in the following figure:

    Works in

    Three different types of relationships can exist between two entities:

    1. One-to-One Relationships: This type of relationship exists when a single occurrence of an entity is related to just one occurrence of another entity. For example, a person has one PAN number and a PAN number is allotted to only one person. Therefore, the entity Person has a one-to-one relationship with the entity PAN number.

    2. One-to-Many Relationships: This type of relationship exists when a single occurrence of an entity is related to many occurrences of another entity. For example, a student studies in one school but a school has various students. Therefore, the relationship between School and Students is one-to-many.

    4 Coordinator Guide Data Warehousing and Data Mining NIIT

    Employee

    First NameLast NameAddressDOB

    Employee

    First NameLast NameAddressDOB

    Department

    NameDNo.Head

  • 3. Many-to-Many Relationships: This type of relationship exists when many occurrences of an entity are related to many occurrences of another entity. For example, resources are allocated to many projects; and a project is allocated many resources. Therefore, the relationship between Resources and Projects is many-to-many.

    The relationships between entities are depicted using specialized graphics known as ER diagrams or Entity-relationship diagrams.

    NIIT Coordinator Guide Data Warehousing and Data Mining 5

  • FAQ1. What is the disadvantage of File Management System over DBMS?

    Ans:

    Some of the disadvantages of file management system over database management system are:

    Data redundancy and inconsistency Difficulty in accessing data Difficulty in integrating data into new enterprise level applications because of varying formats Lack of support for concurrent updates by multiple users. Lack of inherent security

    2. Are relational databases the only possible type of database models?

    Ans:

    No. Apart from relational, other models include network and hierarchical models. However, these two models are obsolete. Nowadays, relational and object-oriented models are preferred.

    3. What is referential integrity and how it is achieved in a relational database?

    Ans:

    Referential integrity is a feature of DBMS that prevents the user from entering inconsistent data. This is mainly achieved by having a foreign key constraint on a table.

    4. What are the higher normal forms?

    Ans:

    A normal form considered higher than 3NF is the Boyce Codd Normal Form (BCNF). The BCNF differs from the 3 NF when there are more than one composite and disjoint candidate keys.

    6 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Chapter Two

    ObjectivesIn this chapter, the students have learned to:

    Define data warehouse Identify the data warehouse process Identify the process flow in a data warehouse Identify the architecture for a data warehouse Apply data warehouse schemas Partition Fact table into separate partitions Identify the need for meta data and data marts

    Focus AreasInitiate the discussion by asking the following questions:

    What is data? What is the need to store this data? How should the data be stored? What techniques can be adopted to retrieve enormous data efficiently?

    You can also discuss the architecture for storing voluminous data with the help of data warehouse. Besides that, a difference between the data warehouse, data mart, and metadata should be discussed clearly.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Evolution of Data Warehousing

    Large organizations, all over the world, are increasingly using data for analytical purposes to assist them in taking business decisions in real time. The applications used for such analyses can be termed as Business Intelligence Applications. These analyses assist in determining trends based on which, future business decisions can be formulated.

    One of the fundamental requirements of these types of complex analyses is that they require a large volume of data to reach a consistent level of sampling.

    For example, soft drink manufacturers often use historical data to forecast the quantity of bottles to be currently manufactured. These forecasts are based on parameters such as the temperature, purchasing capacity of the customers, age group and so on. In order to obtain a better forecast, it becomes necessary that the amount of sample data be very large. In fact, the larger the volumes of data, the better are the chances for the forecast to be accurate.

    The data requirement for such kind of analysis is of two types:

    A large volume of data Historical data, capturing the data at various frames of time

    The On Line Transaction Processing (OLTP) systems are not suitable for the kind of complex analysis. This is because OLTP systems:

    Contain only transactional/recent data. Are specifically designed to manage transaction processing.

    NIIT Coordinator Guide Data Warehousing and Data Mining 7

  • There are therefore two distinct uses of data in organizations. The first is the analysis of historical data for business analytical purposes. The second is the usage of data that takes care of the daily transactional activities.

    The use of data for business analytical purposes has led to the development of data warehouses. Data warehouse is one of the key components of Business Intelligence Systems. The data stored in a data warehouse is used by the querying or reporting tools, data mining applications, and On Line Analysis Processing (OLAP) applications for business analysis.

    Data Mart

    A data mart is a specific subset of the contents of a data warehouse, stored within its own database. It contains data focused at a department level or on a specific business area of the organization. The volume of data in a data mart is less than that of a typical data warehouse, making query processing faster. Experts use data marts for analysis, rather than using the main data warehouse. For example, a large organization can treat its individual departments or divisions as independent business units. Each of these units can have its own data mart, used regularly by the analysts of that specific unit. Data marts contribute to the main data warehouse on a regular basis.

    Data Warehouse vs. Data Mart

    A data mart is simply a mini-data warehouse. More specifically, it is a data warehouse designed for a specific purpose or to be analyzed by a particular group within the institution. The data collected can be extracted from the data warehouse. Basically, a data warehouse collects a wide range of data types, while a data mart specifically involves only data the user will want.

    Data Warehouse and CRM

    Customer Relationship Management (CRM) uses data warehousing for value enhancement purposes. In CRM, data warehousing fulfills two main needs. It helps in analyzing customer habits, which in turn helps businesses, such as hotel and airlines, to offer better services to customers. It also helps in analyzing the behavior trend of customer segments. This analysis helps in determining segment commonalties and thereby, develops a business package based on their common habits.

    Data Warehouse and DSS

    Decision support systems (DSS) are used by organizations to improve strategic, tactical and operational decisions. Such decisions are based on complex analysis of historical data. The data warehouses provide a repository of historical data that support DSS functioning.

    Need for Security of a Data Warehouse

    Let us consider an example:

    Consider a company, which has all the information in the centralized data warehouse. The data warehouse will surely contain confidential information also. Every department's analyst must be in need of this data warehouse to analyze their specific data. However, they want to hide this data from others. In such a case, roles are critical.

    Consider an enterprise data warehouse, such as the airlines system. The data of all airlines is stored in one data warehouse. If security is not implemented, the analyst of one airline can easily get the secrets of another airline.

    Hence, security is the crucial factor that we need to consider while implementing a data warehouse.

    Metadata

    Due to the large volume of data that exists and is queried in a data warehouse, it is often useful to classify the type of data. This helps increase query response activity. In a data warehouse, a specific type of data, known as metadata, is used. Metadata contains information about types of data.

    8 Coordinator Guide Data Warehousing and Data Mining NIIT

    DATA LOADS

  • For example, the class of an object and the object, that is, instance of the class. If data is the instance then metadata is the class. Metadata is data about data.

    Characteristics of a Data Warehouse

    There are four characteristics intrinsic to a data warehouse:

    Consolidated and consistent data Subject-oriented data Historical data Read-only data

    Consolidated and Consistent Data

    A data warehouse combines operational data from a variety of sources with consistent naming conventions, measurements, physical attributes, and semantics.

    For example, the dates are often in different formats. Different regions often refer to the same piece of data differently. "Total" can also be known as "Balance" or "Total Amount" to represent the amount of cash transactions.

    This is unacceptable because data in a data warehouse is stored in a central location and cannot be stored in different formats or referred to differently. Data has to be stored in the data warehouse in a single, agreed-upon format, despite variations in the operational sources. This enables data from across the enterprise to be combined in the data warehouse and cross referenced by the analysts.

    Subject-Oriented Data

    Often, a large amount of data is useful only to the person with whom it exists or who has created that data. However, analysts might not need this entire data. It serves no purpose to store such data in a data warehouse. During data cleaning and reformatting, data that cannot be used for analysis is discarded.

    For example, it is customary to release all goods from the warehouse against an invoice. In the local database, the invoice number is an integral part of the data and so is the name of the person who is entering as well as approving the goods disbursement. However, the information such as who entered the invoice or who approved the invoice has no relevance to business or executive reporting. The presence of this data makes business analysis cumbersome and therefore, it must be organized in a way to make data querying manageable.

    In a data warehouse, only the key business information from operational sources must be stored and organized for analysis.

    Historical Data

    Data in OLTP systems always represent the current value of data at any moment in time. However, data stored in a data warehouse represents data at specific points or frames in time. Data stored in a data warehouse represents historical rather than current information. This is a typical and mandatory requirement of data warehouse.

    For example, an order-entry form, which is an OLTP application, shall always display the current levels of a specific item in the inventory. It will not show data at some point of time in the past. On the other hand, when all inventory-specific data is stored in the data warehouse, it must contain the snapshots of inventory on a daily, weekly, or monthly basis. Moreover, the analysis applications, which access this data from the warehouse, should also be able to exclusively take these snapshots and make them accessible to the end users.

    NIIT Coordinator Guide Data Warehousing and Data Mining 9

    DATA LOADS

  • Read-only Data

    After data has been moved to the data warehouse, it cannot be changed. Data stored in a data warehouse pertains to a point in time or to a specific timeframe, therefore it must never be updated. The only operations that occur in a data warehouse, when it has been set up, are loading and querying.

    Differences in the Way Data Behaves in an RDBMS and a Data Warehouse

    FAQ1. On which layer of application architecture does data warehouse operate?

    Ans:

    A data warehouse is a server-side repository to store data.

    2. What is a data warehouse?

    Ans:

    A data warehouse is a huge repository of data used for very complex business analysis.

    3. What are the benefits of data warehousing?

    Ans:

    The main benefit of data warehousing is to keep data in such a form that complex business analysis can be done in minimum amount of time.

    4. What are the application areas of a data warehouse?

    Ans:

    There are various application areas of a data warehouse. Some of these are:

    Airlines Meteorology Logistics Insurance

    5. Where can you use data warehouse successfully?

    Ans:

    It is ideal to implement data warehousing when there is a large amount of historical data that needs to be processed for extensive analytical purposes.

    10 Coordinator Guide Data Warehousing and Data Mining NIIT

    RDBMSUPDATES QUERIESDATA

    WAREHOUSEDATA LOADS

    QUERIES

  • 6. When and where is a data mart useful?

    Ans:

    A data mart helps to provide data for conducting analysis at a specialized level. It will always be used at a strategic business unit level such as the department level for a business unit.

    7. What does historical data support in data warehouse?

    Ans:

    Historical data is used for supplying pre-processed and non-pre-processed data for conducting business analysis.

    NIIT Coordinator Guide Data Warehousing and Data Mining 11

  • Chapter Three

    ObjectivesIn this chapter, the students have learned to:

    Identify the characteristics of the Star Flake Schemas Design Fact tables Design dimension tables Design the Star Flake schema Identify the concept of query redirection Analyze data using multi-dimensional schema

    Focus AreasConduct a quick recap to ensure that the students have clearly understood the concept of a data warehouse. Initiate a discussion on the concept of schemas. Explain the Star Flake Schema. Discuss the use of Fact and Dimension Tables. Also, clarify the distinction between Fact and Dimension Table so that there is no confusion in this respect.

    Discuss the guidelines to be followed while creating the Fact and Dimension Tables. Also, discuss the concept of query redirection and its purpose in helping users to get the most appropriate results in time.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Fact Tables

    Fact tables contain data that describe a specific event within a business, such as a financial transaction or a product sale. Fact tables can also contain data aggregations such as sales per month per region. Under normal conditions, existing data within a fact table is not updated, however, new data is loaded. Fact tables contain the majority of data stored in a data warehouse. This makes the structuring of fact tables very crucial. The features of fact tables are:

    Contain data that is static in nature Contain many records, possibly billions Store numeric data Have multiple foreign keys For example, a fact table can contain data such as product ID numbers and geographical IDs of areas.

    Dimension Tables

    Dimension tables contain data used to reference the data stored in the fact table, such as product descriptions, customer names, and addresses. Data, in this case, is mainly stored in characters. It is possible to optimize queries by separating the fact table data from the dimension table data. Dimension tables do not contain as many rows as fact tables. Dimension tables can change and must be structured to permit change. Dimension tables:

    Can be updated Have lesser number of rows as compared to fact tables Store data in characters Have many columns to manage dimension hierarchies Have one primary key, also known as the dimensional key

    12 Coordinator Guide Data Warehousing and Data Mining NIIT

  • For example, a Retailer dimension table can contain retailer ID and retailer name columns.

    Keys: Unique identifiers used to query data stored in the central fact table. The dimensional key, such as a primary key, links a row in the fact table with one dimension table. This structure makes it easy to construct complex queries and support a drill-down analysis in decision-support applications.

    Star Schema

    The star schema is a relational database structure and is a popular design technique used to implement a data warehouse. In this schema, data is maintained in a single fact table at the centre of the schema. Each dimension table is directly related to the fact table by a key column.

    The star schema design increases query performance by reducing the volume of data read from disk to satisfy a query. Queries first analyze data in the dimension tables to obtain the dimension keys that index into the central fact table. In this way, the number of rows that have to be scanned to satisfy a query is reduced greatly.

    S e a s o n I DS e a s o n D e s c r i p t i o nS t a r t D a t eS t a r t M o n t hE n d D a t eE n d M o n t hS t a r t Y e a rE n d Y e a r

    S E A S O N T A B L E

    R e g i o n I DR e g i o n D e s c r ip t i o nR e g i o n L o c a t i o n D is t r i c tR e g i o n L o c a t i o n S t a t eR e g i o n L o c a t i o n C o u n t r yR e g i o n a l I n c h a r g e I D

    R E G I O N A L T A B L E

    R e t a i l e r I DR e t a i l e r s h ip s t a r t d a t eR e t a i l e r s h ip v a l i d i t y p e r i o d

    R E T A I L E R T A B L E

    P r o d u c t I DP r o d u c t D e s c r ip t i o nL a u n c h D a t eC o n t in u it y S t a t u sU n i tP r ic eB r a n d N a m eP r o d u c t P r o p e r t y 1 V a lu eP r o d u c t P r o p e r t y 2 V a lu e

    P R O D U C T T A B L E

    S e a s o n I DR e g io n a l I DR e t a i le r I DP r o d u c t I DU n i t s S o ldS e a s o n a l D is c o u n t

    S A L E S T A B L E

    F a c t T a b l e

    D i m e n s i o n T a b l e

    D i m e n s i o n T a b l eD i m e n s i o n T a b l e

    D i m e n s i o n T a b l e

    Star Schema

    OLAP

    The arrangement and analysis of data in a data warehouse is done using On Line Analysis Processing (OLAP) systems. The historical data supports business decisions at various levels or departments, from strategic business planning to financial performance appraisal of a separate organizational unit and/or distinct business functional areas.

    After being collected from various heterogeneous sources, data is extracted, cleaned or scrubbed, and stored in the data warehouse in a homogeneous form. This is the first part of the activity.

    The analysts now have to choose the right data. Applications have to be built to assist analysts in completing their activity in a limited amount of time. It is necessary for another application to arrange data in a format, which shall make it easily accessible to the end user on querying. OLAP arranges data in an easily accessible format typical to a data warehouse.

    OLAP technology thus enables data warehouses to be used effectively for:

    Online analysis Providing quick responses to complex iterative queries posed by analysts

    NIIT Coordinator Guide Data Warehousing and Data Mining 13

  • OLAP achieves this through its multi-dimensional data models. Multi-dimensional data models are to data warehouse what tables are to RDBMS. These are the units where data is stored. These models are used to organize and summarize large amounts of data, making it easy to evaluate using online analysis and graphical tools. It also provides the speed and flexibility to support the analysts, helping them complete complex analysis in an acceptable time limit.

    B u s in e s s A n a ly s t s a n dO t h e r B u s in e s s U s e r s

    O p e r a t i o n a l /T r a n s a c t i o n a l

    D a t a ;H e t e r o g e n o u s D a t a D A T A W A R E H O U S E

    L O A D M A N A G E RD A T A E X T R A C T I O ND A T A T R A N S F O R M A T I O ND A T A L O A D I N G

    D A T A S T O R A G E

    O L A P

    Overview of how OLAP Services are Correlated to the Data Warehouse

    The OLTP databases have basic constituents such as tables and relations. In the case of data warehouses, these constituents are fact tables and dimension tables. A fact table contains measurable columns such as sales, costs, and expenses. A dimension table contains columns on which fact table columns can be categorized. For example, a Product dimension table can have columns such as product family, product category, and product ID on which sales, costs and expenses can be categorized. Similarly, these fact table columns can also be categorized on Stores table columns based on country and region.

    The following table displays the sales categorized according to the product family (from Product dimension) and country (from Store dimension):

    USA Germany Mexico

    Drink 2500000 450000 670000

    Food 3400000 675000 774800

    Solutions to Chapter Three Questions8. What is query redirection?

    Ans.

    When the available data grows beyond size, partitioning becomes essential. Query redirection means that the queries should be directed to the appropriate partitions that store the data required by the query.

    FAQ1. What are the benefits of OLTP?

    Ans:

    On Line Transaction Processing (OLTP) assists in storing current business transactional data. It also supports a large number of concurrent users to access data at the same time.

    14 Coordinator Guide Data Warehousing and Data Mining NIIT

  • 2. Why the OLTP cannot provide history data for analysis?

    Ans:

    Data in a data warehouse comes from an OLTP system only. However, it cannot be directly used for analysis. The reason is that the data in OLTP systems is not organized to give results quickly from billions of records. In a data warehouse, data is classified into various categories and so it is possible to give the results quickly.

    3. Why is the data in the data warehouse not stored in a normalized form as in OLTP?

    Ans:

    The objective of storing data in a normalized form in OLTP is to reduce redundancy and minimize disk storage. The key objective in a data warehouse is to enhance the query response time. The easier the access to data better will be the query response time. Hence, the normalization rules do not matter in a data warehouse.

    4. An integral part of OLTP is its support for hundreds of concurrent users. The number of concurrent users supported by a data warehouse is comparable to OLTP. Is this statement true or false? Justify your answer.

    Ans:

    The statement is false. This is because the number of people involved in data analysis is very low as compared to front-end users who engage in transactional data. Moreover, the percentage of CPU usage per user is very high in case of data warehousing as compared to OLTP users.

    5. Explain why a data warehouse does not use current or OLTP data for analysis.

    Ans:

    The main purpose of a data warehouse is to provide historical data to analyze business trends. Therefore, historical data needs to be a snapshot of events over time and not only on the current data.

    6. What is the advantage of MOLAP as storage model?

    Ans:

    MOLAP dimensions provide better query performance. Here the contents of the dimension are processed and stored on the Analysis Server and not on a Relational Server.

    7. What kind of data does a fact table contain?

    Ans:

    A fact table contains numeric data.

    8. What are the different OLAP storage models?

    Ans:

    Following are the different OLAP storage models: MOLAP, ROLAP and HOLAP.

    9. A data analysis has to be done in the fastest possible means on data stored in Multi-dimensional format. Which storage model is best suited in this case?

    Ans:

    MOLAP.

    NIIT Coordinator Guide Data Warehousing and Data Mining 15

  • Chapter Four

    ObjectivesIn this chapter, the students have learned to:

    Identify a partitioning strategy Use horizontal partitioning Use vertical partitioning Use hardware partitioning

    Focus AreasIntroduce partitioning by asking the students to identify ways in which the performance of a data warehouse and its manageability can be improved. Inform students that if data is broken into smaller manageable chunks, it can be scanned faster and managed easily. In this context, introduce partitioning. Ask students to identify the advantages of partitioning such as, faster access, better manageability due to smaller size, improved recovery time, and reduced effect of failure or breakdown.

    Explain the various types of partitioning with the help of examples for each. You can also explain Stripping discussed in the Additional Inputs section. This section also gives examples of how partitioning such as Stripping can help in optimization. To demonstrate partitioning, you may explain it with reference to the 'Partitioning in Oracle" section discussed in Additional Inputs.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Optimizing through Stripping

    Data warehouses handle a large amount of historical, present, and future data. As a result, the I/O performance is a crucial consideration in data warehouse operations.

    A technique called stripping is frequently used for improving I/O performance in data warehouses

    Stripping divides the data of a large table into small portions and stores them on separate data files on separate disks. You can stripe tablespaces for tables, indexes, rollback segments, and temporary tablespaces, where the data is spread over hardware components such as controllers, I/O channels, and internal buses.

    You can optimize a data warehouse object across multiple hardware components such as hard disks through stripping. However, it is essential that the requirement be studied carefully before a partitioning strategy is adopted for the data warehouse. The following examples will demonstrate how:

    Case A

    Eric Pitt, the data warehousing architect at a stock purchase company, will require frequent full table scans on the data warehouse customer related table(s) to retrieve data for business intelligence reporting. Eric can improve the I/O performance of the data warehouse by placing a table on multiple hard disks for a faster scan.

    Case B

    Eric's organization also has a BI team, which constantly monitors market trends, fluctuations and compares it with historical data to make predictions several times during the day. The availability of the data warehouse is crucial in this case. For this, it is important to restrict the tablespace to a few hard disks. In this case, if Eric wants to improve full table scan as well as availability, he can maximize partitions of the tables but minimize partitions for each tablespace.

    16 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Types of Stripping

    Stripping is of the following types:

    Global: Global Stripping entails disks and partitions. It is often used when you need to access data in only one partition. For implementing global partition in such cases, you can spread the data in that partition across many disks to improve performance for parallel execution operations. However, global Stripping has a single point of failure. That is, if one disk fails, and the disks are not mirrored, all the partitions are affected.

    Local: Local Stripping deals with partitioned tables and indexes. It is a simple form of partitioning in which each partition has its own set of disks and files. Access to the disk on which the partition resides or to the files does not overlap. Unlike global partition, an advantage of local Stripping is that if one disk fails, it does not affect other partitions. However, its main disadvantages are on the cost, and maintenance side. In local partitioning since each partition requires multiple disks of its own, it adds a cost and maintenance overhead due to these multiple hardware components. In this case, if you want to limit the number of disks used, you will have to reduce the number of partitions. This consequently makes local Stripping inappropriate for parallel operations. Local Stripping is a good choice if availability is a critical concern in your data warehouse.

    Automatic: Automatic Stripping is the Stripping done by the operating system itself based on some settings. It is a simple and flexible way of Stripping and useful for parallel processing requirements - for the same operation or multiple operations. However, the advantages of this Stripping are limited by the hardware such as I/O buses. That is, unlike local Stripping the DOP is not a function of disks. As per Oracle's recommendation, the stripe size must be at least 64 KB for good performance.

    Manual: Manual stripping is the process of adding multiple files to each tablespace, such that each file is on a separate disk. When using manual Stripping, the degree of parallelism of processing depends on the number of disks rather than of the number of processors. If manual Stripping is used correctly, system's performance improves significantly.

    Partitioning in Oracle

    Oracle is one of the most preferred choices in the field of data warehousing. Oracle 9i supports various types partitioning techniques such as:

    Hash partitioning: In this type of partitioning a table's records are partitioned based on the value of a particular field in the table. The value, which has to be mapped for deciding each record's partition, is called the hash value.

    Range partitioning: In this type of partitioning a table's records are partitioned based on the range of values in a particular field of the table.

    List partitioning: In this type of partitioning a table's records are partitioned based on a value from a list of values.

    Composite range-hash partitioning: In this type of partitioning, a table is first partitioned on the basis of range partitioning and then the partitions are further partitioned based on hash partitioning.

    Composite range-list partitioning: In this type of partitioning, a table is first partitioned on the basis of range partitioning and then the partitions are further partitioned based on list partitioning

    As an example, consider how Hash partitioning can be performed. Suppose, you are designing a data warehouse for an insurance company. The company stores the details of its investors in its policies released till date (suppose 3) in a table INSURANCE_DATA. This table has a field POLICY_TYPE. You can partition the table through hash partitioning with POLICY_TYPE as the hash value as follows:

    CREATE TABLE INSURANCE_DATA

    (POLICY_NUMBER,...,INSURANCE TYPE);

    PARTITION BY HASH(INSURANCE_TYPE)

    (PARTITION P1_TYPE TABLESPACE TBLSPC01,

    PARTITION P2_TYPE TABLESPACE TBLSPC02,

    PARTITION P3_TYPE TABLESPACE TBLSPC03,)

    This way you can map one policy type to one partition.

    NIIT Coordinator Guide Data Warehousing and Data Mining 17

  • ,FAQ1. Name the 2 important parameters that decide the granularity of partitions.

    Ans:

    Two important factors that decide the granularity of partitions are the overall size and manageability of the system. Both parameters are to be balanced against each other while deciding on a partitioning strategy. Suppose a data containing information about the population is partitioned on the basis of state, the two maintenance related issues that could be faced by the administrator are:

    The query needs the information of all the states, such as particular languages spoken in the states, when all the states have to be scanned.

    If the definition of state changes (state is redefined), the entire fact table needs to be built again.

    2. Are there any disadvantages of data partitioning?

    Ans:

    Data partitioning is by and large an advantageous technique for improving performance. However, it increases the implementation complexity and imposes constraints in query design.

    3. Can partitions be indexed?

    Ans:

    Yes, partitions can be indexed if supported by the platform. For example, in Oracle 9i, you can create various types of partitions in indexes.

    4. If you have a huge amount of historical data, which is too old to be useful often but cannot be discarded, then can partitioning help?

    Ans:

    Essentially the answer to this question depends on various factors such as availability of resources and design strategies. However, you can partition data on the basis of the date that it was last accessed and keep the historical data on a separate partition. In fact, you can use Stripping to keep it on a separate disk to improve access speed to the more useful data.

    18 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Chapter Five

    ObjectivesIn this chapter, the students have learned to:

    Define aggregation Design and create summary tables

    Focus AreasIntroduce aggregation as an operation that has a very significant impact on the performance of a data warehouse. Explain the need for aggregates. Also, identify goals associated with aggregation from the Additional Inputs section. Explain the considerations for designing aggregations referring to Additional Inputs section. Also, explain the concept of an aggregate navigator.

    Explain Summary tables' (or aggregates or aggregate fact tables) design and its creation.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Goals Associated with Aggregation

    Aggregation aims at improving the performance of a data warehouse. An effective aggregation must meet certain goals. Kimball, a pioneer in the field of data warehousing, suggests some goals based on which you can design a good aggregate strategy. These goals are:

    Provide dramatic performance gains for as many categories of users queries as possible. Add only a reasonable amount of extra data storage to the warehouse. What is reasonable is up to the

    DBA. However, many data warehouse DBAs strive to increase the overall disk storage for the data ware house by a factor of two or less.

    Be completely transparent to end users and to application designers except for the obvious performance benefits; in other words, no end-user application SQL should reference the aggregates.

    Directly benefit all users of data warehouse, regardless of which query tool they use. Keep the impact of the cost of the data extract system to the minimum. Inevitably, a lot of aggregates

    must be built every time data is loaded, but their specification should be as automated as possible. Keep the impact of the DBA's administrative responsibility to the minimum. The metadata that

    supports aggregates should be limited and easy to maintain.

    Selecting Candidates for Aggregation

    Before creating aggregates you must select appropriate candidates for creating aggregations. The selection of candidates for aggregation is primarily based on the common requests of business users. The selection may also be based on a statistical view that ensures that even if a dimension does not qualify as a commonly queried dimension, it may still be aggregated if it has a large range of values for its attribute(s). The statistical view also incorporates the relation between various attributes within a dimension and aims at selecting aggregates based on these relations.

    Some generic factors based on which you can select candidates are:

    Dimensions whose attributes could be candidates for aggregation because they are used often Attributes commonly used together The number or range of values for a particular attribute Possible aggregates that may be used to create other aggregates on the fly, if not directory required

    by the business users

    NIIT Coordinator Guide Data Warehousing and Data Mining 19

  • Aggregate Navigator

    An aggregate navigator is a middleware component between a client and the database server. It intercepts the client's SQL queries and transforms them into SQL queries that can be applied on aggregates. It contains up-to-date meta data about the aggregates of the data warehouse. Based on this meta data, it finds the appropriate aggregate, which can handle the basic SQL query sent by a client and transforms the query for the aggregate. The aggregate navigation algorithm suggested by Kimball is as follows:

    1. Create the aggregation as per the following design goals:a. Aggregates must be stored in their own fact tables, separate from the base atomic data. In

    addition, each distinct aggregation level must occupy its own unique fact table.b. The dimension tables attached to the aggregate fact tables must, wherever possible be shrunken

    version of the dimension tables associated with the base fact table. c. The basic atomic fact table and all of its related aggregate fact tables must be associated together

    as a 'family of schemas' so that the aggregate navigator knows which tables are related to each other.

    d. Force all SQL statements created by any end user data access tool or application to refer explicitly to the base fact table and its associated full-size dimension tables.

    2. Sort the schemas from the smallest to the largest based on the row count. For any given SQL statement presented to the database server, find the smallest fact table. Choose the smallest schema.

    3. Compare the table fields in the SQL statement to the table fields in the series of lookups in DBMS system catalogue. If all of the field in the SQL statement can be found in the fact and dimension tables being examined, alter the original SQL by simply substituting destination table names, for original table names. No field names need to be changed. If any field in the SQL statement cannot be found in the current fact and dimension tables, go back to step 2 and find the next larger fact table. This process is guaranteed to terminate successfully because eventually you arrive at the base schema, which is always guaranteed to satisfy the query.

    4. Run the altered SQL. It is guaranteed to return the correct answer because all of the fields in the SQL statement are present in the chosen schema.

    FAQ1. Are there any risks associated with aggregation?

    Ans:

    The main risk associated with aggregates is that of increase in disk storage space.

    2. Once created, is an aggregate permanent?

    Ans:

    No, aggregates keep changing as per the need of the business. In fact, they can be taken offline or put online anytime by the administrator. Aggregates, which have become obsolete, can also be deleted to free up disk space.

    3. Can operations such as MIN and MAX value be determined once a summary table has been created?

    Ans:

    Operations such as MIN and MAX cannot be determined correctly once the summary table has been created. To determine their value they must be calculated and stored at the time that the summary table was derived from the base table.

    4. How much storage increase might be required in the data warehouse system when using aggregates?

    Ans:

    The storage needs typically increase by a factor of 1 or sometimes even 2 for aggregates.

    20 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Chapter Six

    ObjectivesIn this chapter, the students have learned to:

    Identify the need for a data mart Differentiate between a data warehouse and a data mart Describe access control issues in data mart design

    Focus AreasIntroduce the concept of data marts using the store-house retail outlet analogy given in the book. Compare data marts with data warehouses to bring out the difference between the two. Explain the need for a data mart. In this context, explain that a data warehouse can actually be built from scratch or be built up on data marts. Introduce EDMA and DS/DMA as two data warehouse architectures based on data marts.

    Explain the access control issues in data mart design briefly.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Data Mart vs. Data Warehouse

    Criteria Data Mart Data Warehouse

    Purpose Business-driven Technology-driven

    Scope Localized Centralized

    Cost In hundreds of thousands of dollars In millions of dollars

    Development Time 6-8 months 1.5-2 years

    Data Data about specific departments/subjects

    Data about the entire enterprise

    Number An enterprise usually has multiple data marts

    An enterprise usually has one data warehouse

    Granularity of data Detailed Summarized

    Data Mart Types

    You can classify Data marts in the following two types:

    Independent data mart Dependent data mart

    An independent data mart is the one where data is sourced from a legacy application or a source other than the data warehouse. It is created with the aim of integrating into a data warehouse.

    A dependent data mart is the one whose data source is the data warehouse itself.

    NIIT Coordinator Guide Data Warehousing and Data Mining 21

  • Data Mart Based Data Warehouse Architectures

    There are three important Data warehouse architectures based on data marts:

    Enterprise Data Mart Architecture (EDMA) Data Stage/Data Mart Architecture (DS/DMA) Distributed Data Warehouse/Data Mart Architecture (DDW/DMA)

    A data warehouse architecture which implements an incremental approach to designing a data warehouse uses data marts and a shared Global metadata repository (refer to chapter 6). This architecture also supports a common data staging area. This data staging area is called a Dynamic Data Store (DDS). The DDS plays a crucial role in integrating data marts with the data warehouse. In this architecture, star schema modeling should be used if relational technology is used for the data warehouse.

    In Data Stage/Data Mart Architecture, no single data warehouse is physically implemented. Instead, the warehouse is considered a logical group of all the data marts.

    DDW/DMA is also similar to EDMA as it has a dynamic staging area and a common global metadata repository.

    FAQ1. What are conformed dimensions?

    Ans:

    A conformed dimension is the one whose meaning is independent of the fact table from which it is being referred to.

    2. What are virtual data marts?

    Ans:

    Virtual data marts are logical views of multiple physical data marts based on user requirement.

    3. Which tool supports data mart based data warehouse architectures?

    Ans:

    Informtica is commonly used for implementing data mart based data warehouse architectures.

    4. Is the data in data marts also historical like in data warehouses?

    Ans:

    The data in data marts is historical only to some extent. In fact, it is not the same as the data in data warehouse because of the difference in the purpose and approaches of the two.

    22 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Chapter Seven

    ObjectivesIn this chapter, the students have learned to:

    Define metadata Identify the uses of metadata

    Focus AreasExplain metadata as "data about data". Explain its important and its need in a data warehouse. Discuss the various types of metadata referring to the Additional Inputs section. Explain its usage in transformation and loading, data management and query generation. Explain the concept of metadata management referring to the Additional Inputs section.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Metadata Types

    With the growth and advances in the data warehousing field, metadata has become an invaluable resource. According to the source of the data being described, metadata can be classified as:

    Source metadata: This includes the information about the source data. It could include schemas, formats, graphics, relational tables, and ownership, administrative, business descriptions. It could also include process-related information such as extraction settings, schedules, and results of specific jobs that were performed on the source systems.

    Data staging metadata: This includes all the metadata required to load the data into the staging area. It could include, data acquisition information, definitions of conformed dimensions and facts, slowly changing dimension policies, data cleaning specifications, data enhancement and mapping transformations, target schema designs, data flows, load scripts, aggregate definitions, various process logs, and other business documentation.

    DBMS metadata: This includes the metadata describing various definitions, settings and specifications after the data has been loaded into the data warehouse. It could include, partition settings, indexes, Stripping specifications, security privileges, administrative scripts, view definitions, backup status, and procedures.

    Front room metadata: It could include, names and descriptions for attributes and tables, canned query and report definitions. In addition, it includes end-user documentation, user profiles, network security user privileges and profiles, network usage statistics, and usage instructions for data elements, tables, views and reports.

    Metadata Management

    Metadata management is as crucial as metadata itself. Metadata management ensures that a metadata can be represented and shared in a standard format. Metadata management involves two essential components:

    Metadata modeling Metadata repository

    NIIT Coordinator Guide Data Warehousing and Data Mining 23

  • In order to standardize representation of metadata it must be modeled at separate layers. This way varying source systems that have metadata at different layers of abstraction can be mapped to one of the layers and standardized. A metadata model typically has four layers as shown in the following figure:

    Layers of Metadata Model

    The Meta-Metamodel describes the structure of various entities of a database. A metamodel defines the structure and semantics of the Metadata. The Metadata describes the format and semantics of the data. By taking this model as a basis while designing your data warehouse and managing the metadata, you can standardize the representation of the metadata to be used for the data warehouse.

    The second important component of metadata management is metadata repository. A metadata repository is a common storage for all the metadata required for running a data warehouse. A metadata repository can be of two types depending on the architecture used:

    Centralized meta data repository Decentralized meta data repository

    In a centralized meta data repository, the metadata is defined and controlled through a single schema stored in a centralized repository, called the global metadata repository. This single schema represents the composite schema of all the sub systems.

    In a decentralized repository, data repository is required in a distributed environment. It consists of a central global metadata repository, as well as local metadata repositories. A global repository contains the metadata that is to be shared and reused among the local repositories. This global metadata is a single schema that is used by all the local metadata repositories. The local metadata repositories, on the other hand, contain the metadata specific to their individual uses.

    FAQ1. How can you classify metadata?

    Ans:

    You can classify metadata according the use of metadata as:

    Administrative metadata: Metadata that describes the data used for managing the data in terms of statistics such as time of creation, access rights, and last access time.

    Structural metadata: Metadata that describes the structure of the data. Descriptive metadata: Metadata that describes the purpose or functionality of the data.

    24 Coordinator Guide Data Warehousing and Data Mining NIIT

  • 2. What is backroom metadata?

    Ans:

    Backroom metadata is the metadata related to the process of extracting, cleaning, and loading. It is of use to the DBA and business users but not to the end-user.

    3. What is a metadata catalogue?

    Ans:

    A metadata catalogue is the same as a metadata repository. It is also called metadatabase.

    4. Are there any tools for metadata management?

    Ans:

    Yes, there are various tools that facilitate metadata management. One such Windows based tool is Saphir. SQL Server 2000 also enables metadata management to some extent.

    NIIT Coordinator Guide Data Warehousing and Data Mining 25

  • Chapter Eight

    ObjectivesIn this chapter, the students have learned to:

    Identify the various types of managers in a data warehouse environment Identify the responsibilities of each type of system manager Identify the responsibilities of each type of process manager

    Focus AreasTell students that the various components and techniques applied to implement a data warehouse must be managed together. This management is the responsibility of two types of management tools - system management tools and process management tools. Now, explain what system managers and process managers are. Emphasize that these are not persons but tools.

    Explain each type of manager and their responsibilities in detail. Explain the various components of SQL Server 2000, which act as one or more of these managers by referring to the Additional Inputs section. The main points are given in the Additional Inputs section. You can refer to the SQL Server Help if you need to detail students on any specific point.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    SQL Server 2000 for Data Warehousing Process Management

    SQL Server 2000, although not preferred over Oracle, stands as a good Data warehousing tool by itself. It includes various components for enabling the extract-transform-load (ETL) process. The three process managers, query manager, load manager, and warehouse manager, primarily manage the ETL process. Various services in SQL Server provide the functionality of these process managers. These services include:

    Data Transformation services, which primarily acts as the load manager. Metadata services, which primarily act as the warehouse manager along with DTS Job Scheduler and Query Analyzer, which primarily act as the query manager along with DTS

    Note that the terms load, query, and warehouse managers are theoretical terms and may not completely map to the components of a data warehousing software. Their functionality may be spread across various components or services of a tool. Alternatively, a tool may not actually have separate components mapping to each manager type. These are just terms used to collectively indicate a set of tasks that need to be performed. These tasks may be performed by a data warehousing software in any way.

    Data Transformation Services (DTS)

    DTS is a set of tools that enables you to build a collection of various elements that execute to perform the ETL process. This collection of elements that perform ETL is called a DTS package. DTS includes three primary components:

    Connections: In order to enable the elements of a package to function for data transformation, you must establish a connection between a data source and target.

    Tasks: A DTS task is a single step that is performed in the data transformation process. For example, export data from a source.

    Transformation: DTS supports various field level mappings and transformation between data sources. Examples of supported transformations include: Copy column transformation: In this, the source column is directly copied to the destination

    column without making any changes.

    26 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Date Time String transformations: In this, transformations on date/time fields, which are either of string data type or date/time data type, are performed.

    ActiveX script transformation: In this, ActiveX scripts are written that programmatically transform fields for every row being copied from the source to the target.

    DTS can be used simply using the Import/Export Wizard, which is an easy but a less flexible way of loading and transforming data. A more complex but advanced and flexible way to use DTS is through the DTS designer - a full-fledged designing environment in which you can create transformation packages.

    To understand how DTS actually works as a load manager. Observer the following snapshot of the DTS Import/Export Wizard in which data is loaded from an Excel sheet to a SQL Server database.

    The functionality offered by the Import/Export wizard is only limited to loading and transformation. The DTS designer actually provides a host of other functionalities. A snapshot of the DTS designer is given here:

    Under the Task pane, the various tasks supported by DTS are:

    File transfer protocol task ActiveX Script Task Transform Data Task Execute Process Task Execute SQL Task Data Drive Query Task Copy SQL Server Objects Task Send Mail Task Bulk Insert Task Execute Package Task Message Queue Task Transfer Error Message Task Transfer Databases Task Mater Stored Procedures Task Transfer Jobs Task Transfer Login Task

    NIIT Coordinator Guide Data Warehousing and Data Mining 27

  • Dynamic Properties Task

    28 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Meta Data Services

    SQL Server includes repository tables to store and manage SQL Server metadata. The SQL Server Meta Data Services provides storage for metadata including:

    SQL Server metadata Metadata associated with specific DTS packages Online analytical processing (OLAP) metadata

    SQL Server also supports a Meta Data browser that enables you to view metadata.

    Job Scheduler and Query Analyzer

    A Job Scheduler in SQL Server 2000 is used to schedule various jobs or processes that need to be executed. The Wizard that facilitates job scheduling is shown below:

    As shown in the figure, it can be used to schedule SQL queries, ActiveX scripts and shell commands.

    Queries can be created and tested using the SQL Query Analyzer.

    Using query analyzer you can:

    Create queries, SQL scripts, and commonly used database objects from predefined scripts Execute queries on SQL Server databases Execute stored procedures without knowing the parameters Debug stored procedures Copy existing database objects Debug query performance problems Locate objects within databases or view and work with objects Insert, update, or delete rows in a table Add frequently used commands to the Tools menu

    NIIT Coordinator Guide Data Warehousing and Data Mining 29

  • FAQ1. Are the system and process management devoid of any manual intervention considering that process manager is a tool and not a person?

    Ans:

    No. Although the system and process manager are themselves tools that automate system and process management in data warehouses, they must be configured and sometimes handled through manual intervention at times. These tasks may be done by the Database Administrator.

    2. Does SQL Server also provide system managers?

    Ans:

    Yes. SQL Server includes various components that enable system management through management and security services:

    3. What is Oracle Warehouse Builder(OWB)?

    Ans:

    It is one of the commonly used data warehouse development tools with various advanced features such as support for large databases, automated summary management, and embedded multidimensional OLAP engine. Unlike SQL Server, which is only for the Windows platform, OWB can be used on all platforms. It is also more fast, reliable, and scaleable than SQL Server.

    4. What is replication?

    Ans:

    Replication is the process of creating multiple copies of data on the same or different platform and keeping the copies in sync.

    30 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Chapter Nine

    ObjectivesIn this chapter, the students have learned to:

    Define data mining Identify the data that can be mined Identify the functionalities of data mining Categorize data mining systems Identify the application fields for data mining

    Focus Areas

    Initiate a classroom discussion by asking the students the following questions:

    What is data mining? Explain data mining by giving various definitions.

    What is Knowledge Discovery in Databases (KDD)? Explain the various steps in KDD using a figure.

    What type of data can be mined? List and explain the following types of data that can be mined: flat files, relational databases,

    data warehouses, multimedia databases, spatial databases, time-series databases, and the World Wide Web can be mined.

    What benefits does data mining provide? Explain the following benefits of data mining: characterization, discrimination, association

    analysis, classification, prediction, clustering, outlier analysis, evolution and deviation analysis. On what criteria can data mining systems be categorized?

    Explain that data mining systems can be categorized on the following criteria: Type of data source mined Data model used Kind of knowledge discovered Mining techniques used

    What are the issues in data mining? Explain the following issues in context of data mining: security and social issues, user interface

    issues, mining methodology issues, performance issues, data source issues. Why is data mining becoming so popular?

    Explain the following reasons for the growing popularity of data mining: Growing data volume, limitation of human analysis, low cost of machine learning.

    What are the various application areas of data mining? Explain that data mining has found its application in various areas such as retail, marketing,

    banking, insurance, health care, transportation, and medicine.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Data Mining

    Data Mining is the process of finding new and potentially useful knowledge from data. Data mining software allows users to analyze large databases to solve business decision problems. Data mining is, in some ways, an extension of statistics, with a few artificial intelligence and machine learning twists thrown in. Like statistics, data mining is not a business solution, it is just a technology.

    NIIT Coordinator Guide Data Warehousing and Data Mining 31

  • For example, consider a catalog retailer who needs to decide who should receive information about a new product. The information operated on by the data mining process is contained in a historical database of previous interactions with customers and the features associated with the customers, such as age, zip code, their responses. The data mining software would use this historical information to build a model of customer behavior that could be used to predict which customers would be likely to respond to the new product. By using this information, a marketing manager can select only the customers who are most likely to respond. The operational business software can then feed the results of the decision to the appropriate touchpoint systems (call centers, web servers, email systems, etc.) so that the right customers receive the right offers.

    Clustering

    Clustering is often one of the first steps in data mining analysis. It identifies groups of related records that can be used as a starting point for exploring further relationships. This technique supports the development of population segmentation models, such as demographic-based customer segmentation. Additional analyses using standard analytical and other data mining techniques can determine the characteristics of these segments with respect to some desired outcome. For example, the buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign.

    Classification

    Classification, perhaps the most commonly applied data mining technique, employs a set of pre-classified examples to develop a model that can classify the population of records at large. Fraud detection and credit-risk applications are particularly well suited to this type of analysis. This approach frequently employs decision tree or neural network-based classification algorithms. The use of classification algorithms begins with a training set of pre-classified example transactions. For a fraud detection application, this would include complete records of both fraudulent and valid activities, determined on a record-by-record basis. The classifier training algorithm uses these pre-classified examples to determine the set of parameters required for proper discrimination. The algorithm then encodes these parameters into a model called a classifier.

    The approach affects the explanation capability of the system. Once an effective classifier is developed, it is used in a predictive mode to classify new records into these same predefined classes. For example, a classifier capable of identifying risky loans could be used to aid in the decision of whether to grant a loan to an individual.

    KDD

    Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon methodologies for extracting useful knowledge from data. The ongoing rapid growth of online data due to the Internet and the widespread use of databases have created an immense need for KDD methodologies. The challenge of extracting knowledge from data draws upon research in statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing to deliver advanced business intelligence and web discovery solutions.

    KDD refers to a multi-step process that can be highly interactive and iterative. It includes data selection/sampling, preprocessing and transformation for subsequent steps. Data mining algorithms are then used to discover patterns, clusters and models from data. These patterns and hypotheses are then rendered in operational forms that are easy for people to visualize and understand. Data mining is a step in the overall KDD process.

    Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data". It uses machine learning, statistical and visualization techniques to discovery and presents knowledge in a form that is easily comprehensible to humans. The main idea in KDD is to discover a high level knowledge (abstract knowledge) from lower levels of relatively raw data, or to discover a higher level of interpretation and abstraction than those previously known.

    32 Coordinator Guide Data Warehousing and Data Mining NIIT

  • FAQ1. What is KDD Process?

    Ans:

    The unifying goal of the KDD process is to extract knowledge from data in the context of large databases. It does this by using data mining methods (algorithms) to extract (identify) what is deemed knowledge, according to the specifications of measures and thresholds, using a database along with any required preprocessing, subsampling, and transformations of that database.

    2. What is Data Visualization?

    Ans:

    Data Visualization presents data in three dimensions and colors to help users view complex patterns. They also provide advanced manipulation capabilities to slice, rotate or zoom the objects to identify patterns.

    3. What are the constituents of Multidimensional objects?

    Ans:

    Dimensions and Measures.

    4. What does level specify within dimension?

    Ans:

    Levels specify the contents and structure of the dimension's hierarchy.

    5. What is data mining?

    Ans:

    Data Mining is the process of finding new and potentially useful knowledge from data

    6. What does Data Mining Software do?

    Ans:

    A Data Mining Software searches large volume of data, looking for patterns that accurately predict behavior, such as customers most likely to maintain relationship with the company etc. Common techniques employed by Data Mining Software include Neural Networks, Decision Trees and standard statistical modeling.

    7. What is Oracle Data Mining?

    Ans:

    Oracle Data Mining is enterprise data mining software that combines the ease of a Windows-based client with the power of a fully scalable, multi-algorithmic, UNIX server-based solution. Oracle Data Mining provides comprehensive predictive modeling capabilities that take advantage of parallel computing techniques to rapidly extract valuable customer intelligence information. Oracle Data Mining can optionally generate deployable models in C, C++, or Java code, delivering the "power of prediction" to call center, campaign management, and Web-based applications enterprise-wide.

    NIIT Coordinator Guide Data Warehousing and Data Mining 33

  • 8. How does data mining differ from OLAP?

    Ans:

    Simply put, OLAP compares and data mining predicts. OLAP performs roll-ups, aggregations, and calculations, and it compares multiple results in a clearly organized graphical or tabular display. Data mining analyzes data on historical cases to discover patterns and uses the patterns to make predictions or estimates of outcomes for unknown cases. An analyst may use OLAP to discover a business problem, and then apply data mining to make the predictions necessary for a solution. An OLAP user can apply data mining to discover meaningful dimensions that should be compared. OLAP may be used to perform roll-ups and aggregations needed by the data mining tool. Finally, OLAP can compare data mining predictions or values derived from predictions.

    9. What are some typical data mining applications?

    Ans:

    Following are some of the data mining applications:

    Customer retention Cross selling Response modeling / target marketing Profitability analysis Product affinity analysis Fraud detection

    34 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Chapter Ten

    ObjectivesIn this chapter, the students have learned to:

    Identify the steps in the data preparation process Identify data mining primitives Use data mining querying language Identify strategies to define a graphical user interface based on a data mining query language Identify architectures of data mining systems

    Focus AreasIntroduce the first step of data mining, which is data preprocessing. Enumerate the steps to be followed to preprocess data and make it ready for application of data mining techniques. Initiate a discussion on data preparation. However, before that, you must explain how data is acquired for a data mining system from the Additional Inputs section. After data acquisition, explain data preparation with the help of an appropriate example of a dataset. Next, briefly discuss data mining primitives and then shift focus to data mining querying language. Explain data mining querying language in detail with the help of examples. Finally, conclude the session with a brief discussion on data mining system architectures. Hold a discussion for identifying scenarios for selecting a particular architecture.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Preparing Data for Preprocessing

    Before we can preprocess the data, it is very important to collect the right data from the right source.

    The process of data collection involves the following activities:

    Acquiring data from various sources Describing the acquired data Exploring the data Verifying the quality of the data for the mining process

    The process of acquiring data involves various activities such as identifying sources from which data can be acquired, ways of acquiring data, and actually gathering data. Data can be acquired from multiple sources in various ways such as interviews, surveys, observational data, and transactional databases. After the culmination of this process, the following output must be produced:

    List of methods used for acquiring data List of sources from which data was acquired Problems faced during the process and their subsequent resolution List of the data acquired

    After data has been acquired, it must be described. Data description involves various activities such as:

    Defining the volume of data Identifying missing attributes and values. For example, if a table is supposed to have a field as per

    the meta data or requirements specification but the field does not exist, or if all values for a given attribute are empty, then you can conclude the data attributes and value are missing.

    Identifying type and meaning of data attributes. For example, if a data field collected is named "Distance", then you must be able to describe its meaning as which distance and in which units (such as Km, m, or miles). This activity often requires revisiting the data acquisition step and makes it an iterative process, till all the data has been clearly defined.

    NIIT Coordinator Guide Data Warehousing and Data Mining 35

  • Identifying the initial, basic, format of the data by ensuring that metadata is available. Meta data is very crucial for describing data.

    After data has been adequately defined, it must be explored. Data exploration refers to understanding the basic structure and schema of data, and evaluating its usefulness for data mining. Basic exploration can start by analyzing the meta data. Meta data gives an idea of structure as well as the detailed meaning of data. This helps in analyzing the usefulness of data. Beyond studying the meta data, basic exploration uses simple as well as sophisticated statistical techniques to reveal the properties of the data.

    After the data has been gathered, described, and explored, its quality needs to be verified. Quality data is the one that is complete, least complex, has enough meta data, has unambiguous attributes, and possesses context independence. It is essentially that before data is preprocessed and mined, its quality be verified. This is because mining data of low quality will lead to inaccurate results and may cost an enterprise a huge amount in terms of money and effort wasted. Data quality can be harmed at any step from gathering data to delivering and storing it. For example, typing mistakes during storage of data can lead to loss of quality due to incorrect data values.

    After complete verification of data quality, data can now be preprocessed for mining.

    Choosing a Data Mining System - Commercial Aspect

    There is very little common in data mining systems in commercial applications. The functionality and methodology required is different and so is the data to be mined. For the success of a data mining system it is essential to identify the data mining system that you will use. A good data mining system has the following characteristics:

    It is easy to use. It should provide at least 80% accuracy of prediction. It should be able to perform all common data mining tasks such as cleaning, import, export, and

    formatting.

    FAQ1. What is Noisy Data?

    Ans:

    Noise is a random error or variance in data. It can happen because of:

    Faulty data collection and data entry mistake such as a typing mistake Data transmission and storage problem Inconsistency in naming convention

    Noise makes data inaccurate for predictions and renders it futile for mining systems.

    2. Which are the major data mining tasks?

    Ans:

    The main data mining tasks include:

    Classification Clustering Associations Prediction Characterization and Discrimination Evolution Analysis

    36 Coordinator Guide Data Warehousing and Data Mining NIIT

  • 3. What are some other Data Mining Languages and standardization of primitives apart from DMQL?

    Ans:

    Some other Data Mining Languages and standardizations of primitives apart from DMQL include:

    MSQL Mine Rule Query flocks based on Data log syntax OLEDB for DM CRISP-DM

    4. Which Data Mining tools are used commercially?

    Ans:

    Some Data Mining tools used commercially are:

    Clementine Darwin Enterprise Miner Intelligent Miner Mine Set

    5. How can noisy data be smoothened?

    Ans:

    Noisy data can be smoothened using the following techniques:

    Binning Clustering Computer/Human inspection Regression

    NIIT Coordinator Guide Data Warehousing and Data Mining 37

  • Chapter Eleven

    ObjectivesIn this chapter, the students have learned to:

    Identify the techniques of Data Mining Apply Apriori Algorithm Build Decision Trees

    Focus AreasInitiate a discussion by asking the students the importance and need of Data Mining. Explain that various techniques have to be adopted for data mining as it is a specialized field and needs to be handled through appropriate techniques. Discuss the various techniques that can be used for mining the data. Also focus on issues, if any, related to any of the techniques and the key features of various techniques.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Data mining using FP-Tree Algorithm

    The basic idea of Apriori algorithm is to generate the set of candidate patterns of length (k+1) from the set of frequent patterns of length k and use database scan and pattern matching to collect counts for the candidate itemsets.

    However, it has few bottlenecks:

    1. Huge candidate sets generation: Needs 2100 >> 1030 candidates to be generated to discover frequent patterns of size 100 such as {a1, a2, ,a100}.

    2. Multiple scans of database: Needs (n +1) scans, n is the length of the longest pattern

    The solution is in the form of FP-Tree algorithm that shows performance improvements over Apriori and its variations since it uses a compressed data representation (nodes and a tree structure) and does not need to generate candidate sets. However, FP-Tree based mining uses a complete data structure and performance gains are very sensitive to the support threshold setting. Update of the database requires a complete repetition of the scan process and construction of a new tree.

    Definition of FP-Tree:

    FP-Tree is an extended prefix-tree structure storing crucial, quantitative information about frequent patterns.

    It is highly condensed, but complete for frequent pattern mining. The advantage over Apriori is that it avoids costly database scans and does not require candidate generation.

    38 Coordinator Guide Data Warehousing and Data Mining NIIT

  • Process of FP-Tree Mining

    To implement FP-Tree Mining on a transaction database D, perform the following steps assuming that support is set to 4:

    Step1: Scan the transaction database D (shown in the following figure) once. Collect the set of frequent items (with count more than minimum support) F and their supports. Sort F in descending order of support as L, the list of frequent items.

    Step 2: Create the root of an FP-Tree, and label it as "null". For each transaction Trans in D do the following.

    Select and sort the items in Trans according to the order of L. Let the sorted frequent item list in Trans be [p|P], where p is the first element and P is the remaining list. Call insert_tree ([p|P], T], which is performed as follows. If T has a child, N such that N.item-name = p.item-name, then increment N's count by 1. Otherwise, create a new node N, and let its count be 1, its parent link be linked to T, and its node-link to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert_tree (P, N) recursively.

    Step 3: Mine FP-Tree using FP-Growth algorithm to generate frequent itemsets.

    Starting at the frequent header table in the FP-tree:1. For each item in header table, create its conditional pattern base by accumulating all prefix paths of

    that item.2. Expand the item and create its conditional FP-Tree.3. Repeat the process recursively on each constructed FP-Tree until FP-Tree is either empty or contains

    only one path.

    NIIT Coordinator Guide Data Warehousing and Data Mining 39

  • 4. Enumerate all the combinations of the items from a single path that forms the frequent patterns

    FAQ1. What are variations to Apriori algorithm?

    Ans:

    Following are some of the variations to Apriori algorithm that improves the efficiency of the original algorithm:

    Transaction reduction: Reducing the number of transaction scanned in future iterations Partitioning: Partitioning data to find candidate itemsets Sampling: Mining on a subset of the given data Dynamic itemset counting: Adding candidate itemsets at different points during the scan

    2. Which is the best approach when we are interested in finding all possible interactions among a set of attributes?

    Ans:

    The best approach to find all possible interactions among a set of attributes is association rule mining.

    3. What is over fitting in neural network?

    Ans:

    Over fitting is a common problem in neural network design. Over fitting occurs when a network has memorized the training set but has not learned to generalize to new inputs. Over fitting produces a relatively small error on the training set but will produce a much larger error when new data is presented to the network.

    40 Coordinator Guide Data Warehousing and Data Mining NIIT

  • 4. What is back propagation neural network?

    Ans:

    The back propagation is a neural network algorithm for classification that employs a method of gradient descent. It searches for a set of weights that can model the data to minimize the mean squared distance between the network's class prediction and the actual class label of data samples. Rules may be extracted from trained neural networks in order to help improve the interpretability of the learned network.

    NIIT Coordinator Guide Data Warehousing and Data Mining 41

  • Chapter Twelve

    ObjectivesIn this chapter, the students have learned to:

    Identify and apply the guidelines for a KDD

    Focus AreasRecall the chain of data-->organized data (information)-->knowledge and introduce Knowledge Discovery in Databases (KDD) as the complete process of mining knowledge from information stored in databases. Explain the KDD environment set up referring to the Additional Inputs section. Tell the students that a KDD environment is a part of an enterprise whose core competency is data mining. Consequently, it is important that the enterprise understands what makes a KDD environment successful. List the factors, which make a KDD environment successful from the Additional Inputs section. Next, enumerate and explain the various guidelines from the textbook.

    Finally, initiate a discussion on how to act on mining results.

    Additional InputsThe following section provides some extra inputs on the important topics covered in the SG:

    Setting up a KDD Environment

    KDD environment needs to be set up in an organization that wants to shift from guesswork and personal opinions of experts to analysis and facts derived from actual information. KDD is often believed to shift an enterprise's focus from "product or service" to "customer". That is, by the knowledge gained from data mining, various improvements can be made in the product or service, which are beneficial for an enterprise and most importantly for its customer base.

    A KDD environment consists of the following elements:

    A group that develops data mining skills Communication links into the required business units Set of tools, hardware and software for data mining Access to the data of the entire enterprise Ability to publish data mining results so that the enterprise can act on them

    Success Factors of a KDD Environment

    The success of a KDD environment depends on five crucial factors. These factors are:

    The team involved in KDD should ideally consist of 8-10 multi-skilled people, with an excellent aptitude for technical aspects as well as business aspects. They need to be experts in various disciplines such as statistical analysis, understanding business users, comprehending data owners, and managing.

    The team should be lead by a single person