data sourcing - cmrcet

54
Data Sourcing Business analytics UNIT_3

Upload: others

Post on 03-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Sourcing - CMRCET

Data Sourcing

Business analytics

UNIT_3

Data SourcingData Sourcing is needed to get input data for analytics where the customer for the data is the data stewardship team Data is sourced from a variety of systemseach presenting its own unique challenges for a Business Intelligence system This activity deals with the challenges of trying to get data from various sources as needed by analysts and we will address four types of sources

Four types of sources

Transaction Processing Systems

Benchmarks and External Data

Sources

Survey Tools Analytical Output

4 types of sources

1Transaction Processing SystemsOnline Transaction Processing systems (OLTP) are a class of systems that facilitate and manage transaction oriented applications

These range from e-commerce sale transactions and credit card transactions to employee time-card applications

These systems are geared to handle large volumes of transactions efficiently and capture data for each transaction

1Transaction Processing SystemsThis captured data is a primary source of data for BI systems and usually form the cornerstone around which effective data warehouses are builtGiven the need for accuracy and reliability of such systems data from suchsources also is of high fidelityTransactions cannot be completed and logged with incomplete or inaccurate data (in most cases) and hence data sourced from such systems are among the most reliable Data errors if identified down the line are often amenable to remedy by a suitable mechanism introduced in the OLTP

2Benchmarks and External Data Sources

Where surveys and benchmarking activities areoutsourced to specialist providers we count them as external data

sourcesBI systems can source data sources external to the organization Such data are typically provided by vendors who collect and collate data for the industry across organizations and provide such data for a fee Examples includeretail Point-of-Sale data from generic brand retail outletsIndustry data for Salary and Benefits comparison

social-media feeds from Twitter Google Analytics etc Government and financial markets data sources are also available for download and use

2Benchmarks and External Data SourcesWhile the richness and the capabilities enabled by such data are beyond question the quality and reliability of this data proves to be a significant deterrent to its widespread adoption--aligning this data against internal hierarchies and dimensionsis fraught with dangerExthe organizations may choose to view the US and Canada as two distinct sales regions but the retail sales data may choose to tag them collectively under ldquoNorth Americardquo Such disconnects could make such data unusable for analytics or limit the scope of analytical models

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 2: Data Sourcing - CMRCET

Data SourcingData Sourcing is needed to get input data for analytics where the customer for the data is the data stewardship team Data is sourced from a variety of systemseach presenting its own unique challenges for a Business Intelligence system This activity deals with the challenges of trying to get data from various sources as needed by analysts and we will address four types of sources

Four types of sources

Transaction Processing Systems

Benchmarks and External Data

Sources

Survey Tools Analytical Output

4 types of sources

1Transaction Processing SystemsOnline Transaction Processing systems (OLTP) are a class of systems that facilitate and manage transaction oriented applications

These range from e-commerce sale transactions and credit card transactions to employee time-card applications

These systems are geared to handle large volumes of transactions efficiently and capture data for each transaction

1Transaction Processing SystemsThis captured data is a primary source of data for BI systems and usually form the cornerstone around which effective data warehouses are builtGiven the need for accuracy and reliability of such systems data from suchsources also is of high fidelityTransactions cannot be completed and logged with incomplete or inaccurate data (in most cases) and hence data sourced from such systems are among the most reliable Data errors if identified down the line are often amenable to remedy by a suitable mechanism introduced in the OLTP

2Benchmarks and External Data Sources

Where surveys and benchmarking activities areoutsourced to specialist providers we count them as external data

sourcesBI systems can source data sources external to the organization Such data are typically provided by vendors who collect and collate data for the industry across organizations and provide such data for a fee Examples includeretail Point-of-Sale data from generic brand retail outletsIndustry data for Salary and Benefits comparison

social-media feeds from Twitter Google Analytics etc Government and financial markets data sources are also available for download and use

2Benchmarks and External Data SourcesWhile the richness and the capabilities enabled by such data are beyond question the quality and reliability of this data proves to be a significant deterrent to its widespread adoption--aligning this data against internal hierarchies and dimensionsis fraught with dangerExthe organizations may choose to view the US and Canada as two distinct sales regions but the retail sales data may choose to tag them collectively under ldquoNorth Americardquo Such disconnects could make such data unusable for analytics or limit the scope of analytical models

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 3: Data Sourcing - CMRCET

Four types of sources

Transaction Processing Systems

Benchmarks and External Data

Sources

Survey Tools Analytical Output

4 types of sources

1Transaction Processing SystemsOnline Transaction Processing systems (OLTP) are a class of systems that facilitate and manage transaction oriented applications

These range from e-commerce sale transactions and credit card transactions to employee time-card applications

These systems are geared to handle large volumes of transactions efficiently and capture data for each transaction

1Transaction Processing SystemsThis captured data is a primary source of data for BI systems and usually form the cornerstone around which effective data warehouses are builtGiven the need for accuracy and reliability of such systems data from suchsources also is of high fidelityTransactions cannot be completed and logged with incomplete or inaccurate data (in most cases) and hence data sourced from such systems are among the most reliable Data errors if identified down the line are often amenable to remedy by a suitable mechanism introduced in the OLTP

2Benchmarks and External Data Sources

Where surveys and benchmarking activities areoutsourced to specialist providers we count them as external data

sourcesBI systems can source data sources external to the organization Such data are typically provided by vendors who collect and collate data for the industry across organizations and provide such data for a fee Examples includeretail Point-of-Sale data from generic brand retail outletsIndustry data for Salary and Benefits comparison

social-media feeds from Twitter Google Analytics etc Government and financial markets data sources are also available for download and use

2Benchmarks and External Data SourcesWhile the richness and the capabilities enabled by such data are beyond question the quality and reliability of this data proves to be a significant deterrent to its widespread adoption--aligning this data against internal hierarchies and dimensionsis fraught with dangerExthe organizations may choose to view the US and Canada as two distinct sales regions but the retail sales data may choose to tag them collectively under ldquoNorth Americardquo Such disconnects could make such data unusable for analytics or limit the scope of analytical models

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 4: Data Sourcing - CMRCET

1Transaction Processing SystemsOnline Transaction Processing systems (OLTP) are a class of systems that facilitate and manage transaction oriented applications

These range from e-commerce sale transactions and credit card transactions to employee time-card applications

These systems are geared to handle large volumes of transactions efficiently and capture data for each transaction

1Transaction Processing SystemsThis captured data is a primary source of data for BI systems and usually form the cornerstone around which effective data warehouses are builtGiven the need for accuracy and reliability of such systems data from suchsources also is of high fidelityTransactions cannot be completed and logged with incomplete or inaccurate data (in most cases) and hence data sourced from such systems are among the most reliable Data errors if identified down the line are often amenable to remedy by a suitable mechanism introduced in the OLTP

2Benchmarks and External Data Sources

Where surveys and benchmarking activities areoutsourced to specialist providers we count them as external data

sourcesBI systems can source data sources external to the organization Such data are typically provided by vendors who collect and collate data for the industry across organizations and provide such data for a fee Examples includeretail Point-of-Sale data from generic brand retail outletsIndustry data for Salary and Benefits comparison

social-media feeds from Twitter Google Analytics etc Government and financial markets data sources are also available for download and use

2Benchmarks and External Data SourcesWhile the richness and the capabilities enabled by such data are beyond question the quality and reliability of this data proves to be a significant deterrent to its widespread adoption--aligning this data against internal hierarchies and dimensionsis fraught with dangerExthe organizations may choose to view the US and Canada as two distinct sales regions but the retail sales data may choose to tag them collectively under ldquoNorth Americardquo Such disconnects could make such data unusable for analytics or limit the scope of analytical models

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 5: Data Sourcing - CMRCET

1Transaction Processing SystemsThis captured data is a primary source of data for BI systems and usually form the cornerstone around which effective data warehouses are builtGiven the need for accuracy and reliability of such systems data from suchsources also is of high fidelityTransactions cannot be completed and logged with incomplete or inaccurate data (in most cases) and hence data sourced from such systems are among the most reliable Data errors if identified down the line are often amenable to remedy by a suitable mechanism introduced in the OLTP

2Benchmarks and External Data Sources

Where surveys and benchmarking activities areoutsourced to specialist providers we count them as external data

sourcesBI systems can source data sources external to the organization Such data are typically provided by vendors who collect and collate data for the industry across organizations and provide such data for a fee Examples includeretail Point-of-Sale data from generic brand retail outletsIndustry data for Salary and Benefits comparison

social-media feeds from Twitter Google Analytics etc Government and financial markets data sources are also available for download and use

2Benchmarks and External Data SourcesWhile the richness and the capabilities enabled by such data are beyond question the quality and reliability of this data proves to be a significant deterrent to its widespread adoption--aligning this data against internal hierarchies and dimensionsis fraught with dangerExthe organizations may choose to view the US and Canada as two distinct sales regions but the retail sales data may choose to tag them collectively under ldquoNorth Americardquo Such disconnects could make such data unusable for analytics or limit the scope of analytical models

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 6: Data Sourcing - CMRCET

2Benchmarks and External Data Sources

Where surveys and benchmarking activities areoutsourced to specialist providers we count them as external data

sourcesBI systems can source data sources external to the organization Such data are typically provided by vendors who collect and collate data for the industry across organizations and provide such data for a fee Examples includeretail Point-of-Sale data from generic brand retail outletsIndustry data for Salary and Benefits comparison

social-media feeds from Twitter Google Analytics etc Government and financial markets data sources are also available for download and use

2Benchmarks and External Data SourcesWhile the richness and the capabilities enabled by such data are beyond question the quality and reliability of this data proves to be a significant deterrent to its widespread adoption--aligning this data against internal hierarchies and dimensionsis fraught with dangerExthe organizations may choose to view the US and Canada as two distinct sales regions but the retail sales data may choose to tag them collectively under ldquoNorth Americardquo Such disconnects could make such data unusable for analytics or limit the scope of analytical models

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 7: Data Sourcing - CMRCET

2Benchmarks and External Data SourcesWhile the richness and the capabilities enabled by such data are beyond question the quality and reliability of this data proves to be a significant deterrent to its widespread adoption--aligning this data against internal hierarchies and dimensionsis fraught with dangerExthe organizations may choose to view the US and Canada as two distinct sales regions but the retail sales data may choose to tag them collectively under ldquoNorth Americardquo Such disconnects could make such data unusable for analytics or limit the scope of analytical models

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 8: Data Sourcing - CMRCET

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 9: Data Sourcing - CMRCET

2Benchmarks and External Data Sources--such data sources are never exhaustive in that they never trulycapture all the market activity--Channels that are too new or too small to becaptured under this umbrella will be missed--Forinstance Point-of-Sale retail data will miss out sales from small retailers--Since various competitorsproducts have strengths in different channels this could introduce a significant bias in the data obtained

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 10: Data Sourcing - CMRCET

2Benchmarks and External Data SourcesExample Airline reacuteservation systems (Sabre Galileo etc) provideparticipating carriers a consolidated data set that is a true record of every single ticket sold through the reservation system However this data does not include tickets solddirectly through an airlines web site

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 11: Data Sourcing - CMRCET

3Survey ToolsIn those cases where surveying is conducted in-house BI tools provide data from a variety of survey questionnaires the most common being Customer Satisfaction (CSAT) Surveys Employee Feedback Surveys etc Survey data is often perceived to be ldquoone-offrdquo and is usually not provisioned for in a BI environmentHowever in cases like CSAT Employee Feedback etc that are gathered

on a regular basis it is quite useful to understand changing patterns over time and hence necessary to include them in a BI environment

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 12: Data Sourcing - CMRCET

3Survey ToolsOther types of Survey data include more generic surveys like Salary Surveys Lifestyle Surveys etc that are generally used at aggregated levels to establish broad patterns The biggest challenge in including Survey data into a data warehouse is the inability to attach them to the common hierarchies In specific cases (CSAT Employees etc) such a linkage will exist and is easily used

In the generic case where responder identities are obfuscated or detail is not captured such a linkage will have to be created artificially and provisioned for at the time of initiating thesurvey

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 13: Data Sourcing - CMRCET

4Analytical OutputA source of data not generally thought of as a ldquosourcerdquo is the output and results of analytics models Every analytical model generates outputmdashforecasts predicted probabilities allocations etc that share similar characteristics as thedata that was used as input to the model Yes indeed model output is often used as input to other models For instance output of a forecasting model is used as input to another model that identifies optimal order sizes which in turn could serve as an input to a third model that identifies optimal freight assignment and so on

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 14: Data Sourcing - CMRCET

Data Loading

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 15: Data Sourcing - CMRCET

Data loading is the process of copying and loading data or data sets from a source file folder or application to a database or similar application

Data loading from internal and external databases is used to bring data from multiple disparate repositories into a location that can be shared by many analystsData repository is a somewhat general term used to refer to a destination designated for data storage

Most key business processes in the organization are supported by a tool that has a data repository

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 16: Data Sourcing - CMRCET

A data mart is a subset of a data warehouse oriented to a specific business line Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization for example the sales department

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 17: Data Sourcing - CMRCET

A picture archiving and communication system (PACS) is a medical imaging technology which provides economical storage and convenient access to images from multiple modalities

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 18: Data Sourcing - CMRCET

Enterprise applications leverage a shared repository that is available to all users while smaller applications use self-contained repositories (Excel MS Project etc) Example-vid-1This data has to be captured into a centralized repository or a data warehouse and made available for consumption by Business and analytics

This data is then presented to business users (Presentation) in the form of spreadsheets reports dashboards database views etc

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 19: Data Sourcing - CMRCET

In many cases IT teams work to bring these sets of data into a centralized ldquodata warehouserdquo that embeds an IT-supported business model to structure the data This is achieved through two sets of data manipulations1 Extract-Transform-Load (ETL) processes that extract data from the source systems then transform and load them into the data warehouse2 Database designs in the forms of tables views triggers and stored procedures The database can be called a data warehouse or a data mart depending on its scope and the ambition of its builders

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 20: Data Sourcing - CMRCET

5 types of data base models

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 21: Data Sourcing - CMRCET

Tables

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 22: Data Sourcing - CMRCET

Solve Data Quality IT Issues

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 23: Data Sourcing - CMRCET

Data Quality (DQ) is a significant issue facing many organizations poor data quality is associated with a variety of hard and soft costs Most organizations struggle to define and implement a formal strategy for addressing DQ problems Solve data quality woes by adopting a systematic approach for identifying and correcting current issues then put processes in place to stop data quality problems from resurfacingData quality is the overall ability of data to fulfill its intendedpurposeOrganizations of all sizes and in many different industries needclean data to operatePoor data quality negatively impacts a wide array of functions andbusiness processesBoth the business and IT are responsible for solving data qualityissues

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 24: Data Sourcing - CMRCET

5 issues with Data Quality

bull Data quality expert Joseph Juran defines data quality as whether or not data is ldquofit for intended uses in operations decision making and planningrdquo

bull Data quality does not refer to a single problem itrsquos an umbrella term referring to a family of different issues

bull Data quality is not a matter of ldquoexcellentrdquo vs ldquopoorrdquo An organization may excel in some areas of data quality but not in others For example an organization may struggle with duplicate data but have processes in place for ensuring data remains fresh

bull Different data quality problems are often interrelated For example duplicate data can give rise to data conflicts Taking steps to fix one problem can have a positive ldquohalo effectrdquo on other problems

Data Quality

Duplicate Data

Stale Data

Incomplete Data

Invalid Data

Conflicting Data

There are five different issues under the DQ umbrella

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 25: Data Sourcing - CMRCET

What is it What causes it What does it impact What can be done about it

Data Duplication Multiple copies of the same piece of data

bull Incorrect data entrybull Poor integrationbull Faulty database design

bull Wasted storage spacebull Ongoing problems with direct

sales andor marketing communications

bull Data quality toolsbull Better integrationbull Unique indices for data

Stale Data Data being incorrectly used on the assumption that it is current

bull Contacts changing positionbull One-time integration with no

ongoing delta importbull Data not being available fast

enough from source systems

bull Problems with marketing correspondence leading to lost sales and damaged customer relationships

bull Establish clear data refresh cycles

bull Pull customer information from user-supplied sources such as social networking sites

Incomplete Data Key fields are missing or not filled out

bull End user apathybull Required fields not being

enforcedbull Poor user interface

bull Missing data can lead to productivity losses and flawed decision-making

bull End-user trainingbull Strong data validationbull Easy-to-use interfaces

Invalid Data The wrong data or poorly formatted data is stored in columns

bull Ineffective or non-existent validation rules

bull Data type mismatches between integrated systems

bull Creates integration exception reports which must be investigated

bull Interferes with operational reporting

bull Strong data validationbull Elimination of extraneous

use of note fieldsbull User training

Data Conflicts Data contained in one system is at odds with data contained in another system

bull No designated system ofrecord

bull Poor integrationbull Lack of data interchange

between systems

bull Data conflicts confuse usersbull Wasted time and effortbull Threat of using incorrect data

bull Tighter system integrationbull Data auditing

The five data quality problems are distinct issues but they may have similar underlying causes

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 26: Data Sourcing - CMRCET

Solution

IT and the business often try to ldquopass the buckrdquo for data quality issues to one another The business must own the data but IT needs to have an active role in offering solutions to help the business address data quality problems

bull Conventional wisdom holds that the business is responsible for ensuring the integrity and accuracy of data Itrsquos not uncommon for IT to downplay its role in addressing data quality issues

bull However poor data quality is an endemic problem that often permeates the organization Individual business units rarely have the resources or authority to unilaterally solve their data quality problems

bull While the business needs to recognize that it is ultimately accountable for data ownership IT must take a proactive stance on providing solutions and assistance with data quality

bull Itrsquos important to delineate the relationship between IT and the business and specify who is responsible for what IT should not be taking charge of the data rather it should provide tools and assistance with data cleansing

Set policies for matters such as refresh cycles for stale data

Determine which systems will be ldquosystems of recordrdquo to reduce conflicts

Determine access privileges and data validation rights

The business needs tohellip

Advise the business on software tools for improving data quality

Provide assistance with major cleansing efforts

Provide assistance with database and interface design (eg locking down certain fields from end users and setting up data validation)

IT needs tohellip

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 27: Data Sourcing - CMRCET

Analytical Datasets and BI Assets

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 28: Data Sourcing - CMRCET

A critical component of a BI framework is a data repository that drives all processes and tools that make up the frameworkThe data repository represents the

single biggest asset of the BI framework and is usually seen as the primary driverbehind the BI framework

The data repository comes in several flavors

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 29: Data Sourcing - CMRCET

1)Operational Data Store

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 30: Data Sourcing - CMRCET

1)Operational Data StoreAn operational data store (ODS) is a type of database that collects data from multiple sources for processing after which it sends the data to operational systems and data warehousesAn Operational Data Store (ODS) integrates data from disparate sources (through Data Loading) The data is cleaned and rationalized through Data Stewardship to ensure integrity and consistency

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 31: Data Sourcing - CMRCET

1)Operational Data StoreAn operational data store may be designed to store only a limited history of data with older data flushed periodically intoa Data Warehouse Such operational data stores are sometimes referred to as Staging Databases since they hold data temporarily before committing it to the Warehouse Data Store structures are optimized for simple queries with the emphasis on speedy retrieval of limited information

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 32: Data Sourcing - CMRCET

2)Data Warehouse

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 33: Data Sourcing - CMRCET

2)Data WarehouseA Data warehouse collects data from operational data stores and stores them for longer term use A key aspect of a data warehouse is that data is never deleted from a warehouse and once committed to the warehouse the data becomes a permanent record Data warehouses are structured to handle complex queries with larger data sets where speed and responsiveness are often not the driving factorA data warehouse is not an essential ingredient for a BI framework and depending on the volume and usage of the data a Data Store can effectively serve as a data warehouse for all intents and purposes

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 34: Data Sourcing - CMRCET

2)Data WarehouseIn fact a Data Warehouse is so structured that it proves to be a very expensive way to provide BI infrastructureThis is because it creates multiple locations where models must be managed by different specialistsETL specialists embed data transformations into data flows

Data Warehouse designers put in complex database designs as implicit models and Business analytics modelers create their own models on top of these and react to business needs by creating overlaps

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 35: Data Sourcing - CMRCET

3)Data Mart

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 36: Data Sourcing - CMRCET

3)Data MartA Data Mart is a specialized cut from the data warehouse extracted for very specificbusiness needsOwnership of these data marts is typically vested with

the business units The business units can use these marts to create ad-hoc dimensions for specific analysis etc without upsetting the structure of the warehoused data

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 37: Data Sourcing - CMRCET

3)Data MartData Marts are also not essential ingredients for BI infrastructure but get recommendedas a ldquobest practicerdquo in most BI implementations as they distribute the management of the Business Analytics Databases Since it remains the responsibility of decision modeling and decision making functions to take a holistic approach to analytics the proliferation of data marts simply means that there are more places where models are stored so more places to manage

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 38: Data Sourcing - CMRCET

4)Data Structuring and TransformationOne key process in a BI framework is the transformation and structuring of data into a convenient structure The choice of structure can bebull Hierarchical Dimensions and Factsmdasha Star Schemabull A Normalized Structuremdashalso called a Third Normal Form

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 39: Data Sourcing - CMRCET

4)Data Structuring and Transformationbull Hierarchical Dimensions and Factsmdasha Star Schema

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts The star schema consists of one or more fact tables referencing any number of dimension tables

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 40: Data Sourcing - CMRCET

4)Data Structuring and Transformationbull A Normalized Structuremdashalso called a Third Normal Form

Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that The entity is in second normal form

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 41: Data Sourcing - CMRCET

5)Business Analytics Input Databases

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 42: Data Sourcing - CMRCET

5)Business Analytics Input DatabasesThe Business Analytics Input Database is an Operational Data Store into which data from various sources flows The data delivered to this stage must be a faithful representation of data in the source systemsNo additional checks are put in place to validate the data

available at this stage The design of this database needs to address how the data is synchronized with its source and to tune the delay (orcycle time) and data integrity

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis

Page 43: Data Sourcing - CMRCET

6)Business Analytics Ready DatabasesA Business Analytics Ready Database is simply a specialized Data Mart upon which analytics models are built That is because it contains data that has beenscrubbed and massaged to meet the analytics needs and also has access to informationabout data problems that can be used to forestall incorrect execution ofanalytics Examples of such databases include the ubiquitous OLAP9 cube that isvery prevalent in providing reporting and visualization analysis