computer fundamentals(final)

31
ISAS Presentation On Data WareHousing Submitted By: Submitted To:

Upload: siddharth

Post on 21-Dec-2015

8 views

Category:

Documents


0 download

DESCRIPTION

presentaion NIIT

TRANSCRIPT

ISAS

Presentation

On

Data WareHousing

Submitted By: Submitted To:

Siddharth Gaur Sajal Mathur

Anuj Dutta

Kanchan Bhatnagar

Computer Fundamentals

Input Devices (keyboard), Output Devices (monitor) and the CPU

A model expression for computer operation is the following:

INPUT - PROCESS - OUTPUTThis means that information is delivered (input) to the computer,

the computer processes the information and then sends (outputs)

processed information to the user.

Thus, a computer is composed of hardware that performs these

three functions, input, process, output.

The most common imput devices are the keyboard and

the mouse although many other input devices exist. For example, a

computer may monitor a hospital patient's heartbeart via contacts

placed on the patients chest. The contacts are an input device.

The most common output devices are the monitor and printer.

These enable the computer to present information in a format that

humans can understand.

The CPU (Central Processing Unit) has primary responsibility

for processing data. All the complex

processing that a computer performs, no

matter how sophisticated, can be

done with three simple functions

(instructions). These are 1) add two

numbers, 2) move a number from one location to another and 3)

compare two numbers and act upon the result of the comparison. 

Data WareHousing

A Definition of Data Warehousing

By Michael Reed

The data warehousing market consists of tools, technologies, and methodologies that allow for the construction, usage, management, and maintenance of the hardware and software used for a data warehouse, as well as the actual data itself.

In order to clear up some of the confusion that is rampant in the market, here are some definitions:

Data Warehouse:

The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". He defined the terms in the sentence as follows:

Subject Oriented:

Data that gives information about a particular subject instead of about a company's ongoing operations.

Integrated:

Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.

Time-variant:

All data in the data warehouse is identified with a particular time period.

Non-volatile

Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.

Source System Identification

Source System Identification:

In order to build the data warehouse, the appropriate data must be located. Typically, this will involve both the current OLTP (On-Line Transaction Processing) system where the "day-to-day" information about the business resides, and historical data for prior periods, which may be contained in some form of "legacy" system. Often these legacy systems are not relational databases, so much effort is required to extract the appropriate data.

Data Warehouse Design and Creation:

This describes the process of designing the warehouse, with care taken to ensure that the design supports the types of queries the warehouse will be used for. This is an involved effort that requires both an understanding of the database schema to be created, and a great deal of interaction with the user community. The design is often an iterative process and it must be modified a

number of times before the model can be stabilized. Great care must be taken at this stage, because once the model is populated with large amounts of data, some of which may be very difficult to recreate, the model can not easily be changed.

Data Acquisition:

This is the process of moving company data from the source systems into the warehouse. It is often the most time-consuming and costly effort in the data warehousing project, and is performed with software products known as ETL (Extract/Transform/Load) tools. There are currently over 50 ETL tools on the market. The data acquisition phase can cost millions of dollars and take months or even years to complete. Data acquisition is then an ongoing, scheduled process, which is executed to keep the warehouse current to a pre-determined period in time, (i.e. the warehouse is refreshed monthly).

Changed Data Capture:

The periodic update of the warehouse from the transactional system(s) is complicated by the difficulty of identifying which records in the source have changed since the last update. This effort is referred to as "changed data capture". Changed data capture is a field of endeavor in itself, and many products are on the market to address it. Some of the technologies that are used in this area are Replication servers, Publish/Subscribe, Triggers and Stored Procedures, and Database Log Analysis.

Data Cleansing:

A data warehouse that contains incorrect data is not only useless, but also very dangerous. The whole idea behind a data warehouse is to enable decision-making. If a high level decision is made based on incorrect data in the warehouse, the company could suffer severe consequences, or even complete failure. Data cleansing is a complicated process that validates and, if necessary, corrects the data before it is inserted into the warehouse. For example, the company could have three "Customer Name" entries in its various source systems, one entered as "IBM", one as "I.B.M.", and one as "International Business Machines". Obviously, these are all the same customer. Someone in the organization must make a decision as to which is correct, and then the data cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older mainframe file systems (commonly referred to as "flat files" or "sequential files").

Business Intelligence (BI)

A very broad field indeed, it contains technologies such as Decision Support Systems (DSS), Executive Information Systems (EIS), On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a combination of MOLAP and ROLAP), and more. BI can be broken down into four broad fields: Multi-dimensional Analysis Tools:

Tools that allow the user to look at the data from a number of different "angles". These tools often use a multi-dimensional database referred to as a "cube".

Query tools:

Tools that allow the user to issue SQL (Structured Query Language) queries against the warehouse and get a result set back.

Data Mining Tools:

Tools that automatically search for patterns in data. These tools are usually driven by complex statistical formulas. The easiest way to distinguish data mining from the various forms of OLAP is that OLAP can only answer questions you know to ask, data mining answers questions you didn't necessarily know to ask.

Data Visualization Tools:

Tools that show graphical representations of data, including complex three-dimensional data pictures. The theory is that the user can "see" trends more effectively in this manner than when looking at complex statistical graphs. Some vendors are making progress in this area using the Virtual Reality Modeling Language (VRML).

Objectives of a Data warehouse and Data flow

The primary objective of data warehousing is to provide a consolidated, flexible meaningful data repository to the end user for reporting and analysis. All other objectives of Data warehousing are derived from this primary objective. The data flowin the warehouse also is determined by the objectives of data warehousing.

The data in a data warehouse is extracted from a variety of sources. OLTP databases, historical repositories and external data sources offload their data into the data warehouse. Achieving a constant and efficient connection to the data source is one of the objectives of data warehousing. This process is known as Data Source Interaction.

The data extracted from diverse sources will have to be checked for integrity and will have to be cleaned and then loaded into the warehouse for meaningful

analysis. Therefore, harnessing efficient data cleaning and loading technologies (ETL—Extraction, Transformation and Loading) to the warehousing system will be another objective of the data warehouse. This process is known as Data

Transformation service or Data preparation and staging.

The cleaned and stored data will have to be partitioned, summarized and stored for efficient query and analysis. Creating of subject oriented data marts, dimensional models of data and use of data mining technologies would follow, as the next objective of data warehousing. This process is called Data Storage. 

Finally tools necessary for query, analysis and reporting on data would have to be built into the system to the process to deliver a rich end user experience. This process is known as Data Presentation. 

Users need to understand what rules applied while cleaning and transforming data before storage. This information needs to be stored separately in a relational database called Metadata.

Metadata is “data about data”. Mapping rules and the maps between the data sources and the warehouse; Translation, transformation and cleaning rules; date and time stamps, system of origin, type of filtering, matching; Pre-calculated or derived fields and rules thereof are all stored in this database. In addition the metadata database contains a description of the data in the data warehouse; the navigation paths and rules for browsing the data in the data warehouse; the data directory; the list of pre-designed and built in queries available to the users.

Metadata Management

Throughout the entire process of identifying, acquiring, and querying the data, metadata management takes place. Metadata is defined as "data about data". An example is a column in a table. The datatype (for instance a string or integer) of the column is one piece of metadata. The name of the column is another. The actual value in the column for a particular row is not metadata - it is data. Metadata is stored in a Metadata Repository and provides extremely useful information to all of the tools mentioned previously. Metadata management has developed into an exacting science that can provide huge returns to an organization. It can assist companies in analyzing the impact of changes to database tables, tracking owners of individual data elements ("data stewards"), and much more. It is also required to build the warehouse, since the ETL tool needs to know the metadata attributes of the sources and targets in order to "map" the data properly. The BI tools need the metadata for similar reasons.

History of data warehousing

The concept of data warehousing dates back to the late 1980s [2] when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow - mainly, the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same data. The process of gathering, cleaning and integrating data from various sources, usually long existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from the operational systems that were logically related to prior gathered data.

Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data. Data could be retrieved from one central point or data could be distributed to "retail stores" or "data marts" that were tailored for ready access by users.

Key developments in early years of data warehousing were:

1960s - General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[3]

1970s - ACNielsen and IRI provide dimensional data marts for retail sales.[3] 1983 - Teradata introduces a database management system specifically

designed for decision support. 1988 - Barry Devlin and Paul Murphy publish the article An architecture for a

business and information systems in IBM Systems Journal where they introduce the term "business data warehouse".

1990 - Red Brick Systems introduces Red Brick Warehouse, a database management system specifically for data warehousing.

1991 - Prism Solutions introduces Prism Warehouse Manager, software for developing a data warehouse.

1991 - Bill Inmon publishes the book Building the Data Warehouse. 1995 - The Data Warehousing Institute, a for-profit organization that promotes

data warehousing, is founded. 1996 - Ralph Kimball publishes the book The Data Warehouse Toolkit. 1997 - Oracle 8, with support for star queries, is released.

Data warehouse architecture

Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built. There is no right or wrong architecture, rather multiple architectures exist to support various environments and situations. The worthiness of the architecture can be judged in how the conceptualization aids in the building, maintenance, and usage of the data warehouse.

One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers:

Operational database layer The source data for the data warehouse - An organization's Enterprise Resource Planning systems fall into this layer.

Data access layer The interface between the operational and informational access layer - Tools to extract, transform, load data into the warehouse fall into this layer.

Metadata layer The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.

Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data - Business intelligence tools fall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in this article, have to do with this layer.

Normalized versus dimensional approach for storage of data

There are two leading approaches to storing data in a data warehouse - the dimensional approach and the normalized approach.

In the dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, or "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and 2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.

In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to 1) join data from different sources into meaningful information and then 2) access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

These approaches are not mutually exclusive. Dimensional approaches can involve normalizing data to a degree.

Conforming information

Another important fact in designing a data warehouse is which data to conform and how to conform the data. For example, one operational system feeding data into the data warehouse may use "M" and "F" to denote sex of an employee while another operational system may use "Male" and "Female". Though this is a simple example, much of the work in implementing a data warehouse is devoted to making similar meaning data consistent when they are stored in the data warehouse. Typically, extract, transform, load tools are used in this work.

Master Data Management has the aim of conforming data that could be considered "dimensions".

APPLICATIONS

A Decision Support System (DSS) implemented by building a

data warehouse is an asset to an organization whose success

depends to a large extent on customer satisfaction. It would

therefore find application in the following sectors:

Government organizations

Banks

Insurance Companies

Utilities Providers

Health Care Providers

Financial Services Companies

Telecommunications Service Providers

Travel, Transport and Tourism Companies

Security Agencies

Evolution in organization use of data warehouses

Organizations generally start off with relatively simple use of data warehousing. Over time, more sophisticated use of data warehousing evolves. The following general stages of use of the data warehouse can be distinguished:

Off line Operational Database  Data warehouses in this initial stage are developed by simply copying the data off an operational system to another server where the processing load of reporting against the copied data does not impact the operational system's performance.

Off line Data Warehouse  Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data is stored in a data structure designed to facilitate reporting.

Real Time Data Warehouse  Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking.)

Integrated Data Warehouse  Data warehouses at this stage are updated every time an operational system performs a transaction. The data warehouses then generate transactions that are passed back into the operational systems.

Data Mining

Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information that can be used to increase revenue, cut costs, or both.

Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified.

Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

They support the OLAP concept and include query-and-reporting tools, intelligent agents, multi-dimensional analysis tools, and statistical tools.

Use Of Data Mining & Data Warehousing

In order to obtain a competitive advantage and make the best use of our data we implemented the use of data mining.Throughout the year we gather huge amounts of information (DATA) such as: book & magazine sales, best sellers, customer information on the customer loyalty program etc. That information is very valuable as it can help with future pricing, stock allocation and customer satisfaction. As this information is so large, every 3 months we transfer our current Data in our operational data bases to our data warehouse where information in their dates back for years. It enables Our company to determine relationships among"internal" factors such as price, product positioning, or staff skills, and "external" factors such as suppliers, independent surveys, publishers, economic indicators and competition.

How We Benefit

Seeing the time periods in which products sell(which month/season) e.g. more educational books are sold in the months of August, September and October.

Gather Information about the activities and trends of our On-Line Users. Comparing sales of a particular book/Magazine to that of a competitor. Identify which books might sell together(even possibly what book and

magazine). Customers use of coupons & special offers. Customers preferences and behaviours(on the customer loyalty Program).

Data MartA data mart is a subset of an organizational data store,

usually oriented to a specific purpose or major data subject,

that may be distributed to support business needs. Data marts

are analytical data stores designed to focus on specific

business functions for a specific community within an

organization. Data marts are often derived from subsets of

data in a data warehouse, though in the bottom-up data

warehouse design methodology the data warehouse is created

from the union of organizational data marts.

Reasons for creating a data mart

Easy access to frequently needed data

Creates collective view by a group of users Improves end-user response time Ease of creation Lower cost than implementing a full Data

warehouse Potential users are more clearly defined than in a

full Data warehouse

Dependent data mart

According to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a largerdata warehouse, isolated for one of the following reasons:

A need for a special data model or schema: e.g., to restructure for OLAP

Performance: to offload the data mart to a separate computer for greater efficiency or to obviate the need to manage that workload on the centralized data warehouse.

Security: to separate an authorized data subset selectively

Expediency: to bypass the data governance and authorizations required to incorporate a new application on the Enterprise Data Warehouse

Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an application prior to migrating it to the Enterprise Data Warehouse

Politics: a coping strategy for IT (Information Technology) in situations where a user group has more influence than funding or is not a good citizen on the centralized data warehouse.

Politics: a coping strategy for consumers of data in situations where a data warehouse team is unable to create a usable data warehouse.

According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited

scalability, duplication of data, data inconsistency with other silos of information, and inability to leverage enterprise sources of data.

Benefits of data warehousing

Some of the benefits that a data warehouse provides are as follows:

A data warehouse provides a common data model for all data of interest regardless of the data's source. This makes it easier to report and analyze information than it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc.

Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis.

Information in the data warehouse is under the control of data warehouse users so that, even if the source system data is purged over time, the information in the warehouse can be stored safely for extended periods of time.

Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems.

Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems.

Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.

DATA WAREHOUSING IN eGOVERNANCE

Need for Data Warehouse for eGovernance

Information is one of the valuable assets to any Government. When used properly, it can help planners and decision makers in making informed decisions leading to positive impact on targeted group of citizens. However to use information to it's fullest potential, the planners and decision makers need instant access to relevant data in a properly summarized form. In spite of taking lots of initiative for computerization, the Government decision makers are currently having difficulty in obtaining meaningful information in a timely

manner because they have to request and depend on IT staff for making special reports which often takes long time to generate. An Information Warehouse can deliver strategic intelligence to the decision makers and provide an insight into the overall situation. This greatly facilitates decision-makers in taking micro level decisions in a timely manner without the need to depend on their IT staff. By organizing person and land-related data into a meaningful Information Warehouse, the Government decision makers can be empowered with a flexible tool that enables them to make informed policy decisions for citizen facilitation and accessing their impact over the intended section of the population.

Benefits of a Data Warehouse for eGovernance

Citizen facilitation is the core objective of any Government body. For facilitating the citizens of a state or a country, it is important to have the right information about the people and the places of the concerned territory. Hence a data warehouse built for eGovernance can typically have data related to person and land. Such a data warehouse can be beneficial to both the Government decision makers and citizens as well in the following manner:

How can decision makers benefit

They do not have to deal with the heterogeneous and sporadic information generated by various state-level computerization projects as they can access current data with a high granularity from the information warehouse.

They can take micro-level decisions in a timely manner without the need to depend on their IT staff.

They can obtain easily decipherable and comprehensive information without the need to use sophisticated tools.

They can perform extensive analysis of stored data to provide answers to the exhaustive queries to the administrative cadre. This helps them to formulate more effective strategies and policies for citizen facilitation

How can citizens benefit

They are the ultimate beneficiaries of the new policies formulated by the

decision makers and policy planner's extensive analysis on person and land-

related data.

They can view frequently asked queries whose results will already be there in

the database and will be immediately shown to the user saving the time

required for processing.

They can have easy access to the Government policies of the state.

The web access to Information Warehouse enables them to access the public

domain data from anywhere.

A Data Warehouse for eGovernance

The Centre for Development of Advanced Computing (C-DAC) in

collaboration with the Andhra Pradesh Technology Services (APTS) has

developed a data warehouse for aiding the state level decision makers of

Andhra Pradesh (AP) Government in their decision making process. The main

objective of this effort is to organize the Multipurpose Household Survey

(MPHS) data and the land records data of the AP Government into a

meaningful information warehouse for enabling the decision makers in

making informed decisions and accessing their impact over the intended

section of the population.

Disadvantages of data warehouses

There are also disadvantages to using a data warehouse. Some of them are:

Data warehouses are not the optimal environment for unstructured data. Because data must be extracted, transformed and loaded into the warehouse,

there is an element of latency in data warehouse data. Over their life, data warehouses can have high costs. The data warehouse is

usually not static. Maintenance costs are high. Data warehouses can get outdated relatively quickly. There is a cost of

delivering suboptimal information to the organization. There is often a fine line between data warehouses and operational systems.

Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems and vice versa.

The future of data warehousing

Data warehousing, like any technology niche, has a history of innovations that did not receive market acceptance.

A 2009 Gartner Group paper predicted these developments in business intelligence/data warehousing market .

Because of lack of information, processes, and tools, through 2012, more than 35 per cent of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets.

By 2012, business units will control at least 40 per cent of the total budget for business intelligence.

By 2010, 20 per cent of organizations will have an industry-specific analytic application delivered via software as a service as a standard component of their business intelligence portfolio.

In 2009, collaborative decision making will emerge as a new product category that combines social software with business intelligence platform capabilities.

By 2012, one-third of analytic applications applied to business processes will be delivered through coarse-grained application mashups.

Bibliography

B.Sc.(IT) Book

Kuvempu University

(Karnataka)