dwarchitecturesanddevelopmentstrategy.guidebook

1

Companion Guidebook

Learning course Data warehouse architectures

and development strategy

Sabir Asadullaev Executive IT Architect SWG IBM EE/A Distinguished IT Architect, Open Group

2

Table of Content

DATA WAREHOUSE ARCHITECTURES - I.............................................................................................................. 5

ABSTRACT....................................................................................................................................................................... 5 OLAP AND OLTP ........................................................................................................................................................... 5 SIX LEVELS OF DATA WAREHOUSE ARCHITECTURES........................................................................................................ 6 VIRTUAL DATA WAREHOUSE.......................................................................................................................................... 8 INDEPENDENT DATA MARTS.......................................................................................................................................... 10 CONCLUSION................................................................................................................................................................. 11 LITERATURE.................................................................................................................................................................. 11

DATA WAREHOUSE ARCHITECTURES - II .......................................................................................................... 12

ABSTRACT..................................................................................................................................................................... 12 CENTRALIZED DATA WAREHOUSE WITH ETL............................................................................................................... 12 CENTRALIZED DATA WAREHOUSE WITH ELT............................................................................................................... 14 CENTRALIZED DW WITH OPERATIONAL DATA STORE.................................................................................................. 16 EXTENDED MODEL WITH DATA MARTS........................................................................................................................ 17 CONCLUSION................................................................................................................................................................. 18 LITERATURE.................................................................................................................................................................. 18

DATA WAREHOUSE ARCHITECTURES - III................. ........................................................................................ 19

ABSTRACT..................................................................................................................................................................... 19 CENTRALIZED ETL WITH PARALLEL DW AND DATA MARTS ......................................................................................... 19 DW WITH INTERMEDIATE APPLICATION DATA MARTS................................................................................................... 20 DATA WAREHOUSE WITH INTEGRATION BUS................................................................................................................ 22 RECOMMENDED EDW ARCHITECTURE......................................................................................................................... 24 CONCLUSION................................................................................................................................................................. 25 LITERATURE.................................................................................................................................................................. 26

DATA, METADATA AND MASTER DATA: THE TRIPLE STRATEGY FOR DATA WAREHOUSE PROJECTS...................................................................................................................................................................... 27

ABSTRACT..................................................................................................................................................................... 27 INTRODUCTION.............................................................................................................................................................. 27 MASTER DATA MANAGEMENT ....................................................................................................................................... 28 METADATA MANAGEMENT ............................................................................................................................................ 28 DATA, METADATA AND MASTER DATA INTERRELATIONS .............................................................................................. 28

Data and metadata................................................................................................................................................... 29 Data and master data............................................................................................................................................... 30 Metadata and master data ....................................................................................................................................... 30

COMPONENTS OF ENTERPRISE DATA WAREHOUSE....................................................................................................... 32 EXAMPLE OF EXISTING APPROACH................................................................................................................................ 33 THE PRACTICAL REALIZATION OF THE TRIPLE STRATEGY.............................................................................................. 34 CONCLUSION................................................................................................................................................................. 37 LITERATURE.................................................................................................................................................................. 37

METADATA MANAGEMENT USING IBM INFORMATION SERVER... ............................................................ 38

ABSTRACT..................................................................................................................................................................... 38 GLOSSARY..................................................................................................................................................................... 38 METADATA TYPES.........................................................................................................................................................38 SUCCESS CRITERIA OF METADATA PROJECT................................................................................................................... 38 METADATA MANAGEMENT LIFECYCLE ......................................................................................................................... 39 IBM INFORMATION SERVER METADATA MANAGEMENT TOOLS.................................................................................... 41 ROLES IN METADATA MANAGEMENT PROJECT............................................................................................................... 42 THE ROLES SUPPORT BY IBM INFORMATION SERVER TOOLS......................................................................................... 43 CONCLUSION................................................................................................................................................................. 46

3

INCREMENTAL IMPLEMENTATION OF IBM INFORMATION SERVE R’S METADATA MANAGEMENT TOOLS ............................................................................................................................................................................. 47

ABSTRACT..................................................................................................................................................................... 47 SCENARIO, CURRENT SITUATION AND BUSINESS GOALS................................................................................................. 47 LOGICAL TOPOLOGY – AS IS......................................................................................................................................... 47 ARCHITECTURE OF METADATA MANAGEMENT SYSTEM................................................................................................. 49 ARCHITECTURE OF METADATA MANAGEMENT ENVIRONMENT...................................................................................... 49 ARCHITECTURE OF METADATA REPOSITORY................................................................................................................. 50 LOGICAL TOPOLOGY – TO BE........................................................................................................................................ 51 TWO PHASES OF EXTENDED METADATA MANAGEMENT LIFECYCLE............................................................................... 51 “M ETADATA ELABORATION” PHASE.............................................................................................................................. 53 “M ETADATA PRODUCTION” PHASE............................................................................................................................... 53 ROLES AND INTERACTIONS ON METADATA ELABORATION PHASE.................................................................................. 55 ROLES AND INTERACTIONS ON METADATA PRODUCTION PHASE.................................................................................... 58 ADOPTION ROUTE 1: METADATA ELABORATION ............................................................................................................ 60 ADOPTION ROUTE 2: METADATA PRODUCTION.............................................................................................................. 62 CONCLUSION................................................................................................................................................................. 64 LITERATURE.................................................................................................................................................................. 64

MASTER DATA MANAGEMENT WITH PRACTICAL EXAMPLES..... .............................................................. 65

ABSTRACT..................................................................................................................................................................... 65 BASIC CONCEPTS AND TERMINOLOGY........................................................................................................................... 65 REFERENCE DATA (RD) AND MASTER DATA (MD)........................................................................................................ 66 ENTERPRISE RD & MD MANAGEMENT.......................................................................................................................... 67 TECHNOLOGICAL SHORTCOMINGS OF RD & MD MANAGEMENT ................................................................................... 67

No unified data model for RD & MD....................................................................................................................... 67 There is no single regulation of history and archive management .......................................................................... 68 The complexity of identifying RD & MD objects ..................................................................................................... 68 The emergence of duplicate RD & MD objects........................................................................................................ 68 Metadata inconsistency of RD & MD ...................................................................................................................... 68 Referential integrity and synchronization of RD & MD model................................................................................ 68 Discrepancy of RD & MD object life cycle.............................................................................................................. 69 Clearance rules development................................................................................................................................... 69 Wrong core system selection for RD & MD management ....................................................................................... 69 IT systems are not ready for RD & MD integration ................................................................................................ 69

EXAMPLES OF TRADITIONAL RD & MD MANAGEMENT ISSUES..................................................................................... 69 Passport data as a unique identifier ........................................................................................................................ 70 Address as a unique identifier.................................................................................................................................. 70 The need for mass contracts’ renewal .....................................................................................................................70 The discrepancy between the consistent data .......................................................................................................... 70

BENEFITS OF CORPORATE RD & MD............................................................................................................................. 70 Law compliance and risk reduction ......................................................................................................................... 71 Profits increase and customer retention ..................................................................................................................71 Cost reduction.......................................................................................................................................................... 71 Increased flexibility to support new business strategies .......................................................................................... 71

ARCHITECTURAL PRINCIPLES OF RD & MD MANAGEMENT .......................................................................................... 72 CONCLUSION................................................................................................................................................................. 73 LITERATURE.................................................................................................................................................................. 73

DATA QUALITY MANAGEMENT USING IBM INFORMATION SERVE R....................................................... 74

ABSTRACT..................................................................................................................................................................... 74 INTRODUCTION.............................................................................................................................................................. 74 METADATA AND PROJECT SUCCESS............................................................................................................................... 74 METADATA AND MASTER DATA PARADOX .................................................................................................................... 75 METADATA IMPACT ON DATA QUALITY ......................................................................................................................... 75 DATA QUALITY AND PROJECT STAGES........................................................................................................................... 76 QUALITY MANAGEMENT IN METADATA LIFE CYCLE ......................................................................................................76 DATA FLOWS AND QUALITY ASSURANCE....................................................................................................................... 77

4

ROLES, INTERACTIONS AND QUALITY MANAGEMENT TOOLS......................................................................................... 80 NECESSARY AND SUFFICIENT TOOLS............................................................................................................................. 80 CONCLUSION................................................................................................................................................................. 82 LITERATURE.................................................................................................................................................................. 82

PRIMARY DATA GATHERING AND ANALYSIS SYSTEM - I..... ........................................................................ 83

ABSTRACT..................................................................................................................................................................... 83 INTRODUCTION.............................................................................................................................................................. 83 SYSTEM REQUIREMENTS................................................................................................................................................ 84 PROJECT OBJECTIVES..................................................................................................................................................... 85

Development of e-forms for approved paper forms ................................................................................................. 85 Development of e-forms for new paper forms.......................................................................................................... 85 Development of storage for detailed data................................................................................................................ 85 Development of analytical tools............................................................................................................................... 86 Development of reporting and visualization tools ................................................................................................... 86 Information security................................................................................................................................................. 86 Data back-up............................................................................................................................................................ 86 Data archiving ......................................................................................................................................................... 86 Logging system events.............................................................................................................................................. 86 Success criteria ........................................................................................................................................................ 86

ARCHITECTURE OF SYSTEM FOR DATA COLLECTION, STORAGE AND ANALYSIS............................................................. 87 DATA COLLECTION........................................................................................................................................................ 89 DATA STORAGE............................................................................................................................................................. 90 CONCLUSION................................................................................................................................................................. 92 LITERATURE.................................................................................................................................................................. 92

PRIMARY DATA GATHERING AND ANALYSIS SYSTEM - II .... ....................................................................... 93

ABSTRACT..................................................................................................................................................................... 93 DATA ANALYSIS USING IBM INFOSPHERE WAREHOUSE............................................................................................... 93

Cubing Services & Alphablox based OLAP............................................................................................................. 93 Text and data mining ............................................................................................................................................... 94 Data mining using MiningBlox & Alphablox........................................................................................................... 94 Data mining using Intelligent Miner........................................................................................................................ 95 Text analysis ............................................................................................................................................................ 96 Data mining application development ..................................................................................................................... 97

DATA ANALYSIS USING IBM COGNOS BUSINESS INTELLIGENCE................................................................................... 98 ENTERPRISE PLANNING USING COGNOS TM1 .............................................................................................................. 101 CONCLUSION............................................................................................................................................................... 102 LITERATURE................................................................................................................................................................ 102

DATA WAREHOUSING: TRIPLE STRATEGY IN PRACTICE ...... .................................................................... 103

ABSTRACT................................................................................................................................................................... 103 INTRODUCTION............................................................................................................................................................ 103 ARCHITECTURE OF PRIMARY DATA GATHERING AND ANALYSIS SYSTEM..................................................................... 103 ROLE OF METADATA AND MASTER DATA MANAGEMENT PROJECTS............................................................................. 105 RECOMMENDED DW ARCHITECTURE.......................................................................................................................... 106 RELATION BETWEEN THE RECOMMENDED ARCHITECTURE AND THE SOLUTION........................................................... 107 COMPARISON OF PROPOSED AND EXISTING APPROACHES............................................................................................ 109 THE FINAL ARCHITECTURE OF IMPLEMENTATION OF EXISTING APPROACHES.............................................................. 110 CONCLUSION............................................................................................................................................................... 114 LITERATURE................................................................................................................................................................ 114

5

Data Warehouse Architectures - I Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 19.10.2009 http://www.ibm.com/developerworks/ru/library/sabir/axd_1/index.html

Abstract This paper starts a series of three articles on data warehousing (DW) architectures and their predecessors. The abundance of various approaches, methods and recommendations makes a mess of concepts, advantages and drawbacks, limitations and applicability of specific architecture solutions. The first article is concerned with the evolution of OLAP role understanding, with DW architecture components, with virtual DW and independent data marts. The second article considers the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT (Extract, Load, Transform), CDW with operational data store, and extended model with data marts. The third article discusses centralized ETL with parallel DW and data marts; DW with intermediate application data marts, DW with integration bus, and the recommended DW architecture.

OLAP and OLTP Any transactional system usually contains two types of tables. One of them is responsible for quick transactions. For example, ticket sale system has to ensure reliable exchange of short messages between the system and a large number of ticket agents. Indeed, entered and printed information that deals with passengers’ names, flight dates, flight number, seat, flight destination, can be estimated to be around 1000 bytes. Thus, passenger service requires fast processing of short records.

Another type of tables contains a summary of sales for a specified time, by directions, and by categories of passengers. These tables are used by analysts and financial specialists in the end of month, quarter, or year, when company’s financial results are needed. And if the number of analysts is ten times smaller than the number of ticket agents, the volume of data required for analysis exceeds an average transaction size by several orders of magnitude.

Not surprisingly, execution of analytical queries increases the system’s response time to ticket availability request. Creating a system with a reserve of computational power can mitigate the negative impact of analytical processing load on transactional activity, but leads to a significant cost increase of the required software and hardware, due to the fact that excess processing capacity remains unused most of the time. The second factor that led to the separation of analytical and transactional systems is different requirements that are applied by analytical and transactional systems to computing systems.

The OLAP story begins in 1993 when an article "Providing OLAP (online analytical processing) to users - analysts" was published [1]. Initially, it appeared that the separation of transactional and analytical systems (OLTP - OLAP) is sufficient.

However, it soon became clear that OLAP systems cope badly with the task of mediator between various data sources (including transactional systems) and client applications.

It became clear that analytical data storage environment is required. Initially shared databases were proposed for this role, which implied the need to copy and store the original data from data sources. This idea turned out to be quite unviable, as transactional systems were commonly developed without a unified plan, and thus they contained conflicting and inconsistent information.

6

Pic. 1. Evolution of OLTP and OLAP understandings

These implications lead to the idea of a data warehouse, designed for secure data storage and systems for data extraction, transformation and loading (ETL). OLAP-systems had to use information from data warehouse.

It was revealed soon that data warehouse accumulates very important enterprise information and any unauthorized access to the DW is fraught with serious financial losses. Moreover, data formats oriented towards reliable storage are hard to combine with requirements for fast information service. Geographical distribution and organizational structure of enterprise also require a specific approach to information services to each company’s unit. The solution is a data mart that contains a required subset of information from the DW. Data load to data mart from DW may occur during the user activity’s decay time. In case of data mart failure, data can be easily retrieved from the DW with minimal losses.

Data Marts can handle reporting, statistical analysis, planning, scenario calculations (what-if analysis) and, in particular, multidimensional analysis (OLAP).

Thus, OLAP systems that initially claimed to be almost a half of the computing world (giving the second half to OLTP systems), now rank among analytical tools on the working groups level.

Six levels of data warehouse architectures Data warehouse architecture at times resembles a child's toy blocks. Any arbitrary combination of blocks can represent something that you can meet in real life. Sometimes in company one can find the presence of several enterprise data warehouses, each of which is positioned as the only and unified source of consistent information.

Multilevel data marts in the presence of a unified data warehouse bring even more fun. Why should not we build a new DM on top of DW? You see, users want to combine some data from the two DMs to a third one. Maybe, it would make sense if the DMs contain information that is not in the

7

DW, for example, if users have enriched DM with their calculations and data. Even if so, what is the value of these enriched data in comparison with those that have passed through a cleaning sieve in accordance with enterprise policies? Who is responsible for the quality of the data? How they appeared in the system? Nobody knows, but everyone wants to get access to information that is not in the DW.

Data warehouses are somewhat similar to a water purification system. Water with different chemical composition is collected from various sources. Therefore, the specific cleaning and disinfection methods are applied for each case of water source. Water delivered to the consumers meets strict quality standards. And no matter how we complain about the quality of water, this approach prevents the spread of epidemics in the city. And it comes to no one’s mind (I hope so) to enrich purified water with water from a nearby pond. However, IT has its own laws.

Various data warehousing architecture will be considered later, though extremely exotic approaches are not going to be examined.

We will discuss the architecture of the enterprise data warehouse on six layers, because, despite the fact that the components themselves may be absent, the layers exist in some form.

Pic. 2. Six layers of DW architecture

The first layer consists of data sources, such as transactional and legacy systems, archives, separate files of known formats, MS Office documents, and any other structured data sources.

The second layer hosts ETL (Extract, Transformation and Load) system. The main objective of ETL is to extract data from multiple sources to bring them to a consistent form and load to the DW. The hardware and software system where ETL is implemented must have a large throughput. But high computing performance is even more important. Therefore, the best of ETL systems should be able to provide a high degree of task parallelism, and work even with clusters and computational grids.

8

The role of the next level is a reliable data storage, protected from unauthorized access. Under the proposed triple strategy [2], we believe that at this level metadata and master data management systems should also be placed. An operational data store (ODS) is needed when quick access is required to even incomplete, not fully consistent data available with the least possible time lag. A staging area is needed to implement a specific business process, for example, when data steward should review data and should permit to load reviewed data to DW.

Sometimes storage areas are referred to as a database buffer necessary for the implementation of internal process operations. For example, ETL retrieves data from a source, writes them into the internal database, cleans, and loads to DW. In this paper, the term “staging zone” is used for storage areas, designed for operations performed by external users or systems in accordance with business requirements for data processing. Separation of staging zone in a specific component of DW is needed, since these zones require establishment of additional administration, monitoring, security and audit processes.

Information systems at data distribution layer still do not have a common name. They can be simply called ETL, as well as the system of extraction, transformation, and loading on the second layer. Or, to emphasize the differences from ETL, they are sometimes called ETL-2. Data distribution systems at the fourth layer perform tasks that differ significantly from the tasks of ETL, namely, sampling, restructuring, and data delivery (SRD - Sample, Restructure, Deliver).

ETL extracts data from a variety of external systems. SRD selects data from a single DW. ETL receives inconsistent data that are to be converted to a common format. SRD has to deal with purified data the structure of which must be brought into compliance with the requirements of different applications. ETL loads data into a central DW. SRD shall deliver the data in different data marts in accordance with the rights of access, delivery schedule and requirements for the information set.

Information access layer is intended to separate data storage functions from information support functions for various applications. Data marts must have a data structure which suits best the needs of information support tasks. Since there are no universal data structures that are optimal for all applications, data marts should be grouped by geographical, thematic, organizational, functional and other characteristics.

Business applications layer is presented by scenario and statistical analysis, multidimensional analysis, planning and reporting tools and other business applications.

Virtual Data Warehouse Virtual data warehouse remained in the Romantic era, when it seemed that you can implement everything that a human mind can imagine. No one remembers virtual DW, and so again and again invent them, however, on a new level. So we have to start with what is already long gone, but trying to be reborn in a new guise.

The concept of virtual data warehouse was based on a few sublimely beautiful ideas.

The first great idea is costs reduction. There is no need to spend money on expensive equipment for a central data warehouse. No need to have qualified personnel maintaining this repository. We do not need server rooms with expensive cooling systems, fire control and monitoring equipment.

The second idea - we should work with the most recent data. The analytical system must work directly with data sources, bypassing all middlemen. The intermediary is the evil, everyone knows that. Our experts do not have confidence in the mediator applications. Experts have always worked directly with the data source systems.

9

The third idea – we will write everything you need. All that is needed - this is a workstation and access to data sources. And the compiler. Our programmers are still sitting idle. They will develop an application that will query by itself all sources by user's request, it will deliver the data to a user's computer, will convert divergent formats by itself, it will perform data analysis, and it will show everything on the screen.

Pic. 3. Virtual Data Warehouse Does the company have many various users with different needs? Do not worry, we will modify our universal application for as many options as you want.

Is there a new data source? That’s wonderful. We will rewrite all of our applications, taking into account the peculiarities of this source.

Did the data format change? Fine. We will rewrite all of our applications to reflect the new format.

Everything is going well; everybody is busy, we need more staff, SW development department should be expanded.

Oh, and the users of data sources are complaining that for some time their system is very slow, for the reason that every time, even if the request has already been done before, our universal client application queries the data source again and again. Therefore it is necessary to purchase new, more powerful servers.

What about the spending cuts? No cuts. Conversely, the costs only increase. We need more developers, more servers, more power, and more space for server rooms.

Are there still any benefits from this architecture?

We got a tight coupling between data sources and analytical applications. Any change in the source data must be agreed with the developers of the universal client, in order to avoid transmission of distorted and misinterpreted data to the analysis applications. A set of interfaces to access different data source systems should be maintained on every workplace.

10

There is an opinion that all this is obvious and it is not worth wasting time on explaining things that everyone understands. But in case of a user's request “I need data from the data marts A and B” why do the same developers write a client application that accesses multiple data marts, again and again reproduces the dead architecture of a virtual data warehouse?

Independent data marts Independent data marts have emerged as a physical realization of the understanding that transactional and analytical data processing do not get along well together on a single computer.

The reasons for incompatibility are as follows:

• Transactional processing is characterized by a large number of reads and writes to the database. Analytical processing may take only a few queries to the database.

• A record length in OLTP is typically less than 1000 characters. A single analytical query may require megabytes of data for analysis.

• Number of transactional system users can reach up to several thousand employees. The number of analysts is usually within a few tens.

• Typical requirement for transactional systems is the clock round non-stop operation 365 days a year (24 x 365). Analytical processing does not have such well-defined requirements for the availability of analytical systems, but a report not prepared in time can lead to serious troubles for analysts as well as the company.

• The transactional system’s load is distributed more or less evenly over the year. The analytical system’s load is usually maximal at the end of accounting periods (month, quarter, year).

• Transactional processing is mainly carried out using current data. Analytical calculations address historical data.

• Data in transactional systems can be updated, whereas in analytical systems data should only be added to. Any attempt to change data retroactively should at least cause awareness.

Thus, transactional and analytical systems place different requirements both for software and hardware in terms of performance, capacity, availability, data models, data storage organization, access data methods, peak loads, data volumes and processing methods.

Creation of independent data marts was the first response to the need for the separation of analytical and transactional systems. In those days it was a big step forward, simplifying the design and operation of software and hardware because they do not have to try to satisfy the mutually exclusive requirements of analytical and transactional systems.

The advantage of development of independent data marts is the ease and simplicity of their organization, as each of them operates the data of one specific application, and therefore there is no problem with metadata and master data. There is no need for complex systems extraction, transformation and loading (ETL). Data just are copied from a transactional system to a data mart on a regular basis. One application - one data mart. Therefore, independent data marts are often called application data marts.

But what if users need to use information from multiple data marts simultaneously? Development of complex client applications that can query many data marts at a time and can convert the data on the fly has been compromised by virtual data warehouse approach.

11

Pic.4. Independent data marts

So, you need a single repository - a data warehouse. But the information in the data marts is not consistent. Each data mart has inherited from its transactional system its terminology, data model, master data, including the data encoding. For example, in one system the date of the operation can be encoded in the Russian format dd.mm.yyyy (day, month, year), and in the other in the American format mm.dd.yyyy (month, day, year). So, at data merge it is necessary to understand, what does 06.05.2009 mean - is it June 5 or May 6. That’s why we need a data ETL (extract, transform and load) system.

Thus, the benefits of independent data marts disappear with the first requirement of users to work with data from several data marts.

Conclusion The article deals with the evolution of understanding of OLAP role, with DW component architecture, with virtual DW and independent data marts. Next papers will discuss the advantages and limitations of the following architectures: a centralized DW with ETL system, DW with ELT system, central data warehouse with an operational data store (ODS), extended model with data marts, centralized ETL with parallel DW and data marts, DW with intermediate application data marts, data warehouse with integration bus, and recommended DW architecture.

Literature 1. Codd E.F., Codd S.B., and Salley C.T. "Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate". Codd & Date, Inc. 1993.

2. Asadullaev S. "Data, metadata and master data: the triple strategy for data warehouse projects“, http://www.ibm.com/developerworks/ru/library/r-nci/index.html, 2009.

12

Data Warehouse Architectures - II Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 23.10.2009 http://www.ibm.com/developerworks/ru/library/sabir/axd_2/index.html

Abstract The second paper continues a series of three articles on data warehouse (DW) architectures and their predecessors. The abundance of various approaches, methods and recommendations makes a mess of concepts, advantages and drawbacks, limitations and applicability of specific architecture solutions. The first article [1] was concerned with the evolution of OLAP role understanding, with DW architecture components, with virtual DW and independent data marts. This publication considers the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT (Extract, Load, Transform), CDW with operational data store, and the extended model with data marts. The third article discusses centralized ETL with parallel DW and data marts; DW with intermediate application data marts, DW with integration bus, and the recommended DW architecture.

Centralized Data Warehouse with ETL Virtual data warehouse and independent data marts showed that a unified data repository is required for effective operation of analytical systems. To fill this repository we need to extract and to reconcile disparate data from various data sources and to load data into the repository.

ETL tools should be aware of all information about data sources: the structure of stored data and their formats, differences in data processing methods, the meaning of the stored data, the data processing schedule in transaction systems. Ignoring this information about data (metadata) leads inevitably to quality deterioration of the information loaded into the repository. As a result, users lose confidence in the data warehouse, trying to get information directly from the source, which leads to unnecessary time expenditures of specialists who maintain data source systems.

Thus, ETL tools must use the information about data sources. Therefore, ETL tools should work in close conjunction with metadata management tools.

Extracted data should be converted into a unified form. Since the data are stored mainly in relational databases, we need to take into account the difference in the data coded values. Dates can be encoded in different formats; addresses may use different abbreviations, product encoding may follow different nomenclatures. Initially, information about master data was included in the data conversion algorithms of ETL tools. As the number of data sources and the volume of data being processed increased (the first one can reach thousands systems, and the second can be more than ten terabytes per day), it became clear that it is necessary to separate master data management (MDM) from ETL, and to ensure their effective interaction between MDM and ETL.

Thus, ETL tools in conjunction with metadata and master data management tools extract data from sources, transform them to a required format, and load into a data repository. Usually data warehouse repository is used to store the data, but it can be also can also an operational data store (ODS), a staging area, and even a data mart. Therefore, one of the key requirements for ETL tools is their ability to interact with various systems.

Growing volume of processed data and the need to increase the responsiveness of provisioning of analytical information impose increased requirements for the performance and scalability of ETL

13

tools. Therefore, ETL tools should use various schemes of parallel computing and be able to run on high-performance systems having different architectures.

Pic. 1. Centralized Data Warehouse with ETL As is seen, the ETL tools must fit different requirements:

• Data from various data source systems should be collected, even if one or more systems failed to complete data processing in time, and at least some required data should be provided.

• Collected information must be recognized and converted in accordance with transformation rules, and with the help of metadata and master data management system.

• Transformed information must be loaded into a staging zone, into a data warehouse, OSD, a data mart, as required by business and production processes.

• ETL tools must have a high throughput to collect and load the ever-increasing data volumes into various repositories.

• ETL tools must possess high performance and scalability to reduce data processing time and to shorten lags in providing data for analytical tasks.

• ETL tools should provide various data extracting instruments in different operating environments: from data collection batch systems, which are non-critical to time delays, to practically real-time incremental data processing.

In connection with these often mutually exclusive requirements, design and development of ETL tools become a difficult task, especially when ready made solutions are not used.

14

Centralized Data Warehouse with ELT Traditional ETL system is often blamed for poor efficiency and high cost due required dedicated hardware and software. As an alternative to ETL, ELT (extraction, loading and transformation) tools were proposed, which are attributed to high productivity and efficient use of equipment.

In order to understand what are the comparative advantages and disadvantages of ETL and ELT systems of, let’s turn now to the three main functions of enterprise data warehouse (EDW):

1. Full and timely collection and processing of information from data sources;

2. Safe and secure data storage;

3. Provision of data for analytical tasks.

The input to ETL / ELT systems are disparate data whish have to be compared, cleaned, transformed to a common format, and to be processed according to calculation algorithms. On the one hand, data practically do not stay in ETL / ELT systems; on the other – the main information stream flows through these systems to data repositories. Therefore, the requirements for information security can be moderate.

Pic. 2. Centralized Data Warehouse with ELT As a rule central data warehouse (CDW) contains a wealth of information and its full disclosure could lead to serious losses for the company. In this case, a reliable information security perimeter is required around CDW. Data structure in CDW should best fit the requirements of long-term, reliable and secure storage. Using ELT approach means that CDW should also perform the data transformation.

Data delivery for analytical tasks requires specific reorganization of data structures for each analytical application. Multidimensional analysis requires data cubes; statistical analysis, as a rule, uses data series, scenario and model analysis can use MS Excel files. In this architecture business applications use data from CDW directly. So this architecture implies that CDW should store data

15

structures that are optimized both for current and for future business applications. Moreover, such direct access increases the risk of unauthorized access to all data in CDW.

Thus, we see that this architecture entrusts CDW with the data transformation function and information services for analytic tasks. Both of these features are unusual for CDW, which in this form becomes "all in one" unit, where functional components generally have lower quality than if they were implemented separately (e.g., a camera in a mobile phone).

We will later discuss how data storage functions and functions of data delivery for analytical applications can be separated.

Implementation of the ETL scheme allows to separate data processing and data storage functions. ELT scheme loads CDW with improper data conversion functions. The migration of ETL functionality inside CDW forces us not only to provide the same processing power, but also to design a universal platform that can still efficiently process data and store them. This approach could be applied to SOHO segment, but enterprise wide system (like EDW) requires adequate solution.

Despite the stated performance advantages of the ELT scheme, on practice it turns out that:

1. Data quality affects data load time. For example, ETL may discard up to 90% of duplicate data during data cleaning and transformation. In this data case ELT will load all the data in the CDW, where data cleaning will occur.

2. Data transformation rate in CDW storage depends strongly on processing algorithms and data structures. In some cases SQL processing within the CDW’s database is more efficient, in others cases external programs that extract data to be processed and load processing results to the CDW will run much faster.

3. Some algorithms are very difficult to implement using SQL statements and stored procedures. This imposes restrictions on the use of the ELT scheme, while ETL can use appropriate and more effective tools for data processing.

4. ETL is a unified area where data extraction, processing and loading rules reside, which simplifies testing, modification and operation of algorithms. ELT, by contrast, separate data collecting and loading algorithms from transformation algorithms. That is, to test a new transformation rule we have to risk the integrity of the production data warehouse, or to create a test copy of the repository, which is very costly.

Thus, comparing ETL and ELT, we see that the advantages of data loading and transformation are not clear, that the ELT faces SQL constraints in data conversion, and that the savings in ELT software and hardware result in financial costs for the creation of software and hardware CDW test copy.

The use of ELT may be justified if:

1. There are no stringent requirements for DW reliability, performance, and security.

2. Budget constraints force to take a risk of data loss.

3. Data warehouse and data sources interact via a service bus (SOA).

The latter case is the most exotic, but it has a right to exist under certain conditions. In this case the service bus is responsible for the integration between data sources and DW at the messaging level, and a minimal (by the DW standards) data conversion and loading to DW.

16

Centralized DW with Operational Data Store Data extraction, transformation and loading processes, of course, take some time to complete. Additional delay is caused by the need to check data downloaded to DW for consistency with already available in DW data, for data consolidation, and for the totals’ recalculation based on new data.

The operational data store (ODS) was proposed in 1998 [2] in order to reduce the time delay between information receipt from ETL and analytical systems. ODS has less accurate information due to lack of internal checks, and has more detailed data due to missed data consolidation phase. Therefore, data from ODS are designed to make tactical decisions, while information from a central data warehouse (CDW) is better suited for strategic missions [3].

Imagine a company that sells drinks and snacks from vending machines throughout the country. 15 minutes downtime of an empty machine means potential profit loss, so it is crucial to monitor the status of the machine and fill it with the missing goods. Collection and processing of all information across the country may take several hours, whereas products delivery is done locally: in every city there is a warehouse from where drinks and snacks are delivered to the nearest outlets. Warehouses are filled up through centralized procurement. Thus, there are two different types of tactical tasks (filling vending machines), and strategic planning (filling warehouses).

Pic. 3. Centralized DW with Operational Data Store

Indeed, if an extra bottle of water will be delivered as a result of incomplete and inaccurate data in the ODS, then it will not lead to serious losses. However, a planning error caused by low data quality in ODS may adversely affect a decision on the types and volumes of bulk purchases.

Information security requirements for CDW and ODS are also different. In our example, ODS stores recent data for no more than a couple of hours. CDW stores historical information, which can cover a period of several years for better prediction of the required purchase volume. This historical information can present a considerable commercial interest for competitors. So tactic analysts can

17

work directly with the ODS, while strategic analysts must work with the CDW through a data mart for the responsibility delineation. Tactic analysts can access data nearly on line due to absence of data mart. Data mart does not preclude strategic analysis, since such an analysis is carried out on monthly or even quarterly basis.

The architecture shown in Fig. 3 involves direct interaction between CDW and business applications. Analysis of the strengths and limitations of this approach will be considered in the section "Extended model with data marts". Now it should be noted that ODS actually performs another role of a staging zone when data move sequentially from ODS to CDW. Tactic analysts, working with data from ODS, wittingly or unwittingly reveal errors and contradictions in the data, thereby improving their quality.

In this scheme corrected data from the ODS are transferred to CDR. However, there are other schemes, for example, when data from the ETL come both in the ODS and in the CDW in a parallel manner. Unnecessary data are simply erased from ODS after using. This scheme is applicable in cases where human intervention in the data can only distort them, voluntarily or involuntarily.

Extended Model with Data Marts Direct access of business applications to CDW is admissible if the users’ requests do not interfere with the normal functioning of CDW, if users communicate with CDW through high-speed lines, or if accidental access to all data in CDW does not lead to serious losses.

Administration of direct user access to CDW is an extremely difficult task. For example, a user from one department is authorized to access data from another unit only in 10 days after data are available. Another user can see only the aggregates, but no detailed data. There are other, more complicated access rules. Their management, accounting, and change lead to the inevitable errors caused by a combination of difficult access conditions.

Data marts that contain information intended for a specific group of users, significantly reduce the risk of information security breaches.

Up to now, the quality of communication lines is a serious problem for geographically distributed organizations. In the event of breakage or insufficient bandwidth, remote users are denied access to the information contained in CDW. The solution is remote data marts, which are filled either after working hours, or incrementally, as information becomes available, using assured data transfer.

Various business applications require different data formats: multi-dimensional cubes, data series, two-dimensional arrays, relational tables, files in MS Excel, comma separated values, XML-files, etc. No data structure in CDW can meet these requirements. The solution is the creation of data marts, whose data structures are optimized for the specific requirements of individual applications.

Another reason for the need of data marts creation is the requirement for the reliability of CDW, which is often defined as four or five nines. This means that downtime of CDW can not be more than 5 minutes per year (99.999%) or more than 1 hour per year (99.99%). Creation of a hardware and software system with such characteristics is a complex and expensive engineering task. Requirements for protection against terrorist attacks, sabotage and natural disasters further complicate the construction of software and hardware system and implementation of appropriate organizational arrangements. The more complex such system is and the more data it stores, the higher is the cost and complexity of its support.

Data marts reduce dramatically the CDW load, both through the number of users and through data volume in the repository, as these data can be optimized for storage, not for query facilities.

18

Pic. 4. Extended Model with Data Marts If the data marts are filled directly from CDW, the actual number of users is reduced from hundreds and thousands to tens of data marts, which become CDW users. Implementation of SRD (Sample, Restructure, Delivery) tools reduces the number of users to one and only one. In this case, the logic of the information supply for data marts is concentrated in the SRD. So DMs can be optimized for service user requests. CDW hardware and software can be optimized exclusively for reliable, secure data storage.

SRD tools also soften the CDW workload due to the fact that different data marts can access the same data, whereas SRD retrieves data once, convert to various formats and delivers to different data marts.

Conclusion The paper considers the following architectures: a centralized DW with ETL system, DW with ELT system, central data warehouse with an operational data store (ODS), extended model with data marts. Next papers will discuss the advantages and limitations of centralized ETL with parallel DW and data marts, DW with intermediate application data marts, data warehouse with integration bus, and recommended DW architecture.

Literature 1. Asadullaev S. “Data Warehouse Architectures - I”, 19.10.2009, http://www.ibm.com/developerworks/ru/library/sabir/axd_1/index.html

2. Inmon, W. “The Operational Data Store. Designing the Operational Data Store”. Information Management Magazine, July 1998.

3. Building the Operational Data Store on DB2 UDB Using IBM Data Replication, WebSphere MQ Family, and DB2 Warehouse Manager, SG24-6513-00, IBM Redbooks, 19 December 2001, http://www.redbooks.ibm.com/abstracts/sg246513.html?Open

19

Data Warehouse Architectures - III Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 03.11.2009 http://www.ibm.com/developerworks/ru/library/sabir/axd_3/index.html

Abstract The series of three articles is devoted to architectures of data warehouse (DW) and their predecessors. The abundance of various approaches, methods and recommendations makes a mess of concepts, advantages and drawbacks, limitations and applicability of specific architecture solutions. The first article [1] is concerned with the evolution of OLAP role understanding, of DW architecture components, of virtual DW and independent data marts. The second article [2] considers the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT (Extract, Load, Transform), CDW with operational data store, and extended model with data marts. This article discusses centralized ETL with parallel DW and data marts; DW with intermediate application data marts, DW with integration bus, and the recommended DW architecture.

Centralized ETL with parallel DW and data marts In this case the whole architecture of EDW is built around the ETL (Extract, Transform and Load) system. Information from disparate sources goes to ETL which purifies and harmonizes data and loads it to a central data warehouse (CDW), to operational data store (ODS), if any, and, if necessary, to temporary storage area. This is a common practice in EDW development. But downloading data from the ETL to the data marts directly is unusual.

In practice, this architecture is a result of users’ requirement to access analytical data as soon as possible, without time delay. Operational data store does not solve the problem, as users may be located in distant regions, and they require territorial data marts. The security limitations on the deployment of heterogeneous information in ODS may be another rationale for this architecture.

This architecture has a trouble spot: one of the problems of its operation is data recovery difficulty after a crash of data marts, directly supplied from ETL. The point is that ETL tools are not designed for long term storage of extracted and cleaned data. Transactional systems tend to focus on ongoing operations. Therefore, in case of data losses in data marts directly associated with ETL, one has to either extract information from the transactional systems’ backup, or organize historical archives of data sources systems. These archives require funds for their development and operational support, and they are redundant, with a corporate standpoint, since they duplicate functions of EDW, but they are designed only to support a limited number of data marts.

As another approach, sometimes these data marts are connected both to ETL directly and to data warehouse, which leads to confusion and misalignment of the results of analytical work. The reason is that data coming in EDW, as a rule, pass additional checks for consistency with the already loaded data. For example, financial document can be loaded with the requisites, almost coinciding with the document received by the EDW before. The ETL system, not having the information about all downloaded data, can not reveal whether the new document is a mistake or a result of legitimate correction.

Running inside the data warehouse, data verification procedures could reveal such uncertainty. The new data will be discarded in case of errors. Contrary, if it is a required correction, the changes will affect both these numbers, and the corresponding aggregate figures.

20

Pic.1. Centralized ETL with parallel DW and data marts Thus, the information, loaded into data mart from ETL directly, may contradict the data received from the EDW. Sometimes, to solve this contradiction, the identical data verification algorithms are implemented in the data marts and in EDW. The disadvantage is the need to support and to synchronize the same algorithms in the EDW and in data marts, fed from the ETL directly.

To sum up, we can say that parallel data marts lead to additional data processing, to organization and maintenance of excess operating archives, to support of duplicate applications and to decentralized data processing, which causes information mismatch.

Nevertheless, the parallel data marts can be implemented in cases where rapid access to analytical information is more important than disadvantages of this architecture.

DW with intermediate application data marts The following assumptions were the rationales for this architecture’s invention.

1. Some companies still deploy and operate independent disparate application data marts. Data quality in these data marts can meet the requirements of analysts who are working with DM.

2. Project stakeholders are confident that enterprise data warehouse implementation is a deadly technical trick with unpredictable consequences. As a matter of fact, the difficulties of EDW development and implementation are not technical, but are associated with poor project organization and with the lack of involvement of experts - future EDW users. However project team tries to avoid nonsignificant technology issues and to simplify up-to-the-minute tasks, instead of improving project organization.

3. The requirement for quick results. The necessity to report on a quarterly basis causes a need for quick tangible results. That’s why project team is not immune to the temptation to develop and implement a restricted solution with no relation to other tasks.

21

Following these principles either accidentally or deliberately, companies start data integration with introducing the separate independent data marts, in the hope that the data they contain will be easily, simply and quickly integrated when required. The reality is much more complicated. Although the quality of data in data marts can satisfy their users, this information is not consistent with data from other DMs. So reports, prepared for the top management and decision makers, can not be reduced to an uncontroversial view.

The same indicators can be calculated by different algorithms based on different data sets for various periods of time. Figures with the same name may conceal different entities, and vice versa, the same entity may have different names in various DMs and reports.

Pic. 2. DW with intermediate application data marts

Diagnosis is a lack of common data sense. Users of independent data marts speak different business languages, and each DM contains its own metadata.

Another problem lies in the difference of master data, used in the independent data marts. The differences in the data encoding, used codifier, dictionary, classifiers, identifiers, indices, glossaries make it impossible to combine these data without serious analysis, design and development of master data management tools.

However, the organization already has approved plans, budget and timeline for EDW which is based on independent data marts. Management expects to get results quickly and inexpensively. Developers provided with a scarce budget, are forced to implement cheapest solutions. This is a proven recipe for creation a repository of inconsistent reports. Such repository contradicts the idea of data warehousing as a single and sole source of purified, coherent and consistent historical data.

Obviously neither the company management nor the repository users are inclined to trust the information contained therein. Therefore, the total rebuilding of DW is required that usually implies

22

that new EDW should be created, which stores report figures indexes, rather than full reports. This allows to aggregate figures indexes into consistent reports.

Successful EDW rebuilding is impossible without metadata and master data management systems. Both systems will impact only the central data warehouse (CDW), as independent data marts contain their own metadata and master data.

As a result, management and experts can get coherent and consistent records, but they can not trace the data origin, due to discontinuity in the metadata data management between independent data marts and CDW.

Thus, the desire to achieve immediate results and to demonstrate rapid progress leads to denial of unified, end-to-end management of metadata and master data. The result of this approach is the semantic islands, where users speak a variety of business languages.

Nevertheless, this architecture can be implemented, where a single data model is not necessary, or is impossible, and where a relatively small amount of data must be transferred to CDW without knowledge of their origin and initial components. For example, an international company, operating in different countries, has already implemented several national data warehouses that follow local legal requirements, business constrain and financial accounting rules. CDW can require only piece of information from the national DWs for corporate reporting. There is no need to develop a unified data model, because it would not be demanded at the national level.

Certainly, similar scheme requires a high degree of confidence in national data, and can be used, if intentional or unintentional distortion of the data will not lead to serious financial consequences for the entire organization.

Data Warehouse with Integration Bus Widespread acceptance of service - oriented architecture (SOA) [3] has led to an idea to use SOA in solutions for enterprise data warehousing instead of ETL tools to extract, transform, load to a central data warehouse, and instead of SRD tools to sample, restructure and deliver data to the data marts.

Integration bus, which underpins the SOA, is designed for web-services and applications integration, and provides intellectual message routing, protocol mediation and message transformation between service consumer and service provider applications.

At first glance, the functionality of service bus allows us to replace the ETL and SRD by integration bus. Indeed, ETL performs mediation between the central data warehouse (CDW) and data sources, and SRD is the mediator between the CDW and data marts. It would seem that the replacement of the ETL and SRD by the integration bus can benefit from the flexibility provided by the bus for application integration.

Imagine that the CDW, the operational data store (ODS), the temporary storage area, metadata and master data management systems call the bus as independent applications with queries to update the data from data sources.

First of all, the load on the data sources will increase by many times, since the same information will repeatedly transmitted by the request of the CDW, ODS, the temporary storage area and metadata and master data management systems. The obvious solution is to create for the integration bus its own data store to cache queries.

23

Pic. 5. Data Warehouse with Integration Bus Secondly, the data gathering procedures, previously centralized in the ETL, now scattered over the application requesting the data. The discrepancy in various data gathering procedures for the CDW, ODS, metadata and master data management systems will arise Sooner or later. Data collected by different methods at different time intervals, processed by different algorithms contradict each other. Thereby the main goal of creating the CDW as the single source of consistent non-contradictory data will be destroyed.

The consequences of SRD replacement by the integration bus are not so dramatic. CDW must be transformed into service in order to respond to data marts requests for data, directed through the integration bus. This means that data warehouse must conform to the most common style of web - services and support HTTP / HTTPS protocols and SOAP / XML message format. This approach works well for short messages, but usually data marts require a large amount of data to pass through integration bus.

The task can be solved by using the binary objects transmission. The necessary data restructuring can not be performed by the integration bus, and must be carried out either in the CDW, or in the data marts. Data restructuring inside CDW is unusual functionality for CDW, as it must be aware of all data marts, and has to carry the additional workload, irrelevant to its main goal: reliable data storage. Data restructuring inside data marts requires the direct access from DM to CDW. In many cases it’s unacceptable for security reasons. This function can be realized by some proxy service that receives data and transmits them to the data marts after the restructuring. So, we return to the idea of SRD tool just supplied with bus interface.

Thus, integration bus can be used in the EDW architecture as a transport medium between the data sources and the ETL and between SRD and data marts in those cases where the components of EDW are separated geographically and are behind firewalls in accordance with strict requirements for data protection. In this case, for interoperability it is sufficient that the exchange was enabled

24

over HTTP / HTTPS protocols. All data collection, transformation and dissemination logic should still be concentrated in ETL and SRD.

Recommended EDW Architecture Architecture of an enterprise data warehouse (EDW) should satisfy many functional and nonfunctional requirements that depend on the specific tasks solved by the EDW. As there is no generic bank, airline, or oil company, so there is no single solution for the EDW to fit all occasions. But the basic principles that EDW must follow can still be formulated.

First and foremost it is the data quality that can be understood as complete, accurate and reproducible data, delivered in time where they are needed. Data quality is difficult to measure directly, but it can be judged by the decisions made. That is, data quality requires investment, and it can generate profits in turn.

Secondly, it is the security and reliability of data storage. The value of information stored in EDW can be compared to the market value of the company. Unauthorized access to EDW is a threat with serious consequences, and therefore adequate protection measures must be taken.

Thirdly, the data must be available to the employees to the extent necessary and sufficient to carry out their duties.

Fourthly, employees should have a unified understanding of the data, so a single semantic space is required.

Fifthly, it is necessary, if possible, to resolve conflicts in data encoding in the source systems.

Pic. 4. Recommended EDW Architecture The proposed architecture follows the examined principles of modular design - "unsinkable compartments”. The strategy of "divide and rule" is applicable not only in politics. By separating the

25

architecture into modules, we also concentrate in them certain functionality to give power over the unruly IT elements.

ETL tools provide complete, reliable and accurate information gathering from data sources by means of algorithms concentrated in ETL for the collection, processing, data conversion and interaction with metadata and master data management systems.

Metadata management system is the principal "keeper of wisdom" which you can ask for advice. Metadata management system supports the relevance of business metadata, technical, operational and project metadata.

The master data system is an arbitrator for conflict resolution of data encoding.

Central Data Warehouse (CDW) has only the workload of reliable and secure data storage. Depending on the tasks, the reliability of CDW can be up to 99,999%, to ensure smooth functioning with no more than 5 minutes of downtime per year. CDW’s software and hardware tools can protect data from unauthorized access, sabotage and natural disasters. Data structure in the CDW is optimized solely for the purpose of ensuring effective data storage.

Data sample, restructuring, and delivery tools (SRD) in this architecture are the only users of the CDW, taking on the whole job of data marts filling and, thereby, reducing the user queries workload on the CDW.

Data marts contain data in formats and structures that are optimized for tasks of specific data mart users. At present, when even a laptop can be equipped with a terabyte disk drive, the problems associated with multiple data duplication in the data mart do not matter. The main advantages of this architecture are:

• comfortable user’s operation with the necessary amount of data,

• the possibility to restore quickly the contents from the CDW in case of data marts failover,

• off-line data access when connection with the CDW is lost.

This architecture allows a separate design, development, operation and refinement of individual EDW components without a radical overhaul of the whole system. This means that the beginning of work on the establishment of EDW does not require hyper effort or hyper investments. To start it is enough to implement a data warehouse with limited capabilities, and following the proposed principles, to develop a prototype that is working and truly useful for users. Then you need to identify the bottlenecks and to evolve the required components.

Implementation of this architecture along with the triple strategy for data integration, metadata, and master data [4], allows to reduce time and budgeting needed for EDW implementation and to develop it in accordance with changing business requirements.

Conclusion The article discusses the advantages and limitations of the following architectures: centralized ETL with parallel DW and data marts, DW with intermediate application data marts, data warehouse with integration bus and recommended EDW architecture.

Recommended corporate data warehouse architecture allows creating in a short time and with minimal investment the workable prototype that is useful for business users. The key to this architecture, providing an evolutionary development of EDW, is the introduction of metadata and master data management systems in early stages of development.

26

Literature 1. Asadullaev S. «Data Warehouse Architectures – I», 19.10.2009, http://www.ibm.com/developerworks/ru/library/sabir/axd_1/index.html

2. Asadullaev S. «Data Warehouse Architectures – II», 23.10.2009. http://www.ibm.com/developerworks/ru/library/sabir/axd_2/index.html

3. Bieberstein N., Bose S., Fiammante M, Jones K., Shah R. “Service-Oriented Architecture Compass: Business Value, Planning, and Enterprise Roadmap”, IBM Press, 2005.

4. Asadullaev S. «Data, metadata, master data: the triple strategy for data warehouse project», 09.07.2009, http://www.ibm.com/developerworks/ru/library/r-nci/index.html

27

Data, metadata and master data: the triple strategy for data warehouse projects Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 09.07.2009 http://www.ibm.com/developerworks/ru/library/r-nci/index.html

Abstract The concept of data warehousing emerged in the early 90s. Since then, many data warehouse implementation projects were carried out, but not all of these projects were successfully completed. One of the most important reasons of failure is problem of common interpretation of data meaning, data cleaning, alignment and reconciliation. In the article it is shown that three interrelated projects for data, metadata and master data integration should be performed simultaneously in order to implement enterprise data warehouse (EDW).

Introduction The largest companies have implemented DW since the mid 90s. Previous projects can not be considered as unsuccessful, as they solved requested tasks, in particular, to provide company’s management with consistent reliable information, at least in some areas of company’s business. However, the growth of companies, changes in legislation and increased needs for strategic analysis and planning require further development of data warehouse implementation strategy.

By this time the companies have understood that a successful data warehouse required creating a centralized system for master data and metadata management. Unfortunately, these projects are still performed separately. It is assumed generally that the development of enterprise data warehousing is a project of integrating data from disparate sources. At the same time, the sources contain not only data but also the master data, as well as the metadata elements. Typically, large companies start data warehouse project without allocating funds and resources for metadata and master data management.

Project sponsors, steering committee and other decision makers usually try to allocate funds for the project phase by phase and tend to implement these three projects sequentially. As a result, the projects’ budgets, completion periods and quality do not meet the initial requirements because of the need for changes and improvements of IT systems built under the previous project.

In most cases the need of a data warehouse project is the demand of business users who are no longer able to bring together data from different information systems. That is, precisely the requirements of business users define, first of all, the information content of the future data warehouse.

The data sources for the future DW are transactional databases (OLTP), legacy systems, file storage, intranet sites, archives, and isolated local analytic applications. First, you need to determine where the required data are located. Since, as a rule, these data are stored in different formats, you must bring them to a single format. This task is performed by fairly complex system of data extraction, transformation and loading (ETL) into data warehouse.

The ETL procedures can not be accomplished without an accompanying analysis of metadata and master data. Moreover, the practice of data warehouse implementation has shown [1] that the metadata created and imported from various sources manage in fact the entire process of data collection.

28

Master data management Reference data and master data include glossaries, dictionaries, classifiers, indices, identifiers, and codifiers. In addition to standard dictionaries, directories and classifiers, each IT system has its own master data required for system’s operation. As long as several information systems operate in isolation from each other, the problems caused by the difference in the master data usually does not arise. However, if you have to combine the reported data of two or more different systems, the discrepancy in the master data makes it impossible to merge tables directly. In such cases is requires a "translator" of codes stored in multiple tables to a unified form. In addition, the master data although infrequently, but does change, and consistent update of master data in all information systems is a challenge.

Thus, there is a need to establish a master data management system, which helps coordinate master data changes in various information systems and simplifies the data integration from these systems.

Metadata management The prototype for metadata management were the Data Dictionary / Directory Systems, which were designed for logical centralization of information about data resources and should serve as a tool for enterprise data resources management [2].

Data sources, including transactional systems, contain metadata in an implicit form. For example, the table names and column names in the tables are technical metadata, and the definitions of entities that are stored in the tables represent the business metadata. Statistics of applications that can be done in monitoring systems should be classified as operational metadata. Relationship between the project roles and the database access, including the administration rights, and data for audit and for change management, usually related to the project metadata. And finally, business metadata are the most important piece of metadata, and include business rules, definitions, terminology, glossary, the data origin and processing algorithms.

Many sources contain metadata elements, but almost never have full set of metadata. The result is a mismatch in the reports provided by different source systems. For example, in one report, the production volume can be calculated in dollars, and in another - in pieces, while in the third – in a total weight basis. That is, the same field “production volume” may contain a variety of data in different reports.

Such a mismatch of data sense in reports forces companies to develop and implement an integrated system of unified indicators, reports, and terminology.

Data, metadata and master data interrelations DW structure consists of three main information levels: detailed, summary and historical data, as well as their accompanying metadata [3]. Now it is clear that this list should be complemented by master data. The relationship between data, metadata and master data can be visualized as a triangle (Pic. 1).

As can be seen from the figure, all the relationships fall into three pairs:

• Data – metadata

• Data – master data

• Metadata – master data

Consider each pair in more detail.

29

Pic. 1. Data, metadata and master data interrelations

Data and metadata The interdependence of data and metadata can be shown in the following example. We can assume that any book contains data. Library catalog card is the metadata that describe the book. The collection of cards is a library catalog, which can be treated as a set of data (database). Card should be filled according to certain rules which are specified in a book on librarianship (metameta data). This book should also be placed in the library, and its catalog card must be prepared, which must be placed in the appropriate box of catalog cabinet, where you can find the rules for using the catalog. The question, are these rules the metametameta data, we leave to the reader as a homework.

If you hold the book you need in your hands already, you do not need a catalog card for this book. Most home libraries do not have catalogues because the owner knows their library, and they are the library’s creator and the user. And they are a librarian in case if someone asks for a book from their library. But a large public library can not operate without the catalog.

The situation in enterprise information systems is not so simple and so obvious. Despite the fact that the first publications on the need for data dictionary systems appeared in the mid 80s, corporate resources are still designed, developed and operated in isolation, without a unified semantic space. This situation in libraries would mean that the reader is one library could not even tell whether the required book exists in another library.

In 1995 an article was published [4], which stated that for successful data integration it is necessary to establish and maintain the metadata flow. In the language of library users this discovery sounds something like this: "Libraries need to share information about books in a single format." It is now clear that this requirement needs to be clarified, since the metadata are generated on all stages of development and operation of information systems.

Business metadata are generated on the initial stage of system design. They include business rules, definitions, terminology, glossary, the origin of algorithms and data processing, which are described in the language of business.

On the next stage of logical design the technical metadata appear, such as names of entities and relationships between them. The table names and column names also refer to the technical metadata, and are determined on the stage of physical development of the system.

30

Metadata of production stage are operational metadata, which include statistics on computing resources usage, user activities and application statistics (e.g., frequency of execution, the number of records, componentwise analysis).

The metadata that document the development efforts, provide data for the project audit, assign metadata stewards, and support change management, refer to the project metadata.

Operational metadata are the most undervalued. The importance of operational metadata may be demonstrated by an example of a large company which provides customers with a variety of Web-services. At the heart of the company’s IT infrastructure resides a multi-terabyte data warehouse, around which custom local client databases and applications are built. Marketing department receives clients’ orders, legal department manages contracts, and the architectural team develops and provides documentation for project managers to transfer to outsourcing development.

When a customer’s contract ends, everybody is looking for new clients, and nobody takes the time to inform administrators that the customer’s application and the database can be deleted. As a result, the overhead for data archiving, data and applications is increasing. In addition, the development of new versions is significantly hampered, since it is necessary to support the unused protocols and interfaces.

Operational metadata management provides administrators and developers with information about how frequently applications are used. Based on this information we can determine unused applications, data and interfaces, whose removal from the system will significantly reduce the costs of its maintenance and future upgrades.

Data and master data In relational databases, designed in accordance with the requirements of normalization, one can identify two different types of tables. Some include, for example, a list of items of goods and their value (master data). Other tables contain information on purchases (the data). Without going into the jungle of definitions, in this example, you can see the difference between the data and the master data. Multiple purchases can be performed every second in a large store. But the prices and names of goods, at least for now, do not change every second.

Master data in relational databases perform several functions. They help reduce the number of data entry errors, support more compact storage of data through the use of short codes instead of long names. In addition, master data is the basis for standardization and normalization of data. On the one hand, the presence of the corresponding nationwide classifier will inevitably affect the structure of the database. On the other hand, the process of bringing data to a third normal form, as a rule, leads to internal codifiers.

Despite the fact that the ISBN or SIN codes are unique and can be the primary keys in relational databases, in practice, additional local codifiers are often created.

Metadata and master data There are various definitions and classifications of master data: by source, by management method, by classified data. For the purposes of this work, we may assume that the master data includes codifiers, dictionaries, classifiers, identifiers, indices and glossaries (Table 1).

Classifier, for example, bank identification code BIC, is managed centrally by an external organization (Bank of Russia) and provides rules for the code. In addition, the classifier may determine the rules of the code usage. That is, the reuse of bank identification codes by payment participants is allowed one calendar year after the date of their exclusion from the BIC classifier of

31

Russia, but not before Bank of Russia draws up the consolidated balance sheet of the on payments using the aviso for that calendar year. BIC code does not contain a control number.

A three-stage hierarchical classification is adopted in the All-Russia Classifier of Management Documentation: the class of forms (the first two digits), the subclass of forms (the second two digits), registration number (next three digits), and control numbers (last digit). Classifier can contain the rules for control number calculation or code validation algorithms.

Table 1. Master data types

Metadata in classifiers are the rules for calculating the control number, a description of the hierarchical classification, and usage regulations for identification codes.

Identifier (e.g., ISBN) is managed by authorized organizations in a non-centralized manner. Unlike the case of a classifier, identifier codes must follow the rules of control number calculation. The rules for the identifier compilation are developed centrally and are maintained by requirements of standards or other regulatory documents. The main difference from the classifier is that the identifier as a complete list either is not available, or it is not needed on a system design phase. The working list is updated with individual codes during system operation.

The difference between identifiers’ metadata and classifiers’ metadata is the different behavior on various stages of system life cycle. Identifiers’ metadata must be defined at the design stage, when the identifiers are not yet filled with individual values. New identifiers may appear during the system operation. In some cases they do not match the existing metadata, and metadata should be revised to eliminate the possible misinterpretation of the new values of identifiers.

Dictionary (e.g., phone book) is managed by an outside agency. The code numbering (telephone number) is not subject to any rules.

Dictionary’s metadata are less structured, as is the phone book. However, they are also necessary. For example, if the organization provides several different communication methods (work phone,

32

home phone, mobile phone, e-mail, instant messaging tools, etc.), system administrator can describe the rules to send a message at system failure.

Codifier is designed by developers for internal purposes of specific database. As a rule, neither checksum calculation algorithms, nor coding rules are designed for codifier. The encoding of a month of the year is a simple example of codifier.

Despite the absence of external rules, encoding is carried out in accordance with the designer‘s concept and often contains rules (metadata) in an implicit form. For example, for payments over the New Year the month of January can be entered again in the codifier as 13th month.

The index may be a just a numeric value (for example, the tax rate), which is derived from an unstructured document (order, law, act). It would be unreasonable to include a numeric value directly in the algorithm, since changing its value requires finding all occurrences in program text and replace the old value with the new. Therefore, indices, isolated in separate tables, are an important part of the master data.

Metadata of indices define the scope of their applications, time limits and restrictions.

Glossaries contain abbreviations, terms and other string values that are needed during the generation of forms and reports. The presence of these glossaries in the system provides a common terminology for all the input and output documents. Glossaries are so close in nature to the metadata that sometimes is difficult to distinguish them.

Thus, the master data always contains business metadata and technical metadata.

Most of the technical and business metadata is created during the understanding phase of metadata management life cycle [5]. Project metadata arise during the development phase and to a lesser extent, during the operation phase (e.g., assigning metadata stewards). Operational metadata are created and accumulated during the operation of the system.

Components of Enterprise Data Warehouse Enterprise data warehouse (EDW) transforms the data, master data and metadata from disparate sources and makes them available to users of analytical systems as a single version of truth. Data source are usually described as transactional databases, legacy systems, various file formats, as well as other sources of data, information from which must be provided to end users (pic.2).

The components of enterprise data warehouse are

1. ETL tools used to extract, transform and load data into a central data warehouse (CDW);

2. Central data warehouse, designed and optimized for reliable and secure data storage;

3. Data marts which provide efficient user access to data stored in structures that are optimal for specific users’ tasks.

A central repository includes, above all, three repositories:

1. Master data repository;

2. Data repository;

3. Metadata repository.

The scheme above does not include an operational data store, the staging area, the data delivery and access tools, business applications and other components of EDW that are not relevant to this level of detail.

33

Pic. 2. Components of Enterprise Data Warehouse

After several unsuccessful attempts to create a virtual data warehouse the need for a data repository has become unquestionable. In virtual DW architecture, a client program receives data directly from sources, transforming them instantly. The simplicity of architecture compensates for the waiting time of execution of query and data transformation. Query result is not saved, and the next same or similar request requires re-conversion of data, which lead to the abandonment of virtual DW and to creation of data repositories.

The present situation with metadata and master data resembles the situation of virtual DWs. Metadata and master data are used intensively during data extraction, transformation and loading. Cleaned data are saved in data warehouse. However, metadata and master data are discarded as waste material. Creating a repository of metadata and master data significantly reduces the EDW implementation costs and improves the quality of information support for business users by reusing consistent metadata and master data from a single source.

Example of existing approach An illustration of the existing approaches to data integration is presented in the paper [6] on the master data implementation in a bank. The bank spent more than six months reengineering of planning and forecasting process for performance management. Vice-president of the bank explained the success of implementation of master data management initiatives was due to the fact that the team focused on solving a local problem, avoiding the "big bang", which refers to the creation of an enterprise data warehouse. In his view, the creation of enterprise master data management system is a long, difficult and risky job.

During the next step it is planned to create a bank reporting system, based on integration of core banking systems to use more detailed data that are compatible with the general ledger. This will create a financial data repository, which should become the main source for all financial reporting systems, and will support the drill-down analysis.

34

Careful reading of the article leads to the following conclusions. First of all, this project did not provide for the integration of enterprise data and covers only reengineering of planning and forecasting process. Created data repository appears to be narrowly thematic data mart, and is not capable to support common analytical techniques, such as drill-down analysis.

In contrast, the enterprise data warehouse provides a consistent enterprise data to a wide range of analytical applications. In practice, a single version of data can be provided only by enterprise data warehouse, which works in conjunction with one enterprise master data and metadata management systems. The article [6] describes how to create a single version of truth for metadata only for financial reporting area.

Thus, the project team implemented a metadata and the master data management system for one specific area of activity. The team deliberately avoided the enterprise wide solutions: neither enterprise data warehouse, nor metadata or master data management system was implemented.

Statements that enterprise data warehouse can’t be implemented on practice, are refuted by projects performed by IBM employees on a regular basis.

This project is a typical "fast win"; the main objective is to demonstrate quick little success. At this stage, no one thinks about the price of applications redesign, redevelopment, and integration into the enterprise infrastructure. Unfortunately, we have to address increasingly the effects of the activity of "quick winners" who avoid complicated, lengthy, and therefore risky decisions.

It should be clarified that small successful projects are quite useful as a pilot, when beginning of global project geared to the demonstration of a workable solution in a production IT environment. In this situation all the pros and cons should be weighed. In the worst case, all the results of the pilot project may be rejected due to incompatibility with enterprise architecture of information systems. In EDW development the compatibility problem is particularly acute because of the need to coordinate not only the interfaces and data formats, but also the accompanying metadata, and master data.

The practical realization of the triple strategy The data warehouse as corporate memory should deliver the unified consistent information, but usually doesn’t, due to conflicting master data and lack of common understanding of the data sense.

Known solutions are metadata and master data analysis as part of data integration project without establishing metadata and master data management systems. Metadata and master data management systems implementations usually are regarded as separate projects, performed after data warehouse implementation (pic.3).

The drawbacks of such known solutions are insufficient quality of information delivered to data warehouse end users due to lack of consistent metadata and master data management, extra expenditures for data warehouse redesign to align the existing data integration processes with requirements of new metadata and / or master data management systems. The result is inefficiency of these three systems, the coexistence of the modules with similar functionality, waste of duplicated functionality, rising development budget, high total cost of ownership and user frustration due to discrepancy in data, metadata and master data.

35

Pic.3. One of the existing workflows

The master data, metadata and data integration projects, performed sequentially in any order can’t provide the business with the required quality of information. The only way to solve this problem is the parallel execution of three projects: metadata integration, master data integration and data integration (pic.4).

1. Enterprise metadata integration establishes a common understanding of the data and master data sense.

2. Master data integration eliminates the conflict in data and metadata coding in various information systems.

3. Data integration provides end users with data as a single version of the truth based on consistent metadata and master data.

Coordinated execution of these three projects delivers a corporate data warehouse with improved quality at lower costs and time expenditures. The proposed approach increases the quality of information delivered from data warehouse to business users, and consequently provides better support for decisions based on improved information.

The three integration projects (for data, metadata and master data), performed in parallel manner, allow to implement a coordinated architecture design, consistent environment, coherent life cycles and interrelated core capabilities for data warehouse, metadata management system, and master data management system.

36

Pic.4. Workflow according to the triple strategy

In practice there are lots of ways, methods, and approaches which assure success of the parallel coordinated execution of three projects of data, metadata and master data integration.

1. Arrange the data warehouse, metadata integration and master data integration project as a program.

2. Adopt Guide to Project Management Body of Knowledge as world-wide recognized project management standards.

3. Select the spiral development life cycle. 4. Gather functional and non-functional requirements to select suitable core capabilities for data

warehouse, for metadata, and for master data 5. Select an environment

a. for data warehouse: data sources, ETL, data repository, staging area, operational data store, application data mart, departmental and regional data marts, analytical, reporting and other applications

b. for metadata: Managed Metadata Environment with 6 layers: sourcing, integration, data repository, management, meta data marts, and delivery layer

c. for master data: Upstream, MDM core, Downstream 6. Select an architecture design

a. centralized data warehouse architecture b. centralized metadata architecture c. centralized master data repository

7. Select life cycles a. life cycle for data: for example: understand, extract, transform, load, consolidate, archive,

deliver

37

b. life cycle for metadata: development, publishing, ownership, consuming, metadata management

c. life cycle for master data: identify, create, review, publish, update, and retire 8. Define roles in project and responsibilities, and assign team members to specific roles. 9. Select the tools for each team member.

The technical feature, which is absolutely required for the strategy implementation, is primarily the coordination of these three projects. In general, this is subject matter of program management. The specific details (who, what, when, where, how, why) of inter project communication depend on project environment described above.

Conclusion At the moment IBM is the only company which proposes almost full product set for the triple strategy implementation. ETL tools for data extraction from heterogeneous data sources, metadata glossary tools, data architecture instruments, master data management tools, sophisticated tools for BI environment designing, industrial data models, and middleware allowing to integrate the components into the unified environment for information delivery to business users.

The idea of the triple strategy could have arisen 10 or 15 years ago. Practically implementation of the strategy was impossible at that time due to huge costs of developing the required tools, which are available now.

Ready-made software tools for data, metadata and master data integration support the triple strategy and together can mitigate the project risks, reduce the data warehouse development time and provide companies with new availabilities to improve the corporate performance.

The author thanks M. Barinstein, R.Ivanov, D.Makoed, A.Karpov, A.Spirin, and O.Tretyak for helpful discussions.

Literature 1. Asadullaev S. “Vendors’ data warehouse architectures”, PC Week / RE, 1998, № 32-33, p. 156-157

2. Leong-Hong B.W., Plagman B.K. Data Dictionary / Directory Systems. John Wiley & Sons. 1982.

3. Inmon, W. H., Zachman, J. A., Geiger, J. G. “Data Stores, Data Warehousing and the Zachman Framework”, McGraw-Hill, 1997

4. Hackathorn R. “Data Warehousing Energizes Your Enterprise,” Datamation, Feb.1, 1995, p. 39.

5. Asadullaev S. “Metadata management using IBM Information Server”, 2008, http://www.ibm.com/developerworks/ru/library/sabir/meta/index.html

6. Financial Service Technology. «Mastering financial systems success», 2009, http://www.usfst.com/article/Issue-2/Business-Process/Mastering-financial-systems-success/

38

Metadata Management Using IBM Information Server Sabir Asadullaev, Executive IT Architect, SWG IBM EE/A 06.10.2008

Abstract The strategy selection for BI metadata management system implementation requires getting the answer to several critical questions. Which metadata needs to be managed? What does the metadata lifecycle look like? Which specialists are needed to complete the project successfully? Which instruments can support the specialists during the whole lifecycle of required metadata set?

This paper investigates the metadata management system for data integration projects from these four specified points of view.

Glossary Glossary is a simple dictionary which includes a list of terms and definitions on a specific subject. It contains terms and their textual definitions in natural language, like this glossary.

Thesaurus (treasure) is a variety of dictionary, where lexical relations are established between lexical units (e.g., synonyms, antonyms, homonyms, paronyms, hyponyms, hyperonyms…)

Controlled vocabulary requires the use of predefined, authorized terms that have been preferred by the authors of the vocabulary

Taxonomy models subtype - supertype relationships, also called parent-child relationships on basis of controlled vocabularies with hierarchical relationships between the terms

Ontology expands on taxonomy by modeling other relationships, constraints, and functions and comprises the modeled specification of the concepts embodied by a controlled vocabulary.

Metadata types IBM Information Server handles four metadata types. These metadata serve the data integration task solved by Information Server

Business Metadata are intended for business users and include business rules, definitions, terminology, glossaries, algorithms and lineage using business language.

Technical Metadata are required by specific BI, ETL, profiling, modeling tool users and define source and target systems, their table and fields structures and attributes, derivations and dependencies.

Operational Metadata are intended for operations, management and business users, who need information about application runs: their frequency, record counts, component by component analysis and other statistics.

Project Metadata are used by operations, stewards, tool users and management in order to document and audit the development process, to assign stewards, and to handle change management.

Success criteria of metadata project Not all companies have recognized the necessity of metadata management in data integration projects (for instance, in data warehouse development). Those who started the implementation of

39

metadata management system faced a number of challenges. The requirements for metadata management system in BI environment can be defined precisely and in proper time, for example:

• Metadata management system must provide relevant and accessible centralized information about all information systems and their relations.

• Metadata management system must establish a consistent usage of business terminology across organizations

• Impact of change must be discovered and planned

• Problems must be traced from the point of detection down to the origin.

• New development must be supplied with the information about existing systems

The reality is that unwieldy repository stores a pile of useless records; each system uses its own isolated metadata; uncoordinated policies appear to be mismatching factors; obsolete and unqualified metadata do not meet the quickly changing business requirements.

The failure of metadata management projects, when one would think the goals are defined, the budget is allocated and a competent team is picked up, is mainly caused by the next reasons:

• Insufficient participation of business-user in the creation of a consolidated glossary, which may be the result of inconvenience and complexity of glossary and metadata repository management tools.

• Supporting only a couple of metadata types due to shortage of time and / or financial resources.

• Lacking or incomplete documentation for production systems, which could be mitigated by tools for data structure analysis of existing systems on the initial investigation step.

• Lack of support of the full metadata management life cycle due to the fragmented metadata management tools.

Side bar

Strictly speaking, these statements are related to product success or failure as a result of the project execution. As a rule, projects success criteria are timely execution within the budget, required quality and scope. Project success doesn’t guarantee the product success. The history of technological expansion knows a lot of examples, when technically perfect product, developed in time and with no budget deficit, wasn’t demanded or didn’t meet with a ready market sale

The success criteria for metadata implementation project are the demand for developed metadata management system by subject matter experts, by business-users, by IT personnel and by other information systems, both in production and development.

Metadata management Lifecycle The simplified lifecycle implies five stage and five roles (Pic.1). Development is the creation of new metadata by author (subject matter expert). Publishing, performed by publisher, notifies the participants and users of the existing and available metadata and their locations. Ownership allows to define and to assign metadata usage rights. Consuming of metadata is performed by the development team, by users or by information systems. Metadata management, executed by manager or stewards, includes modification, enrichment, extension, and access control.

40

Pic.1. Simplified metadata management lifecycle

Extended metadata management lifecycle consists of the following stages (Pic.2).

Analysis and understanding includes data profiling and analysis, data sets and structures quality determination, understanding the sense and content of the input data, connection revealing between the columns of database tables, analysis of dependence and information relations, data investigations for their integration.

Pic.2. Extended metadata management lifecycle

Modeling means revealing data aggregation schemas, detection and mapping of metadata interrelation, impact analysis and synchronization of models.

Development provides team glossary building and management, business context support for IT assets, elaboration of extraction, transformation and delivery data flows.

Transformation consists of automated generation of complex data transformation tasks and of linking source and target systems by means of data transformation rules.

Development Publishing Consuming

Ownership

Metadata Management

Development

Publishing

Consuming

Ownership

Quality management

Analysis and understanding

Modeling

Transformation

Metadata Management

Reporting and audit

41

Publishing provides a unified mechanism for metadata deployment and for upgrade notification.

Consuming is visual navigation and mapping metadata and their relations; metadata access, integration, import and export; change impact analysis; search and queries.

Metadata quality management solves the tasks of heterogeneous data lineage in data integration processes; quality improvement of the information assets; input data quality monitoring, and allows to eliminate data structure troubles and their processability before they affect the project.

Reporting and audit imply setting formatting options for report’s results, report generation for the linage between business terms and IT assets, scheduling reports execution, saving and reviewing the versions of the reports. Audit results can be used for analysis and understanding on the next loop of life cycle.

Metadata management is to manage access to templates, reports and results, to control metadata, to navigate and query the metamodel, to define access rights, responsibilities and manageability.

Ownership determines metadata usage rights.

Sidebar

Support of full metadata management lifecycle is critically important for metadata management goals, especially for big enterprise information systems. Lifecycle discontinuity leads to the consistency violation of the corporate metadata, and isolated islands of the contradictive metadata arise.

Implementation of consistent tools for metadata management leads to a considerable increase in the success possibility of metadata management system implementation project.

IBM Information Server metadata management tools IBM Information Server platform includes the following metadata management tools.

Business Glossary is a Web-based application that supports the collaborative authoring and collective management of business dictionary (glossary). It allows to maintain the metadata categories, to build their relations, and to link them to physical sources. Business Glossary supports the metadata management, alignment and browsing and assignment the responsible stewards.

Business Glossary Anywhere is a small program which provides a read-only access to the content of business glossary through operation system’s clipboard. User can highlight the term on the screen of any application, and a business definition of the term will appear in a pop-up window.

Business Glossary Browser provides a read-only access to business glossary’s content in a separate window of web-browser.

Information Analyzer scans automatically data sets to determine their structure and quality. This analysis helps in understanding data inputs to integration process, ranging from individual fields to high-level data entities. Information analysis also enables to correct problems with structure or validity before they affect the metadata project. Information Analyzer maintains profiling and analysis as an ongoing process of data reliability improvement.

QualityStage provides the instruments for investigation, consolidation, standardization and validation of heterogeneous data in integration processes and improves the quality of the information assets.

42

DataStage maintains the development of data flows, which extract information from multiple sources, transform it according to the specified rules and deliver it to target data bases or applications.

Information Analyzer performs source systems analysis and passes to QualityStage, which, in turn, supports DataStage, responsible for data transformation. Used together Information Analyzer, QualityStage and DataStage allow to automate the data quality assurance processes, and to eliminate the painstaking or even impossible data integration handworks.

FastTrack reveals the relations between columns of database tables, links columns and business terms, automatically creates complex data transformation tasks in DataStage and QualityStage Designer, binds data sources and target systems by data transformation rules, reducing the application development time.

Metadata Workbench provides metadata visualization and navigation tools, maintains visual representation of metadata interdependences, gives the possibilities of information dependencies and relations analysis between various tools, allows metadata binding, generates reports on business terms and IT assets relations, support metadata management, navigation and metamodel queries; allows to investigate key integration data: Tasks, Reports, DBs, Models, Terms, Stewards, Systems.

Web Console provides administrators with a role based access management tools; maintains the scheduling of report execution, storing the results of queries in common repository and viewing multiple versions of the report; creating the directories for report storage and indicating in which directories the reports will be stored. Web console allows to define the formatting options for results of queries.

Information Services Director resides in the domain layer of IBM Information Server and provides the unified mechanism for publishing and management the data quality services, allowing to IT specialists to deploy and control the services for any data integration task. Common services include metadata services, which supply the standard service-oriented end-to- end access to metadata and their analysis.

Rational Data Architect is an enterprise data modeling and integration design tool that combines data structure modeling capabilities with metadata discovery, relationship mapping, and analysis. Rational Data Architect helps to understand data assets and their relationships to each other and allows to reveal data integration schemas, to visualize metadata relations, to analyze the impact of changes and the synchronization of models.

Metadata Server maintains the metadata repository and other components interaction, and support metadata services: metadata access and integration, impact analysis, metadata import, export, search and queries. The repository is a J2EE application. For persistent storage it uses a standard relational database such as IBM DB2, Oracle, or SQL Server. Backup, administration, scalability, parallel access, transactions, and concurrent access are provided by an implemented database.

As we can see, IBM Information Server metadata management tools cover the extended metadata management lifecycle.

Roles in metadata management project The team roles set of the metadata management project depends on many factors and can include, for example, infrastructure engineers, information security specialists, and middleware developers. Limited by team roles being of direct relevance to metadata system development, the role list can look as follows.

43

Project manager for effective project management requires both project documentation and information on product deliverables, namely, on developing metadata management system. So project manager should be granted an access to tools producing the reports on jobs, queries, data bases, models, terms, stewards, systems and servers.

Subject matter expert has to participate in the business glossary collaborative creation and management. Expert must define the terms and their categories, and to establish their relations.

Business analyst should know the subject matter, understand the terminology and the sense of entities, and have previous experience in formulating the rules of data processing and transformation from sources to target systems and consumers. Participation of business analyst in business glossary creation is also very important.

Data analyst reveals all the inconsistencies and contradictions in data and terms before the application program developments starts.

IT developer should have the ability to familiarize himself with business terminology, to develop the data processing jobs, to implement transformation rules, to code data quality

Application administrator is responsible for maintaining and versioning the configuration of applications in production; for updates and patch sets installation; for maintaining and monitoring the current state of program components; for execution of the general policies of the protection profiles; for conducting the performance analysis and for application execution optimization.

Data base administrator should tune the data base and control its growth; reveal the performance problems and fix them; generate the required data base configurations; change the structure of data vase; add and remove the users and change their access rights.

Business users in the frame of metadata project need a simple and effective access to the metadata dictionary. As in the case of a common paper dictionary users require the ability to read the lexical entry along with the explicit description and the brief dictionary definition, preferably without any loss of context or focus.

The roles support by IBM Information Server tools Business Glossary allows to assign a steward (responsible for metadata) role to a user or a group of users; and to hold steward liable for one or more metadata objects. Steward’s responsibilities imply an efficient management and integration with related data and making the data available to authorized users. Steward should ensure that data is properly defined, and that all users of the data clearly understand its meaning.

Subject matter expert (metadata author) uses Business Glossary to create the business classification (taxonomy), which maintains hierarchical structure of terms. Term is a word or phrase which can be used for object classification and grouping in metadata repository.

Business Glossary supply subject matter experts with collaborative tool to annotate existing data definitions, to edit descriptions, and to assign data object to categories.

If business analyst or data analyst discovers contradictions between glossary and data base columns, he can notify metadata authors by means of Business Glossary features.

Other project participants need a read-only access to metadata. Their demands can be covered by two instruments: Business Glossary Browser and Business Glossary Anywhere.

Information Analyzer plays an important role on the integration analysis stage, which is required for the estimation of the data existence and their current state. The result of this stage fulfillment is

44

the understanding of the source systems and, consequently, the adequate target system design. Instead of time-consuming hand work in analysis of the outdated or missing documentation, Information Analyzer provides business analyst and data analyst with the possibilities of automated analysis of production systems.

Business analyst uses Information Analyzer to make decisions on integration design on the basis of data base tables investigation, columns, keys and their relations. Data analysis helps to understand the content and structure of data before project starts, and allows making useful for integration process conclusions on later project stages.

Data analyst accepts Information Analyzer as an instrument for a complete analysis of source information systems and target systems; for evaluation of structure, content and quality of data in single and multiple column level, on table level, on file level, on cross table level, and on the level of multiple sources.

Stewards can use Information Analyzer to maintain the common understanding of the data sense by all users and project participants.

Business analyst or data analyst by means of Information Analyzer can create additional rules for evaluation and measurement of data and their quality in time. These rules are either simple criteria of column evaluation based on results data profiling, or complex conditions, which evaluate several fields. Evaluation rules allow to create the indices, which deviation can be controlled over time.

QualityStage can be invoked on the preparation stage of enterprise data integration (often referred to as data cleansing). IT developer runs QualityStage for data standardization automation, for data transformation into the verified standard formats, for designing and testing match passes, and for data-cleansing operations setup. Information is extracted from the source system, measured, cleansed, enriched, consolidated, and loaded into the target system. Data cleansing jobs consist of the following sequence of stages.

Investigation stage is performed by business analyst to reach a complete visibility of the actual condition of data and can be fulfilled using both Information Analyzer and QualityStage’s embedded analyzing tools.

Standardization stage reformats data from multiple systems to ensure that each data type has the correct content and format.

Match stages ensure data integrity by linking records from one or more data sources that correspond to the same entity. The goal of the Match stages is to create semantic keys to identify information relationships.

Survive stage ensures that the best available data survives and is correctly prepared for the target. This means that survive stage is executed to build the best available view of related information

Basing on data understanding achieved on investigation stage, IT developer can apply QualityStage ready to run rules to reformat data from several sources on standardization stage.

IT developer leverages DataStage for data transformation and movement from source systems to target systems in accordance with business rules, requirements of subject matter and integrity, and / or in compliance with other data of target environment.

Using metadata for analysis and maintenance, and embedded data validation rules, IT developer can design and implement integration processes for data, received from a broad set of corporate and external sources, and processes of mass data manipulation and transformation leveraging scalable

45

parallel technologies. IT developer can implement these processes as DataStage batch jobs, as real time tasks, or as Web services.

FastTrack is predominantly an instrument of business analyst and IT developer.

Business analyst with the help of the instrumentality of mapping editor, a component of FastTrack, creates mapping specifications for data flows from sources to target systems. Each mapping can contain several sources and targets. Mapping specifications are used for business requirements documentation..

Mapping can be adjusted by applying business rules. End-to-end mapping can involve data transformation rules, which are part of functional requirements and define how application should be developed.

IT developer uses FastTrack during the process of program logic development of end-to-end information processing. FastTrack converts the artifacts received from various sources into understandable descriptions. This information has internal relations and allows the developer to get the descriptions from metadata repository and to concentrate on the complex logic development, avoiding loosing the time for search in multiple documents and files.

FastTrack is integrated into IBM Information Server, so specifications, metadata and jobs become available for all project participants, who use Information Server, Information Analyzer, DataStage Server and Business Glossary.

Table 1. Roles in a metadata management project and IBM Information Server tools

Pro

ject

man

age

r

Su

bje

ct m

atte

r ex

per

t

Ste

wa

rd

Bu

sine

ss a

nal

yst

Dat

a an

alys

t

IT d

evel

ope

r

App

licat

ion

A

dmin

istr

ato

r

DB

ad

min

istr

ato

r

Bu

sine

ss u

sers

Business Glossary x x x x Business Glossary Browser

x x x x x x x x x

Business Glossary Anywhere

x x x x x x x x x

Information Analyzer x x x x QualityStage x x DataStage x FastTrack x x Metadata Workbench x x x x x x x x x Web Console x x Information Services Director

x

Rational Data Architect

x

46

Metadata Workbench provides IT developers with metadata view, analysis and enrichment tools. Thus IT developers can use Metadata Workbench embedded design tools for management and understanding the information assets, created and shared in IBM Information Server.

Business analysts and subject matter experts can leverage Metadata Workbench to manage metadata stored in IBM Information Server.

Specialists, responsible for compliance with regulations such as Sarbanes-Oxley and Basel II, have the possibility to trace the data lineage of business intelligence reports using the appropriate tools of Metadata Workbench.

IT specialists who are responsible for change management, say, project manager, with Metadata Workbench can analyze the change impact on the information environment.

Administrators can use the capabilities of Web console for global administration that is based on a common framework of Information Server. For example, user needs only one credential to access all the components of Information Server. A set of credentials is stored for each user to provide single a sign-on to the products registered with the domain.

IT developer executes Information Services Director as a foundation for deploying integration tasks as consistent and reusable information services. Thus IT developer can use metadata management service-oriented tasks together with corporate applications integration, the business-process management, with enterprise service bus and the application servers.

Data analysts and architects can invoke Rational Data Architect for data base design, including federated databases, that can interact with DataStage and other components of Information Server.

Rational Data Architect provide data analysts with metadata research and analysis capabilities, and data analysts can discover, model, visualize and relate heterogeneous data assets, and can create physical data models of from scratch, from logical models by using transformation, or from the database using reverse engineering.

Conclusion The performed multianalysis, including the types of metadata, the metadata life cycle, the roles in metadata project, metadata management tools, allowed to draw the following conclusions.

IBM Information Server metadata management tools cover an extended metadata management lifecycle in data integration projects.

The participants of metadata management project are provided with the consistent set of IBM Information Server metadata management tools, which allows to considerably increase the corporate metadata management system implementation’s success probability.

The process flows of IBM Information Server components and their interaction will be considered in further papers.

Author thanks S.Likharev for useful discussion.

47

Incremental implementation of IBM Information Serve r’s metadata management tools Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 21.09.2009 http://www.ibm.com/developerworks/ru/library/sabir/Information_Server/index.html

Abstract Just 15 years ago a data warehouse (DW) implementation team had to develop custom DW tools from scratch. Currently integrated DW development tools are numerous and their implementation is an challenging task. This article proposes incremental implementation of IBM Information Server’s metadata management tools in DW projects by the example of typical oil & gas company.

Scenario, current situation and business goals After having spent significant amounts of money on hundreds of SAP applications, our client suddenly realized that a seemingly homogeneous IT environment doesn’t automatically provide unified understanding of business terms.

The customer, one of the world’s leading companies, incorporates four groups of subsidiary units, which operate in oil & gas exploration and production, and in refinery of petroleum products and marketing. Subsidiary units are subsidiaries spread around the world; they operate in various countries with different legislation, languages and terminologies. Each unit has its own information accounting system. Branch data warehouses integrate information from units’ accounting systems. The reports produced by branch data warehouses are not aligned with each other due to disparate treatment of reports’ fields (attributes).

The company decided to build an attribute based reporting system, and realized that the lack of common business language made the enterprise data integration impossible. In this scenario, the company decided to establish a unified understanding of business terms, which allows to eliminate the contradiction in understanding of report fields.

Business goals were formulated in accordance with the identified issues:

• Improve the quality of information, enhance security of information, and provide the transparency of its origin;

• Increase the efficiency of business process integration and minimize time and effort of its implementation;

• Remove the hindrances for corporate data warehouse development.

Logical Topology – As Is The existing IT environment incorporates the information accounting systems of Units, branch information systems, branch data warehouses, information systems of the headquarters, data marts and planned enterprise data warehouse.

Information accounting systems of subsidiaries were realized on various platforms and are out of the scope of this consideration. Branch information systems are mainly based on SAP R/3. Branch data warehouses were developed on SAP BW. Headquarters’ information systems are realized using Oracle technologies. Data marts are currently working over headquarters’ information systems and branch data warehouses are running SAP BW. The platform for enterprise data warehouse is DB2.

48

Pic. 1. Logical Topology – As Is

49

On the left side of Pic.1 we can see the information systems of four branches: Exploration, Production, Refinery and Marketing. These hundreds systems include HR, financial, material management and other modules, and are out of our scope currently because they will not be connected to metadata management system on this stage.

The center of Pic.1 presents the centralized part of the client’s IT infrastructure. It includes several branches’ data warehouses on SAP BW platform and headquarters’ information system on Oracle data base. Historically these two groups use various independent data gathering tools and methods, so stored data are not consistent across the information systems. The information is grouped in several regional, thematic and department data marts. These data marts were built independently over the years. That’s why the reports generated by OLAP systems do not provide a unified understanding of the report’s fields.

Since the metadata management eliminates the data mismatch, improves the integration of business processes and removes obstacles to developing a enterprise data warehouse, it was decided to implement the enterprise metadata management system.

Architecture of metadata management system Basically, there are three main approaches to metadata integration: point-to-point architecture, point-to-point architecture with model-based approach, and central repository-based hub-and-spoke metadata architecture [1].

The first one is point-to-point architecture, a traditional metadata bridging approach, in which pair-wise metadata bridges are built for every pair of product types that are to be integrated. Relative ease and simplicity of pair integration leads to uncontrolled growth of number of connections between systems. This uncontrolled growth results in considerable expense for maintaining a unified semantic space when changes are made in at least one system.

Second is point-to-point architecture with model-based approach, which significantly reduces the cost and complexity associated with the traditional point-to-point metadata integration architectures based on metadata bridges. Common meta-model eliminates the need to construct pair-wise metadata bridges and establishes a complete semantic equivalence at metadata level between different systems and tools that are included in the information supply chain to users.

The third is central repository-based hub-and-spoke metadata architecture. In this case the repository generally takes on a new meaning as the central store for both the common meta-model definition and all of its various instances (models) used within the overall environment.

The centralized structure of information systems of oil and gas company dictates the choice of central repository-based hub-and-spoke metadata architecture as the most adequately implementing the necessary connection between systems to be integrated.

Architecture of metadata management environment Metadata Management Environment (MME) [2] includes the sources of metadata, metadata integration tools, a metadata repository, and tools for metadata management, delivery, access and publication. In some cases, the metadata management environment includes metadata data marts, but in this task they are not needed, since the functionality of the metadata data marts is not required.

Metadata sources are all information systems and other data sources that are included in the enterprise metadata management system.

Metadata integration tools are designed to extract metadata from sources, their integration and deployment to a metadata repository.

Metadata repository stores business rules, definitions, terminology, glossary, data lineage and data processing algorithms, described in the business language; description of the tables and

50

columns (attributes), including the statistics of the applications’ execution, the data for the project audit.

Metadata management tools provide a definition of access rights, responsibilities and manageability.

Tools for metadata delivery, access and publication allow users and information systems to work with metadata the most convenient way.

Architecture of metadata repository Metadata repository can be implemented using either a centralized, a decentralized, or a distributed architecture approach.

The centralized architecture implies a global repository, which is designed on a single metadata model and maintains all enterprise systems. There are no local repositories. The system has a single, unified and coherent metadata model. The need to access a single central repository of metadata can lead to performance degradation of metadata consuming remote systems due to possible communication problems.

In the distributed architecture the global repository contains enterprise metadata for the core information systems. Local repositories, containing a subset of metadata, serve the peripheral system. Metadata model is uniform and consistent. All metadata are processed and agreed in a central repository, but are accessed through the local repository. The advantages of local repositories are balanced by requirements to be synchronized with a central metadata repository. The distributed architecture is preferable for geographically distributed enterprises.

Table 1. Comparison of metadata repositories architectures

The decentralized architecture assumes that a central repository contains only metadata references which are maintained independently in local repositories. Lack of coordination efforts on terms and concepts significantly reduces development costs, but leads to multiple and varied models that are mutually incompatible. The applicability of this architecture is limited to the case when the integrated systems are within the non-overlapping areas of company’s operations.

51

As one of the Company’s most important objectives is to establish a single business language, a decentralized architecture is not applicable. The choice between centralized and distributed architecture is based on the fact that all the systems to be integrated are located in headquarters, and there is no problem with stable communication lines.

Thus, the most applicable to this scenario is a centralized architecture of metadata repository.

In various publications one can find statements that metadata repository is a transactional system, and should be managed differently than the data warehouse. From our point of view, the recommendation to organize metadata repository data warehouse is more justified. Metadata should accompany the data throughout its lifecycle. That is, if the data warehouse contains historical data, the metadata repository should also contain relevant historical metadata.

Logical Topology – To Be The selected architectures of metadata management environment, of metadata management system and of metadata repository lead to the target logical topology shown in Pic. 2. On can see two major changes compared to the current logical topology.

1. We plan to create an enterprise data warehouse and to use IBM Information Server as an ETL tool (Extract, Transform and Load). This task is beyond the scope of current work.

2. The second, most important change is the centralized metadata management, which allows the Company to establish a common business language for all systems operating in headquarter. So, on the client side only the metadata client is required.

Two phases of extended metadata management lifecycl e Extended metadata management lifecycle (Pic.3) as proposed in [3], consists of the following stages: analysis and understanding, modeling, development, transformation, publication, consuming, ownership, quality management, metadata management, reporting and auditing.

In terms of incremental implementation the extended metadata management lifecycle can be divided into two phases:

1. “Metadata elaboration” phase: analysis and understanding, modeling, development, transformation, publication.

2. “Metadata Production” phase: consuming, ownership, quality management, metadata management, reporting and audit.

As the phases’ names imply, on the first phase mainly analysis, modeling and development of metadata are mainly carried out, while the second phase is more closely related to the operation of the metadata management system. For clarity, the stages of phase “Metadata elaboration” are grouped in the left hand side of Pic. 3, whereas the stages of phase “Metadata Production” are placed on the right hand side.

52

Pic. 2. Logical Topology – To Be

53

“Metadata elaboration” phase Analysis and understanding includes data profiling and analysis, quality assessment of data sets and structures, understanding the sense and content of the input data, identification of connections between columns of database tables, analysis of dependencies and information relations, and investigation of data for their integration.

• Business Analyst performs data flow mapping and prepares the initial classification.

• Subject matter expert develops business classification.

• Data Analyst accomplishes analysis of systems.

Modeling means revealing data aggregation schemes, detection and mapping of metadata interrelation, impact analysis and synchronization of models.

• Data Analyst develops the logical and physical models and provides synchronization of models.

Development provides team glossary elaboration and maintenance, business context support for IT assets, elaboration of flows of data extraction, transformation and delivery.

• IT developer creates the logic of data processing, transformation and delivery.

Transformation consists of automated generation of complex data transformation tasks and of linking source and target systems by means of data transformation rules.

• IT developer prepares the tasks to transform and move data, which are performed by the system.

Publishing provides a unified mechanism for metadata deployment and for notification upgrade.

• IT developer provides deployment of integration services, ...

• ... Which help Metadata steward publish metadata

“Metadata Production” phase Consuming is visual navigation and mapping of metadata and their relations; metadata access, integration, import and export; change impact analysis; search and queries (Pic.3).

• Business users are able to use metadata

Ownership determines metadata access rights.

• Metadata steward maintains the metadata access rights

Metadata quality management solves the tasks of lineage of heterogeneous data in data integration processes; quality improvement of information assets; input data quality monitoring, and allows to eliminate issues of data structure and their processability before they affect the project.

• Project manager analyzes the impact of changes

• Business analyst identifies inconsistencies in the metadata

• Subject matter expert updates business classification

• Data analyst removes the contradiction between metadata and classification

• IT developer manages information assets

• Metadata steward supports a unified understanding of metadata meaning

• Business users use metadata and inevitably reveal metadata contradictions

54

Pic. 3. Extended metadata management lifecycle

Development IT developer

Logic of end-to-end information processing

Publication IT developer

Integration services deployment Metadata steward

Metadata publication

Consuming Business user

Read-only access to metadata

Ownership Metadata steward

Define metadata access rights

Quality management Project manager

Analyze the change impact Business analyst

Discover metadata contradictions Subject matter expert

Update the business classification Data analyst

Eliminate metadata contradictions IT developer

Manage the information assets Metadata steward

Maintain the common understanding of the data sense

Business user Report metadata issues

Analysis and understanding Business analyst

Data flows mapping Initial classification

Subject matter expert Business classification

Data analyst System analysis

Modeling Data analyst

Logical & physical data models Synchronization of models

Transformation IT developer

Data standardization and transformation procedures

Metadata Management Project manager

Assign a steward Define responsibilities and manageability

Reporting and audit Metadata steward

Metadata audit Report metadata state

55

During Metadata management stage access to templates, reports and results is managed; metadata, navigation and queries in the meta-model are controlled; access rights, responsibilities and manageability are defined.

• The project manager should appoint stewards and allocate responsibilities among team members.

Reporting and audit imply formatting options settings for report’s results, report generation for the connections between business terms and IT assets, scheduled reports execution, saving and reviewing the reports’ versions.

• Metadata steward provides auditing and reporting

Audit results can be used to analyze and understand metadata on the next stage of the life cycle.

Roles and interactions on metadata elaboration phas e Business analyst, with the help of the instrumentality of mapping editor, a component of FastTrack, creates mapping specifications for data flows from sources to target systems (Pic.4). Each mapping can contain several sources and targets. Mapping specifications are used for business requirements documentation. Mapping can be adjusted by applying business rules. End-to-end mapping can involve data transformation rules, which are part of functional requirements and define how an application should be developed.

Business analyst uses Information Analyzer to make decisions on integration design on the basis of data base tables’ investigation, columns, keys and their relations. Data analysis helps to understand the content and structure of data before a project starts, and on later project stages allows making conclusions useful for integration process.

Subject matter expert (metadata author) uses Business Glossary to create the business classification (taxonomy), which maintains hierarchical structure of terms. A term is a word or phrase which can be used for object classification and grouping in metadata repository. Business Glossary supplies subject matter experts with a collaborative tool to annotate existing data definitions, to edit descriptions, and to assign data object to categories.

Data analyst uses Information Analyzer as an instrument for a complete analysis of data source systems and target systems; for evaluation of structure, content and quality of data on single and multiple columns level, on table level, on file level, on cross table level, and on the level of multiple sources.

Data analysts and architects can invoke Rational Data Architect for database design, including federated databases that can interact with DataStage and other components of Information Server. Rational Data Architect provide data analysts based on metadata research and analysis, and data analysts can discover, model, visualize and link heterogeneous data assets, and can create physical data models from scratch deriving it from logical models by means of transformation, or with the help of reverse engineering of production databases.

IT developer uses FastTrack during program logic development of end-to-end information processing. FastTrack converts the artifacts received from various sources into understandable descriptions. This information has internal relations and allows the developer to get the descriptions from metadata repository and to concentrate on the complex logic development, avoiding losing the time for search in multiple documents and files.

FastTrack is integrated into the IBM Information Server. That’s why the specifications, metadata, and the jobs become available to all project participants, who use the Information Server, Information Analyzer, DataStage Server and Business Glossary.

IT developer runs QualityStage for data standardization automation, for data transformation into verified standard formats, for designing and testing match passes, and for data cleansing

56

operations setup. Information is extracted from the source system, is measured, cleansed, enriched, consolidated, and loaded into the target system.

IT developer leverages DataStage for data transformation and movement from source systems to target systems in accordance with business rules, requirements of subject matter and integrity, and / or in compliance with other data of target environment. Using metadata for analysis and maintenance, and embedded data validation rules, IT developer can design and implement integration tasks for data, received from a broad set of internal and external sources, and can arrange extremely big data manipulation and transformation using scalable parallel processing technologies. IT developer has choice to implement these processes as DataStage batch jobs, as real time tasks, or as Web services.

IT developer executes Information Services Director as a foundation for deploying integration tasks as consistent and reusable information services. Thus IT developer can use metadata management service-oriented tasks together with enterprise applications integration, business-process management, with enterprise service bus and the application servers.

Business users need a read-only access to metadata. Their demands can be met by two instruments: Business Glossary Browser and Business Glossary Anywhere.

57

Рис. 4. Roles & Interactions on Elaboration phases of metadata management lifecycle

58

Roles and interactions on metadata production phase IT specialists who are responsible for change management, say, a project manager, can analyze a change impact on the information environment with the help of Metadata Workbench (Pic.5).

Business Glossary allows to assign the role of stewards, who are responsible for the metadata, to a user or a group, and to link the role of stewards with one or more metadata objects. Stewards’ responsibility includes the effective metadata management and integration with related data, and providing authorized users with relevant data access. Stewards must ensure that all data are correctly described and that all data users understand the meaning of the data.

If business analyst discovers contradictions between glossary and database columns, he can notify metadata authors by means of Business Glossary features.

Business analyst investigates data status to reach a complete visibility of the actual data condition using QualityStage’s embedded analyzing tools.

Data analyst eliminates contradictions between glossary and data base tables and columns by means of Business Glossary and Rational Data Architect

Metadata Workbench provides IT developers with metadata view, analysis and enrichment tools. Thus IT developers can use Metadata Workbench embedded design tools for management and understanding the information assets, created and shared by IBM Information Server.

Business users responsible for regulations compliance such as Sarbanes-Oxley and Basel II, have the possibility to trace the data lineage in reports using the appropriate tools of Metadata Workbench.

Stewards can use Information Analyzer to maintain the common understanding of data sense by all users and project participants.

Stewards can invoke Metadata Workbench to maintain metadata stored in the IBM Information Server.

Administrators can use the capabilities of Web console for global administration that is based on a common framework of Information Server. For example, user may need only one credential to access all the components of Information Server. A set of credentials is stored for each user to provide a single sign-on to all registered assets.

59

Pic. 5. Roles & Interactions on Production phases of metadata management lifecycle

60

Adoption route 1: metadata elaboration So, we have two metadata adoption routes.

Route 1: Metadata Elaboration and

Route 2: Metadata Production.

These routes are beginning at the single start point.

Picture 6 represents Route 1, which deals mainly with first part of metadata management lifecycle, namely with Analysis and understanding, Modeling, Development, Transformation, Publishing, and Consuming

As the first step we have to install Metadata Server, which maintains metadata repository, and supports metadata services.

On the second step one should add Information Analyzer to perform automated analysis of production systems and to define initial classification.

Step three is adding FastTrack which allows to create mapping specifications for data flows from sources to target.

We can add Business Glossary as a fourth step in order to create the business classification

To create logical & physical data models the Rational Data Architect could be added on the fifth step.

Sixth step is the extended usage of the Information Analyzer to create rules for data evaluation.

On the seventh step we plan the extended usage of FastTrack to program the logic of end-to-end information processing.

As step eight one could install QualityStage and DataStage to design and execute data transformation procedures

To deploy integration tasks as services we should add Information Services Director on the ninth step.

On the last step one has to grant users with read-only access to metadata, and we can add Business Glossary Browser and Business Glossary Anywhere.

61

Pic. 6. Metadata adoption route on Elaboration phases of metadata management lifecycle

62

Adoption route 2: metadata production This adoption route covers production part of metadata management lifecycle, and includes Reporting and audit, Ownership, Quality management, Metadata Management. The second route begins at the same starting point as route 1.

Almost all products were installed during the first route, so this route in general deals with extended usage of the software added previously.

Web console is the one of the two products which should be added during this route. It enables management of users’ credentials, and hence, it is required in the very beginning.

The next step “Extended use of Business Glossary” should be performed as soon as possible to assign a steward.

To perform the change impact analysis one should add the Metadata Workbench.

The extended usage of FastTrack and QualityStage allows to discover the contradictions between glossary and data base columns.

Extended usage of Rational Data Architect could eliminate the revealed contradictions between glossary and data base tables & columns.

Metadata Workbench can help in understanding and managing the information assets.

By means of Business Glossary users could update the business classification according to new requirements.

Again Metadata Workbench helps in reporting the revealed metadata issues.

Information Analyzer can be used to maintain the common understanding of the data sense

Both Metadata Workbench and Web Console can be used to maintain metadata and to report metadata state.

63

Pic. 7. Metadata adoption route on Production phases of metadata management lifecycle

64

Conclusion The proposed routes cover an extended metadata management lifecycle in data integration projects. The participants of metadata management project are provided incrementally with a consistent set of IBM Information Server metadata management tools. Software that is implemented following the proposed routes, realizes the pre-selected architecture environment for metadata management, metadata management system and a metadata repository in accordance with the target logical topology.

Incremental implementation of metadata management tools of IBM Information Server reduces the time and complexity of the project, enabling business user to get the benefits of metadata management on earlier stages, and increases the probability of successful implementation of metadata management system.

This work was performed as part of plusOne initiative. The author would like to express his gratitude to Anshu Kak for the invitation to plusOne project.

Literature 1. Poole J., Chang D., Tolbert D, Mellor D. Common Warehouse Metamodel: An Introduction to the Standard for Data Warehouse Integration, Wiley, 2003. 2. Marco D., Jennings M. Universal Meta Data Models, Wiley, 2004. 3. Asadullaev S. “Metadata management using IBM Information Server”, 2008, http://www.ibm.com/developerworks/ru/library/sabir/meta/index.html

65

Master data management with practical examples Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A Alexander Karpov, Solution Architect SWG IBM EE/A 09.11.2010 http://www.ibm.com/developerworks/ru/library/sabir/nsi/index.html

Abstract The article provides examples, when insufficient attention to master data management (MDM) leads to inefficient use of information systems due to the fact that the results of queries and reports do not fit the task and do not reflect the real situation. The article also notices difficulties faced by a company, which decided to implement a home grown MDM system, and provides practical examples and common errors. The benefits of enterprise MDM are stressed, and the basic requirements for MDM system are formulated.

Basic concepts and terminology Master data (MD) includes information about customers, employees, products, goods suppliers, which typically is not transactional in its nature.

Reference data refer to codifiers, dictionaries, classifiers, identifiers, indices and glossaries [1]. This is a basic level of transactional systems, which in many cases is supplied by external designated organizations.

Classifier is managed centrally by an external entity, contains the rules of code generation and has a three or four level hierarchical structure. Classifier may determine the coding rules. Classifier does not always contain the rules for calculating the check digit or code validation algorithms. An example of a classifier is the bank identification code BIC, which is managed by the Bank of Russia, contains no check digit, and has a four-level hierarchical structure: code of the Russian Federation, code of the Russian Federation region, the identification number of division of settlement network of the Bank of Russia, the identification number of the credit institution. Russia Classifier of Enterprises and Organizations is managed centrally by Russian Statistics Committee. In contrast to BIC it contains the method for calculating the check digit for enterprise or organizations code.

Identifier (e.g., ISBN) is managed by authorized organizations in a non-centralized manner. Unlike the case of a classifier, identifier’s codes must follow the rules of check digit calculation. The rules for the identifier compilation are developed centrally and are maintained by requirements of standards or other regulatory documents. The main difference from the classifier is that the identifier as a complete list is either not available, or it is not needed on system design phase. The working list is updated with individual codes during system operation.

Dictionary (e.g., Yellow Pages) is managed by a third party. The numbering code (telephone number) is not subject to any rules.

Codifier is designed by developers for internal purposes of specific database. As a rule, neither checksum calculation algorithms, nor coding rules are designed for a codifier. The encoding of a month of the year is a simple example of codifier.

The index may simply be a numeric value (for example, tax rate), which is derived from an unstructured document (order, law, act). A flat tax rate of 13% is an example of index.

Glossaries contain abbreviations, terms and other string values that are needed during the generation of forms and reports. The presence of these glossaries in the system provides a common terminology

66

for all input and output documents. Glossaries are so close in nature to metadata that it is sometimes difficult to distinguish them.

Reference data (RD) and master data (MD) In Russian literature there is long established concept of "normative-reference information" (reference data) which appeared in the disciplines related to management of the economy back in the pre-computer days [2]. The term "master data" comes from the English-language documentation, and, unfortunately, was used as a synonym for reference data. In fact, there is a significant difference between reference data and master data.

Pic. 8. Data, reference data and master data Pic.1 illustrates the difference between reference data, master data and transactional data in a simplified form. In some conditional e-ticketing system the codifier of the airports performs the role of reference data. This codifier could be created by the developers of the system, taking into account some specific requirements. But the airport code should be understandable to other international information systems for flawless interaction between them. This purpose is achieved by the unique three-letter airport code assigned to airports by the International Air Transport Association (IATA).

Passengers’ data are not as stable as airport codes. At the same time, being once introduced into the system, the passenger’s data can be further used for various marketing activities, such as discounts when a certain total flight distance is achieved. Such information usually refers to master data. Master data may also include information about the crew, the fleet of the company, freight and passenger terminals, and many other entities involved in air transportation, but not considered in the framework of our simplified example.

The top row in Pic.1 schematically depicts some transaction related to the ticket sale. Airports are relatively few in the world, yet there are much more passengers, and they can repeatedly use the services of this company, but a ticket can not and must not be reused. Thus, ticket sales data are the most frequently changing transactional data for an airline company.

To sum up, we can say that reference data constitutes the base level of automated information systems. Master data store information about customers and employees, suppliers of products, equipment, materials and about other business entities. As reference data and master data have much in common, in those cases where the considered factors relate both to reference data and master data, we will refer to as "RD & MD", for example, "RD & MD management system",

67

Enterprise RD & MD management The most common and obvious issue of the traditional RD & MD management is the lack of support of data that changes over time. Address, as a rule, is one of the most important components of RD & MD. Unfortunately, addresses change. A client can move to another street, but the whole house and even the street can “move” also. So, in 2009, address of a group of buildings “Tower on waterfront” changed from “18, Krasnopresnenskaya embankment” to “10, Presnenskaya embankment”. Thus, the query “How much mail was delivered to the office of the company, renting premises in the “Tower on the waterfront” in 2009?” should correctly handle the delivery records to two different addresses.

However, RD & MD management tools (hardware and software) themselves are not enough to reflect real world changes in the IT system. Someone or something is needed to track changes. That is, organizational measures are required, for example, qualified staff with proper responsibilities relevant to adopted methodology of RD & MD management.

Thus, the enterprise RD & MD management includes three categories of activities:

1. Methodological activities that set guidelines, regulations, standards, processes and roles which support the entire life cycle of the RD & MD.

2. Organizational arrangements that determine the organizational structure, functional units and their tasks, roles and duties of employees in accordance with the methodological requirements.

3. Technological measures, which lie at the IT level and ensure the execution of methodological activity and organizational arrangements.

In this article we will primarily discuss technological measures, which include the creation of a unified data model for RD & MD, management and archiving of historical RD & MD, identification of RD & MD objects, elimination of duplicates, conflict identification of RD & MD objects, enforcing referential integrity, support RD & MD objects life cycle, formulation of clearance rules, creating a RD & MD management system, and its integration with enterprise production information systems.

Technological shortcomings of RD & MD management Let us consider in more detail the technological area of RD & MD infrastructure development and associated disadvantages of the traditional RD & MD management.

No unified data model for RD & MD Unified data model for RD & MD is missing or not formalized which prevents the efficient use of RD & MD objects and obstructs any automation of data processing. The data model is the basic and most important part of the RD & MD management, giving answers, for example, for the following questions:

• What should be included into identifying attribute set of RD & MD object?

• Which of all the attributes of RD & MD object should be attributed to the RD & MD and stored in the data model and what should be attributed to operational data and left in the production system?

• How to integrate the model with external identifiers and classifiers?

68

• Does a combination of two attributes from different IT systems provide a third unique attribute, important from a business perspective?

There is no single regulation of history and archiv e management Historical information in existing enterprise IT systems is often carried out by its regulations and has its own life cycles, responsible for the processing, aggregation and archiving of RD & MD objects. Synchronization and archiving of historical data and bringing them to a common view is a nontrivial task even with a common data model of RD & MD. An example of the problems caused by the lack of historical reference data is provided in the section " Law compliance and risk reduction"

The complexity of identifying RD & MD objects RD & MD objects in various IT systems have their own identifiers - sets of attributes. The attributes together can identify uniquely an RD & MD object in the information system and such set of attributes can be treated as an analog of composite primary key field in the database. The situation becomes more complicated when it is impossible to allocate a common set of attributes for the same objects in different systems. In this case, the problem of identifying and comparing objects of different IT systems changes from deterministic to probabilistic. Quality identification of RD & MD objects without specialized data analysis and processing tools is difficult in this case.

The emergence of duplicate RD & MD objects The complexity of object identification leads to the potential emergence of duplicates (or possible duplicates) of the same RD & MD object in different systems, which is the main and most significant problem for business. Duplication of information leads to cost duplication of object processing, duplication of "entry points", to the cost increase of maintaining the objects’ life cycles. Additionally we have to mention the cost of manual reconciliation of duplicates, which were originally too high, as it often goes beyond the boundaries of IT systems and require human intervention. It should be stressed that the occurrence of duplicates is a system error that appears in the earliest steps of business processes which involve RD & MD objects. On the next stages of the business processes execution the duplicate acquires bindings and attribute composition so the situation becomes more complicated.

Metadata inconsistency of RD & MD Each information system which supports a line of business of enterprise generates RD & MD objects specific to the business. Such IT system defines its own set of business rules and constraints applied both to the composition of attribute (metadata), and to the value of attributes. As a result, the rules and constraints imposed by various information systems, are in conflict with each other, thus nullifying even the theoretical attempts to bring all of the RD & MD objects to a singe view. The situation is exacerbated when, outwardly matching data model, the data have the same semantic meaning, but different presentations: various spelling, permutations in the addresses, name reduction, different character sets, reductions and abbreviation.

Referential integrity and synchronization of RD & M D model In real life RD & MD objects, located in the space of their IT systems, contain not only values but also references to other RD & MD objects, which can be stored and managed in separate external systems. Here, a problem of synchronization and integrity maintenance of enterprise wide RD & MD model arises to the utmost. One of the common ways of dealing with such problems is the transition to the use of the RD & MD that are maintained and are imported from outside the organization.

69

Discrepancy of RD & MD object life cycle Due to the presence of the same RD & MD object in a variety of enterprise systems, object input and change in these systems are inconsistent, and are often time stretched. It is possible that the same object in different systems is in mutually exclusive statuses (active in one system, archived in another, deleted in the third), making it difficult to maintain the integrity of RD & MD objects. Unbound and "spread" over time objects are difficult to use both in transactional, and in analytical process.

Clearance rules development RD & MD cleaning rules often are quite equitably attributed to methodological aspects. Of course, IT professionals need a problem statement from business users, for example, when the codes of airports should be updated, or which of the two payment orders has the correct data encoding. But business specialists are not familiar with the intricacies of the implementation of IT systems they use. Moreover, the documentation on these systems is incomplete, or missing. Therefore an analysis of information systems is required in order to clarify existing clearance rules and to identify new rules if required.

Wrong core system selection for RD & MD management Most often, the most significant sources and consumers of the RD & MD are large legacy enterprise information systems which are the core of company’s business. In real life, such a system is often chosen as the "master system" for the RD & MD management instead of creating a specialized RD & MD repository. The fact that such role of this “master” system is irrelevant to its initial design is usually ignored. As a result, any revision of these systems associated with RD & MD, pours into large and unnecessary spending. The situation is exacerbated when qualitatively new features must be entered along with the development of RD & MD management subsystems: batch data processing, data formatting and cleanup, data stewards’ assignment.

IT systems are not ready for RD & MD integration In order to fully implement RD & MD management into existing IT enterprise systems, it is necessary to integrate these systems. More often, this integration is necessary not as a one-time local event but as a change of processes, living within IT systems. The integration intended for operational mode support is not enough. The integration has to be carried out for the initial batch data loading (ETL), as well as for the procedures of manual data verification (reconciliation).

Not all automated information systems are ready for such changes, not all systems provide such interfaces. Most of all, this is a completely new functionality to these systems. During system implementation arise architectural issues related to the selection of different approaches to the development of RD & MD management system and its integration with the technological landscape of the enterprise. To confirm the importance of this moment, we note that there are designed and proven architectural patterns and approaches aimed at proper deployment and integration of INS and MD.

Examples of traditional RD & MD management issues Thus, the main issues of the RD & MD management arise because of the decentralization and fragmentation of RD & MD across the company’s IT systems and are manifested in practice in concrete examples.

70

Passport data as a unique identifier For example, in a major bank as a result of creating a customer data model, it was decided to use the passport data in identifying attributes assuming its maximum selectivity. During execution of merge procedures of client data it was revealed that the customer’s passport is not unique. For example, customers who had relations with the bank using the old passport and then using new passport were registered as different clients. Analysis of client records revealed instances where one passport has been reported by thousands of customers. On top of that, one data source was the banking information system, in which the passport data were optional and the corresponding fields during the filling process were hammered with "garbage".

It should be noted that the detected problems with the customers’ data quality were not expected and were found only at the stage of data cleaning, which required additional time and resources to finalize the rules for data cleaning and improve customer data model.

Address as a unique identifier In another case, an insurance company conducted a merger of customers’ personal data, where address is used as an identifying attribute. It was found that most clients were registered at the address "same", "ibid." Poor quality data were supplied by the application system that supports the activities of insurance agents. The system allowed agents to freely interpret the fields’ values of client questionnaire. Moreover, this system lacked any logic and formatted data input validations.

The need for mass contracts’ renewal In the third case, when an existing enterprise CRM system was connected to RD & MD management system only on the testing phase it became clear that the CRM system can not automatically accept the updates from RD & MD management system. This requires some procedural actions, in this case, invite the customer and renew paper contract documents that mention critical information relating to RD & MD. Both technological and organizational aspects of RD & MD integration and usage were reconsidered due to the large amount of work.

The discrepancy between the consistent data The fourth example describes a typical situation in many organizations. As a result of a rapid development of the company’s business, it was decided to open a new direction that supports the work with clients in the style of B2C / B2B using the Internet. To do this, a new IT system that supports the automation of new company’s business was acquired. During the deployment the integration with existing enterprise’s RD & MD was required. So the existing master data should be expanded by attributes specific for new IT system. Lack of a dedicated RD & MD management system made this task not easy. That’s why RD & MD were once loaded into the new system without any feedback from the existing enterprise’s IT landscape. Some time later this led to two independent versions of client directories. Initially the problem was solved by manual handling of customer data in spreadsheets, but after a while the number of customers has increased considerably, customer directories "diverged", and manual processing has proved ineffective and expensive. As a result, the situation has led to a serious escalation of the problem to the level of business users who do not have the overall picture of their customers for marketing campaigns.

Benefits of corporate RD & MD Enterprise RD & MD management has the following advantages:

• Law compliance and risk reduction

71

• Profits increase and customer retention

• Cost reduction

• Increased flexibility to support new business strategies.

It sounds too good to be true, so we consider each of the benefits on practical examples.

Law compliance and risk reduction Prosecuting authorities demanded a big company to provide data for the past 10 years. The task seemed simple and doable: the company introduced procedures for regular archiving and backup of data and applications long before, storage media was stored in a secure room, the equipment to read the data carriers had not yet become obsolete. However, after the restoration of historical data from the archives it was revealed that the data are of no practical sense. RD & MD during this time changed repeatedly, and it was impossible to determine to what the data were related. Nobody foresaw RD & MD archiving because it seemed that this part of information was stable at the time. The company had been imposed major penalties. The company’s management responsible for these decisions was changed. In addition, the unit, responsible for RD & MD management, was established to avoid the repetition of such an unpleasant situation.

Profits increase and customer retention A large flower shop was one of the first to realize the effectiveness of email marketing. A web site was created where marketing campaigns were performed, where customers could subscribe to mail out on the Valentine's Day, on the birth of first child's, on a birthday of a beloved, etc. Subsequently, clients received congratulations with the proposals of flower choices. However, advertising campaigns were conducted with the assistance of various developers who created disparate applications, unrelated to each other. Therefore, customers can receive up to ten letters on the same occasion that annoyed customers and caused their outflow. As a result, each successive advertising campaign not only rendered unprofitable, but also reduced the number of existing customers. Flower shop had to spend considerable resources to process and integrate the applications. The high amount of expenses was related to the heterogeneity of customer information, multiple formats, addresses and telephone numbers, which caused big problems in the identification of customers to eliminate multiple entries.

Cost reduction One of the main requirements for the company's products is the need to respond quickly to changes in demand, launch a new product to the market in a short time and communication with consumers. We see that yesterday's undisputed leader turn into backward, while newcomers, who brought their product to market for the first time, increase their profits and capitalization greatly. Under these conditions, various corporate information systems, responsible for developing the product, its supply and sales, service and evolving should be based on a unified information base, covering all lines of the company’s business. Then the lunch of a new product to market requires less time and financial costs through seamless interaction between supporting information systems.

Increased flexibility to support new business strat egies Elimination of fragmentation and decentralization of RD & MD allows providing the information as a service. This means that any IT system, following established communication protocols and access rights can query the enterprise RD & MD management system and obtain the necessary data. Service oriented approach allows to build flexible data services in accordance with changing

72

business processes, thus providing a timely response of IT systems and services in terms of changing requirements.

Architectural principles of RD & MD management The basic architectural principles of master data management are published in paper [3]. Let us list them briefly:

• The MDM solution should provide the ability to decouple information from enterprise applications and processes to make it available as a strategic asset of the enterprise.

• The MDM solution should provide the enterprise with an authoritative source for master data that manages information integrity and controls distribution of master data across the enterprise in a standardized way that enables reuse.

• The MDM solution should provide the flexibility to accommodate changes to master data schema, business requirements and regulations, and support the addition of new master data.

• The MDM solution should be designed with the highest regard to preserve the ownership of data, integrity and security of the data from the time it is entered into the system until retention of the data is no longer required.

• The MDM solution should be based upon industry-accepted open computing standards to support the use of multiple technologies and techniques for interoperability with external systems and systems within the enterprise

• The MDM solution should be based upon an architectural framework and reusable services that can leverage existing technologies within the enterprise.

• The MDM solution should provide the ability to incrementally implement an MDM solution so that a MDM solution can demonstrate immediate value.

Based on considered practical examples, we can expand the list of architectural principles with additional requirements to RD & MD management system:

• Master data system must be based on a unified RD & MD model. Without a unified data model it is not possible to create and operate a RD & MD system as a single enterprise source of master data.

• Unified rules and regulations of master data history and archiving management are needed. The purpose is to provide opportunities to work with historical data to improve the accuracy of analytical processing, law compliance and risk reduction.

• An MDM solution must be capable to identify RD & MD objects and to eliminate duplicates. Without identification it is impossible to build a unified RD & MD model and to identify duplicates, which cause multiple "entry points", cost increase for object processing and for maintenance of the objects life cycle.

• RD & MD metadata must be consistent. Metadata mismatch leads to the fact that even if it is possible to create a unified model of the RD & MD, in fact, this model is of low quality due to the fact that different objects can actually be duplicated because of different definitions and presentations.

• An MDM solution must support referential integrity and synchronization of RD & MD models. Depending on the solution architecture RD & MD model may contain both objects

73

and links. That is, the synchronization and integrity are necessary to support a unified RD & MD model.

• A consistent life-cycle of RD & MD object must be supported. RD & MD object stored in different IT systems in various stages of its life cycle (e.g., created, agreed, active, frozen, archived, destroyed), essentially destroys the unified RD & MD model. The life cycle of RD & MD objects must be expressed as a set of procedures, methodological, and regulatory documents approved by the organization.

• Support should develop clearance rules for RD & MD objects and their correction. This ensures the relevance of a unified model of INS and MD, which may be disrupted due to changing business requirements and legislation.

• It is necessary to create a specialized RD & MD repository instead of the use of existing information systems as a RD & MD "master system". The result is flexibility and performance of the RD & MD management system, data security and protection, improved availability.

• The RD & MD management system must take into account that IT systems may not be ready to integrate RD & MD. Systems integration requires a counter-action: the existing system should be further developed to meet the requirements of the centralized RD & MD.

Conclusion The practice of creating RD & MD systems discussed in this paper shows that the company that attempts to develop and implement such a enterprise level system independently, faces a number of problems that lead to significant material, labor and time costs.

As follows from the case studies, the main RD & MD technological challenges are caused by decentralization and fragmentation of the RD & MD in the enterprise. In order to address these challenges requirements to RD & MD management system are proposed and formulated.

The following articles will discuss tools that can facilitate the creation of enterprise RD & MD management system, the main implementation stages of RD & MD management system, and the roles on various phases of the RD & MD life cycle.

Literature 1. Asadullaev S., “Data, metadata and master data: the triple strategy for data warehouse projects”, 09.07.2009, http://www.ibm.com/developerworks/ru/library/r-nci/index.html

2. Kolesov A., «Technology of enterprise master data management», PC Week/RE, № 18(480), 24.05.2005, http://www.pcweek.ru/themes/detail.php?ID=70392

3. Oberhofer M., Dreibelbis A., «An introduction to the Master Data Management Reference Architecture», 24.04.2008, http://www.ibm.com/developerworks/data/library/techarticle/dm-0804oberhofer/

74

Data quality management using IBM Information Serve r Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 08.12.2010 http://www.ibm.com/developerworks/ru/library/sabir/inf_s/index.html

Abstract Data integration projects often do not provide users with data of required quality. The reason is the lack of established rules and processes to improve data quality, wrong choice of software, and lack of attention to work arrangement. A belief that data quality can be improved after the completion of the project is also widespread.

The aim of this study is to determine the required process of data quality assurance, to identify the roles and qualifications, as well as to analyze the tools for interaction between participants in a data quality improvement project.

Introduction Data quality has a crucial impact on the correctness of decision making. Inaccurate geological data can lead to a collapse of high rise buildings, low-quality oil and gas exploration data cause significant losses due to incorrectly assessed effects of well drilling; incomplete data on the bank's customer is a source of error and loss-making. Other examples of serious consequences of inadequate data quality are published in [1].

Despite an apparent agreement on the need of data quality improvement, the intangibility of data quality as an end product raises doubts about the advisability of spending on these works. Typically, a customer, especially from financial management, asks the question, what will be the organization’s profit on completion of works and how the result can be measured.

Some researchers identify up to 200 data quality characteristics [2], so the absence of a precise criteria of quality also prevents the deployment of work to improve data quality.

An analogy to water pipes may clarify the situation. Each end user knows that he needs water suitable for drinking. He does not necessarily understand the chemical, organoleptic, epidemiological and other requirements for water.

Similarly, an end user does not have to understand what should be the technology, engineering construction and equipment for water purification. The purpose is to take water from designated sources, to process water in accordance with the requirements and deliver it to consumers.

To sum up, we can say that to achieve the required data quality it is necessary to create an adequate infrastructure and arrangement of required procedures. That is, the customer does not receive "high-quality data in a box" (equivalent - a bottle of drinking water), but the processes, tools and methods for their preparation (equivalent - town water supply).

The aim of this study is to determine the process of data quality improvement to identify the needed roles and qualifications, as well as to analyze the tools for data quality improvement.

Metadata and project success For the first time I faced with the metadata problem in an explicit form in 1988 when I was the SW development manager for one of the largest petrochemical companies. In a simplified form, the task was to enter a large number of raw data manually, to apply complicated and convoluted algorithms

75

to process input data and to present the results on the screen and on paper. The task complexity and a large amount of work required that various parts of the tasks were performed by several parallel workgroups of customers and developers. The project ran fast, and customer representatives received the working prototypes of future systems regularly. Discussion of the prototype took the form of lengthy debates on the correctness of calculations, because no one could substantiate their doubts or to identify the cause of rejection of these data by experts. That is, results do not correspond to the intuitive customers’ understandings of the expected values.

In this connection we performed a data and code review for the consistency of input and output forms and data processing algorithms. Imagine our surprise when we discovered that the same data have different names in the input and output forms. This "discovery" compelled to change the architecture of the developed system (to carry out all the names into a separate file initially, and then in a dedicated database table), and to reexamine the developed data processing algorithms. Indeed, different names of the same data led to different understanding of their meaning and different algorithms to process them.

Applied corrections allowed to substantiate the correctness of calculations and to simplify the support of the developed system. In the case of changing the index’s name the customer should change one text field in one table, and this change was reflected in all forms.

This was my first but not last time when metadata had a critical influence on the project success. My further practice of data warehouse development reaffirmed the importance of metadata more than once.

Metadata and master data paradox The need to maintain metadata was stressed yet in the earliest publications on data warehouse architecture [3]. At the same time master data management as a part of the process of DW development has not been considered until recently.

Paradoxically, master data management was quite satisfactory, while metadata management was simply ignored. Perhaps the paradox can be explained by the fact that DW is usually implemented on relational databases, where the third normal form automatically leads to the need for master data management. The lack of off-the-shelf product instruments on the SW market also led to the fact that companies experienced difficulties in implementing enterprise metadata management.

Metadata are still out of focus of developers and customers, and ignoring them is often the cause of DW project delays, cost overrun risk, and even project failure.

Metadata impact on data quality Many years ago, reading Dijkstra, I found his statement: "I know one -very successful- software firm in which it is a rule of the house that for one year project coding is not allowed to start before the ninth month! In this organization they know that eventual code is no more than the deposit of your understanding."[4]. At that moment I could not understand what one can do for eight months, without programming, without demonstrating working prototypes to the customer, without improving the existing code basing on the discussions with the customer.

Now, hopefully, I can assume what the developers have been loaded with for eight months. I believe that the solution understanding is best formalized through metadata: a data model, glossary of technical terms, sources description, data processing algorithms, applications launch schedule, responsible personnel identification and access requirements... All this and much more is metadata.

76

In my opinion one of the best definitions of a specification of a system under development is given in [5]: "A specification is a statement of how a system - a set of planned responses - will react to events in the world immediately outside its borders". This definition shows how closely metadata and system specification are. In turn, there are close links between metadata, data, and master data [6]. This gives a reason to believe that the better metadata are worked out, the higher system specification quality is and, under certain circumstances, the higher data quality is.

Data quality and project stages Data quality must be ensured at all stages of problem statement, design, implementation and operation of information system.

Problem statement is eventually expressed in formulated business rules, adopted definitions, industry terminology, glossary, data origin identification and data processing algorithms described on business language. This is business metadata. Thus, problem statement is a definition of business metadata. The better problem statement and definition of business metadata are performed, the better is data quality which must be provided by the designed IT system.

IT system development is associated with entitlement of entities (such as table names and column names in a database) and identifying the links between them, programming of data processing algorithms in accordance with business rules. Thus the following statements are equally true:

1. Technical metadata system appears on development phase;

2. Development of the system is the definition of technical metadata.

Documentation of the design process provides personal responsibility of each team member as the result of his work, which leads to improved data quality due to the existence of project metadata.

Deviations from established regulations may happen during system operation. Operational metadata, such as user activities’ logs, computing resources usage, applications’ statistics (eg, execution frequency, records number, component analysis) allows not only to identify and prevent incidents that lead to data quality deterioration, but also to improve user service quality through optimal utilization of resources.

Quality management in metadata life cycle Extended metadata life cycle [7] consists of the following phases: analysis and understanding, modeling, development, transformation, publishing, consuming, reporting and auditing, management, quality management, ownership (Pic. 1).

Quality management stage solves the task of heterogeneous data lineage in data integration processes; quality improvement of information assets; input data quality monitoring, and allows to eliminate data structure and processability issues before they affect the project.

77

Pic.1. Extended metadata management life cycle

Data flows and quality assurance At first glance, the role of quality management stage is not remarkable. However, if we use the roles description [7, 8] and draw Table 1, which shows the task description for each role at each stage of metadata management life cycle, it is evident that all of the projects tasks can be divided into two streams.

The first flow, directed along the table’s diagonal, contains the activities aimed at creation of functionality of metadata management system.

The second stream consists of tasks to improve data quality. It should be noted, that all project participants contribute to data quality improvement if project team is selected properly.

Let us consider the tasks flow of data quality improvement in more detail. In practice, four indicators of data quality are usually discussed: comprehensiveness, accuracy, consistency and relevance [9].

Comprehensiveness implies that all required data are collected and presented. For example, a client address may omit supplementary house number, or the patient's medical history misses one record of a disease.

Accuracy of data indicates that presented values (e.g., passport number, or the loan period, or the date of departure) don’t contain errors.

Consistency is closely related to metadata and influences data understanding. This may be date in different formats, or such a term as "profit", which is differently calculated in different countries.

Relevance of data is associated with timely data update. The client can change the name, or get a new passport; well’s debit might change over time. In the absence of timely updates the data may be complete, accurate, consistent, but out of date.

78

Change of requirements, which inevitably associated with IT system development, can bring, as any changes, to the result which is the opposite to the desired one.

• Completeness of data may suffer from inaccurate problem statement.

• Data accuracy can be reduced as a result of increased load on the employee responsible for manual data entry.

• Consistency can be impaired due to the integration of a new system with a different understanding of data (metadata).

• Relevance of data can be compromised by the inability to update data timely due to insufficient throughput of the IT system.

So IT professionals responsible for change management (for example, project manager) should analyze the impact of the changes on the IT environment.

The discrepancies between glossary and database columns lead to data consistency violation, which is essentially a metadata contradiction. Since identification of these conflicts requires understanding both of subject area, and IT technologies, in this step it is necessary to involve a business analyst who should reach complete visibility of the actual data state.

Revealed discrepancies may require updates of business classification that must be performed by a subject matter expert.

Consistency as data quality indicator requires discrepancies elimination in metadata. This work should be performed by a data analyst.

Enterprise data used in the company’s business are the most important information assets or data resources. The quality of these resources has a direct impact on business performance and is a concern of, among others, IT developers who can use the design tools for managing and understanding information assets that are created and are available through IBM Information Server. Thus, an IT developer ensures data comprehensiveness, accuracy, consistency and relevance.

Business users have an instrumental ability to track data lineage, which allows to identify missing data and to ensure comprehensiveness.

Stewards maintain data consistency by managing metadata to support common data meaning understanding by all users and project participants, and to monitor comprehensiveness, accuracy and relevance of the data.

79

Table 2. Data flows and quality assurance

80

Roles, interactions and quality management tools Picture 2 shows the interaction pattern between the roles and used tools [8]. Tasks related to data quality improvement and discussed in the previous section are highlighted.

Groups of tasks, related to one role, are enclosed in a dotted rectangle. Interactions between the roles are assigned to the workflow, the direction of which is marked by arcs with arrows. Let us consider in more detail the tools and the tasks performed by roles.

A project manager, who is responsible for change management process, analyzes the impact of changes on the IT environment with the help of Metadata Workbench.

A business analyst reveals contradictions between the glossary and database columns, and notifies metadata authors using a functionality of Business Glossary and FastTrack. Data analysis tools built into QualityStage help the business analyst to reach a full visibility of data actual state.

A subject matter expert (the metadata author) uses Business Glossary to update business classification (taxonomy), which supports the hierarchical structure of terms. A term is a word or phrase that can be used to classify and to group objects in the metadata repository. If a joint work of experts is necessary, Business Glossary provides subject matter experts with collaboration tools to annotate data definitions, descriptions editing, and their categorization.

Using Business Glossary and Rational Data Architect, data analyst eliminates the conflicts between glossary and tables and columns in databases, which were identified by the business analyst.

Metadata Workbench provides an IT developer with tools for metadata review, analysis, design and enrichment, and allows him to manage and understand information assets that were created and are available through the IBM Information Server

Business users, who are responsible for legislative requirements compliance, are able to trace data lineage using appropriate Metadata Workbench tools.

Support of common understanding of data meaning by all users and project participants is performed by stewards with the help of Information Analyzer.

Necessary and sufficient tools As follows from the analysis, IBM Information Server product family provides all participants with the necessary tools to ensure data quality.

Information is extracted from a data source system and then evaluated, cleaned, enriched, consolidated and loaded into the target system. Data quality improvement is carried out in four stages.

1. Research stage is performed in order to fully understand the information.

2. Standardization stage reformats data from different systems and converts them to the required content and format.

3. Matching stage ensures data consistency by linking records from one or more data sources related to the same entity. This stage is performed in order to create semantic keys for information relationships identification.

4. Survival stage ensures that the best available data survive and that data are prepared correctly for transfer to the target system. This stage is required to obtain the best representation of interrelated information.

81

Pic. 2. Roles, interactions and quality management tools

82

Thus, IBM Information Server family is a necessary tool to ensure data quality, but not always sufficient, since in some cases, additional instruments are needed for master data quality assurance. The issues of master data quality assurance will be discussed in future articles.

Conclusion Data quality assurance is a complex process which requires the involvement of all project participants. Metadata quality impact is extremely high, so it is important to ensure quality management within the metadata life cycle. Analysis showed that when used properly, IBM Information Server family creates a workflow to ensure data quality. IBM Information Server’s tools provide each employee involved in a data integration project with the quality management instruments and ensure an effective interaction of the project team.

Literature 1. Redman T.C. “Data: An Unfolding Quality Disaster”. Information Management Magazine,

August 2004. http://www.information-management.com/issues/20040801/1007211-1.html

2. Wang, R., Kon, H. & Madnick, S. “Data Quality Requirements Analysis and Modeling”, Ninth International Conference of Data Engineering, 1993, Vienna, Austria.

3. Hackathorn R. “Data Warehousing Energizes Your Enterprise,” Datamation, Feb.1, 1995, p. 39.

4. Dijkstra E.W. “Why is software so expensive?'”, in "Selected Writings on Computing: A Personal Perspective", Springer-Verlag, 1982, pp. 338-348

5. DeMarco Т. “The Deadline: A Novel About Project Management”, Dorset House Publishing Company, Incorporated, 1997

6. Asadullaev S. “Data, metadata and master data: the triple strategy for data warehouse projects”, 09.07.2009. http://www.ibm.com/developerworks/ru/library/r-nci/index.html

7. Asadullaev S. “Metadata Management Using IBM Information Server”, 30.09.2008. http://www.ibm.com/developerworks/ru/library/sabir/meta/index.html

8. Asadullaev S. “Incremental implementation of IBM Information Server’s metadata management tools”, 21.09.2009, http://www.ibm.com/developerworks/ru/library/sabir/Information_Server/index.html

9. Giovinazzo W. “BI: Only as Good as its Data Quality”, Information Management Special Reports, August 18, 2009. http://www.information-management.com/specialreports/2009_157/business_intelligence_bi_data_quality_governance_decision_making-10015888-1.html

83

Primary data gathering and analysis system - I Problem formulation, data collecting and storing Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 17.02.2011 http://www.ibm.com/developerworks/ru/library/sabir/warehouse-1/index.html

Abstract A standard solution for primary data collection, storage and analysis is proposed. The solution is based on manual input using IBM Forms and IBM InfoSphere Warehouse for storage. The analysis of collected data by means of IBM InfoSphere Warehouse analytical tools and IBM Cognos is discussed in the second part of the article [1].

The proposed approach can be implemented as a basis for a variety of solutions for different industries and enterprises.

Introduction A distribution company purchases goods and distributes them to regional customers. New purchases’ planning requires information from the regions on the goods balance. Customer’s representatives enter data manually. This leads to the fact that, despite the training, instructions and reminders, data input does not fit the expectations. As a result, a whole department in the central office verifies collected data by phone together with regional representatives.

A transportation company has an extensive fleet of cargo conveyances. Despite the presence of automatic diagnostic tools, the technical validation data of rolling stock are maintained manually on paper forms and are later entered into the computer. During the time between the problem detection and entering data from paper to the information system, a defective vehicle can be directed for loading accidentally or intentionally. The fines, specified for this situation, lead to losses of the transportation company and generate profits for the loading firm which can hardly be called fair.

Federal agency collects reports from the regions on a regular basis. Experts in the regions are trained, and they use these gained skills to provide the agency with data so that the region would look decent. The real picture, of course, is blurred, and the agency operates with inaccurate data.

The list of similar examples can be continued. For example, an offender can steal a car in one region, to commit a robbery in the second, to purchase illegally a weapon in the third, and his crimes can not be combined into one case due to small errors in the recording of incidents. Or an employer can reside in one region, open a business in another, and pay taxes in the third.

The same functional requirements unite together these examples:

• Collection of primary, statistical, or reported data from remote locations is required;

• It is necessary to check the data on the workplace prior sending to the center.

• The collected data should be purified and stored for a period defined by regulations;

• It necessary to form management statements to assess the state of affairs in the regions.

• It is necessary to perform an analysis based on the collected statistics to identify regular patterns and to make a management decision;

84

In this paper we consider a standard solution for collection, storage and analysis of primary data that was entered manually. Despite obvious limitations, this approach can be used as a basis of various solutions for different industries and enterprises.

System requirements Consider the most general requirements specific to similar systems. Certainly, this task formulation is somewhat artificial. At the same time it gets rid of the details that are specific to real systems, but aren’t central to the typical task of primary data collection, storage and analysis.

Information systems, operating in the regions, are external to the "Collection and analysis of primary data" system that is being developed and do not interact with it.

The required information must be entered manually using the on-screen forms. Forms should represent the existing and approved paper forms for data collection as accurately as possible. Data input error check should be provided before sending the filled e-form.

The completed e-form is always sent from the region to the central office. Form is resubmitted as a whole in case of revealed data input errors.

The completed e-form is not required to be kept wholly for the audit or to follow legal requirements. Only the data contained in e-form fields are retained. Therefore, there is no need to send to the center the entire e-form, only extracted data from it can be sent.

Consumers of internal management statements are employees and executives. Management statement is provided as on-screen reports and as hard copies.

External users of reports are staff and management of superior bodies and cooperating organizations and agencies. Statements for external organizations are available in hard copy.

The number of users in each region is limited to one user (data input clerk) at any one time. That is, one workplace is enough to operate the system in each region. The number of regions will equal 1000.

The number of users of the system in the head office at one moment can be estimated as 100 analysts.

The number of approved input forms is 10 daily forms with 100 fields each.

To evaluate the data streams one can confine oneself by the following approximate values. As experience shows, on average one field corresponds to 3 - 5 KB of forms.

Thus, the size of one form can be estimated at 300 - 500 Kbytes, and the daily flow from a single location is about 3 - 5 MB / day. Given that the e-forms are filled in by hand during the workday, the minimum required connection throughput shall provide for the transfer of about 1 form per hour, that is, about 1 kbit / sec. The total daily flow from the regions is 3 - 5 GB / day.

In case of insufficient throughput the peak data flow may be reduced through the difference in time zones of regions and by an approved schedule of data transmission.

Storage period for on-line access is 5 years, after which the data are transferred to the archives. Storage period in the archives is 50 years.

Backup and restore tools and procedures should be provided.

Telecommunication infrastructure (active and passive network equipment, communication lines and channels) is beyond the scope of the project.

85

The proposed solution must be expandable and scalable. For example, the integration of the "Collection and analysis of primary data" system with the document workflow system should be anticipated.

Project objectives The following tasks should be performed within the project:

• Development of e-forms for approved paper forms

• Development of e-forms for new paper forms

• Development of storage for detailed data

• Development of analytical tools

• Development of reporting and visualization tools

• Information Security

• Data back-up

• Data archiving

• Logging of system events

Development of e-forms for approved paper forms On-screen forms should be designed to correspond with approved paper forms of statistical indicators. Data entry e-forms for offline and online modes should be offered. Online mode is the primary mode for data entry.

Development of e-forms for new paper forms Developers should be provided with simple and intuitive design tools for creation of new e-forms of data collection to extend the production system. There is no need in simplified new forms development tools, standard tools can be used for this purpose.

New forms can be developed both by internal staff, and by third-party organizations. The customer must be completely independent of the external development company to be able to change external developers or give them direct access to the instruments of development and testing of e-forms and applications.

Development of storage for detailed data As collected detailed data are subject oriented, integrated, time-variant and non volatile, not a database but a data warehouse is proposed to use for data storage. Traditional database is oriented for execution of a large number of short transactions, while analytical tasks require a relatively small number of queries to large volumes of data. A data warehouse meets the conditions.

The data warehouse should be focused on storage of not individual e-forms but of data, from which the e-form must be prepared. So the forms’ field must be mutually agreed upon. Intrinsically, the e-form should be drawn from these agreed data.

We highly recommend using a ready-made data model that should be adapted to the task needs. In case of specific requirements, the data warehouse model will be modified jointly with the customers. It is not necessary to store the history of e-forms, or algorithms of e-forms calculation and assembly.

86

Data can be aggregated into large period indices for long term storage. For example, data with storage period of more than 5 years, can be combined into 5 years indices; data with storage period more than 1 year, can be combined into 1 years indices; data with storage period less than 1 year, can be left as monthly indices.

Development of analytical tools Tools for analytical calculations based on gathered statistics should provide the following capabilities:

• Quite simple statistical analysis. For example, the calculation of the efficiency of usage of various resources;

• Scenario and forecast calculations.

Development of reporting and visualization tools Reporting and visualization should provide

• On-screen reports generation

• Paper reports generation

• Reports visualization in graphical form through the web interface (browser)

• Grouping of graphs into a desired set (control panel or dashboard).

Information security Due to the fact that the system for data collection, storage and analysis should not contain sensitive data, information security will be provided by built-in tools of operating system, databases and data warehouses, application servers and applications.

Data back-up Backups should be performed by means of built-in tools of databases and data warehouse.

Data archiving Data archiving is necessary for long term storage. Expiration time is currently defined as 50 years. Perhaps in the future it will be necessary to reduce the reporting forms in coarse-grained forms, that is, to combine monthly statements into yearly statistics, and yearly data into for periods of several years.

Logging system events Source data input logging should be ensured to eliminate possible conflicts with the non-receipt of data sent by the user from a region.

Success criteria Data collection e-forms must comply with the approved list of data input forms.

Performed procedures should follow the business process of collecting, processing, storage and dissemination of information, agreed with the customer.

Electronic and paper output forms must conform to the approved list of management statements forms.

87

Logging of data input processes should be ensured to track the timeliness of reporting.

Reliability of the delivered information should not be worse than the quality of collected data.

Architecture of system for data collection, storage and analysis In this paper we consider a typical task without taking into account the specific requirements of various projects. Therefore, the proposed architecture is based on the most simple software configuration. Data are collected with the help of IBM Lotus Forms. Storage, analysis and reporting are implemented using IBM InfoSphere Warehouse. The architecture should include IBM Cognos software to manage corporate performance and data interpretation.

Separation of subsystems for data entry, collection, storage and analysis allows us to construct different architectures, depending on the needs of the task and requirements of enterprise infrastructure. Centralized architecture for data collection, data storage and analysis is represented on Pic. 1. This architecture assumes that data input can be carried out remotely, and all servers for data acquisition (Lotus Forms), data storage (InfoSphere Warehouse), and data analysis and interpretation (Cognos) are installed in a single data center. Analysts can work both locally and remotely, with the help of the Web interface provided by Cognos for analytical calculations preparation and their execution.

A distributed Lotus Forms server architecture can be created if various regional forms must be filled in. In this case, initial forms processing should be implemented on a regional level, and gathered data are sent to the central office where the data storage servers reside.

A combination of large volumes of regional data and poor telecommunication lines may require a solution with forms processing and data storage system that are both decentralized.

Analytical works with large amount of ad hoc queries may require the creation of a distributed infrastructure of Cognos servers. In this case, data from a centralized repository can be transmitted in advance to the regional centers, where Cognos servers are deployed. This architecture provides an acceptable response time and high-performance execution of analytical tasks in the regions, even in the absence of high-speed communication channels.

Various options of the system architecture for data collection, storage and analysis will be discussed in more detail in a separate article.

Another advantage of the proposed modular system is the possibility of its functionality expansion. Since all modules interact by standard protocols, it is possible to integrate the system with document management, metadata and master data management, and enterprise resource planning systems, as well as with a variety of analytical and statistical packages.

88

Pic.1. Centralized architecture of system for collecting, storing and analyzing data

89

Data collection Lotus Forms is a set of products that enables organizations to use e-forms for manual data entry and transfer the collected data to other systems [2]. Lotus Forms Server can be further integrated with repositories of data (e.g., IBM DB2, Oracle, and MS SQL Server), with a variety of document management and document repositories (for example, IBM File Net).

Architecture of primary data collection based on Lotus Forms is shown in Pic. 2.

Pic. 2. Primary data collection using Lotus Forms

Forms designer prepares e-forms for data entry using Lotus Forms Designer. The e-forms are stored in the forms repository in XFDL format [3, 4], which is a standard approved by W3C.

An application developer is developing Forms application logic, Webform server’s servlets and mapping for Transformation Extender (TX), which associates the form fields to values in the database.

A translator converts the e-form from XFDL format to HTML and JavaScript for users who are using a thin client (browser).

90

Users who have installed Lotus Form Viewer (thick client) may work with e-forms in XFDL format, bypassing the translation to HTML.

Users in the regions enter data by means of Lotus Form Viewer or browser. The data can pass several stages of verification:

• On user's computer during the form filling, using form’s built-in logic

• On Lotus Form Server invoking the application logic

• When data is being loaded into the database.

The data can be transmitted to the InfoSphere Warehouse database through CLI, ODBC, and JDBC protocols.

Data storage IBM InfoSphere Warehouse Enterprise Edition [4,5] consists of the following products:

• InfoSphere Warehouse Design Studio, that includes IBM Data Server Developer Workbench – subset of IBM Rational Data Architect components.

• InfoSphere Warehouse SQL Warehousing Tool

• InfoSphere Warehouse Administration Console, which is the part of Integrated Solutions Console.

• DB2 Enterprise Server Edition for Linux, UNIX and Windows

• InfoSphere Warehouse Cubing Services

• DB2 Query Patroller

• InfoSphere Warehouse Intelligent Miner

• IBM Alphablox and companion documentation

• WebSphere Application Server

The architecture of data storage in the IBM InfoSphere Warehouse is shown in Pic.3.

Design Studio provides a common design environment for creating physical models, OLAP cubes and data mining models, for data flow and SQL control flow design, as well as for Alphablox Blox Builder analytical applications. Design Studio is based on open source Eclipse platform.

The application developer develops applications using InfoSphere Warehouse Design Studio and deploys them on the server, providing data processing in accordance with the required business logic.

SQL Warehousing Tool (SQW) is a graphical tool that, replacing manual SQL coding, generates SQL code to support and administer the data warehouse. Based on the visual flow of statements, modeled in Design Studio, SQW automatically generates SQL code that is specific to DB2. The integration of SQW with IBM WebSphere DataStage extends the development capabilities of analytical systems based on DB2.

In this project e-forms filled out according to strict rules are the only data source, so at this stage there is no need for Extract, Transform and Load (ETL) tools, such as DataStage. However, as the project evolves, it is expected that other sources will be connected. The ability of using ETL tools provides functional extensibility of the system without the need of radical changes.

91

The administrator uses the Administration Console, which is a WebSphere application, for deploying and managing applications created in Design Studio. Administration Console allows you to:

• Create and manage database resources, view logs and manage SQW processes.

• Perform and monitor database applications, review the history of their deployment, and execution statistics.

• Manage cube services, import and export cubes and models, as well as to execute OLAP Metadata Optimization Advisor.

• Maintain database jobs for data mining; to load, import and export data mining models.

Pic. 3. Data storage in IBM InfoSphere Warehouse

DB2, IBM Alphablox, and WebSphere Application Server have their own administration tools, but these tools can also be executed from Integrated Solutions Console.

The administrator uses the DB2 Query Patroller to manage dynamically the flow of queries to the DB2 database. Query Patroller allows you to adjust the database resource usage so that short queries or queries with the highest priority will be executed in the first place, ensuring efficient use of resources. In addition, administrators can collect and analyze information about the executed queries to determine the temporal patterns, frequently used tables and indexes, as well as resource intensive applications.

92

Conclusion The proposed solution is scalable and has expandable functionality. In the future, you can connect different document workflow systems, enterprise planning, metadata and master data management systems. The system for collecting and analyzing primary data can be easily integrated into existing enterprise IT infrastructure. In other circumstances it may be treated as a first step in implementation of an enterprise system for data collection, storage and analysis.

Various solutions for data analysis by means of IBM InfoSphere Warehouse and IBM Cognos BI will be described in the second part of the article.

The author thanks M.Barinstein, V.Ivanov, M.Ozerova, D.Savustjan, A.Son, and E.Fischukova for useful discussions.

Literature 1. Asadullaev S. “Primary data gathering and analysis system – II.”, 2011,

http://www.ibm.com/developerworks/ru/library/sabir/warehouse-2/index.html

2. IBM Forms documentation, https://www.ibm.com/developerworks/lotus/documentation/forms/

3. Boyer J., Bray T., Gordon M. “Extensible Forms Description Language (XFDL) 4.0”. 1998, http://www.w3.org/TR/1998/NOTE-XFDL-19980902

4. IBM, “XFDL 8 Specification”, 2010, http://www-10.lotus.com/ldd/lfwiki.nsf/xpViewCategories.xsp?lookupName=XFDL%208%20Specification

5. IBM, “InfoSphere Warehouse overview 9.7”, 2010, http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp?topic=/com.ibm.isw.release.doc/helpindex_isw.html

6. IBM, “IBM DB2 Database for Linux, UNIX, and Windows Information Center”, 2011, http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.doc/welcome.html

93

Primary data gathering and analysis system - II Analysis of primary data Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 17.02.2011 http://www.ibm.com/developerworks/ru/library/sabir/warehouse-2/index.html

Abstract A standard solution for primary data collection, storage and analysis was proposed in [1]. The solution is based on IBM Forms for manual input and IBM InfoSphere Warehouse for storage. This article considers the analysis of collected data by means of IBM InfoSphere Warehouse analytical tools and IBM Cognos.

The proposed approach can be implemented as a basis for a variety of solutions for different industries and enterprises.

Data analysis using IBM InfoSphere Warehouse Cubing Services & Alphablox based OLAP IBM Alphablox and the components of Cubing Services are used to provide direct data access from InfoSphere Warehouse. Cubing Services components include tools for metadata modeling for OLAP cubes, an optimizer of materialized query tables (MQT), and cube server for multidimensional data access (Pic. 1).

Pic. 1. Cubing Services & Alphablox based OLAP

94

Due to full integration of Cubing Services components with the user interface of InfoSphere Warehouse, the design is performed using the Design Studio and the administration and support is provided through the Administration Console.

The cube server of Cubing Services processes multidimensional queries expressed in the MDX query language and returns the results of multidimensional queries. In response to MDX queries cube server retrieves data from DB2 via SQL queries. Materialized query tables MQT are used by the DB2 optimizer, which rewrites incoming SQL queries and forwards them to the appropriate MQT for high execution performance of queries.

IBM Alphablox provides a rapid development of analytical Web applications that meet the requirements of enterprise infrastructure and are available both in intranet and behind the enterprise firewall. Alphablox applications allow to perform multidimensional data analysis in real time, using a standard browser as a client.

Alphablox applications can work with data from multiple sources, including DB2 and Cubing Services, create structured reports and provide data in the required form with the help of filters and drill-down tools.

Alphablox analytic applications are aimed to improve the quality of decision-making and to optimize the performance of financial reporting and analysis, planning, operational analysis, task analysis and reporting, performance analysis and analysis of key performance indicators (KPI).

Text and data mining Data mining algorithms embedded into the InfoSphere Warehouse are used to understand a customer’s or a business unit’s behavior. Data discovery tools allow to identify hidden data relationships, to profile data, to browse the tables’ contents and to visualize correlated statistics to identify data suitable for analysis. [InfoSphere Warehouse provides the following data mining tools:

• Miningblox

• Intelligent Miner Easy Mining

• Intelligent Miner Modeling

• Intelligent Miner Scoring

• Intelligent Miner Visualization

• Unstructured text analysis

Data mining using MiningBlox & Alphablox A typical data mining application can include the following steps:

• Data selection for analysis

• Analysis beginning and tracking its progress

• View the results of the analysis

• Selection, management or control of the data mining tasks

The Miningblox tag library provides tags for each step and is designed to perform predictive analysis using Alphablox functions. In this configuration the J2EE application server includes Miningblox applications, Alphablox, and Miningblox tag library, data warehouse applications, as well as the Administration Console (Pic. 2).

95

Miningblox web applications include Java Server Pages (JSP) that use Alphablox and Miningblox JSP tag library. JSP-page which invokes Alphablox, is compiled during the execution on the application server. Alphablox manages queries and the Web server returns the dynamic content.

The data warehouse application contains control flows, which are invoked by Miningblox web application. Control flows contain data flows and data mining flows. DB2 database stores both data, analyzed by data mining flows, and the results in the form of data models and resulting tables.

Pic. 2. Data mining using MiningBlox & Alphablox

Administration Console can be used to deploy and administer data warehouse applications related to Miningblox applications.

Design Studio is used for visual design of the flow of statements of data mining or text analysis, as well as preprocessing statements and text statements. The generated SQL query can be embedded in Alphablox application or other application to invoke the data mining flow.

Data mining using Intelligent Miner To solve data mining and text analysis tasks it is necessary to develop applications using a specialized SQL API application interface, which consists of two levels with different levels of detail and abstraction.

• Application interface of Easy Mining tasks is problem-oriented and is used to perform the basic tasks of data mining;

• IM Scoring / Modeling SQL / MM API application interface conforms to ISO / IEC 13249-6: Data Mining standard and allows to create data mining applications for specific individual

96

user requirements. This interface can be used by SQL scripts, or from any JDBC, CLI, ODBC, or SQLJ application.

Easy Mining Procedures provide the core functionality of typical data mining tasks. To do this users should have knowledge of their subject area and are not required to deeply understand the intricacies of data mining.

IM Scoring and IM Modeling is a set of software development tools (software development kit, SDK). These tools are DB2 extensions, and include SQL API, which allows to invoke data mining functions from applications.

With the help of modeling tools of IM Modeling you can use the following functions for the development of analytical PMML models (Predictive Model Markup Language): association rules, sequence rules, cluster models, regression models, and classification. Generated PMML models can be used in the IM Visualization modules or in IM Scoring.

IM Scoring evaluating tools allow application programs to apply the PMML model to a large database, its subset, or to individual rows. IM Scoring can work with the following PMML models: association rules, sequences rules, cluster models, regression models, classification, naive Bayesian approach, neural networks, and decision trees. Forecasting models created using Intelligent Miner for Data are not part of PMML. These models can be exported from the IM for Data to XML format and can be used in IM Scoring.

The results of data simulation (associations, sequences, classification, clustering and regression) can be viewed by ready-made Java visualization tools of IM Visualization.

To present the modeling results these visualization tools can be invoked by applications or as a browser applet.

Design Studio contains editors and visual tools for data mining application development integrated into the Eclipse environment. An application developer can visually model the data mining tasks and generate SQL code to include the Intelligent Miner SQL functionality in analytical applications.

For prototyping, you can use an Excel extension, which allows you to prove a concept, avoiding the complexities of SQL API.

Administrators can configure a database for data mining, to manage the data mining model, to optimize the performance of analytic queries through the Web interface of Administration Console.

Text analysis Text analysis allows to extract business information from patient records, from a repairs report, from database text fields and from records of a call center. This information can be used in multidimensional analysis, reports, or as an input for data mining. Text analysis covers a broad area of computer science, such as:

• Automatic classification of documents into groups of similar documents (clustering, or unsupervised categorization)

• Automatic classification of documents into predefined categories (supervised categorization)

• Structured data extraction from unstructured text.

Text analysis functions of InfoSphere Warehouse are targeted on data extraction, which generates structured data for business intelligence, together with other structured information by data mining and multidimensional analysis tools and reporting tools.

97

In InfoSphere Warehouse the UIMA (Unstructured Information Management Architecture [2]) based software tools are used for text analysis. UIMA is an open, industrial-oriented, scalable and extensible platform for integration and deployment of solutions of text analysis.

Text analysis function of the InfoSphere Warehouse can perform the following tasks:

• Explore tables that contain text columns;

• Extract data invoking regular expressions, such as phone numbers, email addresses, social insurance number, or a unified resource locators (URL);

• Extract data using dictionaries and classifiers, for example, product names, or names of clients;

• Extract data using the UIMA compliant components.

Data mining application development

Pic. 3. Intelligent Miner using scenario

You can select one of the approaches to application development, depending on the complexity of the problem and the experience, preferences and skills of specialists:

• To use examples and tutorials for quick code adjustment to fit your goals.

98

• To use graphical user interface of Design Studio to determine the analysis process and to generate code and to integrate it into the application. Include data mining steps in the process of automated data conversion.

• To use the Easy Mining procedures for basic functionality of typical mining tasks.

• Use the command-line script generator idmmkSQL as a starting point for scoring statements.

• Invoke a powerful low-level application interface SQL/MM API from SQL scripts or from any JDBC, CLI, ODBC, or SQLJ application.

Picture 3 shows a typical scenario for Intelligent Miner used for data mining tasks.

The application developer integrates SQL functionality of Intelligent Miner into applications using development tools of Design Studio.

The analyst uses the tools of data mining from applications.

Data analysis using IBM Cognos Business Intelligenc e The proposed solution (Fig. 4) is based on the previous architecture, with the expansion of analytic functionality by means of IBM Cognos 10 Business Intelligence [3].

IBM Cognos 10 Business Intelligence (BI) is an integrated software suite to manage enterprise performance and is designed to aid in interpretation of data arising during the operation of the organization.

Cognos BI 10 allows to draw graphs to compare the plan and fact, to create different types of reports, to embed reports into a convenient portal and to create a custom dashboard.

Any organization’s employee can use IBM Cognos 10 BI to create business reports, to analyze data and monitor events and metrics in order to make effective business decisions.

Cognos BI 10 includes the following components:

• Cognos Connection - content publishing, managing and viewing.

• Cognos Administration console - viewing, organizing and scheduling of content, administration and data protection

• Cognos Business Insight - interactive dashboards

• Cognos Business Insight Advanced - easy reporting and data research

• Cognos Query Studio - arbitrary queries

• Cognos Report Studio - managed accounts

• Cognos Event Studio - event management and notification

• Cognos Metric Studio - metrics and scorecarding

• Cognos Analysis Studio - business analysis

• Cognos for Microsoft Office - working with Cognos BI data in Microsoft Office

• Framework Manager - business metadata management for cube connection.

• Metric Designer - data extraction.

• Transformer - multidimensional data cubes PowerCubes modeling

99

• Map Manager - import maps and update labels

• Cognos Software Development Kit - Cognos BI application development

Cognos Connection is a portal that provides a single access point to all enterpise data available in Cognos 10BI. Portal allows users to publish, find, organize and view data. Having appropriate access rights, users can work through the portal with a variety of applications and manage the portal’s content, including schedules management, preparation and distribution of reports.

Cognos Administration console together with the Cognos Connection provides system administrators with abilities to administer Cognos servers, tune performance and manage user access rights.

Cognos Business Insight allows users to create complex interactive dashboard using data from Cognos and from external sources, such as TM1 Websheets and CubeViews. A user can open a personal dashboard, manage reports and send the dashboard via e-mail and participate in collective decision making.

Cognos Business Insight Advanced allows users to create simple reports and explore data from internal and external data sources, both relational and multidimensional. When an analyst uses his personal dashboard and wants to perform a deeper data analysis, he can pass to Business Insight Advanced, where it is possible to add a new dimension, conditional formatting, and complex calculations. The user can launch the Business Insight Advanced directly from Cognos Connection portal.

Query Studio provides an interface for creating simple queries and reports in Cognos 10 BI. Users without special training can use Query Studio to create reports that answer simple business questions. With minimal effort, users can change report layout, filter and sort data, add formatting, and create charts.

Report Studio is a tool that professional report authors and developers use to create sophisticated and managed reports, multi-page reports with composite queries to multiple databases (relational or multidimensional). Using Report Studio, you can create any reports required by organizations such as a sales invoice, budgets, or weekly activity reports of any complexity.

Event Studio is a tool for event management in IBM Cognos 10. It allows to notify users of events as they approach to take timely and effective decisions. Event Studio can be used to create agents that monitor changes of various states of financial and operational performance of the company and of key customers to identify any important events. When an event occurs, the agent can send an e-mail, publish information on the portal, or prepare a report.

Metric Studio allows users to create and use a balanced scorecard to track and analyze key performance indicators (KPI) of the organization. You can use a standard or a custom scorecard, if it is already implemented in the company.

Metric Studio translates the organization's strategy into measurable goals, which allow each employee to correlate their actions with the strategic plan of the company. Scorecarding environment reveals both successful activities of the company, and those that need improvement. It monitors progress in achieving these goals and shows the current state of business. Therefore, all employees and managers of the organization can make necessary decisions and can plan their works.

Analysis Studio is intended for research, analysis and comparison of multidimensional data, and provides real-time processing (OLAP) of various multidimensional data sources. The results of the analysis are available for creation reports of professional quality in Report Studio.

100

Рис. 4. Solution architecture using Lotus Forms, InfoSphere Warehouse и Cognos BI

101

Managers and analysts use Analysis Studio to quickly analyze the reasons of past events and to understand the required actions to improve performance. The analysis allows users to identify unobvious, but influencing on business patterns and abnormalities in big data volumes. Other types of reports do not provide such an opportunity.

Cognos for Microsoft Office allows you to work with Cognos reports directly from MS office, and offers two types of client software:

1. “Smart client” does not require installation, administration, and is updated automatically.

2. The client software is a COM add-in and requires installation. Updates are performed by reinstalling the software.

Cognos for Microsoft Office allows to work with reports created in Query Studio, Analysis Studio, or Report Studio, and users get full access to the contents of the report, including data, metadata, headers, footers, and pictures.

Framework Manager is a simulation tool that is designed to create and manage business metadata for use in analysis and reporting tools of Cognos BI. Metadata provides a common understanding of data from different sources. OLAP cubes contain metadata for business analysis and reporting. Since the cube metadata can be changed, Framework Manager models the minimum amount of information required to connect to a cube.

Metric Designer is a simulation tool for data extraction. Extracts are used for mapping and data transfer to the scorecarding environment from existing sources of metadata, such as files of Framework Manager and Impromptu Query Definition. Typically, a data model is optimized for storage, rather than reporting. Therefore, a developer of data models uses Framework Manager to create data models that are optimized for the needs of business users. For example, a model can define business rules that describe the data and their relationships, dimensions and hierarchies from a business perspective.

Transformer is used for modeling of multidimensional data cubes PowerCubes for business reporting in Cognos BI. After collecting all necessary metadata from various data sources, dimensions modeling, measures customization and dimensional filtering, you can create PowerCubes based on this model. These cubes can be deployed to support OLAP analysis and reporting.

Map Manager allows administrators and modeling specialists to import maps and update maps labels in Report Studio, and add alternative names of countries and cities for the creation of multilingual texts that appear on maps.

IBM Cognos Software Development Kit is designed to create custom reports, to manage the deployment of components of Cognos BI, to ensure the safety of the portal and its functionality in accordance with the requirements of user, local legislation and existing IT infrastructure. Cognos SDK includes cross platform web services, libraries and programming interfaces.

Enterprise planning using Cognos TM1 Analytical tools can be extended by means of IBM Cognos TM1 enterprise planning software [4], which provides a complete, robust and dynamic planning environment for the timely preparation of personalized budgets and forecasts. 64-bit OLAP kernel provides analysis performance of complex models, large data sets, and even streamed data.

A full set of requirements for enterprise planning is supported: from the calculation of profitability, financial analysts and flexible modeling up to revealing of the contribution of each unit.

102

Ability to create unlimited number of custom scripts allows employees, groups, departments and companies to respond more quickly to changing conditions.

Best practices, based on the driver-based planning and rolling forecasts, can become a part of the enterprise planning process.

Models’ and data access’s configuration tools can provide data in familiar formats.

Managed team work provides a quick and automated collection of results from different systems and entities, their assembly into a single enterprise planning process and presentation of results.

Consistent scorecarding, reporting and analysis environment of Cognos BI give a complete picture from goal planning and setting to progress measurement and reporting.

Financial and production units have full control over the processes of planning, budgeting and forecasting.

Users have the ability to work with familiar interfaces (Microsoft Excel and the client software Cognos TM1 Web or Contributor).

Conclusion The proposed solution is scalable and functionally expandable. The solution can be integrated with various document management systems, enterprise planning, metadata and master data management systems.

Primary data gathering and analysis system can be easily integrated into existing enterprise IT infrastructure. In other circumstances the solution may be realized as a first step in implementation of an enterprise system for collecting, storing and analyzing data.

The author thanks M.Barinstein, V.Ivanov, M.Ozerova, D.Savustjan, A.Son, and E.Fischukova for useful discussions.

Literature 1. Asadullaev S., “Primary data gathering and analysis system - I. Problem formulation, data

collecting and storing”, 2011, http://www.ibm.com/developerworks/ru/library/sabir/warehouse-1/index.html

2. Apache Software Foundations, “Apache UIMA”, 2010, http://uima.apache.org/

3. IBM, “Cognos Business Intelligence”, 2010, http://publib.boulder.ibm.com/infocenter/cfpm/v10r1m0/index.jsp?topic=/com.ibm.swg.im.cognos.wig_cr.10.1.0.doc/wig_cr_id111gtstd_c8_bi.html

4. IBM, “Cognos TM1”, 2010, http://www-01.ibm.com/software/data/cognos/products/tm1/

103

Data Warehousing: Triple Strategy in Practice Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A “Program Engineering”, 2011, v4, pp 26-33 www.ibm.com/developerworks/ru/library/sabir/strategy/index.html

Abstract This paper uses a practical example of a system for collecting and analyzing primary data to show how triple strategy and recommended architecture of enterprise data warehouse (EDW) can provide higher quality of the information analysis service while reducing costs and time of EDW development.

Introduction Many successful companies have found that separate line of business management does not give a complete picture of the company’s market situation and business. To make accurate and timely decisions, experts, analysts and company management need unified, consistent information, which should be provided by an enterprise data warehouse (EDW).

In practice, enterprise data warehouse projects do not meet time, cost and quality targets as a rule. In many cases analytical reports as data warehouse output still contain conflicting information. In this article it is shown that adherence to recommended architectural solutions, using proven strategies for creating EDW and the right choice of software tools can reduce EDW development costs and improve the quality of EDW services. Based on the triple strategy, recommended architecture, proposed principles and best practices of EDW construction, project management plan is proposed for software development of an enterprise data warehouse.

IBM offers a complete toolset for data, metadata and master data integration at all stages of the life cycle of an EDW development project. The purpose of this paper is to analyze the simplified solution based on IBM Forms, IBM InfoSphere Warehouse and IBM Cognos BI software. The solution must be scalable and functionally expandable. It should be easily integrated into the enterprise IT infrastructure and to be able to become a foundation for enterprise data warehouse.

Architecture of primary data gathering and analysis system A typical solution for the collection, storage and analysis of primary data inputted manually was proposed in articles [1, 2]. We recall the essential system requirements:

• Collection of primary, statistical, or reported data from remote locations is required; • It is necessary to check the data on the workplace prior sending to the center. • The collected data should be purified and stored for a period defined by regulations; • It necessary to form management statements to assess the state of affairs in the regions. • It is necessary to perform analysis based on the collected statistics to identify regular patterns

and to make a management decision; The solution must be extensible and scalable. For example, it is necessary to anticipate the subsequent integration of the primary data gathering and analysis system with document management system.

Pic. 1 shows the centralized architecture of a primary data gathering and analysis system, which assumes that the data input can be performed remotely, and all IBM Forms [3] data collection servers, InfoSphere Warehouse [4] data storage servers, and IBM Cognos [5,6 ] data analysis and interpretation servers are installed in a single data center.

104

Pic. 1. Solution architecture using Lotus Forms, InfoSphere Warehouse and Cognos BI

105

Analysts can work both locally and remotely, thanks to the Web interface provided by Cognos for the preparation and execution of analytical calculations.

The proposed architecture has been based on the most simple software configuration for a typical task without taking into account the specific requirements of various projects. Despite the obvious limitations, the proposed approach can be used as a basis for a variety of solutions for different industries and enterprises.

Separation of subsystems for data entry, collection, storage and analysis allows us to construct different architectures depending on the needs and demands of the task and enterprise infrastructure requirements.

Another advantage of this modular system is the possibility of its functionality expansion. Since all modules communicate over standard generally accepted protocols, it can be integrated with various IT systems such as document management, metadata and master data management, enterprise resource planning, analytical and statistical packages and many others.

The system for collecting and analyzing raw data can be easily integrated into existing corporate IT infrastructure. In other circumstances it may be a first step to implementation of a corporate system for collecting, storing and analyzing data.

As you can see, the architecture of the primary data gathering and analysis system (Pic. 1) contains no metadata or master data management tools. On the face of it, this contradicts the proposed triple strategy of a data warehouse development [7], which involves integration of data, metadata and master data. On the other hand, it is not clear how this solution relates to the recommended EDW architecture [8], and how the proposed approach differs from the numerous projects whose primary purpose is to demonstrate a quick insignificant success.

Role of metadata and master data management project s The task of primary data input, their collection, storage and analysis has several features. First of all, primary data is entered manually in the field of approved on-screen forms (e-forms). That’s why e-form fields are aligned with each other, both inside individual e-form and between e-forms. This means that different entities have different names and different fields. Therefore, the customer at early project stages has planned to store not forms or reports, but individual field data of which forms and reports can be constructed further. Consistent set of data is a great first step to manage metadata, even if this requirement was not formulated explicitly.

Under these conditions, the first phase of metadata management does not require the use of specific software and can be done with a pencil, an eraser and paper. The main difficulty of this stage is to reach an agreement among all the experts on the terms, entities, their names and methods of calculation. Sometimes users have to abandon familiar but ambiguous names, and agreement may require considerable effort and time. Fortunately, this work has been performed by the customer before resorting to the external project team.

The names of e-form fields, methods of input data check and calculation are essential business metadata. The solution architecture is developed on the basis of collected metadata, including the data warehouse model, EDW tables columns are created and named. Thus, implicit technical metadata management is started.

Changes are inevitable during the maintenance of the developed system. During this stage, the developers need to manage a glossary of terms. If it was not created earlier, it's time to think about glossary implementation, since the system maintenance process forces to start centralized metadata management in an explicit form.

106

This scenario implies minimal overhead for the implementation of a centralized metadata management system, as the kernel of the future system has been created previously. This core, though small and not having enough features, has a major asset of consistent metadata.

Centralized master data management should be started simultaneously with metadata management. The reason is simple: master data and metadata are closely connected [7], and master data implementation without a metadata project, as a rule, does not lead to success.

The basis for a master data management system can be a set of codifiers, dictionaries, classifiers, identifiers, indices and glossaries maintained for the data warehouse. In this case well conceived metadata should perform a master data quality assurance role, which eliminates data encoding conflicts under conditions of skilled DW design.

Thus, systematization of business metadata, based on e-form fields and performed at the pre-project stage, has provided the opportunity to create a trouble-free metadata and master data management systems. It allowed to reduce the budget of the project of implementation of primary data gathering and analysis system. At the same time the project team was aware that the metadata and master data projects are performed implicitly. At this stage only the designers’ strategic vision and the developers’ accuracy are demanded.

Recommended DW Architecture Recommended enterprise data warehouse architecture, proposed in [8],is constructed in accordance with the following basic principles.

EDW is the only source of noncontradictory data and should provide users with consistent data of high quality gathered from different information systems.

• Data should be available to employees to the extent necessary and sufficient to carry out their duties.

• Users should have a common understanding of the data, i.e., there should be a common semantic space.

• It is necessary to eliminate data encoding conflicts in the source systems.

• Analytical calculations must be separated from operational data processing.

• Multilevel data organization should be ensured and maintained.

• It is necessary to follow the evolutionary approach, allowing business continuity and the IT investment saving.

The information content of future data storage, stages of EDW development and putting functional modules into operation are determined, first of all, by the requirements of the business users.

Data protection and secure storage must be ensured. Data protection activities should be adequate to the value of the information.

Architecture designed in accordance with these principles, follows the examined principle of the modular design - "unsinkable compartments”. By separating the architecture into modules, we also concentrate certain functionality in them (Pic. 2).

ETL tools provide complete, reliable and accurate information gathering from data sources by means of algorithms concentrated in ETL for the collection, processing, data conversion and interaction with metadata and master data management systems.

107

Metadata management system is the primary source of information about the data in EDW. Metadata management system supports the relevance of business metadata, technical, operational and project metadata.

The master data management system eliminates conflicts in the data encoding in source systems.

Central Data Warehouse (CDW) has the only workload of reliable and secure data storage. Data structure in the CDW is optimized solely for the purpose of ensuring effective data storage.

Data sampling, restructuring, and delivery tools (SRD) in this architecture are the only users of the CDW, taking on the whole job of data marts filling and, thereby, reducing the user queries workload on the CDW.

Data marts contain data in formats and structures that are optimized for tasks of specific data mart users.

Pic. 2. Recommended DW Architecture

So, comfortable user’s operation with the necessary amount of data is achieved even when the connection to CDW is lost. The ability to quickly restore data mart’s content from the CDW in case of data marts failover is also provided.

The advantage of this architecture is the ability to separate design, development, operation and refinement of individual EDW components without an overhaul of the whole system. This means that the beginning of work on the establishment of EDW does not require hyper effort or hyper investments. To start, it is enough to implement a data warehouse with limited capabilities, and following the proposed principles, to develop a prototype that is working and truly useful for users. Then you need to identify the bottlenecks and to evolve the required components.

Relation between the recommended architecture and t he solution Architecture solution for the primary data collection, storage and analysis system (Pic. 3) is translated to EDW terms of the recommended architecture and is aligned with it.

108

Data are collected with the help of IBM Forms, which uses e-forms for manual data entry and allows you to transfer the collected data to other systems. IBM Forms application server can be further integrated with the repositories of structured and unstructured data.

The only data source in this project are e-forms filled in according to strict rules, so at this stage there is no need for tools extraction, transformation and loading (eg, DataStage). However, in the future the project will evolve, and one can expect the need to connect other sources. The possibility of using ETL tools provides functional extensibility of the system without the need of a radical redesign.

Data storage is implemented using IBM InfoSphere Warehouse. Data analysis can be performed by means of IBM InfoSphere Warehouse and IBM Cognos Business Intelligence (BI).

Pic. 3. Recommended and solution architectures’ relations

IBM InfoSphere Warehouse provides the following data analysis tools: analytical processing using Cubing Services based OLAP tools and Alphablox, data mining using Miningblox and Alphablox, and data mining with the assistance of Intelligent Miner.

IBM Cognos 10 Business Intelligence (BI) is an integrated software suite to manage enterprise performance and is designed to aid in interpretation of data arising during the operation of the organization. Any employee can use IBM Cognos 10 BI to create business reports, to analyze data and monitor events and metrics in order to make effective business decisions. Cognos BI 10 allows to draw graphs to compare the plan and fact, to create different types of reports, to embed reports into a convenient portal and to create a custom dashboard.

Analytical tools can be extended by means of IBM Cognos TM1 enterprise planning software, which provides a complete, robust and dynamic planning environment for the timely preparation of personalized budgets and forecasts.

109

Metadata which are obtained as a byproduct of matching e-forms, and master data, which are the result of data reduction to normal form in relational database of EDW, are the prototypes of future enterprise metadata and master data management systems (Pic. 4).

The first publication of the need to establish systems of Data Dictionary / Directory Systems has appeared in the mid 80s [9]. An article [10] published in 1995 stated that for successful data integration it is necessary to establish and maintain metadata flow. Current practice shows that this requirement needs to be clarified, since metadata are generated at all stages of development and operation of information systems. The relationship between data, metadata and master data was discussed in detail in [8], where it was shown that master data contain business metadata and technical metadata.

Data loading to EDW can not be properly performed without metadata and master data, which are heavily used at this stage, explicitly or implicitly. Cleaned and consistent data are stored, but metadata and master data are usually ignored.

Pic. 4. Step to enterprise metadata & master data management Creation of metadata and master data repositories significantly reduces the EDW implementation costs; allows us to move from the storage of inconsistent forms to the storage of consistent data and improves the quality of information services for business users [11].

Comparison of proposed and existing approaches In order to answer the question, what is the difference of the proposed approach and existing practice, consider a typical example of a project of a financial analysis system development in a bank [12].

The project team relied on the fact that creation of enterprise master data is a long, difficult and risky job. Therefore, the project was limited to local task solution of planning and forecasting

110

processes reengineering, which should pave the way for a bank reporting system, based on the integration of core banking systems, to use more detailed data, consistent with the general ledger.

Development of an EDW in project team’s eyes was tantamount to the "Big Bang" that created the universe. The project team, avoiding enterprise wide solutions, has introduced metadata and master data for a particular area of activity. Therefore, the financial data repository is a highly specific data mart for financial statements (Pic. 5).

In contrast, the EDW provides consistent corporate data for a wide range of analytical applications. Practice shows that only EDW, which is integrated with metadata and master data management systems, can provide a single version of data.

As you can see, the main objective of this project is a small demonstration of quick win. Many of us were put in the same situation, when there was an urgent need to demonstrate even a small, but working system. An experienced developer knows that he will have to follow the advice of Brooks [13] and throw this first version away. The reason is that the cost of applications redesig n and their integration into the enterprise infrastructure would be prohibitive because of the lack of agreed metadata and master data.

Pic. 5. Example of existing approach

The final architecture of implementation of existin g approaches Let us briefly summarize the results of the analysis.

1. Existing approaches implement disparate application data marts effectively. The data in the data marts may be valuable inside the units, but not for the company as a whole, because of the impossibility of data reconciliation due by different data sense and coding.

2. Belief that the creation of an EDW is like a deadly trick with unpredictable consequences is widespread, so it often is decided to create local data marts without EDW development.

3. The demand for instant results leads to development and implementation of limited solutions with no relation to enterprise level tasks.

111

Following these principles, the company introduces initially separate, independent data marts. The information in the data marts is not consistent with data from other data marts, so the management has contradictory reports on their tables. The indicators in these reports with the same name may hide different identities, and vice versa, the same entity may have different names, can be calculated by different algorithms, based on different data sets, for various periods of time.

As a result, users of independent application data marts speak different business languages, and each data mart has its own metadata.

Another problem is the difference between master data used in the independent data marts. The difference in data encoding used in the codifiers, dictionaries, classifiers, identifiers, indices and glossaries makes it impossible to combine these data without serious analysis, design and development of master data management tools.

So the company creates several inconsistent data warehouses, which is fundamentally contrary to the very idea of establishing an EDW as the one and only source of cleaned, consistent and noncontradictory historical data. Lack of enterprise metadata and master data management (shaded in Fig. 6) makes the possibility to agree on of data between them even less probable.

Obviously neither the management nor the users of such a repository are inclined to trust the information contained in it. So on the next step the need for radical redesign arises, and in fact, for creation of a new data warehouse which stores not reports, but agreed indicators, from which reports will be collected.

Thus, the pursuit of short term results and the need to demonstrate quick wins lead to the rejection of a unified end-to-end metadata and master data management. The result of this approach is the presence of semantic islands, where the staff speaks a variety of business languages. Enterprise data integration architecture must be redesigned completely, which leads to repeated time and money expenditures to create a full scale EDW (Pic. 6).

Pic. 6. Result of existing approach: DW with intermediate Application DMs

112

Triple strategy and EDW development planning The proposed approach is based on the triple strategy, on recommended architecture, on formulated principles and on best practices of EDW development.

As a rule, developers need to quickly demonstrate at least an insignificant success in data integration. In some companies, by contrast, one must develop and implement a corporate strategy for EDW. No matter how the task is formulated, in both cases you must have a global goal before your eyes and reach it by means of short steps.

The role of the compass that lines up with strategic goals is given to coordinated integration of data, metadata, and master data (Pic. 7):

1. master data integration to eliminate data redundancies and inconsistencies;

2. metadata integration to ensure a common understanding of data and metadata;

3. data integration to provide end users with a single version of truth on the basis of agreed metadata and master data

As you know, a great journey starts and ends with a small step. Creation of a centralized data metadata, and master data management environment is a priority task. But business users do not see immediate benefits to themselves from that environment, and management prefers to avoid long term projects with no tangible results for the company's core business.

Pic. 7. DW development plan

Therefore, two or three pilot projects should be chosen on the first phase. The main selection criteria for these projects are management support and users’ and experts’ willingness to participate in the task formulation. Projects should provide minimum acceptable functionality of the future of EDW.

As a tentative example the following pilot projects are selected to implement the first phase (Pic.7):

1. Data mining on the basis of Intelligent Miner (IM);

113

2. Multidimensional analysis (OLAP) using Cubing Services and Alphablox;

3. Unstructured text analysis using Unstructured Text Analysis Tools (UTAT).

All these tools deployed in the first phase of pilot projects are part of IBM InfoSphere Warehouse.

It is important that users feel the real benefits of EDW as a result of these short projects. The project team together with users needs to analyze the results of pilot projects implementation and if necessary determine the actions to change the EDW environment and to adjust the tasks of data metadata, and master data integration.

The next step is to choose three or four new pilot projects that cold promote the company to the creation of basic functionality of the future EDW. It is desirable that the selection process involves all concerned parties: company management, users, business experts, project team and EDW maintenance and support team. Centralized data, metadata and master data management environment must be developed enough to meet the requirements of EDW basic functionality.

Assume that the following projects and tools are chosen to be implemented on second phase:

1. Report generation and data analysis with Cognos Business Insight Advanced and Report Studio;

2. Creation of complex interactive dashboard, based on Cognos Business Insight;

3. Scenario analysis using Cognos TM1;

4. Corporate Planning with Cognos TM1.

The projects results should be reexamined after completion of the pilot projects of second phase. The next step should be the development of a fully functional EDW, which is impossible without a comprehensive support by the environment of centralized data, metadata and master data management.

Thus, a rough plan for a EDW development may look as follows:

• Strategic objectives:

o coordinated integration of data, metadata, and master data

• Tactical objectives:

o Selection of two or three projects to demonstrate the benefits

o Creation of a centralized data, metadata and master data management environment,

o Project results analysis and alteration of EDW environment, if necessary

o Implementation of three or four pilot projects, relying on the experience gained

o In case of success - EDW development with company-wide functionality

o EDW operation and modernization to fit new tasks, formulation and solution of which became possible due to the accumulated experience of EDW operation

Thus, the EDW development project is not completed when EDW is accepted as commissioned and fully operational. EDW must evolve together with the company. Life goes on, new problems arise, and new information systems are required. If these systems can provide the information important for the data analysis across the enterprise, these new systems must be connected to the EDW. In order to avoid integration issues it is desirable to create a new system based on the capabilities of a centralized data, metadata and master data management environment.

114

In turn, a centralized data, metadata and master data management environment should be changed and improved taking into consideration the needs of new systems. Therefore, centralized data, metadata and master data management environment must evolve until company and its IT systems exist, which is conventionally indicated on Pic. 7 by the arrows that go beyond the schedule.

Conclusion Enterprise data warehouse, built as a result of a coordinated data, metadata, and master data integration, provides higher quality of information and analytical services at lower costs, reduces development time and enables decision making based on more accurate information.

The proposed approach provides an effective operation of data, metadata, and master data management systems, eliminates the coexistence of modules with similar functionality, lowers the total cost of ownership and increases user confidence in the data of EDW. The integration of data, metadata, and master data, performed simultaneously with the development of EDW functionality allows to implement the agreed architecture, environment, life cycles, and key capabilities for data warehouse and metadata and master data management systems.

Literature 1. Asadullaev S., “Primary data gathering and analysis system – I. Problem formulation, data

collecting and storing ”, 2011, http://www.ibm.com/developerworks/ru/library/sabir/warehouse-1/index.html

2. Asadullaev S., “Primary data gathering and analysis system – II. Primary data analysis”, 2011, http://www.ibm.com/developerworks/ru/library/sabir/warehouse-2/index.html

3. IBM, “IBM Forms documentation”, 2010, https://www.ibm.com/developerworks/lotus/documentation/forms/

4. IBM, “InfoSphere Warehouse overview 9.7”, 2010, http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp?topic=/com.ibm.isw.release.doc/helpindex_isw.html

5. IBM, “Cognos Business Intelligence”, 2010, http://publib.boulder.ibm.com/infocenter/cfpm/v10r1m0/index.jsp?topic=/com.ibm.swg.im.cognos.wig_cr.10.1.0.doc/wig_cr_id111gtstd_c8_bi.html

6. IBM, “Cognos TM1”, 2010, http://www-01.ibm.com/software/data/cognos/products/tm1/

7. Asadullaev S., “Data, metadata and master data: triple strategy for data warehouse project”. http://www.ibm.com/developerworks/ru/library/r-nci/index.html, 2009.

8. Asadullaev S., “Data warehouse architecture – III”, http://www.ibm.com/developerworks/ru/library/sabir/axd_3/index.html, 2009.

9. Leong-Hong B.W., Plagman B.K. “Data Dictionary / Directory Systems”. Wiley & Sons. 1982.

10. Hackathorn R. “Data Warehousing Energizes Your Enterprise”, Datamation, Feb.1, 1995, p. 39.

11. Asadullaev S., “Data quality management using IBM Information Server”, 2010, http://www.ibm.com/developerworks/ru/library/sabir/inf_s/index.html

12. Financial Service Technology. «Mastering financial systems success», 2009, http://www.usfst.com/article/Issue-2/Business-Process/Mastering-financial-systems-success/

13. Brooks F. P. “The Mythical Man-Month: Essays on Software Engineering”, Addison-Wesley Professional; Anniversary edition, 1995.

dwarchitecturesanddevelopmentstrategy.guidebook

Documents

centralized data warehouse

virtual data warehouse

independent data marts

operational data store

centralized dw

centralized etl

parallel dw

open group