information integration 15 th meeting course name: business intelligence year: 2009

14

Upload: philomena-hall

Post on 17-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Bina Nusantara University 3 Source of this Material (2).Loshin, David (2003). Business Intelligence: The Savvy Manager’s Guide. Chapter 10

TRANSCRIPT

Page 1: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009
Page 2: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

Information Integration15th Meeting

Course Name: Business IntelligenceYear: 2009

Page 3: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

Bina Nusantara University

3

Source of this Material

(2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s

Guide. Chapter 10

Page 4: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

The Business CaseThe business intelligence process revolves around the ability to

collect, aggregate, and most importantly, leverage the integration of different data sets together; the ability to collect that data and place it in a data warehouse provides the means by which that leverage can be obtained.

The only way to get data into a data warehouse is through an information integration process. The only way to consolidate information for data consumption is through an information integration process.

Bina Nusantara University 4

Page 5: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

A basic premise of constructing a data warehouse is that data sets from multiple sources are collected and the added to a data repository from which analytical applications can source their input data.

This extract/transform/load process is the sequence of applications that extract data sets from the various sources, bring them to a data staging area, apply a sequence of processes to prepare the data for migration into data warehouse, and actually load them. Here is the general theme of an ETL process.

• Get the data from source location• Map the data from its original form into data model that is suitable for

manipulation at the staging area.• Validate and clean the data• Apply any transformations to the data that are required before the data

sets are loaded into repository.• Map the data form its staging are model to its loading model.• Move the data set to repository• Load the data into the Warehouse

Bina Nusantara University 5

ETL: Extract, Transform, Load

Page 6: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

• Staging ArchitectureThe first part of the ETL process is to assemble the infrastructure needed for aggregating the raw data sets and for the application of the transformation and the subsequence preparation of the data to be forwarded to the data warehouse This is typically a combination of a hardware platform and appropriate management software that we refer to as the staging area. The architecture of staging process can be seen in Figure 15-1

Bina Nusantara University 6

ETL: Extract, Transform, Load (cont…)

Figure 15-1

Page 7: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

• ExtractionA lot of extracted data is formed into flat load files that can be either easily manipulated in process at the staging area or forwarded directly to the warehouse.How data should be extracted may depend on the scale of the project, the number (as disparity) of data sources, and how far into the implementation the developers are. Extraction can be as simple as a collection of simple SQL queries or as complex as to require ad hoc, specially designed programs written in a proprietary programming language.

• TransformationWhat is discovered during the profiling is put to use as part of the ETL process to help in the mapping of source data to a form suitable for the target repository, including the following tasks.

Data type conversion Data cleansing Integration Referential integrity Checking Derivations

Bina Nusantara University 7

ETL: Extract, Transform, Load (cont…)

Page 8: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

Denormalization and renormalization Aggregation Audit information Null conversion

• LoadingThe loading component of ETL is centered on moving the transformed data into the data warehouse. The critical issues include the following.

Target dependencies Refresh volume and frequency

• ScalabilityThere are two flavors of operations that are addressed during the ETL process. One involves processing that is limited to all data instances within a single data set, and the other involves the resolution of issues involving more that one data set. The more data sets that are being integrated, the greater the amount of work that needs to be done for the integration to complete.

Bina Nusantara University 8

ETL: Extract, Transform, Load (cont…)

Page 9: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

Similar to the way that ETL processes extract and transform information from multiple data sources to a target data warehouse, there are processes for integrating and transforming information between active process and application to essentially make them all work together. This process, called enterprise application integration (EAI), which includes a function similar to interacting applications that is provided by ETL.

• Enterprise Application IntegrationEnterprise application integration (EAI) is meant to convey the perception of multiple applications working together as if they were all a single application. The basic goal is for a business process to be able to be cast as the interaction of set of available applications and for all applications to be able to properly communicate with each other. Enterprise application integration is not truly a product or a tool, but rather is framework of ideas comprising different level of integration, including:

Business Process Management Communications Middleware Data Standardization and Transformation

Bina Nusantara University 9

Enterprise Application Integration and Web Services

Page 10: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

Application of Business Rules• Web Services

Web services are business functions available over the internet that are constructed according to strict specifications. Conformance to a strict standard enables different, disparate clients to interact. By transforming data into an Extensible Markup Language (XML) format base on predefined schema and providing object access directives that describe how objects are communicated with, web services provide a higher-level of abstraction than what is assumed by general EAI.

Bina Nusantara University 10

Enterprise Application Integration and Web Services(cont…)

Page 11: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

Consolidation is a catchall term for those processes that make use of collected metadata and knowledge to eliminate duplicate entities and merge data from multiple sources, among other data enhancement operations. That process is powered by the ability to identify some kind of relationship between any arbitrary pair of data instances. The key to record linkage is the concept of similarity. This is a measure of how close two data instances are to each other, and can be a hard measure or a more approximate measure, in which case the similarity is judged based on scores above or below a threshold.

• Scoring Precision and Application ContextOne of the most significant insight into similarity and difference measurements is the issue of application context and its impact on both measurement precision and the similarity criteria. Depending on the kind of application that makes use of approximate searching and matching, the thresholds will most likely change.

• Elimination of DuplicatesThe elimination of duplicates is a process of finding multiple representations of the same entity within the data set and eliminating all but one of those representations from the set. The elimination of duplicates is essentially a process of clustering similar

Bina Nusantara University 11

Record Linkage and Consolidation

Page 12: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

records together and then reviewing the corresponding similarity scores with respect to a pair of thresholds.

• Merge/PurgeMerge/purge is similar to the elimination of duplicates, except that whereas duplicate elimination is associated with removing doubles from a single data set, merge/purge involves the aggregation of multiple data sets followed by eliminating duplicates.

• HouseholdingHouseholding is a process of reducing a number of records into a single set associated with a single household. A household could be defined as a single residence, and the householding process is used to determine which individuals live within the same residence.

• Improving Information CurrencyThere are other applications that make use of a consolidation phase during data cleansing. One application is the analysis of currency and correctness. In the consolidation phase, when multiple records associated with a single entity are combined, the information in all the records can be used to infer the best overall set of data attributes.

Bina Nusantara University 12

Record Linkage and Consolidation (cont…)

Page 13: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

• Data OwnershipHow are you to direct your team to maintain a high level of data quality within the warehouse? There are three ways to address this: correct the data in the warehouse, try to effect some changes to the source data, and leave the errors in the data.

• Activity SchedulingHow to schedule the activities associated with the scheduling process. The answer depend on the available resources, the relative quality of supplied data, and the kind of data sets that are to be propagated to the repository.

• Reliability of Automated LinkageAlthough our desire is for automated processes to properly link data instances as part of the integration process, there is always some doubt that the software is actually doing what we want it to do.

Bina Nusantara University 13

Management Issues

Page 14: Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009

End of Slide

Bina Nusantara University 14