how to achieve flexible, cost-effective scalability and ...€¦ · how to achieve flexible,...

W H I T E P A P E R

How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing

Under the Hood of Informatica PowerCenter Pushdown Optimization Option

This document contains Confi dential, Proprietary and Trade Secret Information (“Confi dential Information”) of Informatica Corporation and may not be copied, distributed, duplicated, or otherwise reproduced in any manner without the prior written consent of Informatica.

While every attempt has been made to ensure that the information in this document is accurate and complete, some typographical errors or technical inaccuracies may exist. Informatica does not accept responsibility for any kind of loss resulting from the use of information contained in this document. The information contained in this document is subject to change without notice.

The incorporation of the product attributes discussed in these materials into any release or upgrade of any Informatica software product—as well as the timing of any such release or upgrade—is at the sole discretion of Informatica.

Protected by one or more of the following U.S. Patents: 6,032,158; 5,794,246; 6,014,670; 6,339,775; 6,044,374; 6,208,990; 6,208,990; 6,850,947; 6,895,471; or by the following pending U.S. Patents: 09/644,280; 10/966,046; 10/727,700.

This edition published November 2007

1Pushdown Optimization

White Paper

Table of ContentsExecutive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Historical Approaches to Data Integration. . . . . . . . . . . . . . . . . . . . . . . . . . 3

The Combined Engine- and RDBMS-based Approach to Data Integration. . . 4

How Pushdown Optimization Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Overview of Pushdown Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Two-Pass Pushdown Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Full Pushdown Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Partial Pushdown Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Platform-Specifi c Pushdown Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Benefi ts of Pushdown Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Increased Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Greater IT Team Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Reduced Risk and Enhanced Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

About Informatica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2

Executive SummaryToday’s IT organizations are struggling to integrate more and more data in less and less time. What’s the right data integration strategy to effectively manage dozens or even hundreds of terabytes of data with enough fl exibility and adaptability to cope with future growth?

Historically, data integration involved developing hand coded programs that extracted data from source systems, applied business/transformation logic, and then populated the appropriate downstream system—be it a staging area, data warehouse, or other application interface.

Hand coding has been replaced by data integration software that accesses, discovers, integrates, and delivers data using an “engine” or “data integration server” and visual tools to map and execute the desired process. Driven by accelerated productivity gains and ever-increasing performance, state of the art data integration platforms, such as Informatica® PowerCenter®, effectively handle the vast majority of today’s data integration scenarios.

Based on PowerCenter’s wide acceptance by a variety of high-volume customers, Informatica has identifi ed scenarios where processing data in a source or target database—instead of within the data integration server—can lead to signifi cant performance gains. These scenarios occur when data is “co-located” within a common database instance, such as when staging and production reside in a single relational database management system (RDBMS) or when a company invests in database hardware and software that can provide additional processing power.

With these scenarios in mind, Informatica set out to deliver a solution that best leverages the performance capabilities of its data integration server and the processing power of a relational database interchangeably to optimize the use of available resources without incurring undue confi guration and management burdens. This solution is the Pushdown Optimization Option for Informatica PowerCenter.

The Pushdown Optimization Option increases system performance by providing the fl exibility to push data transformation processing to the most appropriate processing resource, whether within a source or target database or through the PowerCenter server. With this option, PowerCenter is the only enterprise data integration software on the market that allows you to choose when and how to push down processing, offering a signifi cant performance advantage.

This white paper examines the historical approaches to data integration and describes how a combined engine- and RDBMS-based approach to data integration can help IT organizations:

Cost-effectively scale by using a fl exible, adaptable data integration architecture•

Increase developer and team productivity•

Save costs through greater leverage of RDBMS and hardware investments•

Eliminate the need to write custom-coded solutions•

Easily adapt to changes in underlying RDBMS architecture•

Maintain visibility and control of data integration processes•

“PowerCenter is the hub of our

integration infrastructure. With

release 8.5, we will be able to

upgrade our data integration

capabilities and take on even

more projects, without sacrifi cing

responsiveness or scalability. For

example, pushdown optimization

is a feature unique to Informatica,

and PowerCenter 8.5 supports

additional transformations that can

be processed within the database,

which will help us maximize

existing database resources and

improve performance.”

Mark Cothron

Soft ware Engineering Manager

Ace Hardware

White Paper

3

Historical Approaches to Data IntegrationHistorically, there have been four approaches to data integration:

1. Hand Coding. Since the early days of data processing, IT has attempted to solve integration problems through development of hand coded programs. These efforts still proliferate in many mainframe environments, data migration projects, and other scenarios where manual labor is applied to extract, transform, and move data for the purposes of integration. The high risks, escalating costs, and lack of compliance associated with hand coded efforts are well documented, especially in today’s environment of heightened regulatory oversight and the need for data transparency. Early on, solutions for automation emerged to replace hand coding as an alternative cost effective solution.

2. Code Generators. The fi rst early attempts at increasing IT effi ciency led to the development of code generation frameworks that leveraged visual tools to map out processes and data fl ow but then generated and compiled code as the resultant run-time solution. Code generators were a step-up from hand coding for developers, but this approach did not gain widespread adoption as solution requirements and IT architecture complexity arose and the issues around code maintenance, lack of visibility through metadata, and inaccuracies in the generation process led to higher rather than lower costs.

3. RDBMS-Centric SQL Code Generators. An offspring of early generation code generators emerged from the database vendors themselves. Using the database as an “engine” and SQL as a language, RDBMS vendors delivered offerings that centered on their “fl avor” of database programming. Unfortunately, these products exposed the lack of capability of the SQL language and the database-specifi c extensions (e.g., PL/SQL, stored procedures) to handle cross-platform data issues, XML data; the full range of functions such as data quality, and conditional aggregation, and the rest of the complete range of business logic needed for enterprise data integration. What these products did prove was that for certain scenarios, the horsepower of the relational database can be effectively used for data integration.

4. Metadata-Driven Engines. Informatica pioneered a data integration approach that leverages a data server, or “engine,” powered by open, interpreted metadata as the workhorse for transformation processing. This approach addresses complexity and meets the needs for performance. It also provides the added benefi t of re-use and openness due to its meta data-centricity. Others have since copied this approach through other types of engines and languages. But it wasn’t until this metadata-driven, engine-based approach was widely adopted as the preferred method for saving costs and rapidly delivering on data integration requirements that extraction, transformation, and loading (ETL) was established as a proven technology. Figure 1 shows this engine-based data integration approach.

Pushdown Optimization

PUSHDOWN OPTIMIZATION OPTION KEY FEATURES

Automatic Generation and “Pushdown” of Mapping Logic

Generates database-specifi c logic that • represents the overall data fl ow

Pushes the execution of the logic into the • database to perform data transformation processing

Database Neutrality

Automatically leverages database • resources

Ends reliance on database-specifi c • programming to exploit processing power

Single Design Environment with Easy-to-Use GUI

Decouples data transformation logic • from the physical execution plan

Controls where processing takes place •

Dynamically creates and executes • database-specifi c transformation language

Full Integration Across Entire Data Integration Platform

Automatically applies pushdown • optimization to all data integration processing available on the PowerCenter platform

DataSources

MetadataRepository

DataIntegrationServer

DataTarget

Figure 1: Informatica Pioneered the Metadata-Driven Engine Approach to Data Integration

4

The Combined Engine- and RDBMS-based Approach to Data IntegrationUsing an engine-based approach, Informatica PowerCenter has become the industry performance leader for enterprise data integration. This leadership has been demonstrated in industry benchmarks, with continued success in complex, high-volume customer environments and in head-to-head evaluations with other competitive offerings. Performance capabilities, such as source-specifi c partitioning, 64-bit support, threaded architecture, and continued testing and refi nement of the data server, have led to organizations to choose PowerCenter to meet their most strenuous data integration requirements.

In 2006, Informatica evaluated certain scenarios in which it made sense to limit the movement of data out and subsequently back into the database during “data co-resident” periods, while processing transformations in a relational database. It is with these scenarios in mind that Informatica developed the pushdown optimization capabilities to round out the optimal performance architecture of its enterprise data integration platform.

The PowerCenter Pushdown Optimization Option improves performance by enabling processing to be “pushed down” to a relational database, maximizing fl exibility, minimizing unnecessary data movement, and providing the optimal performance for both data-intensive and process-intensive transformations. Pushdown optimization is enabled through PowerCenter’s metadata-driven architecture, which decouples the data transformation logic from the physical execution plan. This unique architecture allows processing to be “pushed down” inside an RDBMS when possible.

PowerCenter is the only software on the market that offers engine-based and RDBMS-based integration technology in a single, unifi ed platform. This approach ensures a broad spectrum of data integration initiatives and enables IT to save costs through intelligent use of existing computing resources. Both approaches are required for organizations looking to develop an Integration Competency Center (ICC) where all integration efforts are developed and/or managed by an expert team faced with varying solution requirements.

Although processing is spread between the data integration engine and the database engine, with PowerCenter developers use a single design environment and the same standard set of tools. For example, a developer can design the data fl ow using the PowerCenter Designer, and can design job workfl ow using the PowerCenter Workfl ow Manager. Metadata continues to be generated and managed within PowerCenter. By simply selecting pushdown optimization in the PowerCenter graphical user interface (GUI), developers can control where processing takes place, and database-specifi c transformation language will be dynamically created and executed as appropriate. Pushdown optimization ensures that existing IT assets are fully utilized, helping organizations maximize their investment in RDBMS horsepower.

How Pushdown Optimization WorksThe Pushdown Optimization Option increases systems performance by providing the fl exibility

to push data transformation processing to the most appropriate processing resource, whether within a source or target database or through the PowerCenter server. This section explains how pushdown processing works, including two-pass, partial, and full pushdown processing. It describes platform-specifi c pushdown processing and outlines the limitations on the types of transformations that can be pushed to the database.

A SQL code generator-only approach to data integration hampers IT’s ability to deliver on the various needs of the enterprise due to the limitations of SQL as a comprehensive language for data integration efforts.

White Paper

5

Overview of Pushdown ProcessingSeparating business logic from physical run-time execution, the Pushdown Optimization Option is coupled with the creation and management of workfl ows. Workfl ows tie the execution of a metadata-based mapping to an actual physical environment. This environment spans not only the PowerCenter Data Integration Services that may reside on multiple hardware systems, but also the relational databases where pushdown processing will occur. As shown in Figure 2, data integration solution architects can confi gure the pushdown strategy through a simple drop-down menu in the PowerCenter Workfl ow Manager.

Pushdown optimization can be used to push data transformation logic to the source or target database. The amount of work data integration solution architects can push to the database depends on the pushdown optimization confi guration, the data transformation logic, and the mapping confi guration.

When pushdown optimization is used, PowerCenter writes one or more SQL statements to the source or target database based on the data transformation logic. PowerCenter analyzes the data transformation logic and mapping confi guration to determine the data transformation logic it can push to the database. At run time, PowerCenter executes any SQL statement generated against the source or target tables, and it processes any data transformation logic within PowerCenter that it cannot push to the database.

Using pushdown processing can improve performance and optimize available resources. For example, PowerCenter can push the data transformation logic for the mapping seen in Figure 3 to the source database.

Figure 2: Data Integration Solution Architects Can Confi gure the Pushdown Strategy Through a Simple Drop-Down Menu in the PowerCenter Workfl ow Manager


6

The mapping contains a fi lter transformation that fi lters out all items except for those with an ID greater than 1005. PowerCenter can push the data transformation logic to the database, and it generates the following SQL statement to process the data transformation logic:

INSERT INTO T_ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC, n_PRICE) SELECT ITEMS.ITEM_ID, ITEMS.ITEM_NAME, ITEMS.ITEM_DESC, CAST(ITEMS.PRICE AS INTEGER) FROM ITEMS WHERE (ITEMS.ITEM_ID >1005)

PowerCenter generates an INSERT SELECT statement to obtain and insert the ID, NAME, and DESCRIPTION columns from the source table, and it fi lters the data using a WHERE clause. PowerCenter does not extract any data from the database during this process. Because PowerCenter does not need to extract and load data, performance improves and resources are maximized.

Two-Pass Pushdown ProcessingPushdown processing is based on a two-pass scan of the mapping metadata. In the fi rst pass, PowerCenter starts scanning the mapping objects starting with source defi nition object, moving towards the target defi nition object. When the scan encounters an object containing data transformation logic that cannot be represented in SQL, the scanning process stops, and all transformation upstream of this transformation are grouped together with equivalent SQL for execution inside the source system.

In the second pass, PowerCenter scans in the opposite direction (i.e., from the target defi nitions towards the source defi nitions). When the scan encounters an object containing data transformation logic that cannot be represented in SQL, the scanning process stops, and all transformation objects downstream of this transformation are grouped together with equivalent SQL for execution inside the target system. PowerCenter executes any remaining data transformation logic.

When you confi gure PowerCenter to use pushdown optimization, it can process the data transformation logic using full or partial pushdown optimization.

Figure 3: A Sample Mapping Pushed to the Source Database

White Paper

7

Full Pushdown ProcessingFigure 4 shows a mapping example that is fully processed inside the database system.

This mapping is used to update a slowly changing dimension table where the full history of the dimension data must be maintained with the most current data fl agged. The mapping uses a lookup against the target table to determine rows in the target that match the source logical column. It then uses an expression transformation and compares source and target columns and fl ags new and changed rows. Source data is separated into two fl ows based on a router: one for new rows and one for changed rows. The mapping then generates a primary key for new rows using a sequence generator and inserts the new rows into the target. It sets the current fl ag for changed rows and inserts the changed rows with the current fl ag. Update Strategy is used to update existing versions of the changed rows in the target to indicate that those rows are no longer current.

The pushdown capability is activated by a simple menu-driven option (shown earlier in Figure 2.) This functionality enables the transformation logic to be represented visually in PowerCenter so it is platform-independent and easy to modify. However, the actual transformation of the data, including lookups, expressions, sequence generation and update strategy, is completed entirely in the database engine, effi ciently leveraging existing database capacity.

Partial Pushdown ProcessingFigure 5 shows an example of a mapping that fi lters and sorts a customer source and calls out to Informatica Data Quality address validation routines before loading into a target.


Figure 4: A Mapping Example of Full Pushdown Processing

Figure 5: A Mapping Example of Partial Pushdown Processing

8

Relying on database horsepower alone may not always make sense for business logic that needs to be performed on the source data. For example, if special types of transformation routines (e.g., address cleansing) need to be executed on large amounts of source data, transformations prior to the address cleansing can be pushed down to the source database to better leverage the database engine. At the same time, the remaining address validation routines can be performed within Informatica’s data integration engine. All this is done while maintaining the entire mapping logic and metadata within PowerCenter’s visual design environment.

Platform-Specifi c Pushdown OptimizationWhen pushdown optimization is applied to specifi c database type, the PowerCenter Data Integration Services generate SQL statements using native database SQL. Standards-based generation for ODBC is also supported, and PowerCenter generates SQL statements using ANSI SQL. PowerCenter can generate a greater variety of transformation functions when a specifi c database type is used and ensures optimal generation of the fastest execution plan.

Benefi ts of Pushdown OptimizationThe PowerCenter Pushdown Optimization Option offers many benefi ts, including:

Increased performance by optimizing resource utilization•

Enhanced IT team productivity with simplifi ed debugging and performance tuning and greater • ease-of-use with a metadata-driven architecture that provides metadata lineage

Reduced risk and enhanced fl exibility through database neutrality•

Increased PerformanceThe PowerCenter Pushdown Optimization Option increases systems performance by providing

the fl exibility to push data transformation processing to the most appropriate processing resource, whether within a source or target database or through the PowerCenter server. With this option, PowerCenter is the only enterprise data integration software on the market that allows data integration solution architects to choose when pushing down processing offers a performance advantage.

With the Pushdown Optimization Option, data integration solution architects can choose to push all or part of the data transformation logic to the source or target database. Data integration solution architects can select the database they want to push transformation logic to, and they can choose to push some sessions to the database, while allowing PowerCenter to process other sessions.

For example, let’s say an IT organization has an Oracle source database with very low user activity. This organization may choose to push transformation logic for all sessions that run on this database. In contrast, let’s say an IT organization has a Teradata® source database with heavy user activity. This organization may choose to allow PowerCenter to process the transformation logic for sessions that run on this database. In this way, the sessions can be tuned to work with the load on each database, optimizing performance.

With the Pushdown Optimization Option, data integration solution architects can also use variables to choose to push different volumes of transformation logic to the source or target database at different times during the day. For example, partial pushdown optimization may be used during the peak hours of the day, but full pushdown optimization is used from midnight until 2 a.m. when activity is low.

White Paper

9

Greater IT Team ProductivityWith its unique metadata-driven architecture, the PowerCenter Pushdown Optimization Option increases IT team productivity in several ways:

Ease-of-use on different platforms.• PowerCenter’s metadata-driven architecture allows transformation logic to be easily ported to different platforms. The same transformation logic can easily be performed on different databases. The same session can be assigned different database connections, and the same data transformation can be performed without rewriting code or using different SQL syntax.

Ease of maintenance.• PowerCenter’s metadata-driven architecture makes it easy to track data for the purposes of error-logging, and maintaining an audit trail. In addition, metadata for repository objects is also maintained in the PowerCenter repository. Modifi cations to repository objects, import and export metadata can be tracked and a history of changes to repository objects can be maintained.

Simplifi ed debugging and performance tuning.• When data transformation logic is confi gured in PowerCenter, the data transformation logic is represented in a mapping, which provides a visual representation of the data fl ow, making it simple to debug and edit the transformation logic. Because PowerCenter is a single, unifi ed platform, different functions can be applied to the same metadata without exiting the PowerCenter GUI. For example, a developer might create a mapping to represent data transformation, and then launch the PowerCenter Data Profi ling Option to assess the status of the data. Later, the developer can open the Workfl ow Manager to perform the transformation and launch the Workfl ow Monitor to track the data as it moves from the source to the target.

A tool called the Pushdown Optimization Viewer lets data integration solution architects preview the fl ow of data to the source or target database. This tool allows data integration solution architects to preview the data fl ow, the amount of transformation logic that can be pushed to

the source or target database, and the SQL statements that will be generated at run time as well as any messages related to pushdown optimization. Figure 6 shows the full pushdown mapping from Figure 4 displayed in the Pushdown Optimization Viewer.


Figure 6: Preview of SQL Statements in Pushdown Optimization Viewer

10

Reduced Risk and Enhanced FlexibilityIT organizations typically support several different relational databases. Even when they are

able to standardize on a single RDBMS, changing business conditions—resulting from mergers and acquisitions, cost cutting, for example—dictate that they need to be prepared to support multiple relational databases architectures. IT organizations need to be able to fully leverage the capabilities of each type of database, and yet stay agile enough to rapidly integrate other types of databases as the need arises. New regulatory and governance requirements also dictate increased visibility and control into the business rules applied to data as it moves throughout the enterprise.

PowerCenter reduce the risk of changing database architectures and enhances fl exibility by being database-neutral. PowerCenter’s metadata-driven architecture extends to mappings that leverage the Pushdown Optimization Option. The appropriate database-specifi c logic can be easily regenerated post-database change, providing fl exibility of choice and ease of change. Leveraging metadata analysis and reporting, rather than having business logic tied to vendor-specifi c hand coded logic, enables effective data governance and transparency.

ConclusionToday’s challenges to save costs and also drive revenue are driving IT organizations to examine their current data integration infrastructure needs and choose solutions that provide fl exibility and maximum leverage of current assets.

Informatica PowerCenter provides IT organizations with the fl exibility to optimize performance in response to changing runtime demands, peak processing needs, or other dynamic aspects of the production environment, helping IT organizations achieve cost-effective scalability and performance. By delivering a combined engine-centric and RDBMS-centric approach to data integration in a single, unifi ed platform, PowerCenter, with its Pushdown Optimization Option, ensures optimal performance for the broad spectrum of data integration projects and helps IT save costs through the intelligent use of existing computing resources.

With the Pushdown Optimization Option, PowerCenter can help IT organizations:

Cost-effectively scale through a fl exible, adaptable data integration architecture•

Increase developer and team productivity•

Save costs through greater leverage of RDBMS and hardware investments•

Eliminate the need to write custom coded solutions•

Easily adapt to changes in underlying RDBMS architecture•

Maintain visibility and control of data integration processes•

White Paper

11

Learn MoreTo fi nd out how using a combined engine- and RDBMS-centric approach can benefi t your data integration initiatives, or to fi nd out more about PowerCenter and the Pushdown Optimization Option, please visit us at www.informatica.com or call us at (800) 653-3871.

About InformaticaInformatica is a leading provider of enterprise data integration software and services. With Informatica, organizations can gain greater business value by integrating all their information assets from across the enterprise. Thousands of companies worldwide rely on Informatica to reduce the cost and expedite the time to address data integration needs of any complexity and scale.


White Paper

13Pushdown Optimization

Worldwide Headquarters, 100 Cardinal Way, Redwood City, CA 94063, USAphone: 650.385.5000 fax: 650.385.5500 toll-free in the US: 1.800.653.3871 www.informatica.com

Informatica Offi ces Around The Globe: Australia • Belgium • Canada • China • France • Germany • Japan • Korea • the Netherlands • Singapore • Switzerland • United Kingdom • USA

© 2008 Informatica Corporation. All rights reserved. Printed in the U.S.A. Informatica, the Informatica logo, and The Data Integration Company are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners.

6650 (09/17/2008)

how to achieve flexible, cost-effective scalability and ...€¦ · how to achieve flexible,...

Documents