best practices in data loading for an oracle data warehouse · best practices in data loading for...

Best Practices in Data Loading

for an Oracle Data Warehouse

Jean-Pierre Dijcks

Oracle Corporation

United States

Keywords:

Data Integration, ETL, Data Warehouse, Oracle Database Machine, Best Practices

Introduction

Perhaps the most significant trend in data warehousing over the past few years has been the

growth in data volumes. Whether you are a school district, a financial institution or a

manufacturing organization, you are storing more and more data.

If we look at the Winter Survey1 and project the trend over the next couple of years we see

impossible to imagine growth rates in data volumes.

Figure 1. Growth in data volumes

There are many reasons for this data growth: new business processes continue to be

automated and more detailed information is being collected at every level. Regulatory

compliance increase data storage and the desire to analyze more historical data add to the

growth in data volumes. In this paper we do not focus on why data volumes are growing, but

on what that growth means for a data loading or data integration.

1 Source: Winter TopTen Survey, Winter Corporation, Waltham MA, 2008

Increasing Data Volumes and ETL

As the data warehouses grow, so does the requirement to provide performing ETL jobs.

Another strain on the ETL subsystem is the need to load data in ever shorter intervals.

Five years ago loading a data warehouse on a nightly basis was state of the art and the goal to

achieve. Today that modus operandi is outdates and most data warehouse systems are

increasing their load requirements to multi-batches or even micro-batches. In these condensed

batch operations multiple loads are run in a single day. Micro-batching even runs a batch load

every few minutes or so.

Both these forces are causing strain on the actual ETL subsystem and this paper intends to

show how to deal with this strain and how to optimize the use of Oracle’s ETL capabilities to

satisfy your Service Level Agreements in ETL.

Optimized ETL Requires Balanced Hardware

Optimizing data loading or ETL starts with a platform that – at least in theory – can handle

your required throughputs. When looking at the scalability and performance of a system it is

crucial to look at both software and hardware characteristics and understand potential

bottlenecks.

Balanced Systems

A system is balanced when the storage array is capable of reading and moving – through the

storage area network and the Host Bus Adapters (HBA) – enough data to the database servers

to have the CPUs adequately loaded. In other words, neither the IO capacity, nor the

bandwidth within the system, nor the CPU should be a constraint on the system.

Figure 2. Balance between database and storage bandwidth?

Consider the simplified example as shown in Figure 2. The storage subsystem can deliver a

maximum a throughput of 2GB/sec, where as the upper system to the compute platform can

deliver 4GB/sec. If we now assume that the database servers have sufficient CPU capacity

and other resources to handle the 4GB input, they will run at maximum half capacity due to

the bottleneck limiting the storage to deliver no more than 2GB/sec.

Figure 3. A balanced system

When balancing a system it is crucial to balance between I/O capacity, CPU capacity, the

memory available and interconnect capacity. This balance is show in Figure 3.

Balancing the system in Figure 2 will allow the storage arrays to deliver the full 4GB to the

compute side and utilize the available CPU capabilities. When all components are balanced,

the system should double its performance.

Old Hardware and Incorrect System Sizing

If you were trying to break the track record of this years Formula 1 world champion, would

you go to the track with a Formula 1 car from the late eighties?

Reality says however that many of the systems running today are on yesterday’s hardware and

yesterday’s networking and storage systems. Sure, you added a bunch of disks to

accommodate more storage capacity, but you did not, or could not update the compute server

or the SAN.

Now your CIO has challenged you to go and break the track record with the old F1 car. You

can hire a better racecar driver, you can adjust the carburetor, and you can buy better tires, but

the result of your attempts is simple: failure. There is no way you can break that track record.

If we go back to Figure 2 and imagine the pipe from the storage array carries 4GB, we have a

balanced system. But what is the storage array, due to the number of disks or the disks used

can only deliver a 1GB data stream to the system?

You face the same problem in all of these cases. You need more speed than you can get from

the system. No matter how smart your software is it cannot go faster than the infrastructure

that carries it.

How to utilize Oracle for fast bulk loading

For this paper we mainly are focusing on bulk data movement. While we see a trend towards

more real-time deployments, the large majority of data movements are focused on bulk

movement.

Access Methods

When talking about speed of movement, the crucial component is the way Oracle can access

the actual data. With newer releases of Oracle Database new and interesting access methods

become available. The following is a ranking and a brief discussion of some of these methods,

slowest to fastest:

• Web Services – In the original incarnation (using SOAP style communication) web

services allow a system to connect to an outside hosted system via a simple API call.

In general, a web service is a slow method of communicating for bulk data loads.

• Database Links – In this method we incorporate various access protocols depending

on the source system. ODBC and Oracle Database Gateways utilize a database link to

connect to a non-Oracle system. Database links are also used to connect to disparate

Oracle systems. Database links should never be used to connect to a schema within the

same Oracle database. Database links are a convenient way of connecting to a remote

database, but this convenience comes at the cost of performance. A database link does

not allow parallel loading and therefore often provides a bottleneck in the data

movement process.

• Data Pump – since it’s introduction Data Pump has made some interesting things

possible in ETL. Many of us categorize Data Pump as an export/import utility, but the

fact that you can choose specific columns and tables allows for a very fast way of

moving data between two or more Oracle instances. Since 11g External Tables can

also directly read Data Pump files, allowing ETL style SQL access directly on the

actual export file, without first staging or importing the data.

• Flat Files – are still one the best performing means of moving large data volumes

between databases. Especially in a heterogeneous environment where data is loaded

from non-Oracle to Oracle, flat files outperform almost every other means. While Data

Pump will work for Oracle to Oracle situations, a SQL Server unload to file, FTP and

External Table load into Oracle is many magnitudes faster than a database link

scenario.

• Transportable Tablespaces – is arguably the fastest way of moving data within an

Oracle environment. While Data Pump allows for much more granular data

movement, the effective movement of the entire file without any additional steps

makes this fast, direct and simple.

Using the Right Access Method

With all the access methods available (and this is probably not the full list), the task for the

ETL developer is to choose the right method.

Most ETL tools – as a default mechanism – leverage database links to move data around

when Oracle comes into play. However that default is most likely not the way to move data

around in large volumes. When it comes to Oracle to Oracle movements the preferred choice

is to use Data Pump when supported and Flat Files when Data Pump is not available (due to

release restrictions for example).

In a heterogeneous environment, the fastest movement of data is by unloading into Flat Files,

compressing the data and using some FTP mechanism to move the files. Then for Oracle the

ETL strategy should leverage External Tables (and NOT SQL LOADER) with pre-processing

capabilities.

The graph in Figure 4 shows the performance as mentioned above and the ability to handle

heterogeneous source types.

Figure 4. Determine access method based on speed and heterogeneity

Reference data sets – for example small dimensions in a star schema, or recoding tables – can

leverage database link mechanisms. Even when going to a non-Oracle system, that may be a

good enough method. Small data sets typically move fast enough over a database link to not

worry about them. Optimizing this via unload and reload mechanisms is not going to gain

enough to warrant spending the extra time on the more complex processes that go with

unloading.

Changed Data

For large data volumes, the detection of changed data should be decoupled from the access

method. The process that detects changes should do just that. Deliver a set of changed data to

a transportation mechanism.

Detecting changes in itself can be a complex process and the goal is to do the change

detection as quickly as possible. Various methods are available and should be considered, but

it should also be considered that the change detection is not enforcing a particular a

transportation mechanism.

In other words, once the changes are identified, ideally you then choose how to extract and

move these identified changes. A timestamp based changed detection is the simplest case.

Once the window is known, the data – if in Oracle – can be moved using Data Pump, not

using a simple select with a timestamp where clause.

Parallel Loading

Parallel loading and join techniques based on parallel processing are an absolute must when

trying to load data into Oracle. Parallel loading is also the big driver for using External Tables

and not SQL Loader. By using External Tables you can specify that data is moved in parallel

right on the table creation statement. Oracle will then spawn and manage the parallel

processes to actually load the data in parallel. SQL Loader requires the ETL developer to do

his/her own parallelism.

Partitioning and ETL

When joining data elements in queries, but also in ETL, a good method of getting great

performance is to design a schema to leverage partition wise joins. By partitioning tables on

their join column, Oracle can deliver a joined partition set to a single parallel process. That

strategy delivers many small joins divided over the parallel processes making the entire

process run in parallel. This join method is the most efficient from a processing perspective

and the goal should be to run partition wise joins as much as possible for large data sets.

Figure 5. Five Steps for Partition Exchange Loading

Another method for ETL that is based on partitioning large table is Partition Exchange

Loading. The theory is that, rather than inserting into a large table with indexes, inserting to a

smaller table, then enabling indexes is faster. Once the smaller table has its indexes (and

statistics) created, Oracle allows to swap this table into a large partitioned table via an

exchange. The table object becomes a specific partition in the large table, and the partition

becomes the – now empty – table. This action is only a single dictionary operation and costs

no time at all. This process is shown in Figure 5.

The process to publish partitions can also be used to create a single “publish all data” moment

for a data warehouse. Figure 6 and the surrounding text shows another method to achieve this

singular moment in time for data publication.

Publish and Subscribe Model for ETL

Using newer technologies as flashback query archives allows you to create an entire

subsystem used alongside ETL utilities. To avoid end users querying systems that are being

updated you can use “AS OF” queries to regulate which data is being queried.

The scenario goes a little like this. You have a reporting environment on top of the data

warehouse and want to make sure data that comes in and is loaded only gets published after it

is checked within the context of the entire system.

That means you need to do your loads, update the entire system, but shield the end users from

that data until it is verified and certified. Once verified, you want to publish the data and

update all data sources for the reports.

Figure 6. Using AS OF views for publishing the latest data

The diagram above shows these steps. To make this work, the schema (on the left receiving

the ETL loads) is covered with a layer of views that handle the exact timestamp of data

visible to the end users.

1. Update the view layer to set the timestamp to a moment before the ETL starts

2. Run the regular jobs

3. Ensure the data is correct and all present

4. Publish the data by updating the view layer to a point in time after the ETL load

5. The end users now query the updated data in the warehouse

Now, none of the above is required to achieve read consistency for queries! Oracle does that

all by itself without any thoughts from the ETL guys (unlike other databases in the DW

space). So do not confuse those two things.

Summary

More data equals a lot more strain on the ETL infrastructure. It is important to understand that

a well performing ETL infrastructure in many cases depends on the existing hardware and

software platform in use. It is crucial that the system is sized to perform at the required level.

Once the hardware and software pieces are in place to satisfy the throughput required from the

ETL subsystem it is a task for ETL team to utilize the tools of the trade to create a well

performing ETL architecture.

To run ETL fast it is important to leverage Oracle software with the latest features. Instead of

running SQL Loader jobs, Oracle recommends the usage of External Tables. For large data

sets, database links should be avoided and alternative means such as Data Pump,

Transportable Tablespaces and flat files should be utilized.

As data warehouses grow traditional solutions run out of steam and it pays to look at features

that are in Oracle to assist in making ETL faster, simpler or just easier to handle. It is

important for the data warehouse and ETL users to move along and sometimes think outside

the box!

Kontaktadresse:

Jean-Pierre Dijcks

500 Oracle Parkway

Redwood Shores, CA 94065

USA

Telefon: +1 650 607 5394

E-Mail [email protected]

Internet: www.oracle.com

best practices in data loading for an oracle data warehouse · best practices in data loading for...

Documents