etl implementation for extreme performance etl implementation for extreme performance presented by:...
TRANSCRIPT
3
Topics
• About Us
• Tips and tricks for high performance mapping
design
• Pipeline techniques to improve throughput
• Stacked pipelines to achieve extreme throughput
• Ensuring data integrity in an extreme
environment
• Q&A
Copyright 2012 Lockheed Martin
4
About Us
• Who we are
• Catherine Boeving, Software Developer
• Greg Wade, Information Systems Architect
• What we do
• Build large scale active data warehouses with near real time data loads and high availability for Department of Defense (DoD) customers
• Where do we work
• Lockheed Martin Global Systems and Solutions; A leading federal services and information technology contractor
Copyright 2012 Lockheed Martin
5
Our Environment
Input
Data Acquisition
EDW Teradata
Staging Oracle
ETL • SPARC Enterprise M5000
• 144 GB RAM
• Oracle Solaris 10 OS
• Informatica PowerCenter and
DataTransformation V9.1.0 HotFix 3
Excess ETL server capacity needed to achieve extreme throughput.
Copyright 2012 Lockheed Martin
6
Performance vs. Throughput
• Performance
• The execution time for one run of a workflows/mappings
• Throughput
• The volume of data that can be ETL’ed in a specified period of time
• Performance is needed to achieve throughput
• High performance workflows/mappings are not always enough to meet demanding service level agreements
Demanding SLAs require both high performance and throughput
Copyright 2012 Lockheed Martin
7
Carefully implemented lookups can improve mapping performance
Tips and Tricks
• Reference Data Lookups
• Validate the source data
• Expand the source data
• Stage Data Lookups
• Previous source data
• Processing of partial transactions
• Improve integration
Lookups – Requirements
Copyright 2012 Lockheed Martin
8
• Cached
• Only one DB request
• Best used when referenced data does not change
• Better for small tables
• May be able to compensate for slow/overloaded DB
• Un-Cached
• Many DB requests
• Required when referenced data changes
• Better for large tables
• Most performance improvement with fast DB
Selecting the correct lookup type is key to performance
Tips and Tricks
Lookups – Approaches
Copyright 2012 Lockheed Martin
9
• Part Number Example
• 1000 part numbers and descriptions
• 20 character part numbers with 50 character descriptions
Lookups – Calculations
Tips and Tricks
Simple calculations provide some insight but testing is needed in your environment
Copyright 2012 Lockheed Martin
10
Lookups – Calculations
• DB Transfer Entire Table
• = # rows * (characters per row)
• = 1000 * (20 + 50) = 70K Bytes
• One Un-Cached Lookup • = SQL request + SQL response • = 100 + (20 + 50) = 170 Bytes
• Break even point = 70K / 170 = about 411
lookups per mapping execution
Tips and Tricks
A good estimate but ignores DB speed, network,
etc.
Copyright 2012 Lockheed Martin
11
Tips and Tricks
Lookups – Implementation
Peer review or inspection checklists should include validating lookup type selection
• Historical Data Load
• Cached large table and process large amounts of data
one time
• 25 files/10K rows of data against 25M cached lookup
• Average 50 minute workflow execution times and 6 hours total load time
Copyright 2012 Lockheed Martin
12
Tips and Tricks
• Reduces Round Trips to the Database
• Combine Several Lookups with Sequence Logic
• Simplifies Complex Database Insert Logic
• Potential Loss of Data Lineage
Controlling network chatter between ETL and the DB is essential for high performance
Stored Procedures
Copyright 2012 Lockheed Martin
13
Tips and Tricks
Stored Procedures – Implementation
Eliminate redundant
stored procedure calls
Copyright 2012 Lockheed Martin
14
Tips and Tricks
Stored Procedures – Implementation
• Ensure Matching Port and Parameter Sizes
• Mismatched parameter sizes will send extra bytes to database
• Occurs when database and mapping development done in parallel
Verify the Import of Stored Procedures
Copyright 2012 Lockheed Martin
15
• Rapid Development
• Reusable components and patterns
• Understandability
• Onboard new staff with unique ETL approach
• Maintenance
• Source system updates
• Future performance tuning
Tips and Tricks
Mapping Design – Goals
Copyright 2012 Lockheed Martin
16
Tips and Tricks
Mapping Design – Mapplet Execution
• Used to handle similar code from different sources
• Smaller risk in one-time changes
Copyright 2012 Lockheed Martin
17
Tips and Tricks
Mapping Design – Worklet Execution
• Isolates performance tuning and minimizes regression testing
• Implements standards for new developers
Copyright 2012 Lockheed Martin
18
Tips and Tricks
Mapping Design – Stage Table Implementation
• Improves critical
path completion
• Simplifies complex data
Copyright 2012 Lockheed Martin
19
2 Files Processed in 4 Minutes
Standard Approach – No Pipeline
Pipeline Processing
Copyright 2012 Lockheed Martin
20
2 Files Processed in 3 Minutes
Pipeline Processing
Pipeline Approach
Copyright 2012 Lockheed Martin
21
• Pre-Process
• Sort transaction types
• Split large files
• Transform in Multiple Steps
• DataTransformation (DT)
• Break sources in multiple logical parts
• Smart Loading
• Decouple DB loading from transformations
• Use of external loaders
Extr
act
Tra
nsfo
rm
Load
Added Complexity requires standards and review
Pipeline Processing
Copyright 2012 Lockheed Martin
22
• Use Flat Files Between Pipeline Steps
• High demands on ETL server’s file system
• Requires highly tuned cluster file systems
• Pipeline Steps Have Similar Run Times
• Simple Three Step Pipeline
• DT File, Workflow Output File, External DB Loader
Pipeline Processing
Implementing the Pipeline
Flat File Movement Adds Complexity And Must Be Monitored
Copyright 2012 Lockheed Martin
23
Pipeline Processing
Implementing the Pipeline
Batch File Processing
Copyright 2012 Lockheed Martin
24
6 Files Processed in 3 Minutes
Files applied to the DB in non-deterministic order
Pipeline Processing
Stacked Pipelines – Threaded 3x
Copyright 2012 Lockheed Martin
25
Pipeline Threading – Implementation
Pipeline Processing
• Manipulate XML
• Replicate Parameter Files
Copyright 2012 Lockheed Martin
26
Pipeline Processing
• Example Calculations
• F = files to process = 100 files
• T = time to ETL and load a file = 5 minutes
• P = number of pipeline steps = 3 steps
• S = number of stacked pipelines = 4 pipelines
• Assume
• All pipeline steps take the same amount of time
• Ignore any overhead for intermediate files
Estimate with your workload to see the possibilities
Copyright 2012 Lockheed Martin
27
Pipeline Processing
• Example Calculations Summary
• Standard Processing (No Pipeline)
• F * T = 100 files * 5 minutes = 500 minutes
• Pipeline
• (P + (F – 1)) * (T/P) = (3 + (100 -1)) * (5/3) = 170 minutes
• 294% speedup over standard processing
• Stacked Pipelines
• ((P + (F – 1)) * (T/P)) / S = 170 / 4 = 43
• 395% speedup over single pipeline
• 1162% speedup over standard processing
Theoretical speedup shown. Actual speedup depends on your environment Copyright 2012 Lockheed Martin
28
Copyright 2012 Lockheed Martin
Pipeline Processing
• Pipeline Metrics – 2 Steps
• Standard Processing
• 387 sec for one file batch
• Pipeline Processing
• Step #1 – 222 sec for one file batch
• Step #2 – 240 sec for one file batch
• 100 File Batches Calculation
• Standard = 100 * 387 = 38700 sec
• Pipeline = (2 + (100 – 1))* (240/2) = 12120
• Speedup = 313 %
29
Threaded Workflows Metrics -- 4 Threads
Pipeline Processing
TOTAL
RUNTIME (SECONDS)
TOTAL INPUT ROWS
TOTAL WEIGHTED AVERAGE RUNTIME
TOTAL ROWS/ TOTAL
WEIGHTED AVERAGE RUNTIME
PERCENT DIFFERENCE
Non-Threaded Workflows
1,836,519.00 81,148,820.00 303,229.40 267.62 -17.93%
Threaded Workflows
5,503,758.00 88,555,273.00 271,577.78 326.08 21.85%
Full speedup not realized. Consider data volume when threading.
We ran out of
files to process!
Copyright 2012 Lockheed Martin
30
Data Integrity
Customer Satisfaction, Trust, and Growth
• Is the data accurate?
• How complete is the picture?
• Finding the Bottleneck?
Building the System of Record
Copyright 2012 Lockheed Martin
31
• File Monitoring
• System exchange
• Reaching pre-processing phase
• Data Monitoring
• Check for data validity
• Track session execution times
• Output File Monitoring
• Output files load time
• Follow invalid output files
Extr
act
Tra
nsfo
rm
Load
Data Integrity
Copyright 2012 Lockheed Martin
32
Data Integrity
Event Tracking
E Exchange
Event
E Pre-Process
Event E Bulk Process
Event E Output File
Event
E Database File
Event
E T L
Copyright 2012 Lockheed Martin
33
Data Integrity
Monitor the Data - Alert
• Maintain Timelines for Files and Data
• Define expectations for data
• Empower the System
• System operators resolution
• Track to data quality issues
Copyright 2012 Lockheed Martin
34
Data Integrity
Proven Metrics
• Checking the Box on the SLA
• Quantifiable numbers
• Build and Track Future Growth
• Handle errors and invalid data
Review of metrics may require redesign.
Copyright 2012 Lockheed Martin
35
Key Points
• Tips and Tricks
• Stored Procedures, Lookups, Mapping Design
• Pipelines
• Pipeline ETL Processing
• Stacked Pipelines and Threaded Workflows
• Data Integrity
• Events for ETL, Alerting, Proven Metrics
Copyright 2012 Lockheed Martin