sas® data integration studio: the 30-day plan€¦ · sas® data integration studio: the 30-day...
TRANSCRIPT
SAS® Data Integration Studio: The 30-Day Plan
Paper 078-2013
John Heaton
Senior DW Analyst Heritage Bank Australia
Why DI Studio Volume of data.
Automation of data extracts.
80/20 – remove the need for business to spend 80 % of their time on data preparation.
Data Governance – build a standard data model with standard definitions across different functional areas within the business.
Optimisation – reduce the redundancy of multiple people extracting the same information.
Security – secure, audit and share information from a controlled asset.
Do not need an experienced SAS Programmer
Outcome of Planning
Size of Project
5 Source Systems
436 Tables
135 Views
22130 Columns
175 Jobs
Total Estimated Size > 15 TB
Daily Volumes
» Storage +- 50 GB 4 x improvement
» IOPs +- 300 GB 3 x improvement
Resources
1.5 Full Time Equivalents
12 months
Design to Production
Subject Areas
> 5 Subject Areas
> 20 Foundation Star Schemas,
> 6 Data Access Star Schemas
> 20 Complex Cross functional reports
Design the Data Layer
Staging – location for data before any transformation.
Foundation Layer – data and relationships represent business information.
Access and Performance – a data and relationships optimized for information delivery and reporting and analysis.
Libraries and Locations
Actual - Migrate Staging Area
• Reduce IOPS
• Reduce network traffic
• Perform in machine transformations
vs across network
Proposed Architecture
Everything in relational database
• Staging,
• Foundation
• Access and Performance
Libraries in Memory
/*
* sasv9_usermods.cfg
*
* This config file extends options set in sasv9.cfg. Place your site-
specific
* options in this file. Any options included in this file are common across
* all server components in this application server.
*
* Do NOT modify the sasv9.cfg file.
*
*/
-MEMMAXSZ 16G
• Memory is faster than disk
• Bypasses SAS Work which is on disk.
• Can be configured for each transform
• Automatic clean up once job is finished
• Ensure sufficient memory available for all concurrent jobs
Data Staging – Logical Dates
• Logical Dates represents the
period represented by the data
• Enables you to capture and store
multiple data sets from the source
system.
• Reload data without re extracting
information.
• Decouple your source extraction
Product Extracted on
Logical Date 1
Product Extracted on
Logical Date 2
Product Extracted on
Logical Date 3
Incremental or Full Loads
Product Extracted on
Logical Date 1
Product Extracted on
Logical Date 2
Except Product Change
Data for Logical Date 2
Product Extracted on
Logical Date 2
Extract where column >
Last High Water Mark
Reduce # of records = reduce I/O
= Increase Speed exponentially
Slowly Changing Dimension I or II Transformation
Method for tracking changes in data SCD I – no tracking
SCD II
» identify which columns to use to track changes
» Changes become new rows with new version number and valid from to dates
» Old row is end dated
SCD II can be configured for SCD I but no vice versa.
Changing from SCD I to SCD II rewrite job, use SCD II just reconfigure
Extraction, Transformation and Loading Tips
Naming conventions Prefix due to alphabetical sorting
ETL
Remote and local table joins
For dimensions add audit columns
Use surrogate keys add the columns which were used to derive the surrogate keys onto your fact table
Views
Relational views are great tools to abstract tables and logic from the ETL tool.
Organize Metadata
Switch off the automation settings within your jobs
Create the template
-Develop templates
- Copy and paste jobs
-Standardise where in the jobs expressions will be used.
- Build one and test before replicating the template.
Work Smart Not Hard
Summary
1. Develop the blueprint.
2. Setup your libraries to reduce I/O and network traffic.
3. Develop your naming standards.
4. Identify how to track changes load changes.
5. Setup your metadata structure.
6. Experiment with Pre and Post mapping code to get your
stubs correct.
7. Decide on main transformations like SCDI or SCDII.
8. Develop and test templates.