sas® data integration studio: the 30-day plan€¦ · sas® data integration studio: the 30-day...

14
SAS® Data Integration Studio: The 30-Day Plan Paper 078-2013 John Heaton Senior DW Analyst Heritage Bank Australia [email protected]

Upload: lamhuong

Post on 04-Apr-2018

235 views

Category:

Documents


5 download

TRANSCRIPT

SAS® Data Integration Studio: The 30-Day Plan

Paper 078-2013

John Heaton

Senior DW Analyst Heritage Bank Australia

[email protected]

Why DI Studio Volume of data.

Automation of data extracts.

80/20 – remove the need for business to spend 80 % of their time on data preparation.

Data Governance – build a standard data model with standard definitions across different functional areas within the business.

Optimisation – reduce the redundancy of multiple people extracting the same information.

Security – secure, audit and share information from a controlled asset.

Do not need an experienced SAS Programmer

Outcome of Planning

Size of Project

5 Source Systems

436 Tables

135 Views

22130 Columns

175 Jobs

Total Estimated Size > 15 TB

Daily Volumes

» Storage +- 50 GB 4 x improvement

» IOPs +- 300 GB 3 x improvement

Resources

1.5 Full Time Equivalents

12 months

Design to Production

Subject Areas

> 5 Subject Areas

> 20 Foundation Star Schemas,

> 6 Data Access Star Schemas

> 20 Complex Cross functional reports

Design the Data Layer

Staging – location for data before any transformation.

Foundation Layer – data and relationships represent business information.

Access and Performance – a data and relationships optimized for information delivery and reporting and analysis.

Libraries and Locations

Actual - Migrate Staging Area

• Reduce IOPS

• Reduce network traffic

• Perform in machine transformations

vs across network

Proposed Architecture

Everything in relational database

• Staging,

• Foundation

• Access and Performance

Libraries in Memory

/*

* sasv9_usermods.cfg

*

* This config file extends options set in sasv9.cfg. Place your site-

specific

* options in this file. Any options included in this file are common across

* all server components in this application server.

*

* Do NOT modify the sasv9.cfg file.

*

*/

-MEMMAXSZ 16G

• Memory is faster than disk

• Bypasses SAS Work which is on disk.

• Can be configured for each transform

• Automatic clean up once job is finished

• Ensure sufficient memory available for all concurrent jobs

Data Staging – Logical Dates

• Logical Dates represents the

period represented by the data

• Enables you to capture and store

multiple data sets from the source

system.

• Reload data without re extracting

information.

• Decouple your source extraction

Product Extracted on

Logical Date 1

Product Extracted on

Logical Date 2

Product Extracted on

Logical Date 3

Incremental or Full Loads

Product Extracted on

Logical Date 1

Product Extracted on

Logical Date 2

Except Product Change

Data for Logical Date 2

Product Extracted on

Logical Date 2

Extract where column >

Last High Water Mark

Reduce # of records = reduce I/O

= Increase Speed exponentially

Slowly Changing Dimension I or II Transformation

Method for tracking changes in data SCD I – no tracking

SCD II

» identify which columns to use to track changes

» Changes become new rows with new version number and valid from to dates

» Old row is end dated

SCD II can be configured for SCD I but no vice versa.

Changing from SCD I to SCD II rewrite job, use SCD II just reconfigure

Data Mapping

Build the blueprint – understand the journey

Extraction, Transformation and Loading Tips

Naming conventions Prefix due to alphabetical sorting

ETL

Remote and local table joins

For dimensions add audit columns

Use surrogate keys add the columns which were used to derive the surrogate keys onto your fact table

Views

Relational views are great tools to abstract tables and logic from the ETL tool.

Organize Metadata

Switch off the automation settings within your jobs

Create the template

-Develop templates

- Copy and paste jobs

-Standardise where in the jobs expressions will be used.

- Build one and test before replicating the template.

Work Smart Not Hard

Summary

1. Develop the blueprint.

2. Setup your libraries to reduce I/O and network traffic.

3. Develop your naming standards.

4. Identify how to track changes load changes.

5. Setup your metadata structure.

6. Experiment with Pre and Post mapping code to get your

stubs correct.

7. Decide on main transformations like SCDI or SCDII.

8. Develop and test templates.

John Heaton

Senior DW Analyst Heritage Bank Australia

[email protected]