Download - Corporate Data Vault Data Warehousing Workshop Sept. 23 2015 Data Warehousing Workshop Sept. 23 2015
Corporate Data Vault
Data Warehousing WorkshopSept. 23 2015
Background to CDV Project
• Feb 2012 – Review of Corporate Data Model published
• Apr 2012 – Technical group set up
• Dec 2012 – Proposal for CDV sent to SMC
The ProposalOption 1File Store
Option 2File Store with Direct
Access
Option 3Database tables
Option 4Data warehouse
Data stored in the same format as lodged by the data custodian;
Data retrieved only through the front-end application and copied to local work space.
Data stored in the same format as lodged by the data custodian;
Data can be accessed directly by third party products (e.g. SAS).
Data converted and stored in database table with similar structure to source;
Database tables can be accessed directly by third party products (e.g. SAS).
Data converted and stored in standardised relational database tables;
Database tables can be accessed directly by third party products (e.g. SAS).
Pros Cons
Option 1: File Store Simplest concept Lowest development effort
No direct access with 3rd party products
Possible proliferation of copies of files in local work areas
Long term usability of data more difficult to manage
Option 2: File Store Direct Access
Simple concept Provides direct access to data
Security more difficult to manage than for database options
Long term usability of data more difficult to manage
Option 3: Database tables
Provides direct access to data Data stored in single platform Easier to manage long term
usability issues
Data transformed from original format – transformed data may need validation
Option 4: Data warehouse
Provides direct access to data Standardized data in relational
databases Enables easier linkages
between data Opportunities to build other
applications on the warehouse
Data transformed from original format – transformed data may need validation
Difficult to design and build Business effort high as data
standardization required
Project Stage 1Two Prototypes
• Early 2013, the SMC requested that working prototypes of both Option 2 and 3 be developed
• Prototypes were designed, built & tested between June and Oct 2013
• A recommendation on the optimal solution was submitted to the SMC in Nov 2013.
Design, Build and Assessment In-scope Out of Scope
Focus of system development Produce a working system Final screen designs
Functions of the system (1) Lodging data & metadata(2) Storing data & metadata(3) Viewing of catalogue
(1) Security(2) Reports
Testing of system Testing to focus primarily on the “happy path”. Only major bugs and issues to be addressed.
Robust testing of the system
File Types SAS files only as (1) High risk (2) Benefit of variable metadata available within the file (3)Structured nature provided suitable test for both prototypes
All other file types
Issues with Database Prototype Issue Impact on Database Prototype
Unable to distinguish between a date and a date/time variable in a SAS dataset
SAS dataset is rejected because the date/time column is created as a date and a date/time variable cannot be loaded into a date column.
Maximum length of a character variable can be 16384
Character variables longer than 16384 will be truncated.
Maximum number of columns currently allowed is 254
SAS dataset is rejected is the number of variables exceed 254
There are 995 different formats available in SAS
Data integrity may be compromised or the dataset may be rejected if an unknown format is encountered. It would require each format to be coded for individually during conversion program.
Project Stage 2CDV v1 Build & Design
• The second stage of this project involved the further design, build and testing of the file store solution.
• It also included information sessions to users and the initial “Go Live” of the CDV.
• This second project ran from Jan 2014 until Dec 2014.
Project Stage 3CDV v1 Implementation
• The third stage of this project is ongoing since Jan 2015
• Roll-out of the system across the office
• Requirements gathering and specifications for CDV v2.
About the CDV
• Independent of production processes
• Data stored in the same format as lodged
• Access data through a third party product
• CDV v1 accepts SAS datasets only
Technical Specs
• Three tier application
• Client tier: Java
• Business Logic tier: Weblogic
• Data Tier: Sybase database.
Functionality
• Lodge Data and Metadata
• Browse/Search the Catalogue
• Reports
• Security
Lodge Data and Metadata: Step 1
Lodge Data and Metadata: Step 2
Variable Details Screen
Link Classification from CARS
Metadata Stored
File Level
• Survey Name• Periodicity• Time Period• Version No. • Linked Themes• Micro/Macro Data• Reference Documentation• Description• Reason for Version• Date Lodged• Lodged By
Variable Level
• Name• Description• Primary Key• Unit Type• Length• Data Type• Linked Classification Details
Lodgement Summary
Access To Data
The End
Any Questions?