![Page 1: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/1.jpg)
Data Staging Data Loading and Cleaning
Marakas pg. 25
BCIS 4660Spring 2012
![Page 2: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/2.jpg)
Basic Processes
• Building the data warehouse involves extracting, transforming, and loading (ETL) data from source systems to the target databases.
• The identification, selection, and Transformation Mapping of source data to target data.
![Page 3: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/3.jpg)
Data Loading
• The source-to-target mapping includes the specification of a process model that covers the many tough issues of data acquisition.
• Detection of source data changes, data extraction techniques, timing of data extracts, data transformation techniques, frequency of database loads, and levels of data summary are among the difficult data acquisition challenges
![Page 4: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/4.jpg)
Processing Steps• Extract, Transform, Load (ETL)
– Extracting– Data transformation– Loading the data
• Data cleanup• Index creation
– Performance requirement
• Aggregation creation and maintenance• Backup• Data archiving• Data mart refresh
![Page 5: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/5.jpg)
Sales Date DimSales date keySales dataSales date monthSales date year
Sales Summary Factsales date keySales dept keyCat mgr keyProduct keyQtyDollarsCostNet
Category Manager DimCat mgr keyCategory mgr name Distribution center name
Store Dept DimStore dept keyStoreStore size Store mgrDeptDept size Dept mgrDistrictRegion
Product dimProduct keyProduct idProduct descProduct sub-categoryProduct category
Sample Dimensional Schema
![Page 6: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/6.jpg)
Extracting
• Reading and understanding the source data and copying the parts that are needed to the data staging layer for further work.
![Page 7: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/7.jpg)
Transforming
• Cleansing the data by correcting misspelling, resolving domain conflicts (city vs. zip)
• Purging fields that are not useful • Combining data sources – matching
exactly on key values or attributes• Creating surrogate keys for
dimensions• Building aggregates (totals) for
boosting performance of common queries
![Page 8: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/8.jpg)
Loading and Indexing
• Replicating the dimension tables and fact tables
• Bulk loading of each recipient data mart
• Bulk loading is an important capability in contrast to record at a time loading
![Page 9: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/9.jpg)
Quality Assurance Checking
• Run comprehensive exception reports over newly loaded data
• All counts and totals must be satisfactory [data audit]
• Reported values must be consistent with similar values that preceded them before loading new data
![Page 10: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/10.jpg)
Release (e.g., Version 3.1)
Publishing• User community notification• Communicates the nature of
any changes in dimensions or facts
• Updates to meta data
![Page 11: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/11.jpg)
Updating
• Incorrect data must be corrected.
• Changes to the meta data, etc must be made
![Page 12: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/12.jpg)
Querying
• The end goal is to allow access by all authorized uses
• Takes place on the data warehouse presentation server
![Page 13: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/13.jpg)
Important Concepts
• The requirements for placing extract, transform, and load (ETL) processes into a stable production environment.
• The technical requirements for these processes including support considerations with purchased ETL software.
• The challenges of supporting the data warehouse with custom code.
![Page 14: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/14.jpg)
The Analyst Must
• Identify, assess, select, and map source data to target data stores
• Identify and specify kinds of data transformations (keys, totals, omits, etc.)
• Manage ETL schedules, including frequency of extract and latency of load
• Understand the role of meta data (data about data)
• Identify the classes of technology useful in warehouse data acquisition
![Page 15: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/15.jpg)
Who Else Needs to Know this Information?
• IT designers, developers, and data administrators new to DW
• Business and technical data warehouse team members
• Technical business users interested in building sound decision support systems
![Page 16: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/16.jpg)
SUMMARY: The Processes
•Plan the process• Identify the tools to be used•Clean the data •Backup data and processes
the data•Populate Dimension tables
![Page 17: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/17.jpg)
Source data
• Enterprise data• B2B data• Web harvesting – the ultimate
data store
See The Data Webhouse Toolkit by Kimball
![Page 18: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/18.jpg)
Identifying data sources
• Source data assessment and qualification
• Understanding and modeling source data
• Triage of source data
![Page 19: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/19.jpg)
Source-to-target movement
• Source-to-target mapping • Data transformations• Timing considerations • Levels of detail • Processes and flows
![Page 20: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/20.jpg)
Meta data considerations
• Data structure layouts and data element documentation
• Required meta data• Support of meta data
propagation
![Page 21: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/21.jpg)
Requirements for stable production processing
• Scheduling • Logging• Recovery
![Page 22: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/22.jpg)
Extract, Transform, and Load technology
• Extraction - • Buy versus build • Matching needs to technology
![Page 23: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/23.jpg)
Software
• XML – (eXtensible Markup Language)
• Used in moving data around among applications
![Page 24: Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012](https://reader035.vdocuments.site/reader035/viewer/2022081603/56649f255503460f94c3c17a/html5/thumbnails/24.jpg)
ETL activities