Download - Data Warehousing for Gnocode
![Page 2: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/2.jpg)
How did I get here?
![Page 3: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/3.jpg)
Typical Business "Design"
![Page 4: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/4.jpg)
Typical Goal Scenario
![Page 5: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/5.jpg)
What happened?
![Page 6: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/6.jpg)
What success still looks like - version 1
![Page 7: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/7.jpg)
What success still looks like - version 2
![Page 8: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/8.jpg)
What success should look like
![Page 9: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/9.jpg)
Dimensional Modeling
![Page 10: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/10.jpg)
Normal Form
![Page 11: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/11.jpg)
The intuitive resolution of contemporary design problems simply lies beyond the reach of a single individual’s integrative grasp…
…there are bounds to man’s cognitive and creative capacity…
…the very frequent failure of individual designers to produce well organized forms suggests strongly that there are limits to the individual designer’s capacity.
Christopher Alexander – Notes on the Synthesis of Form,Introduction: The Need for Rationality
![Page 12: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/12.jpg)
Facts and Dimensions
![Page 13: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/13.jpg)
Attendee (NormalForm)
ID int
AttendeeName varchar(50)
Column Name Condensed Type
Calendar (NormalForm)
ID int
ScheduleDate datetime
Column Name Condensed Type
Meal (NormalForm)
ID int
MealDesc varchar(50)
Column Name Condensed Type
Meeting (NormalForm)
ID int
MeetingName varchar(50)
VenueID int
MealID int
Column Name Condensed Type
MeetingAttendee (NormalForm)
MeetingID int
AttendeeID int
Column Name Condensed Type
Venue (NormalForm)
ID int
Venue varchar(50)
Column Name Condensed Type
CalendarMeeting (NormalForm)
CalendarID int
MeetingID int
Column Name Condensed Type
Invoice (NormalForm)
MeetingID int
AttendeeID int
InvoiceNumber int
ProductID int
UnitPrice money
Column Name Condensed Type
Product (NormalForm)
ID int
ProductDescription varchar(50)
UnitPrice money
Column Name Condensed Type
Class (NormalForm)
ID int
ClassDescription varchar(50)
Column Name Condensed Type
ClassAttendee (NormalForm)
SectionID int
AttendeeID int
SignInDate... datetime
Column Name Condensed Type
Section (NormalForm)
ID int
ClassID int
MeetingID int
Slot datetime
RoomID int
InstructorID int
Column Name Condensed Type
Room (NormalForm)
ID int
RoomDesc varchar(50)
VenueID int
ClassroomCapacity int
Column Name Condensed Type
Instructor (NormalForm)
ID int
InstructorName varchar(50)
Column Name Condensed Type
![Page 14: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/14.jpg)
Best Practices
There are tons of top ten lists of tips and keys to success in articles and books. I will give you my top two.
Incremental Delivery – Show successes early, win people over, prove concepts and approach
Proactively Manage Quality - Test thoroughly and automate – Testing is usually considered important, but people don’t approach it systematically. Round-trip the data, know the dimensional behavior with benchmarking, automate exception reporting and make sure false positives don’t make your warning system too noisy. Get confidence by showing the tests are working. Add tests as defects are found, documenting expectations.
![Page 15: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/15.jpg)
Worst Practices
Again, there are plenty of online tips – every one of the best practices has a corresponding anti-practice, but these are my top two.
Avoid understanding the data, the business motivations, or the details because there are far too many feeds of data coming into the warehouse. Avoid looking ahead to how the data will be used because you shouldn’t change the ETL process to accommodate expectations or provide services.
Handle every model the same way, so the data warehouse is consistent, even if some models are awkward and difficult for users to use and difficult to change over time as the business evolves.
![Page 16: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/16.jpg)
Links
Email: [email protected]: @caderoux
Bookmarks: http://delicious.com/caderoux1/gnocode-dw
Rate this presentation:http://www.speakerrate.com/caderoux
My Resume:http://careers.stackoverflow.com/caderoux
![Page 17: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/17.jpg)
Q&A
![Page 18: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/18.jpg)
Glossary
Data warehouse
Bill Inmon - "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process" - typically associated with top-down design
Ralph Kimball - "A copy of transaction data specifically structured for query and analysis." - Typically associated with bottom-up design
![Page 19: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/19.jpg)
Glossary (2)
ETL
Extract...Transform...Load
Shorthand for any number of ways of getting the data into the warehouse.
Sometimes it's really transform...extract...load, sometimes it's extract...load...transform...load.
Key things are to have a strategy and principles for when data is changed/cleaned/conformed/exceptions reported.
![Page 20: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/20.jpg)
Glossary (3)
FactsDimensions
Conformed DimensionsSlowly Changing Dimensions
Granularity
Dimensionally modelled data is mostly associated with Kimball.
Huge advantages in analyzing large amounts of data.
Modelling is problematic, but not nearly as hard as normalizing a non-normalized database.
![Page 21: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/21.jpg)
Glossary (4)
Business IntelligenceSingle version of the truth
These are relatively meaningless, but they point to the problem trying to be solved:
Get good decision support information to the business - every business is different, and there isn't a silver bullet
Eliminate, as much as possible, the ability for users to generate inconsistent information from the same data
![Page 22: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/22.jpg)
Glossary (5)
Data MartSilos
Silos are mini-data warehouses that are specialized to a subject area - typically from a bottom-up approach.
Data Marts are the components of a data warehouse in the top-down design, the building blocks of a data warehouse in a bottom-up design.
Typically, you cannot really do JUST top-down or JUST bottom-up. The reality is always hybrid, because you have to look forward to enterprise-level integration.
![Page 23: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/23.jpg)
Glossary (6)
Operational Data StoreEnterprise data warehouse
ODS is a place where data is combined before load. Sometimes there are services performed off this. Typically, the data model has not changed dramatically from the original operational source systems, but it is (another) copy of the data.
EDW is an Inmon term which means that the data warehouse covers the enterprise in an integrated fashion. It is mainly used to distinguish from a data warehouse which does not cover the entire enterprise.
![Page 24: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/24.jpg)
Glossary (7)
OLTPOLAP
OnLine Transaction Processing: Typical online systems, may maintain coherent temporal history, may overwrite themselves when data is changed, usually modelled in third normal form or better, Entity-Relationship modeling.
OnLine Analytical Processing: Fast analysis of multi-dimensional data - generally refers to tools running against dimensional data warehouses because the dimensions are explicit - often precalculated "cubes" are created
![Page 25: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/25.jpg)
Dimensional Modelling
Facts:Usually scalar quantitiesTypically can be:SUM, AVG, etc.
Modelling:View all data as either facts or dimensionsDetermine the nature of the changes in the dimensionsThen divide up dimensions for convenience - based on usage/data patternsCombination of art and science
![Page 26: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/26.jpg)
Topics
Conformed dimensionsNULLsJunk DimensionsToo Few DimensionsToo Many DimensionsParallel ETL
![Page 27: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/27.jpg)
Conformed Dimensions
Reduces the learning curveAllows models to be combinedAccount number padding, e.g.
![Page 28: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/28.jpg)
Some things to keep in mind
Terminology is confusing and inconsistent – only your architecture matters – keep eyes open to approaches, but terminology is not as important as conventions chosen matching environment desired.
Overriding concern is practicality – get the information into users hands, this will drive the need for more information and guide you into managing the data.
Decoupling produces a lot of redundancy: Source->Flat File->EDI gateway->Stage->DW – understand where the redundancy can be removed, and where decoupling is the goal.
![Page 29: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/29.jpg)
NULLs
Usually represent unknowns
Big problem for users in face of model evolution
If you have a derived stat/measure like customer.allfees = customer.latefees + customer.nsffees
Model starts out like-latefees money NOT NULL-nsffees money NOT NULL
Now we branch out into mailbox rental:-customer.rentalfees NULL (or NOT NULL?)
customer.allfees = customer.latefees + customer.nsffees + customer.rentalfees
Handle with a view, or populate old data with 0
![Page 30: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/30.jpg)
Performance Issues
Cleansing/Manipulation:Y/N, M/F, Codes – standardize in the ETL, use data types efficientlyIndexes – keep end goals in mind, index according to expected usageSARG'able, leading zeros/spaces – look for consistency where posible to avoid data manipulation prior to joins, this helps index usagePartitioning tables – for sliding windows of data retention, partitioning the tables allows old data to be dropped off the end of the fact tables fairly easily
![Page 31: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/31.jpg)
Application Logic
Shared work should be pushed into ETL when:Not likely to changeExpensiveEverybody needs it
Examples:Trivial - Scaling to convention (rates)Intermediate - Simple calculations (Patient Age)Marginal - Interest rates, risk ratings
Alternative to marginal cases - generate additional facts, either in their own fact tables or as late arriving facts
When a DW is shared, lots more applications to worry about
![Page 32: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/32.jpg)
![Page 33: Data Warehousing for Gnocode](https://reader033.vdocuments.site/reader033/viewer/2022052623/559cbb1a1a28abe9558b4862/html5/thumbnails/33.jpg)