a framework for building resilient data warehouses a
TRANSCRIPT
A Framework for Building Resilient Data Warehouses A Framework for Building Resilient Data Warehouses A Framework for Building Resilient Data Warehouses using a Mandala Topology Architectureusing a Mandala Topology ArchitectureIA using a Mandala Topology ArchitectureIA using a Mandala Topology Architecture
Michel JACQUES
IA
Michel JACQUESInformation Assembly SPRL, Brussels, BelgiumInformation Assembly SPRL, Brussels, Belgium
Introduction 1.Users Perspectives & Interfaces 2.The Four Major Streams of DataIntroduction 1.Users Perspectives & Interfaces 2.The Four Major Streams of DataIntroduction 2.The Four Major Streams of Data
Data Application: DW Platform is Usage-Driven Data Application: DW Platform is Usage-Driven Constructing ETL work for a DW project is a complex affair, one that Each user type has a different Source data is in an unrefined state (to Data Application: DW Platform is Usage-Driven Data Application: DW Platform is Usage-Driven Constructing ETL work for a DW project is a complex affair, one that
requires planning. This framework uses a DW architecture based on aperspective and role towards the DW.
Existing users include: data stewards,
Source data is in an unrefined state (to various degrees) that must have its data
elements differentiated into a DW core IntegrationIntegration
requires planning. This framework uses a DW architecture based on a
Mandala Topology. At its core is an agile, step-by-step approach toExisting users include: data stewards,
analytical end-users, quality control elements differentiated into a DW core
data model (IN) in order to able to
META
Integration
META
IntegrationMandala Topology. At its core is an agile, step-by-step approach toidentify ETL work units:
analytical end-users, quality control
masters, DW administrators & integration data model (IN) in order to able to
recombine them effectively afterwards for METAMETA
identify ETL work units:1.Identifying the external users
masters, DW administrators & integration
managers. Each user has his own set of
information requirements and accesses
recombine them effectively afterwards for
analytics (OUT). During this transition,
STARSTAR
1.Identifying the external users
2.Positioning main data flows,information requirements and accesses
the DW via a specific interface (STAG, two other types of data elements are
produced: metadata (UP) and erroneous STAR
EDMSTAG MART
Data
STAR
EDMSTAG MART
Data
2.Positioning main data flows,3.Decomposing a flow internal layers
the DW via a specific interface (STAG,
META, MART, or SINK). The STAR can produced: metadata (UP) and erroneous
data (DOWN) streams. No data should EDM
Analytics
Data Stewards
EDM
Analytics
Data Stewards
3.Decomposing a flow internal layers4.Incorporating them within the DW application platform (Mandala)
META, MART, or SINK). The STAR can
only be accessed indirectly via the other data (DOWN) streams. No data should
be lost in this closed system. Thus the
SINKAnalytics
SINKAnalytics4.Incorporating them within the DW application platform (Mandala)
5.Extending conceptual EDM with a functional level of detail
only be accessed indirectly via the other
interfaces to ensure data integrity & security and prevents dependencies
be lost in this closed system. Thus the
position & direction of a data stream determines its purpose, the means, and 5.Extending conceptual EDM with a functional level of detail
6.Classifying data model entities and their dependenciessecurity and prevents dependencies
caused by different user demands. The determines its purpose, the means, and
its destination. This naturally leads to
Quality Quality
6.Classifying data model entities and their dependencies7.Combining the Functional ER model with Topic Areas & Tiers
caused by different user demands. The
users are part of an iterative feedback its destination. This naturally leads to
asymmetries between streams, which
Management:Quality
ControlQuality
Control7.Combining the Functional ER model with Topic Areas & Tiers
8.Enumerate the ETL work units in matrix format
users are part of an iterative feedback
loop improving future data content and asymmetries between streams, which
must be accounted for in the ETL design..Management:
•Intravenous Dextrose infusion with rates up to 250mls/hr of 20% Dextrose
8.Enumerate the ETL work units in matrix format
The method focuses on the topological relationship between all the DW
loop improving future data content and
quality.
Hence a multi-perspectives/faceted data warehouse architecture acting as a crossway between
must be accounted for in the ETL design..“Just about every [DW] process has side effects; but they can be deliberate and sustaining instead of unintentional and pernicious…and we can also be inspired by it to design some •Intravenous Dextrose infusion with rates up to 250mls/hr of 20% DextroseThe method focuses on the topological relationship between all the DW
artifacts, in order to comprehensive improve planning & design of DW.
Hence a multi-perspectives/faceted data warehouse architecture acting as a crossway between these end-users is comprehensive and non-discriminatory at an organizational level.
instead of unintentional and pernicious…and we can also be inspired by it to design some
positive side effects to our own enterprises instead of focusing exclusively on a single end.” (p.80)
•Dietary intervention with frequent meals and corn starchartifacts, in order to comprehensive improve planning & design of DW. these end-users is comprehensive and non-discriminatory at an organizational level. positive side effects to our own enterprises instead of focusing exclusively on a single end.” (p.80)
Ref: W. McDonough and M. Braungart, Craddle to Craddle, North Point Press, NY, 2002•Dietary intervention with frequent meals and corn starch Ref: W. McDonough and M. Braungart, Craddle to Craddle, North Point Press, NY, 2002
•Diazoxide – intolerant leading to hyponatraemia, oedema and nausea 5.Functional ER Data Model 3.The Five ETL Layers & Chirality 4.DW Mandala Topology Architecture•Diazoxide – intolerant leading to hyponatraemia, oedema and nausea
•Octreotide/glucagon intravenously – in order to replace counter-regulatory hormones
5.Functional ER Data Model 3.The Five ETL Layers & Chirality 4.DW Mandala Topology ArchitectureData Application: Platform for 5 Architectural LayersData Application: Platform for 5 Architectural Layers Enterprise Data Model: Conceptual ModelEnterprise Data Model: Conceptual ModelThe Buddhist Mandala metaphor
The conceptual model is The purpose of an ETL is to increase the •Octreotide/glucagon intravenously – in order to replace counter-regulatory hormonesData Application: Platform for 5 Architectural LayersData Application: Platform for 5 Architectural Layers Enterprise Data Model: Conceptual ModelEnterprise Data Model: Conceptual ModelThe Buddhist Mandala metaphor
helps visualize an architectureThe conceptual model is business-oriented, while the
The purpose of an ETL is to increase the integration & analytical capability of
•Subcutaneous Octreotide - hypoglycaemia worsened1METAMETA
helps visualize an architecture
where topological relationships
between DW components
business-oriented, while the
logical model is focused on
integration & analytical capability of
sourced data. An ETL data path consists of 5 •Subcutaneous Octreotide - hypoglycaemia worsened1
AuthoringAuthoringbetween DW components
including: data modeling, ETL
logical model is focused on
content and application. The
sourced data. An ETL data path consists of 5
layers, each conducting a different set of data
transformations. The first two “IN” stages
•Prednisolone – developed fluid retention
including: data modeling, ETL
data flows, and surrounding
content and application. The
functional data model, proposed
herein, places itself in the gap
transformations. The first two “IN” stages
integrate the data to enable multiple •Prednisolone – developed fluid retention
STARSTAR
Trn MapsTrn Maps
data flows, and surrounding actors & applications; functionally
herein, places itself in the gap between the two. It extends the
integrate the data to enable multiple interpretations, while the last two “OUT”
•Hepatic Arterial Embolisation (HAE) performed twice with initial improvement (post-procedure insulin 29 pmol/l), but relapsed after 4 weeks.STAR
EDMSTAG MARTIN OUT
STAR
EDMSTAG MARTIN OUT
interact. The architecture is
similar to a road crossway that
between the two. It extends the
number of entity types from the
interpretations, while the last two “OUT”
stages specialise the data such that it •Hepatic Arterial Embolisation (HAE) performed twice with initial improvement (post-procedure insulin 29 pmol/l), but relapsed after 4 weeks.EDMIN OUT
Extraction:::Staging:::::Integration::::Publishing:User Access
EDMIN OUT
Extraction:::Staging:::::Integration::::Publishing:User Access ProfilesAttendance
ProfilesAttendancesimilar to a road crossway that
gives essential context and
number of entity types from the
initial fact and dimension with becomes fit-for-purpose. This mirror-effect is
referred to as ETL chirality. Extraction:::Staging:::::Integration::::Publishing:User AccessExtraction:::Staging:::::Integration::::Publishing:User Access Profiles
Roster
Profiles
Roster
gives essential context and
movement to data and new types for holding hierarchies,
dim. ids, details & associations.
referred to as ETL chirality.The following are important elements when selecting the most appropriate data modelling
SINKSINK
Roster
Trainings
Roster
Trainings
movement to data and operations. The context is
composed of 5 distinct locations
dim. ids, details & associations. This improves history-keeping
The following are important elements when selecting the most appropriate data modelling technique to use: a) the degree of convergence built into the data during ETL; b) the number of
SINKSINKTrainingsTrainings
composed of 5 distinct locations
(STAG, META, SINK, MART, &
This improves history-keeping
and makes the core data model
technique to use: a) the degree of convergence built into the data during ETL; b) the number of
unique pathways in the dimensional model; c) increased data flow resilience by decreased data
reliance; and d) ability for decoupling of model components. Erroneous
Data Repo
Erroneous
Data Repo
(STAG, META, SINK, MART, &
STAR) that provide a clear
and makes the core data model
more resilient while decoupling
the associated data flows.
reliance; and d) ability for decoupling of model components.
Data disturbances will occur either from external sources in an unexpected, subtle or extreme Data RepoData RepoSTAR) that provide a clear
logical structure determiningThere is a need to extend our “vocabulary of forms” when modelling. This involves adding
the associated data flows.Data disturbances will occur either from external sources in an unexpected, subtle or extreme
manner, which requires a faster data recovery by minimising data reload to only what is relevant. logical structure determining
data, flows, modeling patterns, access and security, user groups, and integration methods. Moreover, locations enable data persistence facilitating data recovery and transformations.
There is a need to extend our “vocabulary of forms” when modelling. This involves adding functional features so as to harmonise “form with function” and thus achieve a greater
manner, which requires a faster data recovery by minimising data reload to only what is relevant. Resilience also involves cyclic transfer of information across data applications reinforcing each
Moreover, locations enable data persistence facilitating data recovery and transformations. functional features so as to harmonise “form with function” and thus achieve a greater
decoupling of DW artefacts, whilst maintaining data cohesion. system data quality and monitoring usage.
decoupling of DW artefacts, whilst maintaining data cohesion.
8.Work Matrix from Method & Conclusion7. Agile DW Meets Functional ER Model6.Volatility of Functional Entities 8.Work Matrix from Method & Conclusion7. Agile DW Meets Functional ER Model6.Volatility of Functional Entities
ETL Decomposition Process: Classes as Tiers RevisitedETL Decomposition Process: Classes as Tiers Revisited ETL Decomposition Process: Work Units and ModulesETL Decomposition Process: Work Units and ModulesETL Decomposition Process: Topic Areas & Tiers for EDMETL Decomposition Process: Topic Areas & Tiers for EDM Development progress follows an The ETL decomposition process This functional data model ETL Decomposition Process: Classes as Tiers Revisited DEPENDENCY LAYERS as TIERS
ETL Decomposition Process: Classes as Tiers Revisited DEPENDENCY LAYERS as TIERS DEPENDENCY LAYERS as TIERS
ETL Decomposition Process: Work Units and ModulesETL Decomposition Process: Work Units and ModulesETL Decomposition Process: Topic Areas & Tiers for EDM
EVALUATION_QUESTIONS3TYPES_OF_NEEDS3 OTHER_CATALOG3
EVALUATION_ANSWERS3
ETL Decomposition Process: Topic Areas & Tiers for EDM
EVALUATION_QUESTIONS3TYPES_OF_NEEDS3 OTHER_CATALOG3
EVALUATION_ANSWERS3
Development progress follows an iterative and mostly top-down
The ETL decomposition process requires an entity having both a
This functional data model applies additional data entity DEPENDENCY LAYERS as TIERS
TIERSDATA CLASSES
DEPENDENCY LAYERS as TIERS
TIERSDATA CLASSES
DEPENDENCY LAYERS as TIERS
TIERSDATA CLASSES
1,n
1,n 0,n
1,n TRAINING_NEEDS_ACTUAL
EVALUATION_QUESTIONS3TYPES_OF_NEEDS3
PRIORITY
SOLUTION
<Undefined>
<Undefined>
OTHER_CATALOG3EVALUATION_ANSWERS3
1,n
1,n 0,n
1,n TRAINING_NEEDS_ACTUAL
EVALUATION_QUESTIONS3TYPES_OF_NEEDS3
PRIORITY
SOLUTION
<Undefined>
<Undefined>
OTHER_CATALOG3EVALUATION_ANSWERS3 iterative and mostly top-down
approach: one topic area, one
requires an entity having both a
class and a theme. For each
entity the process allocates: a) a
applies additional data entity
classes giving it increased ETL
flexibility. The classes follow a T00
Metadata(static) T00
Metadata(static) T00
Metadata(static) EVA-4 TNA-5
1,n1,n TRAINING_NEEDS_ACTUAL
TNA_NO_NEEDS_VAALUE
TNA_SUGGESTIONS_DESC
TNA_OBJECTIVE_DESC
...
<Undefined>
<Undefined>
<Undefined>
TRAININGS_EVALUATIONS
EVA-4 TNA-5
1,n1,n TRAINING_NEEDS_ACTUAL
TNA_NO_NEEDS_VAALUE
TNA_SUGGESTIONS_DESC
TNA_OBJECTIVE_DESC
...
<Undefined>
<Undefined>
<Undefined>
TRAININGS_EVALUATIONS
approach: one topic area, one
module, one tier at a time, although
units within same tier can be
entity the process allocates: a) a
tier corresponding to a class; b) flexibility. The classes follow a
step-wise approach whereby they T01
DimensionsT01
DimensionsT01
Dimensions
TMA-2
EVA-4 TNA-5
ORGANISATION_HIERARCHY3
WORKFLOW_STATUS6TMA-2
EVA-4 TNA-5
ORGANISATION_HIERARCHY3
WORKFLOW_STATUS6 units within same tier can be
developed in parallel. Once there is
tier corresponding to a class; b)
a topic area corresponding to a step-wise approach whereby they
become increasingly volatile T01base, details & struct)
Facts
T01base, details & struct)
Facts
T01base, details & struct)
Facts
TRA-3
T1SUB_TIME SUP_TIME
0,n1,nSUB SUP
sup sub
TRAINING_MAP_ACTUAL
TMA_DOSSIER_ID
TMA_STATUS_DATE
<Undefined>
<Undefined>TIMES3
TIMES_HIERARCHY3
WORKFLOW_STATUS5
WFS_SID
WFS_CODE
<Undefined>
<Undefined>
ORGANISATION_HIERARCHY3
ROOMS3 TRA-3
T1SUB_TIME SUP_TIME
0,n1,nSUB SUP
sup sub
TRAINING_MAP_ACTUAL
TMA_DOSSIER_ID
TMA_STATUS_DATE
<Undefined>
<Undefined>TIMES3
TIMES_HIERARCHY3
WORKFLOW_STATUS5
WFS_SID
WFS_CODE
<Undefined>
<Undefined>
ORGANISATION_HIERARCHY3
ROOMS3developed in parallel. Once there is a Reference Model (built prototype)
a topic area corresponding to a theme; and c) a priority
become increasingly volatile(data susceptibility to change).
Facts(events & profiles) T02
Facts(events & profiles) T02
Facts(events & profiles) T02
T0
T1
T2 1,n0,n
0,n
0,n
TMA_STATUS_DATE
TMA_VERSION_NO
TMA_VERSION_START_DATE
TMA_VERSION_END_DATE...
<Undefined>
<Undefined>
<Undefined>
<Undefined>
TIME_SID
TIME_DATE
<Undefined>
<Undefined>TRAINING_MAP_EXERCISE3
TME_SID
TME_NAME
<Undefined>
<Undefined>
ORGANISATIONS3
ORG_SID
ORG_CODE
...
<Undefined>
<Undefined>COURSE_DETAILS3T0
T1
T2 1,n0,n
0,n
0,n
TMA_STATUS_DATE
TMA_VERSION_NO
TMA_VERSION_START_DATE
TMA_VERSION_END_DATE...
<Undefined>
<Undefined>
<Undefined>
<Undefined>
TIME_SID
TIME_DATE
<Undefined>
<Undefined>TRAINING_MAP_EXERCISE3
TME_SID
TME_NAME
<Undefined>
<Undefined>
ORGANISATIONS3
ORG_SID
ORG_CODE
...
<Undefined>
<Undefined>COURSE_DETAILS3
a Reference Model (built prototype)
for a Tier work unit, estimates for all
theme; and c) a priority
corresponding to prevailing
business needs. An entity defines
(data susceptibility to change).
Since data in lower classes is
used to derive new data in upper T03
Grids(associations) T03
Grids(associations) T03
Grids(associations)
T0
T40,n
0,n0,n
Owner
1,n
0,nDetails0,n1,n1,n
TME_NAME
...
<Undefined>
STATUTORIES3
COURSES3
COURSE_SID <Undefined>TRAININGS_ATTENDANCE_PAG
T0
T40,n
0,n0,n
Owner
1,n
0,nDetails0,n1,n1,n
TME_NAME
...
<Undefined>
STATUTORIES3
COURSES3
COURSE_SID <Undefined>TRAININGS_ATTENDANCE_PAG
for a Tier work unit, estimates for all
units can be based on the reference
model and adjusted according to
business needs. An entity defines
the work unit in which it is used to derive new data in upper
classes, the data volatility in T04
Densification &Conformations T04Densification &Conformations T04Densification &Conformations Details
0,n Participant
1,n 1,n 1,n
1,n
1,n
PARTICIPANTS3
PTCD_SID
PTC_LAST_NAME
...
<Undefined>
<Undefined>
PARTICIPANTS_PROFILES_ACTUAL
PPA_ONE_VALUE
PPA_VALID_START_DATE
PPA_VALID_END_DATE
PPA_ACTUAL_FLAG
<Undefined>
<Undefined>
<Undefined>
<Undefined>
ADMIN_POSITIONS3
COURSE_SID
COURSE_CODE
<Undefined>
<Undefined>
PARTICIPANT_STATUS3
SESSION_DURATION
SESSION_EQUIVALENT_VALUE
SESSION_EVENT_DATE
SESSION_VIRTUAL_FLAG...
<Undefined>
<Undefined>
<Undefined>
<Undefined>
Details
0,n Participant
1,n 1,n 1,n
1,n
1,n
PARTICIPANTS3
PTCD_SID
PTC_LAST_NAME
...
<Undefined>
<Undefined>
PARTICIPANTS_PROFILES_ACTUAL
PPA_ONE_VALUE
PPA_VALID_START_DATE
PPA_VALID_END_DATE
PPA_ACTUAL_FLAG
<Undefined>
<Undefined>
<Undefined>
<Undefined>
ADMIN_POSITIONS3
COURSE_SID
COURSE_CODE
<Undefined>
<Undefined>
PARTICIPANT_STATUS3
SESSION_DURATION
SESSION_EQUIVALENT_VALUE
SESSION_EVENT_DATE
SESSION_VIRTUAL_FLAG...
<Undefined>
<Undefined>
<Undefined>
<Undefined>
model and adjusted according to
variable difficulty factors. Once
the work unit in which it is
contained. A work unit is the classes, the data volatility in
lower class will be less than that
Data MartsT05
T04Conformations(ballasting & mappings)
Data MartsT05
T04Conformations(ballasting & mappings)
Data MartsT05
T04Conformations(ballasting & mappings)
T2T1T3
0,n 0,n
Details
0,n
0,n
1,n
1,n
PARTICIPANTS_DETAIL3
PPA_ACTUAL_FLAG <Undefined>JOBS3
CATEGORIES_GRADES3
TRAININGS_ROSTER_PTG
FINAL_EXAM_MARK
INSCRIPTION_DATE...
<Undefined>
<Undefined>T2
T1T30,n 0,n
Details
0,n
0,n
1,n
1,n
PARTICIPANTS_DETAIL3
PPA_ACTUAL_FLAG <Undefined>JOBS3
CATEGORIES_GRADES3
TRAININGS_ROSTER_PTG
FINAL_EXAM_MARK
INSCRIPTION_DATE...
<Undefined>
<Undefined>
variable difficulty factors. Once completed & tested, work units
contained. A work unit is the smallest functional artefact
lower class will be less than that in upper classes. Volatility
Data Marts(lov, hier & msr)
T05
T06T07
Data Marts(lov, hier & msr)
T05
T06T07
Data Marts(lov, hier & msr)
T05
T06T07
T1
T30,n 0,n
1,n
1,n
1,n
RESOURCE_CENTERS3 TRAININGS_ACTUAL
COST
NO_SESSION_VALUE
<Undefined>
<Undefined>
CATEGORIES_GRADES3
CGR_SID
CGR_CODE
...
<Undefined>
<Undefined>
DOMAINS3
T1
T30,n 0,n
1,n
1,n
1,n
RESOURCE_CENTERS3 TRAININGS_ACTUAL
COST
NO_SESSION_VALUE
<Undefined>
<Undefined>
CATEGORIES_GRADES3
CGR_SID
CGR_CODE
...
<Undefined>
<Undefined>
DOMAINS3
completed & tested, work units
modules are then promoted to next determining granularity of
resource allocation. Regrouping
in upper classes. Volatility
determines which data flows are
performed in parallel or in T08T09
Segments & KPI
(compositions & formula)
T08T09
Segments & KPI
(compositions & formula)
T08T09
Segments & KPI
(compositions & formula)
PPA-1
T1
T1T2
0,n 1,n
1,n
1,n
1,n0,n
NO_SESSION_VALUE
DURATION_CALCUL_VALUE
DURATION_MANUAL_VALUE
NO_MAX_PARTICIPANTS
...
<Undefined>
<Undefined>
<Undefined>
<Undefined>
...
CITIES3
COURSE_TYPES3
PPA-1
T1
T1T2
0,n 1,n
1,n
1,n
1,n0,n
NO_SESSION_VALUE
DURATION_CALCUL_VALUE
DURATION_MANUAL_VALUE
NO_MAX_PARTICIPANTS
...
<Undefined>
<Undefined>
<Undefined>
<Undefined>
...
CITIES3
COURSE_TYPES3dev. environment.resource allocation. Regrouping
work units is as follows: Work performed in parallel or in
sequence. This principle T99
Topic Area
PromotionT99
Topic Area
PromotionT99
Topic Area
PromotionTOPIC
PPA-1 T1
T1TIERS TOPICTIERS
1,n
COURSE_STATUS3TRAINERS3
TOPIC
PPA-1 T1
T1TIERS TOPICTIERS
1,n
COURSE_STATUS3TRAINERS3
work units is as follows: Work
unit >> Module >> Topic Area >> sequence. This principle
drastically reduces ETL execution Conclusion: The advantages of implementing such a topological architecture include: greater T99Promotion
T99Promotion
T99Promotion
T1T1unit >> Module >> Topic Area >> Enterprise Data Model (EDM).
drastically reduces ETL execution and development time.
Conclusion: The advantages of implementing such a topological architecture include: greater scalability of additional data themes, enhanced performance of data flows, increased resilience of
decoupled artifacts, sturdier quality control, and lower operating and development costs. The Concept of Tiers and Topic Areas is borrowed from R. Hughes’ book: The Agile Data Warehousing, iUniverse Inc, Bloomington, 2008
decoupled artifacts, sturdier quality control, and lower operating and development costs. The
framework provides a comprehensive, reproducible, and proven DW architecture solution.The 6 classes are mapped to 9 tiers that maintain volatility.
Warehousing, iUniverse Inc, Bloomington, 2008 framework provides a comprehensive, reproducible, and proven DW architecture solution.
Ecole Centrale Paris, Châtenay-MalabryEcole Centrale Paris, Châtenay-Malabry