lecture @dhbw: data warehouse 02 dwh and big data...

154
A company of Daimler AG LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTURE ANDREAS BUCKENHOFER, DAIMLER TSS

Upload: others

Post on 08-Jun-2020

9 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

02 DWH AND BIG DATA ARCHITECTUREANDREAS BUCKENHOFER, DAIMLER TSS

Page 2: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ABOUT ME

Andreas BuckenhoferSenior DB Professional

Since 2009 at Daimler TSS Department: Machine Learning SolutionsBusiness Unit: Analytics

DHBWDOAG

xing

Contact/Connect

vcard

• Oracle ACE Associate• DOAG responsible for InMemory DB• Lecturer at DHBW• Certified Data Vault Practitioner 2.0• Certified Oracle Professional• Certified IBM Big Data Architect

• Over 20 years experience with database technologies

• Over 20 years experience with Data Warehousing

• International project experience

Page 3: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Describe different DWH architectures

• Explain Big Data architectures

• Understand Data Lakes and other buzzwords

DWH LECTURE - LEARNING TARGETS

Data Warehouse / DHBWDaimler TSS 3

Page 4: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Specific implementation can follow an architecture

• Architecture describes an ideal type. Therefore an implementation may not use all components or can combine components

• Better understanding, overview and complexity reduction by decomposing a DWH into its components• Can be used in many projects: repeatable, standardizable

• Map DWH tools into the different components and compare functionality

• Functional oriented as it describes data and control flow

PURPOSE: WHY ARE DWH ARCHITECTURES USEFUL?

Data Warehouse / DHBWDaimler TSS 4

Page 5: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 5

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager incl. Monitor

Page 6: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Providing internal and external data out of the source systems

• Enabling data through Push (source is generating extracts) or Pull (BI Data Backend is requesting or directly accessing data)

• Example for Push practice (deliver csv or text data through file interface; Change Data Capture (CDC))

• Example for Pull practice (direct access to the source system via ODBC, JDBC, API and so on)

DATA SOURCES

Data Warehouse / DHBWDaimler TSS 6

Page 7: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• “Landing Zone” for data coming into a DWH

• Purpose is to increase speed into DWH and decouple source and target system (repeating extraction run, additional delivery)

• Granular data (no pre-aggregation or filtering in the Data Source Layer, i.e. the source system)

• Usually not persistent, therefore regular housekeeping is necessary (for instance delete data in this layer that is few days/weeks old or – more common - if a correct upload to Core Warehouse Layer is ensured)

• Tables have no referential integrity constraints, columns often varchar

STAGING LAYER

Data Warehouse / DHBWDaimler TSS 7

Page 8: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Business Rules, harmonization and standardization of data

• Classical Layer for transformations: ETL = Extract – TRANSFORM – Load

• Fixing data quality issues

• Usually not persistent, therefore regular housekeeping is necessary (for instance after a few days or weeks or at the latest once a correct upload to Core Warehouse Layer is ensured)

• The component is often not required or often not a physical part of a DB

INTEGRATION LAYER

Data Warehouse / DHBWDaimler TSS 8

Page 9: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Data storage in an integrated, consolidated, consistent and non-redundant (normalized) data model

• Contains enterprise-wide data organized around multiple subject-areas

• Application / Reporting neutral data storage on the most detailed level of granularity (incl. historic data)

• Size of database can be several TB and can grow rapidly due to data historization

• Write-optimised layer

CORE WAREHOUSE LAYER

Data Warehouse / DHBWDaimler TSS 9

Page 10: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Preparing data for the Data Mart Layer to the required granularity

• E.g. Aggregating daily data to monthly summaries

• E.g. Filtering data (just last 2 years or just data for a specific region)

• Harmonize computation of key performance indicators (measures) and additional Business Rules

• The component is often not required or often not a physical part of a DB

AGGREGATION LAYER

Data Warehouse / DHBWDaimler TSS 10

Page 11: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Read-optimised layer: Data is stored in a denormalized data model for performance reasons and better end user usability/understanding

• The Data Mart Layer is providing typically aggregated data or data with less history (e.g. latest years only) in a denormalized data model

• Created through filtering or aggregating the Core Warehouse Layer

• One Mart ideally represents one subject area

• Technically the Data Mart Layer can also be a part of an Analytical Frontend product (such as Qlik, Tableau, or IBM Cognos TM1) and need not to be stored in a relational database

DATA MART LAYER

Data Warehouse / DHBWDaimler TSS 11

Page 12: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Metadata Management

• Not just “Data about Data”, separate lecture

• Security

• Not all users are allowed to see all data

• Data security classification (e.g. restricted, confidential, secret)

• DWH Manager incl. Monitor

• DWH Manager initiates, controls, and checks job execution

• Monitor identifies changes/new data from source systems, separate lecture

METADATA MANAGEMENT, SECURITY, MONITOR

Data Warehouse / DHBWDaimler TSS 12

Page 13: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

The article

http://www.kimballgroup.com/2004/03/differences-of-opinion/

compares THE two classic DWH architectures.

Read the paper and complete the table / questions on the next slide.

(Caution: The paper is biased / favors one approach; you may want to read other/more papers for a neutral view.)

EXERCISE: CLASSICAL DWH ARCHITECTURES

Data Warehouse / DHBWDaimler TSS 13

Page 14: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

EXERCISE: CLASSICAL DWH ARCHITECTURES

Data Warehouse / DHBWDaimler TSS 14

How are the approaches called?

Who “invented” the approach?

How many layers are used and how are the layers called?

Which data modeling approaches are used in which layer?

In which layer are atomic detail data stored?

In which layer are aggregated / summary data stored?

List at least 2 advantages

List at least 2 disadvantages

Page 15: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

EXERCISE: CLASSICAL DWH ARCHITECTURES

Data Warehouse / DHBWDaimler TSS 15

How are the approaches called?

Kimball Bus Architecture Corporate Information Factory

Who “invented” the approach?

• Ralph Kimball • Bill Inmon

How many layers are used and how are the layers called?

• Data Staging• Dimensional Data Warehouse

• Data Acquisition• Normalized Data Warehouse• Data Delivery / Dimensional Mart

Which data modeling approaches are used in which layer?

• Data Staging: variable, corresponds to source system

• Dimensional Data Warehouse:Dimensional Model

• Data Acquisition: variable, corresponds to source system

• Normalized Data Warehouse: 3NF• Data Delivery: Dimensional Model

In which layer are atomic detail data stored?

• Dimensional Data Warehouse • Normalized Data Warehouse

In which layer are aggregated / summary data stored?

• Dimensional Data Warehouse • Data Delivery / Dimensional Mart

Page 16: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

EXERCISE: CLASSICAL DWH ARCHITECTURES

Data Warehouse / DHBWDaimler TSS 16

Kimball Bus Architecture Corporate Information Factory

Advantages • Two layers only mean faster development and less work

• Rather simple approach to make data fast and easily accessible

• Lower startup costs (but higher subsequent development costs)

• Separation of concerns: long-term enterprise data storage separated from data presentation

• Changes in requirements and scope are easier to manage

• Lower subsequent development costs (but higher startup costs)

Disadvantages • If table structures change (instable source systems), high effort to implement the changes and reload data, especially conformed dimensions (“Dimensionitis” desease)

• Non-metric data not optimal for dimensional model

• Dimensional model (esp. Star Schema) contains data redundancy

• Data model transformations from 3NF to Dimensional model required

• More complex as two different data models are required

• Larger team(s) of specialists required

Page 17: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Kimball Bus Architecture (Central data warehouse based on data marts)

• Inmon Corporate Information Factory

• Data Vault 2.0 Architecture (Dan Linstedt)

• DW 2.0: The Architecture for the Next Generation of Data Warehousing

• Operational Data Store (ODS)

OTHER DWH ARCHITECTURES

Data Warehouse / DHBWDaimler TSS 17

Page 18: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

KIMBALL BUS ARCHITECTURE (CENTRAL DATA WAREHOUSE BASED ON DATA MARTS)

Data Warehouse / DHBWDaimler TSS 18

Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/

Page 19: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

KIMBALL BUS ARCHITECTURE (CENTRAL DATA WAREHOUSE BASED ON DATA MARTS)

Data Warehouse / DHBWDaimler TSS 19

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse Layer= Mart Layer

Data Mart 1

Data Mart 2Data Mart 3

Metadata Management

Security

DWH Manager incl. Monitor

More Business-process oriented

than subject-oriented,

integrated, time-variant,non-volatile

Page 20: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Bottom-up approach

• Dimensional model with denormalized data

• Sum of the data marts constitute the Enterprise DWH

• Enterprise Service Bus / conformed dimensions for integration purposes• (don’t confuse with ESB as middleware/communication system between applications)

• Kimball describes that agreeing on conformed dimensions is a hard job and it’s expected that the team will get stuck from time to time trying to align the incompatible original vocabularies of different groups

• Data marts need to be redesigned if incompatibilities exist

KIMBALL BUS ARCHITECTURE (CENTRAL DATA WAREHOUSE BASED ON DATA MARTS)

Data Warehouse / DHBWDaimler TSS 20

Page 21: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Co

re W

are

ho

use

La

yer

DATA INTEGRATION WITH AND WITHOUT COREWAREHOUSE LAYER

Data Warehouse / DHBWDaimler TSS 21

Page 22: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

INMON CORPORATE INFORMATION FACTORY

Data Warehouse / DHBWDaimler TSS 22

Source: http://www.kimballgroup.com/2004/03/differences-of-opinion/

Not relevant for exam

Page 23: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

INMON CORPORATE INFORMATION FACTORY

Data Warehouse / DHBWDaimler TSS 23

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Metadata Management

Security

DWH Manager incl. Monitor

subject-oriented,

integrated, time-

variant,non-

volatile

Not relevant for exam

Page 24: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Top-down approach

• (Normalized) Core Warehouse is essential for subject-oriented, integrated, time-variant and nonvolatile data storage

• Create (departmental) Data Marts as subsets of Core Enterprise DWH as needed

• Data Marts can be designed with Dimensional model

• The logical standard architecture is more general compared to CIF, but was mainly influenced by CIF

INMON CORPORATE INFORMATION FACTORY

Data Warehouse / DHBWDaimler TSS 24

Not relevant for exam

Page 25: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA VAULT 2.0 ARCHITECTURE – TODAY’S WORLD (DANLINSTEDT)

Data Warehouse / DHBWDaimler TSS

Page 26: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)

Data Warehouse / DHBWDaimler TSS 26

Michael Olschimke, Dan Linstedt: Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann, 2015, Chapter 2.2

Page 27: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)

Data Warehouse / DHBWDaimler TSS 27

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Raw Data Vault

Mart Layer(Output Layer)

(Reporting Layer)

Business Data Vault

Metadata Management

Security

DWH Manager incl. Monitor

Hard Rules only

Soft Rules

Page 28: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Core Warehouse Layer is modeled with Data Vault and integrates data by BK (business key) “only” (Data Vault modeling is a separate lecture)

• Business rules (Soft Rules) are applied from Raw Data Vault Layer to Mart Layer and not earlier

• Alternatively from Raw Data Vault to additional layer called Business Data Vault

• Hard Rules don’t change data

• Data is fully auditable

• Real-time capable architecture

• Architecture got very popular recently; also applicable to BigData, NoSQL

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)

Data Warehouse / DHBWDaimler TSS 28

Page 29: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• In the classical DWHs, the Core Warehouse Layer is regarded as “single version of the truth”

• Integrates + cleanses data from different sources and eliminates contradiction

• Produces consistent results/reports across Data Marts

• But: cleansing is (still) objective, Enterprises change regularly, paradigm does not scale as more and more systems exist

• Data in Raw Data Vault Layer is regarded as “Single version of the facts”

• 100% of data is loaded 100% of time

• Data is not cleansed and bad data is not removed in the Core Layer (Raw Vault)

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)

Data Warehouse / DHBWDaimler TSS 29

Page 30: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Data Vault is optimized for the following requirements:

• Flexibility

• Agility

• Data historization

• Data integration

• Auditability

• Bill Inmon wrote in 2008: “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)

Data Warehouse / DHBWDaimler TSS 30

Page 31: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DW 2.0: THE ARCHITECTURE FOR THE NEXT GENERATION OF DATA WAREHOUSING

Data Warehouse / DHBWDaimler TSS 31

Source: W.H. Inmon, Dan Linstedt: Data Architecture: A Primer for the Data Scientist, Morgan Kaufmann, 2014, chapter 3.1

Operational applicationdata model

Integrated corporatedata model

Integrated corporatedata model

Archivaldata model

Dat

a Li

fecy

cle

Not relevant for exam

Page 32: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Main characteristics:

• Structured and “unstructured” data, not just metrics

• Life Cycle of data with different storage areas

• Hot data: High speed, expensive storage (RAM, SSDs) for most recent data

• …

• Cold data: Low speed, inexpensive storage (e.g. hard disks) for old data; archival data model with high compression

• Metadata is an integral part of the DWH and not an afterthought

DW 2.0: THE ARCHITECTURE FOR THE NEXT GENERATION OF DATA WAREHOUSING

Data Warehouse / DHBWDaimler TSS 32

Not relevant for exam

Page 33: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

OPERATIONAL DATA STORE (ODS)

Data Warehouse / DHBWDaimler TSS 33

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Metadata Management

Security

DWH Manager incl. Monitor

subject-oriented,

integrated, time-

variant,non-

volatile

Operational Data Store

Not relevant for exam

Page 34: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• ODS: Real-time/Right-time layer

• Replication techniques used to transport data from source database to ODS layer with minimal impact on source system

• Data in the ODS has no history and is stored without any cleansing and without any integration (1:1 copy from single source)

• DWH performance not optimal as data model is suited for OLTP and not for reporting requirements

• ODS normally additionally to Staging / Core DWH / Mart Layer but can exist alone without other layers

OPERATIONAL DATA STORE (ODS)

Data Warehouse / DHBWDaimler TSS 34

Not relevant for exam

Page 35: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

EXAMPLE DWH FOR STATE OF CONSTRUCTION DOCU

Data Warehouse / DHBWDaimler TSS 35

Not relevant for exam

Page 36: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ARCHITECTURE FROM AN ACTUAL PROJECT IN THE AUTOMOTIVE INDUSTRY

Data Warehouse / DHBWDaimler TSS 36

ETL Engine

Fron

tend

StandardReports

AdHocReportsLogs

TSM

IIDRReplEngine

Source

DatastoreSource

Mirror DB (Operational Data Store)

OLTPDB

IIDR ReplEngineMirror

DatastoreMirror

IIDR ReplEngineDWH

DatastoreDWH

BackendDWH DB

Staging Layer

Raw + Business Data Vault

Mart Layer

Not relevant for exam

Page 37: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

END USER SAMPLE QUESTIONS

Data Warehouse / DHBWDaimler TSS 37

Which vehicles or aggregates are documented incompletely? (Data quality)

Which vehicles / which control units require SW updates?

Which interiors are most common by region?

Supply data for external simulations, customs clearance, spare part planning, etc.

Not relevant for exam

Page 38: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Review the presented data warehouse architectures.

Which architecture would you recommend for

• An online store with real/right-time data integration needs

• Marketing department of a bank

List advantages and drawbacks of your proposal.

EXERCISE: RECOMMEND AN ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 38

Page 39: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

An online store with real-time/right-time data integration needs

• Architecture: Data Vault 2.0

• + Integration of many internal and external source systems (e.g. integrate social media data about the online store)

• + Fast data delivery in Raw Vault Layer (Real-time/Right-time data integration). Complex data cleansing / transformation / soft rules are delayed downstream towards Mart Layer

• - Transformation overhead (Source system data model to Data Vault data model to Dimensional data model)

EXERCISE: RECOMMEND AN ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 39

Page 40: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Marketing department of a bank

• Architecture: Kimball Bus architecture

• + Start small for a department. If other departments are interested, new data and new Marts can be added on demand

• - High risk to loose the Enterprise view and several DWHs are built

That’s still quite a common scenario nowadays. A single Enterprise DWH is often not achieved (e.g. Mergers & Acquisitions, inflexibility due to a single centralized DWH, rapidly changing conditions, etc.) and therefore very often several DWHs with different architectures exist in parallel within a company.

EXERCISE: RECOMMEND AN ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 40

Page 41: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Now imagine that you prepare an exam.

• Identify 1-3 questions about DWH architecture (and/or DWH introduction) that you would ask in an exam.

• Write down the questions on stick-it cards.

EXERCISE - INTRODUCTION AND DWH ARCHITECTUREGROUP TASK

Data Warehouse / DHBWDaimler TSS 41

Page 42: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Which layers does the logical standard architecture have?

• Staging (Input), Integration (Cleansing), Core Warehouse (Storage), Aggregation, Mart (Reporting, Output) and additionally Metadata, Security, DWH Manager, Monitor

Which other architectures exist?

• Kimball Bus Architecture (Central data warehouse based on data marts)

• Inmon Corporate Information Factory

• Data Vault 2.0 Architecture (Dan Linstedt)

• DW 2.0: The Architecture for the Next Generation of Data Warehousing

• Operational Data Store (ODS)

SUMMARY DWH ARCHITECTURES

Data Warehouse / DHBWDaimler TSS 42

Page 43: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

BIG DATA ARCHITECTURES

Data Warehouse / DHBWDaimler TSS 43

Page 44: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 44

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager incl. Monitor

Page 45: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHAT IS A DATA LAKE?

Data Warehouse / DHBWDaimler TSS 45

Hadoop

Dump anything in and wait?

Hoard 100ths ofPetabyte in HDFS?

Page 46: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Data Warehouse and Big Data / DHBWDaimler TSS 46

Data Lake on Hadoop

Data Swamp

Data Reservoir

Landing Zone

Data Library

Data Repository

Data Archive

Data Lake on Spark

Data Lake 3.0

Page 47: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

IT‘S VERY HARD TO GET SPEED AND QUALITY [MARK MADSEN]

Data Warehouse and Big Data / DHBWDaimler TSS 47

Schema-on-write• RDBMS: create data model firstSchema-on-read• Hadoop HDFS / NoSQL: create

data model later (when reading data)

RDBMS can also work with schema-on-read. Hadoop can also work with schema-on-write.

Page 48: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Dump question, but actually

there are many comparisons like that at the moment

• Hadoop is a tool / technology (or even many tools) like a RDBMS

• DWH is an architecture and concept• Architecture is abstraction and

defines a goal

• Architecture vs tools / technology

WHAT ARE DIFFERENCES BETWEEN HADOOP AND DWH?

Data Warehouse and Big Data / DHBWDaimler TSS 48

Page 49: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• The data is not always known in advance, so it can’t be modeled in advance. [data can be anywhere → collect everything approach]

• The data architecture must be read-write from both the back and front, not a one-way data flow. The data written back may be repeatedly used, persistent data, or it may be temporary.

• The data may arrive with any frequency, and the rate may not be under your control.

ASSUMPTIONS + REQUIREMENTS HAVE CHANGED

Data Warehouse and Big Data / DHBWDaimler TSS 49

Page 50: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Naive idea: dump everything in (“landing zone”)

• Data hoarding is not a data management strategy

• A Data Lake brings in structure

• e.g. create directories in HDFS if Hadoop is used

• /transient

• /raw

• /standardized

• /use-case specific

HOW DOES A DATA LAKE DIFFER FROM A DATA SWAMP?

Data Warehouse and Big Data / DHBWDaimler TSS 50

Page 51: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHAT IS A DATA LAKE ACCORDING TO MARK MADSEN?SEPARATE COLLECT / MANAGE / DELIVERY

Data Warehouse and Big Data / DHBWDaimler TSS 51

Page 52: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ZONES INSTEAD OF LAYERS ACCORDING TO MARK MADSEN

Data Warehouse and Big Data / DHBWDaimler TSS 52

New data of unknown value, simple requests for new data can land here first, with little work by IT. Typically schema-

on-read.

More effort applied to data management: cleansed,

curated. Typically schema-on-write.

Optimized for specific uses / workloads. Personal folders.

Schema-on-write.

Page 53: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SCHEMA-ON-WRITE VS SCHEMA-ON-READ REVISITED

Data Warehouse and Big Data / DHBWDaimler TSS 53

Old approach New approach

Model Collect

Collect Model

Analyze Analyze

Promotemove data into next

zone if required (datahas been analyzed andis needed more often)

Page 54: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA VAULT 2.0 ARCHITECTURE – TODAY’S WORLD (DANLINSTEDT)

Data Warehouse / DHBWDaimler TSS

Page 55: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

https://www.youtube.com/watch?v=tDNjI1Yvqxw

DEFINING A DATA LAKE … BY DAN LINSTEDT

Data Warehouse and Big Data / DHBWDaimler TSS 55

Not relevant for exam

Page 56: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

https://www.youtube.com/watch?time_continue=48&v=b9YfsjEjVS8

INTERVIEW WITH “FATHER OF DATA WAREHOUSING” BILL INMON

Data Warehouse and Big Data / DHBWDaimler TSS 56

Not relevant for exam

Page 57: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA LAKE TURNED INTO DATA SWAMP

Data Warehouse / DHBWDaimler TSS 57

Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.12

Page 58: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA LAKE

Data Warehouse / DHBWDaimler TSS 58

Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.11

• No agreed definition

• Characteristics [Madsen]:• Deals with data and schema

change easily

• Does not always require up front modeling

• Does not limit the format or structure of data

• Assumes a full range of data latencies, from streaming to one-time bulk loads, both in and out including write-back

Page 59: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Two different styles of work: different zones with different degrees ofgovernance

• Stability & Quality• Predictable, improving and

renovating

• Flexibility & Speed

• Explanatory

BIMODAL DATA GOVERNANCE

Data Warehouse / DHBWDaimler TSS 59

Source: https://www.gartner.com/it-glossary/bimodal/

Not relevant for exam

Page 60: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• No agreed, standardized definition

• Additionally, there are many more buzzwords like Landing Zone, Data Repository, Data Swamp,

• Characteristics of a Data Lake architecture according to Madsen:

• Deals with data and schema change easily

• Does not always require up front modeling

• Does not limit the format or structure of data

• Assumes a full range of data latencies, from streaming to one-time bulk loads, both in and out including write-back

• Supports different uses of the same data

WHAT IS A DATA LAKE?

Data Warehouse and Big Data / DHBWDaimler TSS 60

Page 61: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• A physical implementation of data solutions in the Hadoop ecosystem.

• A reservoir of curated, and often unconnected, datasets for data science, data exploration and reference data management use.

• The new Operational Data Store, a pub flap to hide a proper data management architecture.

• A name to indicate the staging area or the persistent staging area of a data warehouse solution.

• A technical solution to process huge volumes of data, for which an RDBMS is often too expensive to deploy.

THE (AB)USE OF THE WORDS „DATA LAKE“TECHNICAL

Data Warehouse / DHBWDaimler TSS 61

Source: https://www.linkedin.com/pulse/abuse-words-data-lake-martijn-ten-napel/

Not relevant for exam

Page 62: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• A concept that will spray fairy dust upon data management issues that arise from organisational issues, politics or a lack of understanding what working with data requires of an organisation.

• A term to use when you are really out of your comfort zone as an architect and to shelter behind.

• A clever marketing and sales trick, used to sell you a “must have” solution next to what you already have. You are told that without a data lake you will be unable to do data science, digital transformation or business analytics.

THE (AB)USE OF THE WORDS „DATA LAKE“CYNICAL REALITY

Data Warehouse / DHBWDaimler TSS 62

Source: https://www.linkedin.com/pulse/abuse-words-data-lake-martijn-ten-napel/

Not relevant for exam

Page 63: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SPECIFIC BIG DATA ARCHITECTURES

• There exist well-known reference architectures for Data Warehouses

• Many tools and schema-on-read came with the Hadoop ecosystem• Was a “black box” at the beginning

• Gets more and more structure with different layers instead of a “black box”

• Structure, modeling, organization, governance instead of tool-only focus

Data Warehouse / DHBWDaimler TSS 63

Page 64: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Architecture by Nathan Marz

• Realtime and batch processing

• Batch layer stores and historizes raw data

• Serving layer contains batch views

• Query unions serving and realtime layer

• Rather complex

• Author recommends graph data model and advises against schema-on-read

LAMBDA ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 64

Page 65: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Lack of human fault tolerance

• Bugs will be deployed

• Operational errors will happen, e.g. accidentally delete data

• Data loss / corruption is worst case scenario

• Without data loss / corruption, mistakes can be fixed if original data is still available

• Must design for human error like you’d design for any other fault

WHAT IS THE MAIN INEVITABLE PROBLEM IN DATA SYSTEMS?

Data Warehouse and Big Data / DHBWDaimler TSS 65

Page 66: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHAT IS THE LAMBDA ARCHITECTURE?

Data Warehouse and Big Data / DHBWDaimler TSS 66

Batch Layer

All Data (Master data set)

Speed Layer

RealTime Views

Serving Layer

Batch Views

Query(merge)

Data Stream

Page 67: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Atomic, immutable data

• Ensure that data can not be deleted (accidentally)

• Fundamentally simpler

• Easy to implement on top of a distributed filesystem, eg Hadoop

• CR instead of CRUD• No updates

• No deletes

• Create (insert) and read (select) only

Immutability restricts the range of errors that can cause data loss/corruption

IMMUTABLE DATAMASTER DATA SET

Data Warehouse and Big Data / DHBWDaimler TSS 67

Page 68: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

NATHAN MARZ VIEW ON SCHEMA

Data Warehouse / DHBWDaimler TSS 68

• RawnessStore the data as it is. No transformations.

• ImmutabilityDon’t update or delete data, just

add more.

• Graph-like schema recommended

„Many developers go down the path of writing their raw data in a schemaless

format like JSON. This is appealing because of how easy it is to get started, but this

approach quickly leads to problems. Whether due to bugs or misunderstandings

between different developers, data corruption inevitably occurs“

(see page 103, Nathan Marz, „Big Data: Principles and best practices of scalable

realtime data systems", Manning Publications)

Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015

Page 69: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.

ANALYTICS VS RE-ANALYTICS

Data Warehouse and Big Data / DHBWDaimler TSS 69

Source: http://www.odbms.org/blog/2014/07/big-data-science-interview-jules-j-berman/

Page 70: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

NEW APPROACH TO BUILD DATA SYSTEMS

Data Warehouse and Big Data / DHBWDaimler TSS 70

Immutable data(Source of Truth / Single version of

facts)

(Materialized) Views on data

(Materialized) Views on data

(Materialized) Views on data

(Materialized) Views on data

(Materialized) Views on data

Query = Application

HDFS / NoSQLRDBMSNewSQL

(Materialized) Views on data

(Materialized) Views on data

(Materialized) Views on data

(Materialized) Views on data

(Materialized) Views on data

Query = Application

Not relevant for exam

Page 71: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

BATCH VIEWS AND REALTIME VIEWS

Data Warehouse and Big Data / DHBWDaimler TSS 71

Page 72: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA ABSOBTION

Data Warehouse and Big Data / DHBWDaimler TSS 72

https://dzone.com/articles/lambda-architecture-with-apache-spark

Page 73: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

LAMBDA ARCHITECTURE – DATA EXAMPLECOMPUTE FOLLOWER LIST

Data Warehouse and Big Data / DHBWDaimler TSS 73

Batch Layer

1.1. insert Jim1.1. insert Anne2.1 remove Jim3.1. insert George5.1. insert John (now)

Speed Layer

insert John (now)

ServingLayer

AnneGeorge

Query

Data Stream

AnneGeorgeJohn

Follower: 3

Page 74: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Computing best answer in real time may not always be possible

• Can compute exact answer in batch layer and approximate answer in realtime layer

• Best of both worlds of performance and accuracyFor example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used.

EVENTUAL ACCURACY

Data Warehouse and Big Data / DHBWDaimler TSS 74

Page 75: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHICH TOOLS COULD BE USED IN THE LAMBDAARCHITECTURE?

Data Warehouse and Big Data / DHBWDaimler TSS 75

Batch Layer

All Data

Speed Layer

RealTime Views

Serving Layer

Batch Views

Query(merge)

Data Stream

Not relevant for exam

Page 76: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

CLOUD VENDORS OFFERINGS, E.G.AMAZON AWS

Data Warehouse and Big Data / DHBWDaimler TSS 76

https://aws.amazon.com/de/blogs/big-data/unite-real-time-and-batch-analytics-using-the-big-data-lambda-architecture-without-servers/

Not relevant for exam

Page 77: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

CLOUD VENDORS OFFERINGS, E.G.MICROSOFT AZURE

Data Warehouse and Big Data / DHBWDaimler TSS 77

https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx

Not relevant for exam

Page 78: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Data Warehouse and Big Data / DHBWDaimler TSS 78

sou

rce:

Mar

kus

Sch

mid

ber

ger

-B

ig D

ata

ist

tot

–Es

leb

eB

usi

nes

s In

telli

gen

z?,

TDW

I 2

01

6Not relevant for exam

Page 79: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Data Warehouse and Big Data / DHBWDaimler TSS 79

sou

rce:

Mar

kus

Sch

mid

ber

ger

-B

ig D

ata

ist

tot

–Es

leb

eB

usi

nes

s In

telli

gen

z?,

TDW

I 2

01

6

Lambda @glomex• Enrich batch-

driven data processing with real-time requirements

• Adapt Lambda architecture to own requirements

Remark: AWS lambda # lambda architecture

Not relevant for exam

Page 80: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

LAMBDA ARCHITECTURE – PROS AND CONS

Data Warehouse and Big Data / DHBWDaimler TSS 80

Pro Con

Architecture emphasizes to keep data immutable. Mistakes can be corrected via recomputation

Maintaining code that needs to produce the same result in two complex distributed systems

Reprocessing is one of the key challenges of stream processing but is very often ignored

Operational burden of running and debugging two systems

Page 81: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Incoming data is sent to batch and speed layer

• Batch layer constantly (re-) computes batch views

• Master data is stored in the batch layer in raw format: immutable & append-only

• Contains data except most up-to-date data due to high latency

• Replaces data in speed layer as soon as data in newer compared to speed layer

• Speed layer uses incremental algorithms to refresh real-time Views• Receives data stream for real-time processing

• Contains most up-to-date data only

• Serving layer contains views on batch layer data

• Merge is done in the query or in the serving layer

SUMMARY LAMBDA ARCHITECTURE

Data Warehouse and Big Data / DHBWDaimler TSS 81

Page 82: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Jay Kreps wrote an article about „Questioning the Lambda architecture“

• He wrote about his experience with the Lambda architecture

• It works, but not very pleasant or productive

• Keeping code in sync is really hard

• Need to build complex, low-latency processing systems

• Scalable high-latency batch system

• Low-latency stream stream-processing system

• Instead of duct taping batch & speed:

→ Kappa architecture / Log-centric architecture / Stream data platform

QUESTIONING THE LAMBDA ARCHITECTURE

Data Warehouse and Big Data / DHBWDaimler TSS 82

Source: https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Page 83: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Architecture by Jay Kreps

• Logcentric, write-ahead logging

• Each event is an immutable log entry and is added to the end of the log

• Read and write operations are separated

• Materialized views can be recomputed consistently from data in the log

KAPPA ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 83

Page 84: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Lambda architecture

• Complex

• It works, but not very pleasant or productive: keeping code in sync is really hard

• Addresses the challenge of reprocessing

• Batch views to contain updated data (a Lambda architecture rcommendation)

• Bugs (true requirement for reprocessing)

• New user requirements (true „requirement“ for reprocessing)

Logs unify• Batch processing + Stream processing

• In Kappa, reprocessing required only when processing logic has been modified

REPROCESSING

Data Warehouse and Big Data / DHBWDaimler TSS 84

Page 85: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

KAPPA ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 85

Source: https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Page 86: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SUMMARY: ENTERPRISE-WIDE ARCHITECTURE

Data Warehouse and Big Data / DHBWDaimler TSS 86

Source: Jay Kreps: I heart logs, O’Reilly 2014

Page 87: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• The LDW is a multi-server / multi-engine architecture

• The LDW can have multiple engines, the main ones being: The data warehouse(DW), data marts and the data lake

• We should not be trying to choose a single engine for all our requirements. Instead, we should be sensibly distributing all our requirements across the various components we have to choose from. Reasons:

• Regulatory constraints

• Organizational constraints

LOGICAL ARCHITECTURE (GARTNER)

Data Warehouse / DHBWDaimler TSS 87

https://blogs.gartner.com/henry-cook/2018/01/18/logical-data-warehouse-project-plans/

Not relevant for exam

Page 88: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

VIRTUAL ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 88

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Data Lake

OLTPQuery Management

Weakly+partly subject-oriented, Weakly+partly integrated,

Not time-variant,Not non-volatile

Views,Alias names,

Metadata

DWH

Not relevant for exam

Page 89: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Data not extracted from operational systems and stored separately

• Standardized interface for all operational data sources

• One "GUI" for all existing data

• Generates combined queries

• Query Processor joins query result data from different sources

• Can also access data in Hadoop (Polybase, Big SQL, BigData SQL, etc)

VIRTUAL ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 89

Not relevant for exam

Page 90: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Query Management manages metadata about all operational systems

• (physical) location of data and algorithms for extracting data from OLTP system

• Implementation easier

• Low cost: can use existing hardware infrastructure

• Queries cause significant performance problems in operational systems

• Known problems when analyzing operational data directly

• Same query is processed multiple times (if queried multiple times)

• Same query delivers different results when processed at different times

VIRTUAL ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 90

Not relevant for exam

Page 91: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA MANAGEMENT ACCORDING TO BARCFLEXIBILITY CAUSES HETEROGENITY

Data Warehouse / DHBWDaimler TSS 91

Source: https://www.datafestival.de/events/proof-of-concepts-als-strategietool-erfahrungen-aus-der-medienbranche-de/; Jacqueline Bloemen, Datenarchitektur für Business Analytics – was Sie berücksichtigen sollten, DataFestival 2019

Not relevant for exam

Page 92: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

Data Warehouse / DHBWDaimler TSS 92

THANK YOU

Page 93: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

USE CASE: ANALYSIS BATTERY AGING

Data Warehouse and Big Data / DHBWDaimler TSS 93

Max capacityCurrent capacity

• JSON data ingested into HDFS, Hive tables on JSON files

• Identify breaks (“> 8h”) and compute current drain

Page 94: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Sensor data format change without notice

• Sensors get regularly updated with new versions

• Names of metrics may change

• Sensors with various versions in the field

• Sensors from different suppliers

• Often many fields >>100 and increasing with new sensor versions

• Easy storing of data in HDFS and applying schema later

• Data from Robots, vehicles, …

STRUCTURING THE DATA LAKENEW DATA SOURCES – SENSOR DATA

Data Warehouse and Big Data / DHBWDaimler TSS 94

Page 95: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Sensor data format change without notice• Time consuming and error-prone

data integration into the Data Lake

• Therefore preparation of data for usage in the Data Reservoir required: “Data Engineer”

STRUCTURING THE DATA LAKE“SCHEMA-ON-READ”

Data Warehouse and Big Data / DHBWDaimler TSS 95

Raw dataD

ata

Go

vern

ance

Consumption

Enhanced data

Met

adat

a M

anag

eme

nt

Data A

rchival

Data Secu

rity

json

Samp-ling / filter

Hive tables

Hive tables

Struc-ture

R Python

Page 96: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

A holding of 3 telecommunication companies

• Architecture: Virtual Data Warehouse

• + Companies may not want to provide their data to a new storage

• + Can easily be extended if new companies join the holding or reduced if a company leaves the holding

• - Bad performance

• - Not really data integration achieved, low Data Quality

• - Firewalls have to be opened

EXERCISE: RECOMMEND AN ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 96

Page 97: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHERE TO USE HADOOP AND SPARK?

Data Warehouse / DHBWDaimler TSS 97

Source: Rick F. van der Lans: New Data Storage Technologies, TDWI Munich 2018

Page 98: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHERE TO USE NOSQL?

Data Warehouse / DHBWDaimler TSS 98

Source: Rick F. van der Lans: New Data Storage Technologies, TDWI Munich 2018

Page 99: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Apple: multiple Petabytes

• Customer insights: who’s who and what are the customers up to

Walmart: 300TB (2003), several PB today

• It tells suppliers, “You have three feet of shelf space. Optimize it.”

eBay: >10PB, 100s of production DBs fed in

• Get better understanding of customers

Most DWHs are much smaller though. For huge and small DWHs: High challenges to architect + develop + maintain + run such complex systemshttps://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/ and http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/

EXAMPLES OF DATA WAREHOUSES IN THE INDUSTRY

Data Warehouse / DHBWDaimler TSS 99

Page 100: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Architecture, conceptData Lake

• Tools (that can be used to implement a Lake)

Hadoop, Spark, Elastic Stack

DATA LAKE VS HADOOP

Data Warehouse and Big Data / DHBWDaimler TSS 100

Page 101: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DWH AND DATA LAKE

Data Warehouse and Big Data / DHBWDaimler TSS 101

DWH on RDBMS

Slowly Changing DimensionELT vs ETL3-Layer vs 2-LayerKimball ApproachInmon DefinitionStar SchemaData VaultAnchor Modelingetc

Data Lake on Hadoop

Schema-on-ReadAgilityParquetHiveHBaseSQL-on-HadoopImpalaOozieZoekeeper

Methods, Concepts,

Techniques

Tools,Tools,Tools

Page 102: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA LAKE (MARTIN FOWLER)

Data Warehouse / DHBWDaimler TSS 102

Source: https://martinfowler.com/bliki/DataLake.html

Page 103: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA LAKE (MARTIN FOWLER)

Data Warehouse / DHBWDaimler TSS 103

Source: https://martinfowler.com/bliki/DataLake.html

Page 104: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

A Data Lake acquires data from multiple sources in an enterprise in its native form and may also have internal, modeled forms of this same data for various purposes. The information thus handled could be any type of information, ranging from structured or semi-structured data to completely unstructured data. A Data Lake is expected to be able to derive enterprise-relevant meanings and insights from this information using various analysis and machine learning algorithms.

WHAT IS A DATA LAKE?

Data Warehouse / DHBWDaimler TSS 104

Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017

Page 105: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA LAKE LIFE CYCLE

Data Warehouse / DHBWDaimler TSS 105

Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017

Page 106: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA LAKE (ECKERSON GROUP)

Data Warehouse / DHBWDaimler TSS 106

Source: https://www.eckerson.com/articles/ten-characteristics-of-a-modern-data-architecture

Page 107: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

LAMBDA ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 107

Source:

Batch Layer

BatchEngine Serving Layer

ServingBackend Queries

Raw historydata

Resultdata

Real-Time Layer

Real-TimeEngine

Page 108: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

OVERVIEW LAMBDA ARCHITECTURE

Data Warehouse and Big Data / DHBWDaimler TSS 108

https://de.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action

Page 109: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHICH TOOLS COULD BE USED IN THE LAMBDAARCHITECTURE? DB REQUIREMENTS

Data Warehouse and Big Data / DHBWDaimler TSS 109

Easier to implement fordatabase vendors

compared to randomaccess

Page 110: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Batch layer

Write sequential

once

Bulk sequential read many

times

Speed layer

Random write

Random read

Servinglayer

Batch write

Random read

WHICH TOOLS COULD BE USED IN THE LAMBDAARCHITECTURE? DB REQUIREMENTS

Data Warehouse and Big Data / DHBWDaimler TSS 110

More challenging

More challenging More challenging

Page 111: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHAT ARE THE THREE PARADIGMS OF PROGRAMMING?

• Request/Response

• Batch

• Stream processing

Data Warehouse and Big Data / DHBWDaimler TSS 111

Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619

Page 112: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

WHAT IS A LOG?

Data Warehouse and Big Data / DHBWDaimler TSS 112

Source: Jay Kreps: I heart logs, O’Reilly 2014

Page 113: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Database transactions/data

• User, products, etc.

• Events• Tweets, clicks, impressions, pageviews, etc.

• Application metrics

• CPU usage, requests, etc.

• Application logs• Service calls, errors, etc.

TYPES OF DATA

Data Warehouse and Big Data / DHBWDaimler TSS 113

Page 114: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• A bank account’s current balance can be built from a complete list of its debits and credits, but the inverse is not true.

• In this way, the log of transactions is the more “fundamental” data structure than the database records storing the results of those transactions.

A software application’s database is better thought of as a series of time-ordered immutable facts collected since that system was born, instead of as a current snapshot of all data records as of right now.

WHAT IS A LOG?

Data Warehouse and Big Data / DHBWDaimler TSS 114

Source: https://blog.parse.ly/post/1550/kreps-logs/

Page 115: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ENTERPRISE-WIDE ARCHITECTUREUSING LOG-CENTRIC APPROACH

Data Warehouse and Big Data / DHBWDaimler TSS 115

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

Page 116: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ENTERPRISE-WIDE ARCHITECTUREUSING LOG-CENTRIC APPROACH AND CDC

Data Warehouse and Big Data / DHBWDaimler TSS 116

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

Page 117: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ENTERPRISE-WIDE ARCHITECTURE TODAYDATA INTEGRATION CHALLENGE

Data Warehouse and Big Data / DHBWDaimler TSS 117

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

Page 118: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ENTERPRISE-WIDE ARCHITECTURE TODAYDATA INTEGRATION CHALLENGE

Data Warehouse and Big Data / DHBWDaimler TSS 118

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

Page 119: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ENTERPRISE-WIDE ARCHITECTURE TODAYDATA INTEGRATION CHALLENGE

Data Warehouse and Big Data / DHBWDaimler TSS 119

Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

Page 120: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

LOG CENTRIC ARCHITECTURE / KAPPA ARCHITECTURE / DATA STREAMING PLATFORM

Data Warehouse and Big Data / DHBWDaimler TSS 120

Source: Jay Kreps: I heart logs, O’Reilly 2014

Page 121: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Most common understanding (doing the „T“ in ETL):

• Stream processing is the parallel processing of data in motion = computing on data directly as it is produced or received.

• Not necessarily transient, approximate, lossy (assumptions from Lambda architecture and other event processing systems)

WHAT IS STREAM PROCESSING?

Data Warehouse and Big Data / DHBWDaimler TSS 121

Page 122: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Active MQ, RabbitMQ

• Problems:• Not distributed

• Throughput

• Persistence

• Ordering

MESSAGING SYSTEMS @LINKEDIN

1st attempt

• Kafka

• Key abstraction: Logs

• build from scratch• Distributed system by design

• Partitioning with local ordering

• Elastic scaling

• Fault tolerance

2nd attempt

Data Warehouse and Big Data / DHBWDaimler TSS 122

Page 123: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Scalability

• Hundreds of MB/sec/server throughput

• Many TB per node

• Guarantees of a database

• All messages strictly ordered (within a partition)

• All data persistent

• Distributed by default

• Replication

• Partitioning

• Producers + consumers all fault tolerant and horizontally scalable

KAFKA: A MODERN DISTRIBUTED SYSTEM FOR STREAMS

Data Warehouse and Big Data / DHBWDaimler TSS 123

Page 124: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

ETL REVISITED WITH KAFKA CONNECT AND KAFKA STREAMS

Data Warehouse and Big Data / DHBWDaimler TSS 124

Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619

Page 125: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Apache Spark streaming

• Apache Storm (twitter, Nathan Marz)

• Apache Samza (linkedin)

• Apache Flink

• Apache Kafka Streams

• Simple library

• Reprocessing

• No microbatch = everything is a stream

• Local state

• Key operations: filter, aggregate, join

WHAT ARE STREAM PROCESSING FRAMEWORKS?

Data Warehouse and Big Data / DHBWDaimler TSS 125

Page 126: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

KAFKA @LINKEDIN

Data Warehouse and Big Data / DHBWDaimler TSS 126

Source: https://de.slideshare.net/JayKreps1/i-32858698

Page 127: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SUMMARY: STREAM DATA PLATFORM USING KAFKAUNIFYING BATCH AND STREAM PROCESSING

Data Warehouse and Big Data / DHBWDaimler TSS 127

Source: https://de.slideshare.net/JayKreps1/i-32858698

Page 128: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• A message-oriented implementation requires an efficient messaging backbone that facilitates the exchange of data in a reliable and secure way with the lowest latency possible.

• Creating small, self-contained, data-driven applications that meld streaming data and microservices together is a good practice to break down large problems and projects into approachable chunks, reduce risk, and deliver value faster.

• Think of combinations of data-processing applications with microservices to deliver specific features and insights from a data stream.

KAFKA AND MICROSERVICES

Data Warehouse and Big Data / DHBWDaimler TSS 128

Page 129: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

KAPPA ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 129

Source:

Real-Time Layer

Real-TimeEngine

Serving Layer

ServingBackendData Queries

Raw historydata

Resultdata

Page 130: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

PS-3C ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 130

Source: TDWI 2016

Data Library

StorageEngine

3C Layer

PreparationEngineData Queries

Serving Layer

DeliveryEngine

Raw historydata

Integrated subjects

Resultdata

Page 131: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Architecture by Rogier Werschkull

• Store incoming data in Data Library Layer (Persistent staging = PS)

• Prepare data in a 3C layer for “Concept – Context – Connector”-model

• Concept + Connector can be virtualized on data in Data Library Layer

PS-3C ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 131

Page 132: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Architecture by Joe Caserta

• Big Data Warehouse may live in one or more platforms on premises or in the cloud

• Hadoop only

• Hadoop + MPP or RDBMS

• Additionally NoSQL or Search

POLYGLOT WAREHOUSE

Data Warehouse / DHBWDaimler TSS 132

Page 133: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

POLYGLOT WAREHOUSE

Data Warehouse / DHBWDaimler TSS 133

Source: https://www.slideshare.net/CasertaConcepts/hadoop-and-your-data-warehouse

Page 134: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Architecture by Claudia Imhoff

• combine the stability and reliability of the BI architectures while embracing new and innovative technologies and techniques

• 3 components that extend the EDW environment

• Investigative computing platform

• Data refinery

• Real-time (RT) analysis platform

THE EXTENDED DATA WAREHOUSE ARCHITECTURE (XDW)THE ENTERPRISE ANALYTICS ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 134

Page 135: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

THE EXTENDED DATA WAREHOUSE ARCHITECTURE (XDW)THE ENTERPRISE ANALYTICS ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 135

Source: https://upside.tdwi.org/articles/2016/03/15/extending-traditional-data-warehouse.aspx

Page 136: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Inflow Lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place

• Outflow Lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement.

• Data Science Lab: most suitable for data discovery and for developing new advanced analytics models

GARTNER DATA LAKE ARCHITECTURE STYLES

Data Warehouse / DHBWDaimler TSS 136

Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/

Page 137: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

GARTNER DATA LAKE ARCHITECTURE STYLES

Data Warehouse / DHBWDaimler TSS 137

Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/

Page 138: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

GARTNER – THE LOGICAL DWH

Data Warehouse / DHBWDaimler TSS 138

https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/

Page 139: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SUMMARY

Data Warehouse / DHBWDaimler TSS 139

Landing Area

StorageEngine

Data Lake

IntegrationEngineData Queries

Data Presentation

DeliveryEngine

Raw historydata

Lightly integrated data

Resultdata

Page 140: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Architecture by Eckerson Group

• DWHs exist together with MDM, ODS, and portions of the data lake as a collection of data that is curated, profiled, and trusted for enterprise reporting and analysis

DATA CORE – DAVE WELLS

Data Warehouse / DHBWDaimler TSS 140

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Page 141: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA CORE – DAVE WELLS

Data Warehouse / DHBWDaimler TSS 141

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Page 142: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DATA LIFECYCLE – DAVE WELLS

Data Warehouse / DHBWDaimler TSS 142

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Page 143: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DWH AND DATA LAKE – DAVE WELLSIN PARALLEL VS INSIDE

Data Warehouse / DHBWDaimler TSS 143

https://www.eckerson.com/articles/the-future-of-the-data-warehouse

Page 144: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

• Dimensional modeling is not dead.

• The benefits are still valid in the age of Big Data, Hadoop, Spark, etc:

• Eliminate joins

• Data model is understandable for end users

• Well-suited for columnar storage + processing (e.g. SIMD)

• Nesting technique

• E.g. tables with lower granularity can be nested into large fact table

• Usage in SQL: Flatten(kvgen(<json>))

DIMENSIONAL MODELING IN THE AGE OF BIG DATA

Data Warehouse / DHBWDaimler TSS 144

Page 145: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SELF-SERVICE DATA

Data Warehouse / DHBWDaimler TSS 145

https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Functional area Why important Self-service approach

Data acceleration

With shadow analytics, users create redundant data copies.

The system must be capable of autonomously identifying the best optimizations and adapting to emerging query patterns over time.

Data catalog Data consumers struggle to find data that is important to their work. Users keep private notes about data sources and data quality, meaning there is no governance.

In the self-service approach, the catalog is automatic—as new data sources are brought online.

Data virtualization

It is virtually impossible for an organization to centralize all data in a single system.

Data consumers need to be able to access all data sets equally well, regardless of the underlying technology or location of the system.

Page 146: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SELF-SERVICE DATA

Data Warehouse / DHBWDaimler TSS 146

https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Functional area Why important Self-service approach

Data curation There is no single “shape” of data that works for everyone.

Data consumers need the ability to interact with data sets from the context of the data itself, not exclusively from simple metadata that fails to tell the whole story. Data consumers should be capable of reshaping data to their own needs without writing any code or learning new languages.

Data lineage As data is accessed by data consumers and in different processes, it is important to track the provenance of the data, who accessed the data, how the data was accessed, what tools were used, and what results were obtained.

As users reshape and share data sets with one another through a virtual context, a self-service data platform can seamlessly track these actions and all states of data along the way, providing full audit capabilities as well.

Page 147: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SELF-SERVICE DATA

Data Warehouse / DHBWDaimler TSS 147

https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Functional area Why important Self-service approach

Open source Because data is essential to every area of every business, the underlying data formats and technologies used to access and process the data should be open source

Self-service data platforms build on open source standards like Apache Parquet, Apache Arrow, and Apache Calcite to store, query, and analyze data from any source.

Security controls

Organizations safeguard their data assets with security controls that govern authentication (you are who you say you are), authorization (you can perform specific actions), auditing (a record of the actions you take), and encryption (you can only read the data if you have the right key).

Self-service data platforms integrate with existing security controls of the organization, such as LDAP and Kerberos.

Page 148: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Hadoop (and for similar reasons Spark) has its strengths but no as a DWH replacement, e.g.

• Fast query reads only possible in HBase with an inflexible (use case specific) data model

• No sophisticated query optimizer

• Hadoop is very complex with many tools/versions/vendors and no standard

• Security is still at the beginning

IS THE DWH DEAD?

Data Warehouse / DHBWDaimler TSS 148

Page 149: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Digitization is the process of converting information into a digital (i.e. computer-readable) format

DIGITIZATION – THE DIGITAL DATA EXPLOSION

Data Warehouse / DHBWDaimler TSS 149

Source: https://slideplayer.com/slide/10254426/

Page 150: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

FUTURE MOBILITY

Data Warehouse / DHBWDaimler TSS 150

https://blog.daimler.com/en/2018/07/19/future-mobility-metropolitan-cities-congress/

Page 151: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

Display video

IS CHINA THE NEXT SILICON VALLEY?

Data Warehouse / DHBWDaimler TSS 151

Page 152: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

SOME STATISTICSBAT – BAIDU + ALIBABA + TENCENT

Data Warehouse / DHBWDaimler TSS 152

Sources: https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdl

Page 153: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

VolumeVelocityVarietyVeracity

Value

DIGITIZATION – CHALLENGES & OPPORTUNITIES

Data Warehouse / DHBWDaimler TSS 153

Ethics

AI

Impact

IoT, Industry 4.0, …

Data integration

Real-time

Data quality

Data-driven

Digital Services

Page 154: LECTURE @DHBW: DATA WAREHOUSE 02 DWH AND BIG DATA ARCHITECTUREbuckenhofer/20191DWH/Buckenhofer-… · •Providing internal and external data out of the source systems •Enabling

DIGITIZATION PUBLIC AUTHORITIES

Data Warehouse / DHBWDaimler TSS 154