hadoop meets agile! - an agile big data model

39
Hadoop meets Agile! An (attempt on an) Agile Big Data Model uweprintz

Upload: uwe-printz

Post on 13-Apr-2017

79 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Hadoop meets Agile!

An (attempt on an) Agile Big Data Model

uweprintz

/whoami &

/disclaimer

1. Background & Motivation 2. My experience 3. Agile Big Data Model

Agenda

1. Background & Motivation 2. My experience 3. Agile Big Data Model

Agenda

#1 Request/Response

3 Paradigms for dealing with data

#2 Batch

3 Paradigms for dealing with data

#3 Stream

3 Paradigms for dealing with data

#1 Request/Response #2 Batch

#3 Stream

3 Paradigms for dealing with data

Once Hadoop was the cool kid on the block…

…but nowadays Hadoop feels just bloated

Your typical Hadoop distribution

Your typical Hadoop data flows

Perceived Productivity Curve

1. Background & Motivation 2. My experience 3. Agile Big Data Model

Agenda

Big Data LakeHadoop

Da

ta S

ourc

esD

ata

Sys

tem

sA

pp

lica

tion

s

Traditional Sources

RDBMS OLTP OLAP …

Traditional Systems

RDBMS EDW MPP …

Business Intelligence

BusinessApplications

Custom Applications

Operation

Manage &

Monitor

Dev Tools

Build &

Test

New Sources

Logs Mails Sensor …SocialMedia

EnterpriseHadoop Plattform

#1 The Vision of the Big Data Lake

A Hadoop project feels just like yet another data warehouse project

-except the knowledge

#1 Vision & Reality

#1 Real world architecture - Insurance

Da

ta S

ourc

esD

ata

Sys

tem

sA

pp

lica

tion

s

Traditional Sources

RDBMS OLTP OLAP …

Traditional Systems

DWH

BusinessIntelligence

New Sources

Logs Sensor …SocialMedia

Enterprise Hadoop Plattform

SAS LASR Server

Apache Zeppelin

#2 Hadoop

& Streaming

Batch Layer

Speed LayerData Ingestion

Data Processing

Data Storage

Data Storage Data Analysis

Visualization

Visualization

DataChannels

ms - s

min - h

#2 Lambda in Action - (e)Commerce

Data Ingestion

Data Processing

Raw Data

#2 Cassandra & Hadoop - AdServing

Data Processing

User Journey

Aggregated Data

Web Frontend

Aggregated Data< 120 days

Data Science

#3 It’s all about Data Science!

Batch Layer

Speed LayerData Ingestion

Stream Processing

ms - s

min - h

#3 Fraud detection - Financial services

DataImport

Data Preparation

Model Generation

Model Validation

Feature & Parameter Selection

Manual or automatic Iterations to tune

parameters

Use Model

Refresh Model from latest input data

1. Background & Motivation 2. My experience 3. Agile Big Data Model

Agenda

Trade-offs for a Hadoop Platform

Cost Efficiency

Flexibility

Speed of Provisioning

Trade-offs for a Hadoop Platform

Cost Efficiency

Flexibility

Speed of Provisioning

Those companies will be successful that manage to build maximum flexibility and speed of provisioning into their platform without generating yet another silo, all while controlling the costs

Support for different speeds

A modern Hadoop platform needs to cope with different speed levels to enable different use cases.

Speed of data processing

Siz

e of

dat

a

Batch

Interactive

Streaming

Realtime

Batch Layer

Speed Layer

Data Ingestion

KB

TB

h ms

DataChannels

The Microservices of Hadoop

Data-centric, in Pipelines you have to think!

Producers Data Ingestion Data Storage & Analysis Visualization & Consumers

Batch Data

Streaming Data

MS SQL MySQL Oracle

JMS

Events

csv

Interactive Parallel Processing

HDFS (redundant, reliable storage)

SQL

Hive

YARN (Data Operating System)

In-Memory

Spark

Others

Search

Solr

Spark R

Ambari Views & Zeppelin (Visualization)

Hadoop Platform

MS SQL 2016 + R

Data Pipeline A

Data Pipeline C

Data Pipeline B

Core principles and values

• The core beliefs are the agile principles

• The foundation is a data-centric role model oriented on the domains of the Big Data Platform

• Independent project teams deliver data pipe lines - from the beginning to the end

• The project teams collaborate with specialized Big Data roles

• The data model is built on the principles of domain driven design (DDD)

• Data Governance is built on self-organization

Role model

Analytical Data

Operational Data Data Engineers

load and transform data

Answers to QuestionsData Analysts

process data

Data Scientists analyze and correlate data

Admins maintain, enhance, scale

“Hidden treasures“ Data Stewards are responsible for the data

quality in one domain

Big Data Platform

Raw Data

Data model

Project data

Project data

Data X

Domain A

Project data

Data Y

Domain B

Data Steward A

is responsible

Data Steward B

Project A

is responsible

uses

• The data model is based on the principles of Domain Driven Design (DDD)

• The data is divided into domains, the smallest domain is user data • User and project teams are directly responsible for their own data • Can use other existing data

• Data is bundled into comprehensive domains, e.g. • Business domains • National subsidiaries

• Domains can be hierarchical

• Responsibility for one domain is exactly at one data steward

• Always put meta data to the user data • If not possible otherwise, do it in an informal way, otherwise use an

automated tool

Don’t strive for an unified data model! • Redundancy will not be forced but accepted as a real-world necessity

User data

Collaboration model

Big Data Platform

Architecture Board

provide authoritative guidelines

Project A Project B Project X

use

consult

IT Operations

Data Stewards

Business Departments

Data Scientists

work with / are part of

Own projects

work with /

are part of

consult

are responsible

for data domains

consult

Own projects

are responsible

Project Teams

Role description: IT Operations

• Operates and monitors the Big Data Platform - based on an agreed-upon service level agreement (SLA)

• Keeps the platform up-to-date in short cycles

• Add additional components and technologies

• Scales the platform

• Have a DevOps mindset

Role description: Project Team

• A project team works on a data pipeline - from beginning to the end • Data pipelines can have different depth

• A project team is independent from other project teams • Project teams can collaborate

• A project team needs to have all roles to fulfill their project goal

• A project team has full responsibility for it’s own data

Project A

Data Scientist

Data Engineers

Data Analyst

Product Owner

Role description: Architecture Board

• Designs technological guidelines

• Consults on deviations from those guidelines

• Meets on a regular basis with full transparency

• Can consult project teams on their daily business

• Consists of architects, Data Stewards & Key members of the project teams

Role description: Data Steward

• Supervises the creation and usage of data and its quality • Steering person of a self-organized data

governance

• Is responsible for the user and meta data of (at least) one domain

• Operative role, works closely with all other roles

• Independent, self-organized team

• Are part of the architecture board

Role description: Data Scientist

• Independent team of data specialists

• Work as part of project teams but also have their own tasks, e.g. • Scientific assessment of data quality

• Generate project and product ideas

• Consult and work closely with data stewards and business departments

• Still unicorns on the job market

Get in contact

Twitter: @uweprintz [email protected]

Mail: [email protected]

Phone +49 176 1076531

XING: https://www.xing.com/profile/Uwe_Printz

Slide 1: https://unsplash.com/photos/7NtiJBowheE

Slide 2: Copyright by Uwe Printz

Slide 9: https://www.splitshire.com/little-dark-rider/

Slide 10: https://pixabay.com/de/kugelfisch-mexiko-handwerk-seziert-882440/

Slide 15: https://commons.wikimedia.org/wiki/File:Welcome_to_Fabulous_Las_Vegas_Sign.svg

Slide 19: http://unsplash.com/photos/7x4dOkulU9E/

Slide 22: https://unsplash.com/search/unicorn?photo=iWYrCr8eGwU

Slide 38: Copyright by Uwe Printz

All pictures CC0 or shot by the author