evolving beyond the data lake: a story of wind and rain

33
1 © 2016 MapR Technologies 1 © 2016 MapR Technologies Evolving Beyond the Data Lake A Story of Wind and Rain Jim Scott @kingmesal #strataconf

Upload: mapr-technologies

Post on 16-Apr-2017

281 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Evolving Beyond the Data Lake: A Story of Wind and Rain

1© 2016 MapR Technologies 1© 2016 MapR Technologies

Evolving Beyond the Data LakeA Story of Wind and Rain

Jim Scott@kingmesal #strataconf

Page 2: Evolving Beyond the Data Lake: A Story of Wind and Rain

2© 2016 MapR Technologies 2

Industry Leaders Are Investing in Disruptive Technology NowInnovating and reducing costs at the same time

Source: IDC, Gartner; Analysis & Estimates: MapRNext-gen consists of cloud, big data, software and hardware related expenses

2013 2014 2015 2016 2017 2018 2019 2020

(100,000)

(50,000)

-

50,000

100,000

150,000 Investment in Next-Gen vs. Legacy Technologies for Data

$120

100

80

60

40

20

(20)

(40)

(60)

(80)

(100)

In Billions

Total $ Growth of IT Market Next-Gen Growth Legacy Market Growth/Shrink in $

90% of data is on next-gen technology

in just four years

Page 3: Evolving Beyond the Data Lake: A Story of Wind and Rain

3© 2016 MapR Technologies 3

Application Development and Deployment

Oracle

Bulk Load

Machine Learning

Data LakePredictive

Modeling

BI / Reporting

Insights DB

Events(Kafka)

NoSQL

SQL Server

Graph DB

Microservice(.NET)

Microservice(NodeJS)

Microservice(Java)

Customer Insights

SQL Server

IIS, ASP.NET

DesktopBrowser

(Javascript, jQuery)

SQL

HTML, CSS, JS

MicrosoftReporting

Service

2005 Today DesktopBrowser

(Javascript, 20+ Frameworks)

Tablet

Native Android

Native iOS

JSON

JSON, CSS, HTML, JS

Backend for Frontend

(Java)

Page 4: Evolving Beyond the Data Lake: A Story of Wind and Rain

4© 2016 MapR Technologies 4

Application Development and Deployment

Oracle

Bulk Load

Machine Learning

Data LakePredictive

Modeling

BI / Reporting

Insights DB

Events(Kafka)

NoSQL

SQL Server

Graph DB

Microservice(.NET)

Backend for Frontend

(Java)

Microservice(NodeJS)

Microservice(Java)

DesktopBrowser

(Javascript, 20+ Frameworks)

Tablet

Native Android

Native iOS

Customer Insights

JSON

JSON, CSS, HTML, JS

SQL Server

IIS, ASP.NET

DesktopBrowser

(Javascript, jQuery)

SQL

HTML, CSS, JS

MicrosoftReporting

Service

2005 Today

Page 5: Evolving Beyond the Data Lake: A Story of Wind and Rain

5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies

Messaging platforms

Page 6: Evolving Beyond the Data Lake: A Story of Wind and Rain

6© 2016 MapR Technologies 6

Producers Consumers

A stream is an unbounded sequence of events carried from a set of producers to a set of consumers.

What’s a Stream?

Producers and consumers don’t have to be aware of each other, instead they participate in shared topics.

This is called publish/subscribe.

/Events:Topic

Page 7: Evolving Beyond the Data Lake: A Story of Wind and Rain

7© 2016 MapR Technologies 7

Publishers and Subscribers (pub-sub)

/Events:Topic Analytics

Consumers

Stream ProcessorsSocial Platforms

Servers (Logs, Metrics)

Sensors

Mobile Apps

Other Apps & Microservices

Alerting Systems

Stream Processing Frameworks

Databases & Search Engines

Dashboards

Other Apps & Microservices

Page 8: Evolving Beyond the Data Lake: A Story of Wind and Rain

8© 2016 MapR Technologies 8

Considering a Messaging Platform• 50-100k messages per second used to be good

– Not really good to handle decoupled communication between services

• Kafka model is BLAZING fast– Kafka 0.9 API with message sizes at 200 bytes– MapR Streams on a 5 node cluster sustained 18 million events / sec– Throughput of 3.5GB/s and over 1.5 trillion events / day

• Manual sharding is not a “great” solution– Adding more servers should be easy and fool proof, not painful– Yes, I have lived through this

Page 9: Evolving Beyond the Data Lake: A Story of Wind and Rain

9© 2016 MapR Technologies 9

Goals• Real-time or near-time

– Includes situations with deadlines– Also includes situations where delay is simply undesirable– Even includes situations where delay is just fine

• Microservices– Streaming is a convenient idiom for design– Microservices … you know we wanted it– Service isolation is a key requirement

Page 10: Evolving Beyond the Data Lake: A Story of Wind and Rain

10© 2016 MapR Technologies 10

Advantages of Messaging and Real-time Enablement• Less moving parts

– Less things to go wrong

• Better resource utilization– Scale any application up or down on demand

• Common deployment model (new isolation model)– Repeatability between environments (dev, qa, production)

• Improved integration testing– Listen to production streams in dev and qa (** this is a BIG DEAL! **)

• Shared file system– Get at the data anywhere in the cluster– Simplifies business continuity

Page 11: Evolving Beyond the Data Lake: A Story of Wind and Rain

11© 2016 MapR Technologies 11

A microservice isloosely coupled

with bounded context

Page 12: Evolving Beyond the Data Lake: A Story of Wind and Rain

12© 2016 MapR Technologies 12

How to Couple Services and Break micro-ness• Shared schemas, relational stores• Ad hoc communication between services• Enterprise service busses• Brittle protocols• Poor protocol versioning

Don’t do this!

Page 13: Evolving Beyond the Data Lake: A Story of Wind and Rain

13© 2016 MapR Technologies 13

How to Decouple Services• Use self-describing data • Private databases• Infrastructural communication between services• Use modern protocols• Adopt future-proof protocol practices• Use shared storage where necessary due to scale

Page 14: Evolving Beyond the Data Lake: A Story of Wind and Rain

14© 2016 MapR Technologies 14

Decoupled Architecture

Producer

Activity Handler

Producer

ProducerHistorical

Interesting Data Real-time

Analysis

Results Dashboard

Anomaly Detection

Page 15: Evolving Beyond the Data Lake: A Story of Wind and Rain

15© 2016 MapR Technologies 15

Mechanisms for Decoupling• Traditional message queues?

– Message queues are classic answer– Key feature/flaw is out-of-order acknowledgement– Many implementations– You pay a huge performance hit for persistence

• Kafka-esque Logs?– Logs are like queues, but with ordering– Out-of-order consumption is possible, acknowledgement not so much– Canonical base implementation is Kafka– Performance plus persistence

Page 16: Evolving Beyond the Data Lake: A Story of Wind and Rain

16© 2016 MapR Technologies 16

Shared Resources

Page 17: Evolving Beyond the Data Lake: A Story of Wind and Rain

17© 2016 MapR Technologies 17

Fraud Detection

Page 18: Evolving Beyond the Data Lake: A Story of Wind and Rain

18© 2016 MapR Technologies 18

Traditional Solution

Page 19: Evolving Beyond the Data Lake: A Story of Wind and Rain

19© 2016 MapR Technologies 19

What Happens Next?

Page 20: Evolving Beyond the Data Lake: A Story of Wind and Rain

20© 2016 MapR Technologies 20

What Happens Next?

Page 21: Evolving Beyond the Data Lake: A Story of Wind and Rain

21© 2016 MapR Technologies 21

How to Get Service Isolation

Page 22: Evolving Beyond the Data Lake: A Story of Wind and Rain

22© 2016 MapR Technologies 22

New Uses of Data

Page 23: Evolving Beyond the Data Lake: A Story of Wind and Rain

23© 2016 MapR Technologies 23

Scaling Through Isolation

Page 24: Evolving Beyond the Data Lake: A Story of Wind and Rain

24© 2016 MapR Technologies 24© 2016 MapR Technologies

Use Cases

Page 25: Evolving Beyond the Data Lake: A Story of Wind and Rain

25© 2016 MapR Technologies 25

Event-based Data Drives Applications

FailureAlerts

Real-time application & network monitoring

Trending now

WebPersonalized Offers

Real-time Fraud Detection

Ad optimizationSupply Chain Optimization

Page 26: Evolving Beyond the Data Lake: A Story of Wind and Rain

26© 2016 MapR Technologies 26

ClassifiersFighting Fraudulent Web Traffic

Activity Stream

Click Stream

Deviation from Normal

Blacklist Activities

Whitelist Activities

User Activity Profile

Known Bad Classifier

All OK Classifier

Session Alteration Stream Notify Security

Page 27: Evolving Beyond the Data Lake: A Story of Wind and Rain

27© 2016 MapR Technologies 27

Similarities between Marketing and Fraud?

Customer 360 Website Fraud

• Build a user profile– What are their normal usage patterns

• Build “segmented” profiles– What do real users normally do

• Dynamically alter website– Prevent user functionality

• Kick-off external workflows– Notify security team

• Build a user profile– What type of content do they like

• Build “segmented” profiles– Company affiliation

• Dynamically alter website– Show alternate content

• Kick-off external workflows– Nurture emails

Page 28: Evolving Beyond the Data Lake: A Story of Wind and Rain

28© 2016 MapR Technologies 28

Message Bus

Specialized Storage

Operational Applications

J2EE AppServer

Relational Database

Legacy Business Platforms

• IT must integrate all the products

• Inability to operationalize the insight rapidly

• Can’t deal with high speed data ingestion and processing

• Scale up architecture leads to high cost

Specialized Storage

Analytical Applications

Analytic Database ETL Tool BI Tool

Page 29: Evolving Beyond the Data Lake: A Story of Wind and Rain

29© 2016 MapR Technologies 29

Converged Data Platform

Analytical Applications

Operational Applications

Converged ApplicationsComplete Access to Real-time and

Historical Data in One Platform

Developers Creating Database and Event Based

Applications

(Bottom Line Initiatives) (Top Line Initiatives)

Analysts Creating BI Reports and KPIs on Data

Warehouse

Historical Data Current Data

Page 30: Evolving Beyond the Data Lake: A Story of Wind and Rain

30© 2016 MapR Technologies 30

Web-Scale StorageMapR-FS MapR-DB

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR StreamsEvent StreamingDatabase

MapR Platform Services: Open API ArchitectureAssures Interoperability, Avoids Lock-in

HDFS API

POSIXNFS

SQL,HBase

APIJSONAPI

KafkaAPI

Page 31: Evolving Beyond the Data Lake: A Story of Wind and Rain

31© 2016 MapR Technologies 31

Converged Application Benefits

• Consumers scale horizontally with partitions• 1:1 mapping between consumer and partition• Enables predictable scaling as production needs grow

• Data can be seamlessly replicated to another cluster• Enables HA with zero code changes

• Data is indexed dynamically according to receivers, senders• Scales beyond the capabilities of Kafka

• Snapshots can be taken to capture state• Enables faster testing and deployment of applications

Page 32: Evolving Beyond the Data Lake: A Story of Wind and Rain

32© 2016 MapR Technologies 32

Not All Data Platforms are the Same

Page 33: Evolving Beyond the Data Lake: A Story of Wind and Rain

33© 2016 MapR Technologies 33

@kingmesal

[email protected]

Engage with us!

kingmesal