big data: from mammoth to elephant

47
Roman Nikitchenko, 10.05.2015 BIG DATA: FROM MAMMOTH TO ELEPHANT

Upload: roman-nikitchenko

Post on 21-Jul-2015

814 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: BIG DATA: From mammoth to elephant

Roman Nikitchenko, 10.05.2015

BIG DATA: FROM MAMMOTH TO ELEPHANT

Page 2: BIG DATA: From mammoth to elephant

MAMMOTHThe only real truth we know about them is their rests. Do you feel your enterprise data infrastructure goes this way?

Come and see in the nearest data center...

2

Page 3: BIG DATA: From mammoth to elephant

TWO YEARS AGO● Our exciting high scalability realtime

BIG DATA solution with broad technologies stack in production.

3

Page 4: BIG DATA: From mammoth to elephant

This is our PRESENT DAY

.. yet is powered by

4

Page 5: BIG DATA: From mammoth to elephant

storage storage

SQL DB Processed inbound data

Inbound Outbound

SQL DB Processed inbound data

Healthcare providers data: labs, cares ...

Mostly insurance companies

SQL DB Application data

SQL DB Outbound information

OUR INITIAL STATE: TOP VIEW

CLIENT APPLICATIONS

CLIENT APPLICATIONS

CLIENT APPLICATIONS

CLIENT APPLICATIONS

5

Page 6: BIG DATA: From mammoth to elephant

storage storage

SQL DB Processed inbound data

Inbound Outbound

SQL DB Processed inbound data

Mostly insurance companies

SQL DB Application data

SQL DB Outbound information

OUR INITIAL STATE: TOP VIEW

CLIENT APPLICATIONS

CLIENT APPLICATIONS

CLIENT APPLICATIONS

CLIENT APPLICATIONS

Inbound data archives(pretty short cycle)

One SQL DB per application

Huge amount of data. Serious amount of duplicates

How about retention and data issues investigation?

Healthcare providers data: labs, cares ...

6

Page 7: BIG DATA: From mammoth to elephant

Outbound flow is slow because of RDBMS processing

storage storage

SQL DB Processed inbound data

Inbound Outbound

SQL DB Processed inbound data

Mostly insurance companies

SQL DB Application data

SQL DB Outbound information

OUR INITIAL STATE: TOP VIEW

CLIENT APPLICATIONS

CLIENT APPLICATIONS

CLIENT APPLICATIONS

CLIENT APPLICATIONS

Inbound data retention cycle is short, so prolonged period data investigation is hard

Overall huge amount of SQL databases, high operational complexity

One application DB per service client makes inter-application analytics and monitoring extremely hard

YELLOW ALARMS

Healthcare providers data: labs, cares ...

7

Page 8: BIG DATA: From mammoth to elephant

8

Page 9: BIG DATA: From mammoth to elephant

BIG DATA

Better ways to store huge data volumes: cheaper, safer and easier.

WHAT TO RUN FOR?

MORE STORAGE

9

Page 10: BIG DATA: From mammoth to elephant

BIG DATAWHAT TO RUN FOR?

Scalable effective distributed processing models to open new opportunities like machine learning.

MORE POWER

10

Page 11: BIG DATA: From mammoth to elephant

BIG DATAWHAT TO RUN FOR?

More flexible data structures closer to subject area and real world.

11

Page 12: BIG DATA: From mammoth to elephant

RDBMS LIMITS● Good for anything

● Not so good for anything in particular

OUR MAIN ENEMY WAS ...

12

Page 13: BIG DATA: From mammoth to elephant

MASSIVE ANALYSISIs about massive access to your data objects

Yourdatabase

Subject area objects data

Subject area objects data

Subject area objects data

Subject area objects data

Processing

Processing

Processing

Processing

Transformation from database structure into object structure

Distributed parallel

processing

Effective results collection

Distributed processing

results to be joined

WHY SQL IS EVIL

13

Page 14: BIG DATA: From mammoth to elephant

RDBMS LIMITS

When you go massive processing, objects collection is getting too complex. Think about 100.000.000 people data scan.

Address ID City Street

1 New York 1020, Blue lake

2 Atlanta 203, Bricks av.

3 Seattle 120, Green drv.

FirstName LastName Address Payer

John Smith 1 2

Kate Davis 2 1

Samuel Brown 3 2

Payer ID Name State

1 SaferLife GA

2 YourGuard CA

Kate Davis,Atlanta 203, Bricks av.SafeLife, GA

SUBJECT AREA OBJECT COLLECTION

14

Page 15: BIG DATA: From mammoth to elephant

FirstName

LastName

Address

Payer

Birthday

RDBMS LIMITSFirstName LastName Address Payer

John Smith 1 2

Kate Davis 2 1

Samuel Brown 3 2

And now let us add new «Birthday» column.Easy as pie!

Let it be Patients table ...

ALTER TABLE Patient ADD Birthday ...

TABLE STRUCTURE MODIFICATION

Let's do this with 2.000.000.000 rows MySQL table in production. What to do if your table grows further?

15

Page 16: BIG DATA: From mammoth to elephant

ANY RELATIONAL DATA MODELSOONER OR LATER

16

Page 17: BIG DATA: From mammoth to elephant

Your SQLdatabase

Shard

Shard

Shard

Shard

Processing

Processing

Processing

Processing

How to partition data? What to do

when new shard is added?

Need another cluster for

processing?

Distributed processing

results to be joined

HOW TO SCALE?

RDBMS LIMITS

17

Page 18: BIG DATA: From mammoth to elephant

If you need to store plain text log, collection of objects for a long time or current user session attributes do you really need SQL?

18

Page 19: BIG DATA: From mammoth to elephant

Cross-application data storage

SQL DB Application data

SQL DB Application data

SQL DB Application data

Small realtime requests Batch analytic and reporting

load

ETL

ETL

ETL

● One-time ETL as initial step and backup strategy.● Full migration to Apache Hbase.● As a transition period solution — realtime synchronization.

OUR INITIALBIG PLAN WAS

19

Page 20: BIG DATA: From mammoth to elephant

OPEN SOURCE framework for big data. Both distributed storage and processing

Provides RELIABILITY and fault tolerance by SOFTWARE design (for example file system with replication factor 3 as default one.Horizontal scalability from

single computer up to thousands of nodes

Why Hadoop (initially 1.x)?

20

Page 21: BIG DATA: From mammoth to elephant

First ever worldDATA OS

10.000 nodes computer... Can start in production from just 4 servers, 1 of them is for management and coordination. Single server is enough for development environment.

21

Page 22: BIG DATA: From mammoth to elephant

HBase motivationWHY

LATENCY, SPEED AND ALL HADOOP PROPERTIES

22

Page 23: BIG DATA: From mammoth to elephant

Database

Region server

Distributed processing

WHY YET ?

DataNode Node

File system Hardware

TaskTracker

Region server DataNode NodeTaskTracker

Region server DataNode NodeTaskTracker

Region server DataNode NodeTaskTracker

● Good both for OLTP and batch load.● Natural scaling and reliability with Hadoop.● Data processing locality, natural sharding with regions.● Coordination with ZooKeeper.

23

Page 24: BIG DATA: From mammoth to elephant

ZooKeeperBecause coordinating distributed systems is a Zoo.

● Quorum based service for fast distributed system coordination.

● Came in our stack with Apache Hbase where it was needed for coordination. Now is part of core Hadoop infrastructure.

● Yet we use it for our own applications,

24

Page 25: BIG DATA: From mammoth to elephant

Finally we went initial production with HADOOP 2.0

RESOURCE MANAGEMENT

DISTRIBUTED PROCESSING

FILE SYSTEM

COORDINATION

HADOOP 2.x CORE

25

Page 26: BIG DATA: From mammoth to elephant

Database

Region server

Distributed processing & coordination

Real initial approach

DataNode Node

File system Hardware

Region server DataNode Node

Region server DataNode Node

Region server DataNode Node

● ZooKeeper Instances are distributed among cluster.● MapReduce is not service in Hadoop 2.x, just YARN application.

Resource management

NodeManager

NodeManager

NodeManager

NodeManager

26

Page 27: BIG DATA: From mammoth to elephant

FIRST REAL RESULT

Cross-application data storage

SQL DB Application data

SQL DB Application data

SQL DB Application data

Small realtime requests Batch analytic and reporting

load

ETL

ETL

ETL

CLOSE BUT NOT EXACT PLAN

Daily ETL. Satisfied our daily reporting needs with major SQL infrastructure offload. Direct profit — massive processing is much faster, can handle inter-application data.

DO NOT WEAR PINK GLASSES

27

Page 28: BIG DATA: From mammoth to elephant

APPROACH WE HAVE FIXED MUCH LATER

SQLserver

JOIN

Table1

Table2

Table3

Table4

ETL stream

SQLserver

JOIN

Table1

Table2

Table3

Table4

ETL stream

ETL stream

ETL stream

ETL stream

BIG DATA shard

BIG DATA shard

BIG DATA shard

BIG DATA shard

BIG DATA shard

BIG DATA shard

Bulkload

Bulkload

28

Page 29: BIG DATA: From mammoth to elephant

Hadoop: don't do it yourself

DON'T DO IT YOURSELF

Because of number of factors starting from our distributed team support needs we have selected

29

Page 30: BIG DATA: From mammoth to elephant

x MAX+

=

BIG DATA

BIG DATA

BIG DATA

HADOOP as INFRASTRUCTURE

30

Page 31: BIG DATA: From mammoth to elephant

WHERE TO GO FROM HERE?31

Page 32: BIG DATA: From mammoth to elephant

The admission of temporary residents into Canada is a privilege, not a right.

http://www.cic.gc.ca/

SEARCH / SECONDARY INDICES

32

Page 33: BIG DATA: From mammoth to elephant

NO SEARCH OUT OF THE BOX OTHER THAN LINEAR SCAN OVER THE TABLE AND FILTERS.

SEARCH / SECONDARY INDICES

The same happened to be applicable to secondary indices in Hbase.

33

Page 34: BIG DATA: From mammoth to elephant

SEARCH / SECONDARY INDICES

HOW WE MADE IT

HBase handles user data changes

Indexes are built on SOLR

NGData Lily indexer transforms data

changes into SOLR index updates

34

Page 35: BIG DATA: From mammoth to elephant

HBase: Data and search integration

Data update

Client

User just puts (or deletes) data.

Search responses

Lily HBase NRT indexerREPLICATION

Translates data changes into SOLR

index updates.

SOLR cloud

Search requests (HTTP)

Apache Zookeeper does all coordination Provides real

indexing

Search and indexing together

35

Page 36: BIG DATA: From mammoth to elephant

● Kafka is a high throughput distributed messaging system.

● Allows true realtime system reaction through publish-subscribe approach.

● New services can subscribe to data events stream.

GOING REALTIMEBatch load

Realtime load

New data

36

Page 37: BIG DATA: From mammoth to elephant

● Kafka can be separated from Hadoop infrastructure or have backup cluster.

● Data publishers can switch to another cluster.

● Subscribers (including Spark on Hadoop) keep 2 places of subscription.

● So now you are free to put Kafka cluster in maintenance or backup subscribers.

GOING REALTIME

GENTLY

MAINTENANCE

37

Page 38: BIG DATA: From mammoth to elephant

This is our PRESENT DAY

.. yet is powered by

38

Page 39: BIG DATA: From mammoth to elephant

SO WHERE ARE WE GOING?

39

Page 40: BIG DATA: From mammoth to elephant

OVER BIG DATAREACTIVE MANIFESTO

MOTIVATION

… users expect millisecond response times and 100% uptime. Data is measured in Petabytes. Today's demands are simply not met by yesterday’s software architectures.

40

Page 41: BIG DATA: From mammoth to elephant

OVER BIG DATAREACTIVE MANIFESTO

… we want systems that are Responsive, Resilient, Elastic and Message Driven. We call these Reactive Systems. http://www.reactivemanifesto.org/

41

Page 42: BIG DATA: From mammoth to elephant

OVER BIG DATAREACTIVE MANIFESTO

Responsiveness is the cornerstone of usability and utility, but more than that, responsiveness means that problems may be detected quickly and dealt with effectively.

RESPONSIVE

42

Page 43: BIG DATA: From mammoth to elephant

OVER BIG DATAREACTIVE MANIFESTOThe system stays responsive in the face of failure.

… The client of a component is not burdened with handling its failures.

RESILIENT All services here are located through ZooKeeper which is quorum based so resilience is achieved

43

Page 44: BIG DATA: From mammoth to elephant

OVER BIG DATAREACTIVE MANIFESTO

Reactive Systems can react to changes in the input rate by increasing or decreasing the resources allocated to service these inputs.

ELASTICBoth HDFS and Hbase

allow dynamic node addition / removal

YARN already handles most resource allocation

work and makes progress

44

Page 45: BIG DATA: From mammoth to elephant

OVER BIG DATAREACTIVE MANIFESTO

Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling.

MESSAGE DRIVEN

Asynchronous messages from

applications

Any application can subscribe, not only Hadoop services

45

Page 46: BIG DATA: From mammoth to elephant

LESSONS LEARNED

● No transition in one step. You enter Big Data world step by step.

● Change you mind first. You should stop thinking in old style. Do not try simply to map your existing approaches.

● No silver bullet. Don't ruin your existing infrastructure. Extend it. NoSQL is not always good and some cases are really to be kept on SQL. Use the right tool.

● As you progress you pay more attention to operations and reactive system properties.

46

Page 47: BIG DATA: From mammoth to elephant

QUESTION?

47