dunning time-series-2015

71
© 2014 MapR Technologies 1 © 2014 MapR Technologies

Upload: ted-dunning

Post on 17-Jul-2015

164 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Dunning time-series-2015

© 2014 MapR Technologies 1© 2014 MapR Technologies

Page 2: Dunning time-series-2015

© 2014 MapR Technologies 2

Agenda

• The Internet is turning upside down

• A story

• The last (mile) shall be first

• Time series on NO-SQL

• Faster time series on NO-SQL

• Summary

Page 3: Dunning time-series-2015

© 2014 MapR Technologies 3

How the Internet Works

• Big content servers feed data across the backbone to

• Regional caches and servers feed data across neighborhood

transport to

• The “last mile”

• Bits are nearly conserved, $ are concentrated centrally

– But total $ mass at the edge is much higher

Page 4: Dunning time-series-2015

© 2014 MapR Technologies 4

How The Internet Works

Server

Cache

Cache

Gateway

SwitchFirewall

c1

c2

Gateway

Switch Firewall

c1

c2

SwitchFirewall c1

c2

Page 5: Dunning time-series-2015

© 2014 MapR Technologies 5

Conservation of Bits Decreases Bandwidth

Server

Cache

Cache

Gateway

SwitchFirewall

c1

c2

Gateway

Switch Firewall

c1

c2

SwitchFirewall c1

c2

Page 6: Dunning time-series-2015

© 2014 MapR Technologies 6

Total Investment Dominated by Last Mile

Server

Cache

Cache

Gateway

SwitchFirewall

c1

c2

Gateway

Switch Firewall

c1

c2

SwitchFirewall c1

c2

Page 7: Dunning time-series-2015

© 2014 MapR Technologies 7

The Rub

• What's the problem?

– Speed (end-to-end latency, backbone bw)

– Feasibility (cost for consumer links)

– Caching

• What do we need?

– Cheap last-mile hardware

– Good caches

Page 8: Dunning time-series-2015

© 2014 MapR Technologies 8

First:

An apology for going

off-script

Page 9: Dunning time-series-2015

© 2014 MapR Technologies 9

Now, the story

Page 10: Dunning time-series-2015

© 2014 MapR Technologies 10

Page 11: Dunning time-series-2015

© 2014 MapR Technologies 11

By the 1840’s, the NY-SF

sailing time was down to

130-180 days

Page 12: Dunning time-series-2015

© 2014 MapR Technologies 12

Page 13: Dunning time-series-2015

© 2014 MapR Technologies 13

In 1851, the record was

set at 89 days by the

Flying Cloud

Page 14: Dunning time-series-2015

© 2014 MapR Technologies 14

The difference was due

(in part) to big data

and a primitive kind of

time-series database

Page 15: Dunning time-series-2015

© 2014 MapR Technologies 15

Page 16: Dunning time-series-2015

© 2014 MapR Technologies 16

Page 17: Dunning time-series-2015

© 2014 MapR Technologies 17

Page 18: Dunning time-series-2015

© 2014 MapR Technologies 18

These charts were free …

If you donated your data

Page 19: Dunning time-series-2015

© 2014 MapR Technologies 19

But how does this apply

today?

Page 20: Dunning time-series-2015

© 2014 MapR Technologies 20

What has changed?

Where will it lead?

Page 21: Dunning time-series-2015

© 2014 MapR Technologies 21

Page 22: Dunning time-series-2015

© 2014 MapR Technologies 22

Page 23: Dunning time-series-2015

© 2014 MapR Technologies 23

Page 24: Dunning time-series-2015

© 2014 MapR Technologies 24

Page 25: Dunning time-series-2015

© 2014 MapR Technologies 25

Page 26: Dunning time-series-2015

© 2014 MapR Technologies 26

Page 27: Dunning time-series-2015

© 2014 MapR Technologies 27

Page 28: Dunning time-series-2015

© 2014 MapR Technologies 28

Page 29: Dunning time-series-2015

© 2014 MapR Technologies 29

Page 30: Dunning time-series-2015

© 2014 MapR Technologies 30

Page 31: Dunning time-series-2015

© 2014 MapR Technologies 31

Things

Page 32: Dunning time-series-2015

© 2014 MapR Technologies 32

Emitting data

Page 33: Dunning time-series-2015

© 2014 MapR Technologies 33

How The Internet Works

Server

Cache

Cache

Gateway

SwitchFirewall

c1

c2

Gateway

Switch Firewall

c1

c2

SwitchFirewall c1

c2

Page 34: Dunning time-series-2015

© 2014 MapR Technologies 34

How the Internet is Going to Work

Server

Cache

Cache

GatewaySwitchControllerm4

m3

Gateway

SwitchController

m6

m5

SwitchControllerm2

m1

Page 35: Dunning time-series-2015

© 2014 MapR Technologies 35

Where Will The $ Go?

Server

Cache

Cache

GatewaySwitchControllerm4

m3

Gateway

SwitchController

m6

m5

SwitchControllerm2

m1

Page 36: Dunning time-series-2015

© 2014 MapR Technologies 36

Sensors

Page 37: Dunning time-series-2015

© 2014 MapR Technologies 37

Controllers

Page 38: Dunning time-series-2015

© 2014 MapR Technologies 38

The Problems

• Sensors and controllers have little processing or space

– SIM cards = 20Mhz processor, 128kb space = 16kB

– Arduino mini = 15kB RAM (more EPROM)

– BeagleBone/Raspberry Pi = 500 kB RAM

• Sensors and controllers have little power

– Very common to power down 99% of the time

• Sensors and controls often have very low bandwidth

– Mesh networks with base rates << 1Mb/s

– Power line networking

– Intermittent 3G/4G/LTE connectivity

Page 39: Dunning time-series-2015

© 2014 MapR Technologies 39

What Do We Need to Do With a Time Series

• Acquire

– Measurement, transmission, reception

– Mostly not our problem

• Store

– We own this

• Retrieve

– We have to allow this

• Analyze and visualize

– We facilitate this via retrieval

Page 40: Dunning time-series-2015

© 2014 MapR Technologies 40

Retrieval Requirements

• Retrieve by time-series, time range, tags

– Possibly pull millions of data points at a time

– Possibly do on-the-fly windowed aggregations

• Search by unstructured data

– Typically require time windowed facetting after search

– Also need to dive in with first kind of retrieval

Page 41: Dunning time-series-2015

© 2014 MapR Technologies 41

Storage choices and trade-offs

• Flat files

– Great for rapid ingest with massive data

– Handles essentially any data type

– Less good for data requiring frequent updates

– Harder to find specific ranges

• Traditional relational db

– Ingests up to 10,000’s/ sec; prefers well structured (numerical) data; expensive

• Non-relational db: Tables (such as MapR tables in M7 or HBase)

– Ingests up to 100,000 rows/sec

– Handles wide variety of data

– Good for frequent updates

– Easily scanned in a range

Page 42: Dunning time-series-2015

© 2014 MapR Technologies 42

Specific Example

• Consider a server farm

• Lots of system metrics

• Typically 100-300 stats / 30 s

• Loads, RPC’s, packets, requests/s

• Common to have 100 – 10,000 machines

Page 43: Dunning time-series-2015

© 2014 MapR Technologies 43

The General Outline

• 10 samples / second / machine

x 1,000 machines

= 10,000 samples / second

• This is what Open TSDB was designed to handle

• Install and go, but don’t test at scale

Page 44: Dunning time-series-2015

© 2014 MapR Technologies 44

Specific Example

• Consider oil drilling rigs

• When drilling wells, there are *lots* of moving parts

• Typically a drilling rig makes about 10K samples/s

• Temperatures, pressures, magnetics,

machine vibration levels, salinity, voltage,

currents, many others

• Typical project has 100 rigs

Page 45: Dunning time-series-2015

© 2014 MapR Technologies 45

The General Outline

• 10K samples / second / rig

x 100 rigs

= 1M samples / second

Page 46: Dunning time-series-2015

© 2014 MapR Technologies 46

The General Outline

• 10K samples / second / rig

x 100 rigs

= 1M samples / second

• But wait, there’s more

– Suppose you want to test your system

– Perhaps with a year of data

– And you want to load that data in << 1 year

• 100x real-time = 100M samples / second

Page 47: Dunning time-series-2015

© 2014 MapR Technologies 47

How Should That Work?

Message

queueCollector

MapR

tableSamples

Web service Users

Page 48: Dunning time-series-2015

© 2014 MapR Technologies 48

A First Attempt

OpenTSDB is a distributed Time Series Database build on top of

HBase, enabling you …

– to store & index, as well as

– to query & plot

… metrics at scale.

Page 49: Dunning time-series-2015

© 2014 MapR Technologies 49

Design Goals

• Distributed storage of metrics

• Metrics query fast and easy

• Scale out to thousands of machines and billions of data points

• No SPOF

Page 50: Dunning time-series-2015

© 2014 MapR Technologies 50

Key concepts

Page 51: Dunning time-series-2015

© 2014 MapR Technologies 51

Key concepts

(00:38, 56) mysql.com_delete schema=userdb

Page 52: Dunning time-series-2015

© 2014 MapR Technologies 52

Key concepts

data point: (timestamp, value)

+ metric

+ tag: key=value

time series

Page 53: Dunning time-series-2015

© 2014 MapR Technologies 53

Example TS

...

1409497082 327810227706 mysql.bytes_received schema=foo host=db1

1409497099 6604859181710 mysql.bytes_sent schema=foo host=db1

1409497106 327812421706 mysql.bytes_received schema=foo host=db1

1409497113 6604901075387 mysql.bytes_sent schema=foo host=db

...

UNIX epoch timestamp: $(date +%s)

a metric (often hierarchical)

two tags

Page 54: Dunning time-series-2015

© 2014 MapR Technologies 54

Declare metric

$ tsdb mkmetric mysql.bytes_sent mysql.bytes_received

metrics mysql.bytes_sent: [0, 0, 1]

metrics mysql.bytes_received: [0, 0, 2]

… or use –auto-metric

Page 55: Dunning time-series-2015

© 2014 MapR Technologies 55

Collect metric

• tcollector: gathers data from local

collectors, pushes to TSDs and

providing deduplication

• lots bundled

– General: iostat, netstat, etc.

– Others: MySQL, HBase, etc.

• … or roll your own

Page 56: Dunning time-series-2015

© 2014 MapR Technologies 56

The Whole Picture

HBase

or

MapR-DB

Page 57: Dunning time-series-2015

© 2014 MapR Technologies 57

Wide Table Design: Point-by-Point

Page 58: Dunning time-series-2015

© 2014 MapR Technologies 58

Wide Table Design: Hybrid Point-by-Point + Blob

Insertion of data as blob makes original columns redundant

Non-relational, but you can query these tables with Drill

Page 59: Dunning time-series-2015

© 2014 MapR Technologies 59

Status to This Point

• Each sample requires one insertion, compaction requires

another

• Typical performance on SE cluster

– 1 edge node + 4 cluster nodes

– 20,000 samples per second observed

– Would be faster on performance cluster, possibly not a lot

• Suitable for server monitoring

• Not suitable for large scale history ingestion

• Bulk load helps a little, but not much

• Still 1000x too slow for industrial work

Page 60: Dunning time-series-2015

© 2014 MapR Technologies 60

Speeding up OpenTSDB

20,000 data points per second per node in the test cluster

Why can’t it be faster ?

Page 61: Dunning time-series-2015

© 2014 MapR Technologies 61

Speeding up OpenTSDB: open source MapR extensions

Available on Github: https://github.com/mapr-demos/opentsdb

Page 62: Dunning time-series-2015

© 2014 MapR Technologies 62

Status to This Point

• 3600 samples require one insertion

• Typical results on SE cluster– 1 edge node + 4 cluster nodes

– 14 million samples per second observed

– ~700x faster ingestion

• Typical results on performance cluster– 2-4 edge nodes + 4-9 cluster nodes

– 110 million samples/s (4 nodes) to >200 million samples/s (8 nodes)

• Suitable for large scale history ingestion

• 30 million data points retrieved in 20s

• Ready for industrial work

Page 63: Dunning time-series-2015

© 2014 MapR Technologies 63

Key Results

• Ingestion is network limited

– Edge nodes are the critical resource

– Number of edge nodes defines a limit to scaling

• With enough edge nodes scaling is near perfect

• Performance of raw OpenTSDB is limited by stateless demon

• Modified OpenTSDB can run 1000x faster

Page 64: Dunning time-series-2015

© 2014 MapR Technologies 64

Overall Ingestion Rate

Nodes

To

tal In

ge

stion

Ra

te (

mill

ion

s o

f p

oin

ts /

se

co

nd

)

4 5 8 9

05

01

50

25

0 Two ingestors

One ingestor

Page 65: Dunning time-series-2015

© 2014 MapR Technologies 65

Normalized Ingestion Rate

Nodes

Ing

estio

n p

er

no

de

(m

illio

ns o

f p

oin

ts / s

eco

nd)

4 5 8 9

01

02

03

04

0Two ingestors

One ingestor

Page 66: Dunning time-series-2015

© 2014 MapR Technologies 66

Why MapR?

• MapR tables are inherently faster, safer

– Sustained > 1GB/s ingest rate in tests

• Mirror to M5 or M7 cluster to isolate analytics load

• Transaction logs involves frequent appends, many files

Page 67: Dunning time-series-2015

© 2014 MapR Technologies 67

When is this All Wrong?

• In some cases, retrieval by series-id + time range not sufficient

• May need very flexible retrieval of events based on text-like

criteria

• Search may be better than class time-series database

• Can scale Lucene based search to > 1 million events / second

Page 68: Dunning time-series-2015

© 2014 MapR Technologies 68

When is it Even More Right

• In many industrial settings, data rates from individual sensors are

relatively high

– Latency to view is still measured in seconds, not sample points

• This allows batching at source

• Common requirement for highly variable sample rates

– 1 sample/s, baseline, switch to 10 k sample/s

– Small batches during slow times are just fine since number of sensors is

constant

– Requires variable window sizes

Page 69: Dunning time-series-2015

© 2014 MapR Technologies 69

Summary

• The internet is turning upside down

• This will make time series ubiquitous

• Current open source systems are much too slow

• We can fix that with modern NoSQL systems

– (I wear a red hat for a reason)

Page 70: Dunning time-series-2015

© 2014 MapR Technologies 70

Questions

Page 71: Dunning time-series-2015

© 2014 MapR Technologies 71

Thank You

@mapr maprtech

[email protected]@apache.org

Ted Dunning, Chief Application Architect

MapRTechnologies

maprtech

mapr-technologies