in-memory data management trends & techniques

Lightning TalkIn-Memory Data Management Trends & Techniques"

GREG LUCK!CTO HAZELCAST

2

In-Memory Hardware Trends

How to Use It

3

Von Neumann Architecture

3

Hardware Trends

4

5

Commodity Multi-core Servers

5

0

4

8

12

16

20

Cores/CPU

UMA -> NUMA

7

Commodity 64 bit servers

7

4GB 32 18EB 64

8

50 Years of RAM Prices Historical and Projected

8

9

50 Years of Disk Prices

9

10

SSD Prices

10

Average Price $1/GB

11

Cost Comparison: USD/GB 2012

11

Disk: $0.04

SSD: $1 25x

DRAM: $21 525x

$4k

$100k

$2.1m

100TB

12

Max RAM Per Commodity Server

12

0 1 2 3 4 5 6 7 8 9

2010 2011 2012 2013

TB

13

Latency across the network

13

0 10 20 30 40 50 60 70

µs

14

Access Times & Sizes

14

Level RR Latency Typical Size Technology Managed By Registers <1 ns 1 KB Custom CMOS Compiler L1 Cache 1 ns 8 – 128 KB SRAM Hardware L2 Cache 3 ns .5 – 8 MB SRAM Hardware L3 Cache (oc) 10-15 ns 4 – 30 MB SRAM Hardware Main Memory 60 ns 16GB – TB DRAM OS/App SSD 50 -100us 400GB – 6TB Flash Memory OS/App Main Memory over Network

2-100us Unbounded DRAM/Ethernet/Infinband

OS/App

Disk 4 - 7ms Multiple TBs Magnetic Rotational Disk

OS/App

Disk over Network

6 - 10ms Unbounded Disk/Ethernet/Infiniband

OS/App

15

Access Times & Sizes

15

Level RR Latency Typical Size Technology Managed By Registers <1 ns 1 KB Custom CMOS Compiler L1 Cache 1 ns 8 – 128 KB SRAM Hardware L2 Cache 3 ns .5 – 8 MB SRAM Hardware L3 Cache (oc) 10-15 ns 4 – 30 MB SRAM Hardware Main Memory 60 ns 16GB – TB DRAM OS/App SSD 50 -100us 400GB – 6TB Flash Memory OS/App Main Memory over Network

2-100us Unbounded DRAM/Ethernet/Infinband

OS/App

Disk 4 - 7ms Multiple TBs Magnetic Rotational Disk

OS/App

Disk over Network

6 - 10ms Unbounded Disk/Ethernet/Infiniband

OS/App

Cache up to 30 times faster than memory. Memory 106 times faster than disk.

Network Memory 103 times faster than disk. SSD 102 faster than disk

Techniques

16

Exploit Data Locality

Data is more likely to be read if: •  It was recently read (temporal locality) •  If it is adjacent to other data (e.g. arrays, fields in an object) •  If it is part of a pattern (e.g. looping, relations) •  Some data is naturally accessed more frequently e.g. Pareto

Distribution

Working with the CPU’s Cache Hierarchy

•  Memory up to 30x slower than cache •  Alleviated somewhat by NUMA, wide ��

channel, multi-channel/large cache •  Vector instructions •  Work with Cache Lines •  Work with Memory Pages (TLBs) •  Work with Prefetching •  Exploit NUMA with cpu affinity��numactl --physcpubind=0 –localalloc java … ��

•  Exploit natural data locality

Data Locality Effects – intra machine

0

20

40

60

80

100

120

140

160

Linear Random - Page

Random - Heap

Intel U4100 i7-860 i7-2760QM

20

Tiered Storage

20

20

Local Disk SSD and Rotational

(Restartable)

Local Storage

Heap Store

Off-Heap Store

5,000,000+

1,000,000

10

1000+

2,000+

Speed (TPS) Size (GB)

100,000

10,000s -

Network Storage

Network Accessible Memory - 100,000 +

21

Data Locality Effects – inter machine

21 21

Compared with hybrid in-‐process and distributed cache: Latency = L1 speed * propor:on + L2 speed * propor:on

L1 = 0ms (< 5us) for on-‐heap and 50-‐100 us off-‐heap L2 = 1 ms 80% L1 Pareto Model:

= 0 * .8 + 1 * .2 = .2 ms

90% L1 Pareto Model: latency = 0 * .9 + 1 * .1

= .1 ms

Columnar Storage

•  Manipulate data locality •  Sorted Dictionary compression

for finite values •  Allows values to be held in

cache for SSE instructions •  Better cache line effectiveness •  Fewer CPU cache misses for

aggregate calculations •  Cross-over point is around a

few dozen columns

Parallelism

•  Multi-threading •  Avoid synchronized: CAS •  Query using a scatter gather pattern •  Map/Reduce e.g. Hazelcast Map/Reduce

Java: Will it make the cut?

Garbage Collection limits heap usage. G1 and Balanced aim for <100ms at 10GB. ��

Unused Memory

64GB

4GB 4s

Heap

Java Apps Memory Bound

GC Pause Time

Available Memory

GC

Off-Heap Storage

No low-level CPU access ��

Java is challenged as an infrastructure language despite its newly popular ��

usage for this

CEP/Stream Processing

•  Don’t let data pool up and then process with “pull queries”. •  Invert that and process it as it streams in. “push queries” •  Queries execute against “tables” that breaks the stream up into��

a current time window •  Hold the window and intermediate results in memory�� Results are in real-time

In-Situ Processing

Rather than moving the data to be processed you process it in-situ. �� Examples: �� - HANA Calculation Engine��- Google Big Query��- Exadata Storage Servers��- Hazelcast EntryProcessor and Distributed Executor Service

27

Souped-Up Von Neumann Architecture

27

Memory Over The Network

Memory Over The Network

SSD (Flash and

RAM)

Multi-processor Multi-core/

Compression

64 bit DRAM

More Cache, NUMA, Wide/Multi channel, Locality

PCI Flash

PCI Flash

Vector/AES etc

The Data Management Landscape

28

29 29

The new data management world

Data Grid

Terracotta Coherence Gemfire …

SAP HANA Relational | Analytical

•  “Appliance” •  Aggressive IA64 optimisations •  ACID, SQL and MDX •  In-memory SSD and Disk •  Row and Column based Storage •  Fast aggregation on column store •  Single Instance 1TB limit •  Uses compression (est. 5x size) •  Parallel DB - round-robin, hash, or range partitioning of a table

with shared storage •  Updates as delta inserts •  Data is fed from source systems near real-time, real-time or

batch

Volt DB Relational | New SQL | Operational | Analytical

•  An all in-memory design •  Full SQL and full ACID •  Partitioned per core so that one thread own its partition –

avoids locking and latching •  Redundancy provided by ��

multiples instances with ��writes being replicated

•  Claims to be 45x faster

Oracle Exadata Relational | Operational | Analytical | Appliance

•  Combines Oracle RAC with “Storage Servers” •  Connected with the box with Infiniband QDR •  SS use PCI Flash (not SSD) for a 22 TB hardware cache •  In-situ computation on the Storage Servers with “Smart Scan” •  Uses “Hybrid Columnar Compression” a compromise of row

and column storage.

PCI Flash Card

Terracotta BigMemory Key-Value | Operational | Data Grid

•  In-memory •  Key-value with the Ehcache and soon javax.cache APIs •  In-process (L1) and server storage (L2) •  Persistence via log-forward Fast Restart Store: SSD or Disk •  Tiered Storage: local on-heap, local off-heap, server on-heap,

server off-heap •  Partitions with consistent hashing •  Search with parallel in-situ execution •  Off-heap allows 2TB uncompressed in each app server Java

process and on each server partition •  Compression •  Speed ranging from < 1µs to a few ms.

Hazelcast Key-Value | Operational | Data Grid

•  In-memory •  Key-value Map API and javax.cache API •  Near cache and server data storage •  Tiered Storage: local on-heap, local off-heap, server on-heap,

server off-heap •  Partitions with consistent hashing •  Search with parallel in-situ execution •  In-situ processing with Entry Processors and Distributed

Executors •  Speed ranging from < 1µs to a few ms.

Disk is the new tape

35

SSD is the new disk

Memory is the new operational store

in-memory data management trends & techniques

Software