in-memory data management trends & techniques
DESCRIPTION
www.hazelcast.comTRANSCRIPT
Lightning TalkIn-Memory Data Management Trends & Techniques"
GREG LUCK!CTO HAZELCAST
2
In-Memory Hardware Trends
How to Use It
3
Von Neumann Architecture
3
Hardware Trends
4
5
Commodity Multi-core Servers
5
0
4
8
12
16
20
Cores/CPU
UMA -> NUMA
7
Commodity 64 bit servers
7
4GB 32 18EB 64
8
50 Years of RAM Prices Historical and Projected
8
9
50 Years of Disk Prices
9
10
SSD Prices
10
Average Price $1/GB
11
Cost Comparison: USD/GB 2012
11
Disk: $0.04
SSD: $1 25x
DRAM: $21 525x
$4k
$100k
$2.1m
100TB
12
Max RAM Per Commodity Server
12
0 1 2 3 4 5 6 7 8 9
2010 2011 2012 2013
TB
13
Latency across the network
13
0 10 20 30 40 50 60 70
µs
14
Access Times & Sizes
14
Level RR Latency Typical Size Technology Managed By Registers <1 ns 1 KB Custom CMOS Compiler L1 Cache 1 ns 8 – 128 KB SRAM Hardware L2 Cache 3 ns .5 – 8 MB SRAM Hardware L3 Cache (oc) 10-15 ns 4 – 30 MB SRAM Hardware Main Memory 60 ns 16GB – TB DRAM OS/App SSD 50 -100us 400GB – 6TB Flash Memory OS/App Main Memory over Network
2-100us Unbounded DRAM/Ethernet/Infinband
OS/App
Disk 4 - 7ms Multiple TBs Magnetic Rotational Disk
OS/App
Disk over Network
6 - 10ms Unbounded Disk/Ethernet/Infiniband
OS/App
15
Access Times & Sizes
15
Level RR Latency Typical Size Technology Managed By Registers <1 ns 1 KB Custom CMOS Compiler L1 Cache 1 ns 8 – 128 KB SRAM Hardware L2 Cache 3 ns .5 – 8 MB SRAM Hardware L3 Cache (oc) 10-15 ns 4 – 30 MB SRAM Hardware Main Memory 60 ns 16GB – TB DRAM OS/App SSD 50 -100us 400GB – 6TB Flash Memory OS/App Main Memory over Network
2-100us Unbounded DRAM/Ethernet/Infinband
OS/App
Disk 4 - 7ms Multiple TBs Magnetic Rotational Disk
OS/App
Disk over Network
6 - 10ms Unbounded Disk/Ethernet/Infiniband
OS/App
Cache up to 30 times faster than memory. Memory 106 times faster than disk.
Network Memory 103 times faster than disk. SSD 102 faster than disk
Techniques
16
Exploit Data Locality
Data is more likely to be read if: • It was recently read (temporal locality) • If it is adjacent to other data (e.g. arrays, fields in an object) • If it is part of a pattern (e.g. looping, relations) • Some data is naturally accessed more frequently e.g. Pareto
Distribution
Working with the CPU’s Cache Hierarchy
• Memory up to 30x slower than cache • Alleviated somewhat by NUMA, wide ���
channel, multi-channel/large cache • Vector instructions • Work with Cache Lines • Work with Memory Pages (TLBs) • Work with Prefetching • Exploit NUMA with cpu affinity������numactl --physcpubind=0 –localalloc java … ���
• Exploit natural data locality
Data Locality Effects – intra machine
0
20
40
60
80
100
120
140
160
Linear Random - Page
Random - Heap
Intel U4100 i7-860 i7-2760QM
20
Tiered Storage
20
20
Local Disk SSD and Rotational
(Restartable)
Local Storage
Heap Store
Off-Heap Store
5,000,000+
1,000,000
10
1000+
2,000+
Speed (TPS) Size (GB)
100,000
10,000s -
Network Storage
Network Accessible Memory - 100,000 +
21
Data Locality Effects – inter machine
21 21
Compared with hybrid in-‐process and distributed cache: Latency = L1 speed * propor:on + L2 speed * propor:on
L1 = 0ms (< 5us) for on-‐heap and 50-‐100 us off-‐heap L2 = 1 ms 80% L1 Pareto Model:
= 0 * .8 + 1 * .2 = .2 ms
90% L1 Pareto Model: latency = 0 * .9 + 1 * .1
= .1 ms
Columnar Storage
• Manipulate data locality • Sorted Dictionary compression
for finite values • Allows values to be held in
cache for SSE instructions • Better cache line effectiveness • Fewer CPU cache misses for
aggregate calculations • Cross-over point is around a
few dozen columns
Parallelism
• Multi-threading • Avoid synchronized: CAS • Query using a scatter gather pattern • Map/Reduce e.g. Hazelcast Map/Reduce
Java: Will it make the cut?
Garbage Collection limits heap usage. G1 and Balanced aim for <100ms at 10GB. ���
Unused Memory
64GB
4GB 4s
Heap
Java Apps Memory Bound
GC Pause Time
Available Memory
GC
Off-Heap Storage
No low-level CPU access ���
Java is challenged as an infrastructure language despite its newly popular ���
usage for this
CEP/Stream Processing
• Don’t let data pool up and then process with “pull queries”. • Invert that and process it as it streams in. “push queries” • Queries execute against “tables” that breaks the stream up into���
a current time window • Hold the window and intermediate results in memory������������ Results are in real-time
In-Situ Processing
Rather than moving the data to be processed you process it in-situ. ��� Examples: ��� - HANA Calculation Engine���- Google Big Query���- Exadata Storage Servers���- Hazelcast EntryProcessor and Distributed Executor Service
27
Souped-Up Von Neumann Architecture
27
Memory Over The Network
Memory Over The Network
SSD (Flash and
RAM)
Multi-processor Multi-core/
Compression
64 bit DRAM
More Cache, NUMA, Wide/Multi channel, Locality
PCI Flash
PCI Flash
Vector/AES etc
The Data Management Landscape
28
29 29
The new data management world
Data Grid
Terracotta Coherence Gemfire …
SAP HANA Relational | Analytical
• “Appliance” • Aggressive IA64 optimisations • ACID, SQL and MDX • In-memory SSD and Disk • Row and Column based Storage • Fast aggregation on column store • Single Instance 1TB limit • Uses compression (est. 5x size) • Parallel DB - round-robin, hash, or range partitioning of a table
with shared storage • Updates as delta inserts • Data is fed from source systems near real-time, real-time or
batch
Volt DB Relational | New SQL | Operational | Analytical
• An all in-memory design • Full SQL and full ACID • Partitioned per core so that one thread own its partition –
avoids locking and latching • Redundancy provided by ���
multiples instances with ���writes being replicated
• Claims to be 45x faster
Oracle Exadata Relational | Operational | Analytical | Appliance
• Combines Oracle RAC with “Storage Servers” • Connected with the box with Infiniband QDR • SS use PCI Flash (not SSD) for a 22 TB hardware cache • In-situ computation on the Storage Servers with “Smart Scan” • Uses “Hybrid Columnar Compression” a compromise of row
and column storage.
PCI Flash Card
Terracotta BigMemory Key-Value | Operational | Data Grid
• In-memory • Key-value with the Ehcache and soon javax.cache APIs • In-process (L1) and server storage (L2) • Persistence via log-forward Fast Restart Store: SSD or Disk • Tiered Storage: local on-heap, local off-heap, server on-heap,
server off-heap • Partitions with consistent hashing • Search with parallel in-situ execution • Off-heap allows 2TB uncompressed in each app server Java
process and on each server partition • Compression • Speed ranging from < 1µs to a few ms.
Hazelcast Key-Value | Operational | Data Grid
• In-memory • Key-value Map API and javax.cache API • Near cache and server data storage • Tiered Storage: local on-heap, local off-heap, server on-heap,
server off-heap • Partitions with consistent hashing • Search with parallel in-situ execution • In-situ processing with Entry Processors and Distributed
Executors • Speed ranging from < 1µs to a few ms.
Disk is the new tape
35
SSD is the new disk
Memory is the new operational store