hbase low latency
TRANSCRIPT
![Page 1: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/1.jpg)
HBase Low Latency
Nick Dimiduk, Hortonworks (@xefyr)Nicolas Liochon, Scaled Risk (@nkeywal)
Hadoop Summit June 4, 2014
![Page 2: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/2.jpg)
Agenda
• Latency, what is it, how to measure it
• Write path
• Read path
• Next steps
![Page 3: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/3.jpg)
What’s low latency
Latency is about percentiles• Average != 50% percentile• There are often order of magnitudes between « average » and « 95
percentile »• Post 99% = « magical 1% ». Work in progress here.
• Meaning from micro seconds (High Frequency Trading) to seconds (interactive queries)• In this talk milliseconds
![Page 4: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/4.jpg)
Measure latency
bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation• More options related to HBase: autoflush, replicas, …• Latency measured in micro second• Easier for internal analysis
YCSB - Yahoo! Cloud Serving Benchmark• Useful for comparison between databases• Set of workload already defined
![Page 5: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/5.jpg)
Write path
• Two parts• Single put (WAL)
• The client just sends the put• Multiple puts from the client (new behavior since 0.96)
• The client is much smarter
• Four stages to look at for latency• Start (establish tcp connections, etc.)• Steady: when expected conditions are met• Machine failure: expected as well• Overloaded system
![Page 6: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/6.jpg)
Single put: communication & scheduling• Client: TCP connection to the server• Shared: multitheads on the same client are using the same TCP connection
• Pooling is possible and does improve the performances in some circonstances• hbase.client.ipc.pool.size
• Server: multiple calls from multiple threads on multiple machines• Can become thousand of simultaneous queries• Scheduling is required
![Page 7: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/7.jpg)
Single put: real work
• The server must• Write into the WAL queue• Sync the WAL queue (HDFS flush)• Write into the memstore
• WALs queue is shared between all the regions/handlers• Sync is avoided if another handlers did the work• Your handler may flush more data than expected
![Page 8: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/8.jpg)
Simple put: A small run
Percentile Time in msMean 1.2150% 0.9595% 1.5099% 2.12
![Page 9: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/9.jpg)
Latency sources
• Candidate one: network• 0.5ms within a datacenter• Much less between nodes in the same rack
Percentile Time in msMean 0.1350% 0.1295% 0.1599% 0.47
![Page 10: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/10.jpg)
Latency sources
• Candidate two: HDFS Flush
• We can still do better: HADOOP-7714 & sons.
Percentile Time in msMean 0.3350% 0.2695% 0.5999% 1.24
![Page 11: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/11.jpg)
Latency sources
• Millisecond world: everything can go wrong• JVM• Network• OS Scheduler• File System• All this goes into the post 99% percentile
• Requires monitoring• Usually using the latest version helps
![Page 12: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/12.jpg)
Latency sources
• Split (and presplits)• Autosharding is great!• Puts have to wait• Impacts: seconds
• Balance• Regions move• Triggers a retry for the client
• hbase.client.pause = 100ms since HBase 0.96
• Garbage Collection• Impacts: 10’s of ms, even with a good config• Covered with the read path of this talk
![Page 13: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/13.jpg)
From steady to loaded and overloaded• Number of concurrent tasks is a factor of
• Number of cores• Number of disks• Number of remote machines used
• Difficult to estimate• Queues are doomed to happen• hbase.regionserver.handler.count
• So for low latency• Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code.• RPC Priorities: work in progress (HBASE-11048)
![Page 14: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/14.jpg)
From loaded to overloaded
• MemStore takes too much room: flush, then blocksquite quickly• hbase.regionserver.global.memstore.size.lower.limit• hbase.regionserver.global.memstore.size• hbase.hregion.memstore.block.multiplier
• Too many Hfiles: block until compactions keep up• hbase.hstore.blockingStoreFiles
• Too many WALs files: Flush and block• hbase.regionserver.maxlogs
![Page 15: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/15.jpg)
Machine failure
• Failure• Dectect• Reallocate• Replay WAL
• Replaying WAL is NOT required for puts• hbase.master.distributed.log.replay
• (default true in 1.0)
• Failure = Dectect + Reallocate + Retry• That’s in the range of ~1s for simple failures• Silent failures leads puts you in the 10s range if the hardware does not help
• zookeeper.session.timeout
![Page 16: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/16.jpg)
Single puts
• Millisecond range
• Spikes do happen in steady mode• 100ms• Causes: GC, load, splits
![Page 17: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/17.jpg)
Streaming puts
Htable#setAutoFlushTo(false)
Htable#put
Htable#flushCommit
• As simple puts, but• Puts are grouped and send in background• Load is taken into account• Does not block
![Page 18: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/18.jpg)
Multiple puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.perregion.tasks (default 1)
• Decouple the client from a latency spike of a region server
• Increase the throughput by 50% compared to old multiput• Makes split and GC more transparent
![Page 19: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/19.jpg)
Conclusion on write path
• Single puts can be very fast• It’s not a « hard real time » system: there are spikes
• Most latency spikes can be hidden when streaming puts
• Failure are NOT that difficult for the write path• No WAL to replay
![Page 20: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/20.jpg)
And now for the read path
![Page 21: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/21.jpg)
Read path
• Get/short scan are assumed for low-latency operations• Again, two APIs• Single get: HTable#get(Get)• Multi-get: HTable#get(List<Get>)
• Four stages, same as write path• Start (tcp connection, …)• Steady: when expected conditions are met• Machine failure: expected as well• Overloaded system: you may need to add machines or tune your workload
![Page 22: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/22.jpg)
Multi get / Client
Group Gets byRegionServer
Execute themone by one
![Page 23: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/23.jpg)
Multi get / Server
![Page 24: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/24.jpg)
Multi get / Server
http://hadoop-hbase.blogspot.com/2012/05/hbasecon.html
![Page 25: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/25.jpg)
Access latency magnides
Dean/2009
Memory is 100000xfaster than disk!
Disk seek = 10ms
![Page 26: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/26.jpg)
Known unknowns
• For each candidate HFile• Exclude by file metadata
• Timestamp• Rowkey range
• Exclude by bloom filter
StoreFileScanner#shouldUseScanner()
![Page 27: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/27.jpg)
Unknown knowns
• Merge sort results polled from Stores• Seek each scanner to a reference KeyValue• Retrieve candidate data from disk
• Multiple HFiles => mulitple seeks• hbase.storescanner.parallel.seek.enable=true
• Short Circuit Reads• dfs.client.read.shortcircuit=true
• Block locality• Happy clusters compact!
HFileBlock#readBlockData()
![Page 28: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/28.jpg)
BlockCache
• Reuse previously read data• Maximize cache hit rate• Larger cache• Temporal access locality• Physical access locality
BlockCache#getBlock()
![Page 29: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/29.jpg)
BlockCache Showdown
• LruBlockCache• Default, onheap• Quite good most of the time• Evictions impact GC
• BucketCache• Offheap alternative• Serialization overhead• Large memory configurations
http://www.n10k.com/blog/blockcache-showdown/
L2 off-heap BucketCachemakes a strong showing
![Page 30: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/30.jpg)
Latency enemies: Garbage Collection
• Use heap. Not too much. With CMS.• Max heap
• 30GB (compressed pointers)• 8-16GB if you care about 9’s
• Healthy cluster load• regular, reliable collections• 25-100ms pause on regular interval
• Overloaded RegionServer suffers GC overmuch
![Page 31: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/31.jpg)
Off-heap to the rescue?
• BucketCache (0.96, HBASE-7404)• Network interfaces (HBASE-9535)• MemStore et al (HBASE-10191)
![Page 32: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/32.jpg)
Latency enemies: Compactions
• Fewer HFiles => fewer seeks
• Evict data blocks!• Evict Index blocks!!
• hfile.block.index.cacheonwrite
• Evict bloom blocks!!!• hfile.block.bloom.cacheonwrite
• OS buffer cache to the rescue• Compactected data is still fresh• Better than going all the way back to disk
![Page 33: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/33.jpg)
Failure
• Detect + Reassign + Replay• Strong consistency requires replay
• Locality drops to 0• Cache starts from scratch
![Page 34: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/34.jpg)
Hedging our bets
• HDFS Hedged reads (2.4, HDFS-5776)• Reads on secondary DataNodes• Strongly consistent• Works at the HDFS level
• Timeline consistency (HBASE-10070)• Reads on « Replica Region »• Not strongly consistent
![Page 35: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/35.jpg)
Read latency in summary
• Steady mode• Cache hit: < 1 ms• Cache miss: + 10 ms per seek• Writing while reading => cache churn• GC: 25-100ms pause on regular interval
Network request + (1 - P(cache hit)) * (10 ms * seeks)
• Same long tail issues as write• Overloaded: same scheduling issues as write• Partial failures hurt a lot
![Page 36: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/36.jpg)
HBase ranges for 99% latency
PutStreamed Multiput Get Timeline get
Steady milliseconds milliseconds milliseconds milliseconds
Failure seconds seconds seconds milliseconds
GC10’s of
milliseconds milliseconds10’s of
milliseconds milliseconds
![Page 37: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/37.jpg)
What’s next
• Less GC• Use less objects• Offheap
• Compressed BlockCache (HBASE-8894)• Prefered location (HBASE-4755)
• The « magical 1% »• Most tools stops at the 99% latency• What happens after is much more complex
![Page 38: HBase Low Latency](https://reader035.vdocuments.site/reader035/viewer/2022062220/554f72c0b4c9058a148b5415/html5/thumbnails/38.jpg)
Thanks!Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
Hadoop Summit June 4, 2014