hbase mttr, stripe compaction and hoya ted yu ([email protected])

37
HBase MTTR, Stripe Compaction and Hoya Ted Yu ([email protected])

Upload: clifford-oswin-wood

Post on 16-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

HBase MTTR, Stripe Compaction and Hoya

Ted Yu([email protected])

Page 2: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

About myself

• Been working on Hbase for 3 years• Became Committer & PMC member June 2011

Page 3: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Outline

• Overview to HBase Recovery• HDFS issues• Stripe compaction• Hbase-on-Yarn• Q & A

Page 4: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

We’re in a distributed system

• Hard to distinguish a slow server from a dead server

• Everything, or, nearly everything, is based on timeout

• Smaller timeouts means more false positive• HBase works well with false positive, but they

always have a cost.• The less the timeouts the better

Page 5: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

HBase components for recovery

Page 6: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Recovery in action

Page 7: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Recovery process• Failure detection: ZooKeeper

heartbeats the servers. Expire the session when it does not reply

• Region assignment: the master reallocates the regions to the other servers

• Failure recovery: read the WAL and rewrite the data again

• The client stops the connection to the dead server and goes to the new one.

ZK Heartbeat

Client

Region Servers, DataNode

Data recovery

Master, RS, ZKRegion Assignment

Page 8: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Failure detection

• Failure detection– Set a ZooKeeper timeout to 30s instead of the old 180s

default. – Beware of the GC, but lower values are possible.– ZooKeeper detects the errors sooner than the configured

timeout

• 0.96 – HBase scripts clean the ZK node when the server is kill -9ed

• => Detection time becomes 0

– Can be used by any monitoring tool

Page 9: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

With faster region assignment

• Detection: from 180s to 30s• Data recovery: around 10s• Reassignment : from 10s of seconds to

seconds

Page 10: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

DataNode crash is expensive!

• One replica of WAL edits is on the crashed DN– 33% of the reads during the regionserver recovery

will go to it• Many writes will go to it as well (the smaller

the cluster, the higher that probability)• NameNode re-replicates the data (maybe TBs)

that was on this node to restore replica count– NameNode does this work only after a good

timeout (10 minutes by default)

Page 11: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

HDFS – Stale modeLive

Stale

Dead

As today: used for reads & writes, using locality

Not used for writes, used as last resort for reads

As today: not used.And actually, it’s better to do the HBase recovery before HDFS replicates the TBs of data of this node

30 seconds, can be less.

10 minutes, don’t change this

Page 12: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Results

• Do more read/writes to HDFS during the recovery

• Multiple failures are still possible– Stale mode will still play its role– And set dfs.timeout to 30s– This limits the effect of two failures in a row. The

cost of the second failure is 30s if you were unlucky

Page 13: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Here is the client

Page 14: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

The client

• You want the client to be patient• Retries when the system is already loaded is

not good. • You want the client to learn about region

servers dying, and to be able to react immediately.

• You want the solution to be scalable.

Page 15: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Scalable solution

• The master notifies the client

– A cheap multicast message with the “dead servers” list. Sent 5 times for safety.

– Off by default.– On reception, the client stops immediately waiting on

the TCP connection. You can now enjoy large hbase.rpc.timeout

Page 16: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Faster recovery (HBASE-7006)

• Previous algorithm– Read the WAL files– Write new Hfiles– Tell the region server it got new Hfiles

• Put pressure on namenode– Remember: avoid putting pressure on the namenode

• New algo:– Read the WAL– Write to the regionserver– We’re done (have seen great improvements in our tests)– TBD: Assign the WAL to a RegionServer local to a replica

Page 17: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

RegionServer0 RegionServer_xRegionServer_y

WAL-file3 <region2:edit1><region1:edit2>……<region3:edit1>……..

WAL-file2<region2:edit1><region1:edit2>……<region3:edit1>……..

WAL-file1 <region2:edit1><region1:edit2>……<region3:edit1>……..

HDFS

Splitlog-file-for-region3 <region3:edit1><region1:edit2>……<region3:edit1>……..

Splitlog-file-for-region2 <region2:edit1><region1:edit2>……<region2:edit1>……..

Splitlog-file-for-region1 <region1:edit1><region1:edit2>……<region1:edit1>……..

HDFS

RegionServer3

RegionServer2

RegionServer1

writes

writesreads

reads

Distributed log Splitting

Page 18: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

RegionServer0 RegionServer_xRegionServer_y

WAL-file3 <region2:edit1><region1:edit2>……<region3:edit1>……..

WAL-file2<region2:edit1><region1:edit2>……<region3:edit1>……..

WAL-file1 <region2:edit1><region1:edit2>……<region3:edit1>……..

HDFS

Recovered-file-for-region3 <region3:edit1><region1:edit2>……<region3:edit1>……..

Recovered-file-for-region2 <region2:edit1><region1:edit2>……<region2:edit1>……..

Recovered-file-for-region1 <region1:edit1><region1:edit2>……<region1:edit1>……..

HDFS

RegionServer3

RegionServer2

RegionServer1

writes

writesreads

reads

Distributed log Replay

replays

Page 19: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Write during recovery

• Concurrent writes allowed during the WAL replay – same memstore serves both

• Events stream: your new recovery time is the failure detection time: max 30s, likely less!

• Caveat: HBASE-8701 WAL Edits need to be applied in receiving order

Page 20: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)
Page 21: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

MemStore flush

• Real life: some tables are updated at a given moment then left alone– With a non empty memstore– More data to recover

• It’s now possible to guarantee that we don’t have MemStore with old data

• Improves real life MTTR• Helps online snapshots

Page 22: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

.META.

• .META.– There is no –ROOT- table in 0.95/0.96– But .META. failures are critical

• A lot of small improvements– Server now says to the client when a region has moved

(client can avoid going to meta)• And a big one– .META. WAL is managed separately to allow an

immediate recovery of META– With the new MemStore flush, ensure a quick recovery

Page 23: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Data locality post recovery

• HBase performance depends on data-locality• After a recovery, you’ve lost it– Bad for performance

• Here comes region groups• Assign 3 favored RegionServers for every region• On failures assign the region to one of the

secondaries• The data-locality issue is minimized on failures

Page 24: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Discoveries from cluster testing

• HDFS-5016 Heartbeating thread blocks under some failure conditions leading to loss of datanodes

• HBASE-9039 Parallel assignment and distributed log replay during recovery

• Region splitting during distributed log replay may hinder recovery

Page 25: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Compactions example

Architecting the Future of Big Data

•Memstore fills up, files are flushed•When enough files accumulate, they are compacted

MemStore

HDFS

writes

HFile

HFile HFile HFileHFile

Page 26: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

But, compaction cause slowdownsLooks like lots of I/O for no apparent benefitExample effect on reads (note better average)

0 3600000 7200000 108000000

5000000

10000000

15000000

20000000

25000000

Load test time, sec

Read

late

ncy,

ms

Page 27: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

© Hortonworks Inc. 2011

Key ways to improve compactions•Read from fewer files

–Separate files by row key, version, time, etc.–Allows large number of files to be present, uncompacted

•Don't compact the data you don't need to compact–For example, old data in OpenTSDB-like systems–Obviously, results in less I/O

•Make compactions smaller–Without too much I/O amplification or too many files–Results in less compaction-related outages

•HBase works better with few large regions; however, large compactions cause unavailability

Page 28: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

© Hortonworks Inc. 2011

Stripe compactions (HBASE-7667)

Architecting the Future of Big Data

• Somewhat like LevelDB, partition the keys inside each region/store• But, only 1 level (plus optional L0)• Compared to regions, partitioning is more flexible

–The default is a number of ~equal-sized stripes• To read, just read relevant stripes + L0, if present

HFile HFile

Region start key: ccc eee

Row-key axis

iii: region end keyggg

H

HFileHFileHFile

HFile L0

get 'hbase'

Page 29: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

© Hortonworks Inc. 2011

Stripe compactions – writes

Architecting the Future of Big Data

•Data flushed from MemStore into several files•Each stripe compacts separately most of the time

MemStore

HDFS

HFile HFile

H

HFileHFileHFile

H

HH

HFile

Page 30: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

© Hortonworks Inc. 2011

Stripe compactions – other

Architecting the Future of Big Data

•Why Level0?–Bulk loaded files go to L0–Flushes can also go into single L0 files (to avoid tiny files) –Several L0 files are then compacted into striped files

•Can drop deletes if compacting one entire stripe +L0–No need for major compactions, ever

•Compact 2 stripes together – rebalance if unbalanced–Very rare, however - unbalanced stripes are not a huge deal

•Boundaries could be used to improve region splits in future

Page 31: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

© Hortonworks Inc. 2011

Stripe compactions - performance

Architecting the Future of Big Data

•EC2, c1.xlarge, preload; then measure random read perf–LoadTestTool + deletes + overwrites; measure random reads

2500000 3500000 4500000 5500000 6500000 7500000 85000000

500

1000

1500

2000

Test time, sec.

Rand

om g

ets p

er s

econ

d

Page 32: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Hbase on Yarn

• Hoya is a YARN application• All components are YARN services• Input is cluster specification, persisted as JSON

document on HDFS• HDFS and ZooKeeper are shared by multiple

cluster instances• The cluster can also be stopped and later

resumed

Page 33: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Hoya Architecture

• Hoya Client: parses commandline, executes local operations, talks to HoyaMasterService

• HoyaMasterService: AM service, deploys the HBase master locally

• HoyaRegionService: installs and executes the region server

Page 34: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

HBase Master Service Deployment

• HoyaMasterService requested to create cluster• Local Hbase dir chosen for expanded image• User supplied config dir overwrites conf files in

conf directory• Hbase conf patched with hostname of master• HoyaMasterService monitors reporting from

RM

Page 35: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Failure Handling

• Region Service failures trigger new RS instances• MasterService failures not trigger restart• RegionService monitors ZK node for master• MasterService monitors state of Hbase master

Page 36: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Runtime classpath dependencies

Page 37: HBase MTTR, Stripe Compaction and Hoya Ted Yu (tyu@hortonworks.com)

Q & A

Thanks!