hive hbase metastore - improving hive with a big data metadata storage
TRANSCRIPT
Hive Hbase Metastore - Improving Hive with a Big Data Metadata StorageDaniel Dai Vaibhav GumashtaHortonworksHadoop Summit San JoseJune 2016
2 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
3 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
What is Hive MetaStore
Store Metadata about the datandash Databasendash Tablendash Partitionndash Privilegendash Rolendash Permanent UDFndash Statisticsndash Locksndash Transactionndash etc
Two modesndash Thrift Serverndash Embedded
Backendndash RDBMS Derby MSSQL MySQL Oracle PostGres
4 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Low latency in Hive
Hadoop is only for large jobndash Most jobs are small jobsndash User want to run both small and large
jobs in one system
Whatrsquos trending in Hive ndash Low latencyndash Stinger (Tez + ORC + Vectorization)
bull Bring query to 5-10sndash LLAP
bull Sub-second query TPC-DS query 27
5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
New BottleNet - Metastore
Planning time is non-negligible Among planning significant amount of time spent on metadata fetching
6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Besides Latency
Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks
Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp
7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
ER Diagram for ObjectStore Database
8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
How About Improving ObjectStore
Already happeningndash Using direct SQL instead of O-R
But
ndash Maintenance nightmarendash Handle syntax difference for databases
Re-engineering effort may not pay off Ultimate barrier Scalability
String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
2 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
3 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
What is Hive MetaStore
Store Metadata about the datandash Databasendash Tablendash Partitionndash Privilegendash Rolendash Permanent UDFndash Statisticsndash Locksndash Transactionndash etc
Two modesndash Thrift Serverndash Embedded
Backendndash RDBMS Derby MSSQL MySQL Oracle PostGres
4 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Low latency in Hive
Hadoop is only for large jobndash Most jobs are small jobsndash User want to run both small and large
jobs in one system
Whatrsquos trending in Hive ndash Low latencyndash Stinger (Tez + ORC + Vectorization)
bull Bring query to 5-10sndash LLAP
bull Sub-second query TPC-DS query 27
5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
New BottleNet - Metastore
Planning time is non-negligible Among planning significant amount of time spent on metadata fetching
6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Besides Latency
Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks
Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp
7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
ER Diagram for ObjectStore Database
8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
How About Improving ObjectStore
Already happeningndash Using direct SQL instead of O-R
But
ndash Maintenance nightmarendash Handle syntax difference for databases
Re-engineering effort may not pay off Ultimate barrier Scalability
String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
3 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
What is Hive MetaStore
Store Metadata about the datandash Databasendash Tablendash Partitionndash Privilegendash Rolendash Permanent UDFndash Statisticsndash Locksndash Transactionndash etc
Two modesndash Thrift Serverndash Embedded
Backendndash RDBMS Derby MSSQL MySQL Oracle PostGres
4 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Low latency in Hive
Hadoop is only for large jobndash Most jobs are small jobsndash User want to run both small and large
jobs in one system
Whatrsquos trending in Hive ndash Low latencyndash Stinger (Tez + ORC + Vectorization)
bull Bring query to 5-10sndash LLAP
bull Sub-second query TPC-DS query 27
5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
New BottleNet - Metastore
Planning time is non-negligible Among planning significant amount of time spent on metadata fetching
6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Besides Latency
Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks
Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp
7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
ER Diagram for ObjectStore Database
8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
How About Improving ObjectStore
Already happeningndash Using direct SQL instead of O-R
But
ndash Maintenance nightmarendash Handle syntax difference for databases
Re-engineering effort may not pay off Ultimate barrier Scalability
String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
4 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Low latency in Hive
Hadoop is only for large jobndash Most jobs are small jobsndash User want to run both small and large
jobs in one system
Whatrsquos trending in Hive ndash Low latencyndash Stinger (Tez + ORC + Vectorization)
bull Bring query to 5-10sndash LLAP
bull Sub-second query TPC-DS query 27
5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
New BottleNet - Metastore
Planning time is non-negligible Among planning significant amount of time spent on metadata fetching
6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Besides Latency
Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks
Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp
7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
ER Diagram for ObjectStore Database
8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
How About Improving ObjectStore
Already happeningndash Using direct SQL instead of O-R
But
ndash Maintenance nightmarendash Handle syntax difference for databases
Re-engineering effort may not pay off Ultimate barrier Scalability
String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
5 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
New BottleNet - Metastore
Planning time is non-negligible Among planning significant amount of time spent on metadata fetching
6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Besides Latency
Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks
Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp
7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
ER Diagram for ObjectStore Database
8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
How About Improving ObjectStore
Already happeningndash Using direct SQL instead of O-R
But
ndash Maintenance nightmarendash Handle syntax difference for databases
Re-engineering effort may not pay off Ultimate barrier Scalability
String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
6 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Besides Latency
Significantly more scalendash More metadata ndash millions of partitionsndash New large scale metadata ndash Split information ORC row group statisticsndash More calls ndash Handle orders of magnitude higher no of calls ndash From tasks
Reduce Complexityndash Object Relational Modeling is an impedance mismatchndash DataNucleusndash DBCP BoneCP or Hikaricp
7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
ER Diagram for ObjectStore Database
8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
How About Improving ObjectStore
Already happeningndash Using direct SQL instead of O-R
But
ndash Maintenance nightmarendash Handle syntax difference for databases
Re-engineering effort may not pay off Ultimate barrier Scalability
String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
7 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
ER Diagram for ObjectStore Database
8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
How About Improving ObjectStore
Already happeningndash Using direct SQL instead of O-R
But
ndash Maintenance nightmarendash Handle syntax difference for databases
Re-engineering effort may not pay off Ultimate barrier Scalability
String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
8 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
How About Improving ObjectStore
Already happeningndash Using direct SQL instead of O-R
But
ndash Maintenance nightmarendash Handle syntax difference for databases
Re-engineering effort may not pay off Ultimate barrier Scalability
String queryText = select PARTITIONSPART_ID SDSSD_ID SDSCD_ID + SERDESSERDE_ID PARTITIONSCREATE_TIME + PARTITIONSLAST_ACCESS_TIME SDSINPUT_FORMAT SDSIS_COMPRESSED + SDSIS_STOREDASSUBDIRECTORIES SDSLOCATION SDSNUM_BUCKETS + SDSOUTPUT_FORMAT SERDESNAME SERDESSLIB + from PARTITIONS + left outer join SDS on PARTITIONSSD_ID = SDSSD_ID + left outer join SERDES on SDSSERDE_ID = SERDESSERDE_ID + where PART_ID in ( + partIds + ) order by PART_NAME asc
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
9 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
10 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
System Architecture
HiveMetaStore Thrift Server
ObjectStoreHBaseStore
RDBMSHBase
Omid
bull Two implementation of the RawStore interfacebull HBaseStorebull ObjectStore
bull Both backend will live together for a while
bull HBaseStorebull Most traffic will go through transaction
layer (Omid)bull Some traffic will bypass transaction layer
bull Volatile databull High possibility of conflict
HiveMetaStore Thrift Client
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
11 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writesreads values tofrom various tables in
RDBMS using appropriate foreign key references bull RDBMS fastpath enabled by not using ORM and writing direct SQL However
complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
12 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
RDBMS schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server extracts values from Thrift objects and creates corresponding ORM model
objects bull ORM opens transaction on RDBMS and writes reads values to from various tables in
RDBMS using appropriate foreign key references
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
TBLS
TBL_PRIVS
TBL_COL_PRIVS
PART_PRIVS
SDS
CDS
SORT_ORDER
SERDES
TYPE_FIELDS
PARTITIONS
PARTITION_KEY_VALS
PARTITION_PARAMS
BUCKETING_COLS
SORT_COLS
SD_PARAMS
SKEWED_COL_NAMES
SKEWED_VALUES
TABLE_PARAMS
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
13 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog ldquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
HBase schema
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
14 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Table Name Key Column Families and Columns
Description
HBMS_GLOBAL_PRIVS bytes(ldquogprdquo) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized PrincipalPrivilegeSet proto
HBMS_ROLES bytes(roleName) cf_catalog ldquorolesrdquo ldquorolesrdquo storeretrieve serialized Role proto
HBMS_USER_TO_ROLE bytes(userName) cf_catalog ldquocrdquo ldquocrdquo storeretrieve serialized RoleList proto
HBMS_SECURITY bytes(delTokenId) cf_catalog ldquodtrdquo ldquomkrdquo ldquodtrdquo storeretrieve delegation token ldquomkrdquo master keys
HBMS_SEQUENCES bytes(sequence) cf_catalog ldquocrdquo ldquocrdquo storeretrieve sequences
HBase schema
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
15 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
De-normalization
bull Goalbull Optimized for queryingbull May slower in DDL bull Example drop_role(String roleName)
Key Value
bytes(ldquoUser 1rdquo) Proto(Role 1 Role 2 Role 3 Role 5)
bytes(ldquoUser 2rdquo) Proto(Role 1 Role 2)
bytes(ldquoUser 3rdquo) Proto(Role 4 Role 5)
bytes(ldquoUser 4rdquo) Proto (Role 2 Role 3)
HBMS_USER_TO_ROLE
bull Need to scan amp de-serialize everything in order to drop a role
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
16 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Partition Keys
Range scan for most queriesndash Where date = lsquo201601rsquo and state = lsquoCArsquondash Where date gt= lsquo201602rsquo and date lt lsquo201604rsquo
Server side filter for the restndash Where state = lsquoCArsquo (not prefix key)ndash Where date like lsquo2016rsquo (regex)ndash Where date gt lsquo201601rsquo and state gt lsquoORrsquo (cannot be range scan)ndash Scan all keys but not deserialize value
date state
201601 CA
201601 WA
201602 CA
201603 CA
201605 CA
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
17 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Typed Partition Keys
Binary sortedndash HBase range scan Scan(byte[] startRow byte[] stopRow)
ndash Where key1 gt= lsquoA5rsquo and key2 gt= 8bull startRow 41 35 00 00 00 00 08
Using BinarySortableSerDendash Support all Hive data typesndash Handles null
(String Integer) Bytes
lsquoA10rsquo 3 41 31 30 00 00 00 00 03
lsquoA10rsquo 10 41 31 30 00 00 00 00 0A
lsquoA5rsquo 4 41 35 00 00 00 00 04
lsquoA5rsquo 15 41 35 00 00 00 00 0D
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
18 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
19 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
struct StorageDescriptor
1 listltFieldSchemagt cols
2 string location
3 string inputFormat
4 string outputFormat
5 bool compressed
6 i32 numBuckets
7 SerDeInfo serdeInfo
8 listltstringgt bucketCols
9 listltOrdergt sortCols
10 mapltstring stringgt parameters
11 optional SkewedInfo skewedInfo
12 optional bool storedAsSubDirectories
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
20 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Storage Descriptor de-duplication
Table Name Key Column Families and Columns
Description
HBMS_DBS bytes(dbName) cf_catalog ldquocrdquo ldquocrdquo Database proto
HBMS_SDS bytes(md5(SD proto)) cf_catalog ldquocrdquo ldquorefrdquo ldquocrdquo StorageDescriptor protoldquorefrdquo reference count
HBMS_TBLS bytes(dbName tblName)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Table protoldquosrdquo Stats per column in the Table
HBMS_PARTITIONS bytes(dbName tblName partVal1 partValn)
cf_catalog ldquocrdquocf_stats ldquosrdquo -gt c1 hellip cn
ldquocrdquo Partition protoldquosrdquo Stats per column in the Partition
HBMS_AGGR_STATS bytes(md5(dbName tblName partVal1 partValn colName) )
cf_catalog ldquosrdquo ldquobrdquo ldquobrdquo AggrStatsBloomFilter protoldquosrdquo AggrStats proto
HBMS_FUNCS bytes(dbName funcName)
cf_catalog lsquocrdquo ldquocrdquo Function proto
HBMS_FILE_METADATA bytes(fileId) cf_catalog ldquocrdquocf_stats ldquosrdquo
ldquocrdquo Metadata footer protoldquosrdquo PPD Stats
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
message StorageDescriptor
message Order hellip
message SerDeInfo hellip
message SkewedInfo hellip
repeated FieldSchema cols = 1
optional string input_format = 2
optional string output_format = 3
optional bool is_compressed = 4
optional sint32 num_buckets = 5
optional SerDeInfo serde_info = 6
repeated string bucket_cols = 7
repeated Order sort_cols = 8
optional SkewedInfo skewed_info = 9
optional bool stored_as_sub_directories = 10
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
21 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
22 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBase schema
ReadWrite path bull Thrift Client creates Thrift objects for RPC (based on specs in
metastoreifhive_metastorethrift) bull Thrift Server passes thrift objects to HBase client open in the thrift server bull HBase client extracts fields from thrift objects converts them to corresponding
protobuf objects (metastoresrcprotobuforgapachehadoophivemetastorehbasehbase_metastore_protoproto)
bull Writesreads the protobuf payloads tofrom HBase tables
Example adding a new partition ldquoadd_partition(Partition new_part)rdquo
struct Partition
1 listltstringgt values
2 string dbName
3 string tableName
4 i32 createTime
5 i32 lastAccessTime
6 StorageDescriptor sd
7 mapltstring stringgt parameters
8 optional PrincipalPrivilegeSet privileges
message Partition
optional int64 create_time = 1
optional int64 last_access_time = 2
optional string location = 3
optional Parameters sd_parameters = 4
required bytes sd_hash = 5
optional Parameters parameters = 6
HBMS_
PARTITIONS
HBMS_
SDS
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
23 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
24 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching
Aggregate Statsbull Location - on HBasebull Compile time
File Footers bull Location - on HBasebull Runtime - accessed from tasks
Tables Partitions Storage Descriptors bull Location - on Metastore server(s)bull Compile time
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
25 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching Aggregate Stats
ldquoget_aggr_stats_for(dbName tblName partNames colNames)rdquo
bull Gets aggregated stats for columns in each partition ndash expensive callbull Used in CBO Stats Annotation Stats Optimizerbull HBMS_AGGR_STATS
bull RowKey md5(dbName tblName partVal1 partValn colName) bull Columns AggrStats proto and AggrStatsBloomFilter proto
bull Lookup bull New entry added for each key not found in cache AggrStats calculated on client
side amp cached entry saved as serialized AggrStats proto bull AggrStatsBloomFilter created on partitions contained in AggrStats
bull Invalidation bull TTL expiry nodes evicted from cachebull Alter partition Drop partition Analyze etc add invalidation request to a queuebull Invalidator thread picks invalidation request amp executes a filter on HBase to
removes expired entriesbull Uses the bloom filter to find all AggrStats proto contains the candidate partition amp
removes them from the cache
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
26 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Caching File Footers
bull ORC footer cachebull Task write file footers to a cache table on HBase (HBMS_FILE_METADATA RowKey fileId)bull Read from AM for split generation (avoids reading lots of HDFS files for split generation)bull Since fileId is unique overwrite not a problem Stale entries removed by a cleaner
thread
bull Skip transactionbull High overheadbull Transaction conflictbull Row mutation is already atomic
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
27 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
28 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
HBaseMetaStore Needs Transaction
Atomic is requiredndash Create table partition also create storage descriptorndash Alter table also alter partitionsndash Drop table also drop table column privilege
HBase donrsquot support transactionndash Donrsquot support cross-row transactions
HBaseConnectionndash Support different transaction manager in theoryndash VanillaHBaseConnection no transaction
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
29 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid
Transaction layer on top of Hbase Initially developed by Yahoo Apache incubator project
ndash First release this Monday
Snapshot isolationndash Natural as HBase is a versioned databasendash No locking no dead lock no blocking for both read and writendash Two concurrent transaction write to the same data the later one aborts
Low overhead
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
30 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Components
TSO Server (Timestamp Oracle)ndash Generate transidndash Status of transaction
TSO Clientndash Talk to TSOndash Cache transaction metadatandash Most read donrsquot need to talk to TSO
Compactorndash Run as HBase Coprocessorndash Remove stale cell versions
HBaseCompactor
Client
TSO
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
31 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Operations
Open transactionndash Get transid from TSO
Read a cellndash Read all versions of the cell from HBasendash Read latest committed version before transaction start
Write a cellndash Write value versioned with transid to HBase
Commitndash Generate commitid from TSOndash TSO figure out if there is conflict using transaction metadata
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
32 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Omid Data Structure
Memory management in TSOndash Never run OOM abort old transactions
TSO
row1 T20
row2 T25
row5 T22
lastCommit committedT10 T20
T4 T25
T11 T30
T2 hellip hellip
aborted
bull Detect transaction conflict at commit time
bull Largest trunk of memory
bull Construct snapshot at read time
bull Partially replicated to client
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
33 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Transaction Conflict
Two concurrent DDL write to the same datandash Proper retry logic
Task node writes - ORC footer cache
ndash High chance for write conflictndash Row mutation is atomic in Hbasendash Cross row atomic is not requiredndash Bypass transaction layer
public void putFileMetadata(ListltLonggt fileIds ListltByteBuffergt metadata FileMetadataExprType type)
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
34 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
35 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deployment
Server side components in HBasendash Server side filterndash Omid compactorndash Copy related hive jars into hbase hive-commonjar hive-metastorejar hive-serde-jar
New config in hive-sitexmlndash hivemetastorerawstoreimpl orgapachehadoophivemetastorehbaseHBaseStore
Server Side Filter
Omid Compactor
HBase
TSO
Hive MetaStore
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
36 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Deploy Omid
Create Omid Tables in HBasendash omidsh create-hbase-commit-tablendash omidsh create-hbase-timestamp-table
Start Omid TSOndash omidsh tso
Related config in hive-sitexmlndash hivemetastorehbaseconnectionclass=orgapachehadoophivemetastorehbaseOmid
HBaseConnectionndash tsohost=localhostndash tsoport=54758ndash omidclientconnectionType=DIRECT
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
37 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Instantiate HBase Metastore
Instantiate Hbase Tables from scratchndash hive --service hbaseschematool --install
Hbaseimport import existing Hive Metastorendash One way import from ObjectStore to HBaseStorendash hive --service hbaseimport
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
38 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
39 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
TPCDS queries
Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 760
1000
2000
3000
4000
5000
6000
Query Plan Time for TPCDS queries
HBaseStore HBaseStore+Omid ObjectStore
1824 partitions Sweetspot for ObjectStore Average Speed up for all TPCDS queries
ndash 219 (without Omid)ndash 212 (With Omid)
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
40 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
AgendaMotivation
System Design
Caching Strategy
Transaction Management
Deployment
Experimental Results
Future Work
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
41 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Current Status
hbase-metastore branch merged to master last September Turn off by default Feature parity Almost
ndash Minor holes event notificationversionconstraintsndash Deprecate listTableNamesByFilterlistPartitionNamesByFilterndash Tools enhancementndash ACID is not supported
Run most e2e queries Fixing unit tests
ndash TestMiniTezCliDriver all passndash TestCliDriver HIVE-14097 pending reviewndash Not production quality yet
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
42 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work - ACID
Transaction metadata is stored in Metastorendash Locksndash Txnsndash Compactions
Data structure is harder to de-normalize New work transaction server
ndash Keep lock and transaction tree in memory
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
43 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash HA via HBase Coprocessor
Two new server componentsndash Omid TSO Serverndash Transaction Server
All servers need HAndash Management headache
Automatic HA through HBase Coprocessor
TSO Server via CoProcessor
TSO Server via CoProcessor
Region Server Region Server
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
44 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Future Work ndash Other
Stats Aggregationndash Coprocessor
Improving ObjectCachendash Rudimentary implementation currentlyndash LRU
Omid consuming high CPUndash 300 CPU always by designndash High throughput avoid context switchndash Might be an issue for small system
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-
45 copy Hortonworks Inc 2011 ndash 2016 All Rights Reserved
Thank You
- Hive Hbase Metastore - Improving Hive with a Big Data Metadata
- Agenda
- What is Hive MetaStore
- Low latency in Hive
- New BottleNet - Metastore
- Besides Latency
- ER Diagram for ObjectStore Database
- How About Improving ObjectStore
- Agenda (2)
- System Architecture
- RDBMS schema
- RDBMS schema (2)
- HBase schema
- HBase schema (2)
- De-normalization
- Partition Keys
- Typed Partition Keys
- HBase schema (3)
- HBase schema (4)
- HBase schema (5)
- HBase schema (6)
- HBase schema (7)
- Agenda (3)
- Caching
- Caching Aggregate Stats
- Caching File Footers
- Agenda (4)
- HBaseMetaStore Needs Transaction
- Omid
- Omid Components
- Omid Operations
- Omid Data Structure
- Transaction Conflict
- Agenda (5)
- Deployment
- Deploy Omid
- Instantiate HBase Metastore
- Agenda (6)
- TPCDS queries
- Agenda (7)
- Current Status
- Future Work - ACID
- Future Work ndash HA via HBase Coprocessor
- Future Work ndash Other
- Thank You
-