hbase @ hadoop day seattle
TRANSCRIPT
![Page 1: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/1.jpg)
HBaseAmandeep Khurana
University of California, Santa CruzTwitter: @amansk
Tuesday, August 17, 2010
![Page 2: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/2.jpg)
How did it start?
• At Google
• Lots of semi structured data
• Commodity hardware
• Horizontal scalability
• Tight integration with MapReduce
2
Tuesday, August 17, 2010
![Page 3: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/3.jpg)
Why NoSQL?
• RDBMS don’t scale
• Typically large monolithic systems
• Hard to shard
• Specialized hardware.. expensive!
• Buzzword!
3
Tuesday, August 17, 2010
![Page 4: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/4.jpg)
Google BigTable
• Distributed multi level map
• Fault tolerant, persistent
• Scalable
• Runs on commodity hardware
• Self managing
• Large number of read/write ops
• Fast scans
4
Tuesday, August 17, 2010
![Page 5: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/5.jpg)
HBase
• Open source BigTable
• HDFS as underlying DFS
• ZooKeeper as lock service
• Tight integration with Hadoop MapReduce
5
Tuesday, August 17, 2010
![Page 6: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/6.jpg)
HBase
• Data model
• Architecture, implementation
• Regions, Region Servers etc
• API
• Current status and future direction
• Use cases
• How to think HBase (or NoSQL)?
6
Tuesday, August 17, 2010
![Page 7: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/7.jpg)
Data Model
• Sparse, multi dimensional map(row, column, timestamp) cell
• Column = Column Family:Column Qualifier
v1
Fam1:Qual1
t1AK
Rows
Columns
Timestamps
7
Tuesday, August 17, 2010
![Page 8: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/8.jpg)
Data Model
• Sparse, multi dimensional map(row, column, timestamp) cell
• Column = Column Family:Column Qualifier
v1
Fam1:Qual1
t1v2
t2>t1
t2AK
Rows
Columns
Timestamps
7
Tuesday, August 17, 2010
![Page 9: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/9.jpg)
Regions
• Region: Contiguous set of lexicographically sorted rows
• hbase.hregion.max.filesize (default 256MB)
• Regions hosted by Region Servers
8
Tuesday, August 17, 2010
![Page 10: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/10.jpg)
Regions and Splittingrow1
row256
row257
row600
9
Tuesday, August 17, 2010
![Page 11: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/11.jpg)
Regions and Splittingrow1
row256
row257
row600
9
Writes
Tuesday, August 17, 2010
![Page 12: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/12.jpg)
Regions and Splittingrow1
row256
row257
row600
row400
row401
9
Tuesday, August 17, 2010
![Page 13: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/13.jpg)
System Structure
10
Region Servers Master
ZooKeeperHDFS
Map
Reduce
Tuesday, August 17, 2010
![Page 14: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/14.jpg)
Master
• Region splitting
• Load balancing
• Metadata operations
• Multiple masters for failover
11
Tuesday, August 17, 2010
![Page 15: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/15.jpg)
ZooKeeper
• Master election
• Locate -ROOT- region
• Region Server membership
12
Tuesday, August 17, 2010
![Page 16: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/16.jpg)
Where is my row?
13
ZooKeeper
MyRow
-ROOT-
.META.MyTable
• 3 level hierarchical lookup scheme
Tuesday, August 17, 2010
![Page 17: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/17.jpg)
Where is my row?
13
ZooKeeper
MyRow
-ROOT-
.META.MyTable
• 3 level hierarchical lookup scheme
Tuesday, August 17, 2010
![Page 18: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/18.jpg)
Where is my row?
13
ZooKeeper
MyRow
-ROOT-
.META.MyTable
• 3 level hierarchical lookup scheme
Row per META region
Tuesday, August 17, 2010
![Page 19: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/19.jpg)
Where is my row?
13
ZooKeeper
MyRow
-ROOT-
.META.MyTable
• 3 level hierarchical lookup scheme
Row per META region
Row per table region
Tuesday, August 17, 2010
![Page 20: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/20.jpg)
Where is my row?
13
ZooKeeper
MyRow
-ROOT-
.META.MyTable
• 3 level hierarchical lookup scheme
Row per META region
Row per table region
Tuesday, August 17, 2010
![Page 21: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/21.jpg)
Region
14
HFile(on HDFS)
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value
Memstore
Region
HFile(on HDFS)
Tuesday, August 17, 2010
![Page 22: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/22.jpg)
Region
14
HFile(on HDFS)
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value
Memstore
Region
HFile(on HDFS)
Write
Tuesday, August 17, 2010
![Page 23: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/23.jpg)
Region
14
HFile(on HDFS)
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value
Memstore
Region
HFile(on HDFS)
Tuesday, August 17, 2010
![Page 24: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/24.jpg)
Region
14
HFile(on HDFS)
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value
Memstore
Region
HFile(on HDFS)
SmallHFile
Flush
Tuesday, August 17, 2010
![Page 25: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/25.jpg)
Region
14
HFile(on HDFS)
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value
Memstore
Region
HFile(on HDFS)
SmallHFile
Tuesday, August 17, 2010
![Page 26: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/26.jpg)
Region
14
HFile(on HDFS)
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value
Memstore
Region
HFile(on HDFS)
SmallHFile
Compaction
Tuesday, August 17, 2010
![Page 27: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/27.jpg)
Region
14
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value
Memstore
Region
Compaction
Tuesday, August 17, 2010
![Page 28: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/28.jpg)
Region
14
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value
Memstore
Region
HFile(on HDFS)
Tuesday, August 17, 2010
![Page 29: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/29.jpg)
Region
15
HFile(on HDFS)
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
Memstore
Region
HFile(on HDFS)
HFile(on HDFS)
Tuesday, August 17, 2010
![Page 30: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/30.jpg)
Region
15
HFile(on HDFS)
HLog(Append only
WAL on HDFS)(Sequence File)(one per RS)
Memstore
Region
HFile(on HDFS)
HFile(on HDFS)
Read
Tuesday, August 17, 2010
![Page 31: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/31.jpg)
Ways to access• Java
• REST
• Thrift
• Scala
• Jython
• Groovy DSL
• Ruby shell
• Java MR, Cascading, Pig, Hive
16
Tuesday, August 17, 2010
![Page 32: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/32.jpg)
Java API
• Get
• Put
• Delete
• Scan
• IncrementColumnValue
• TableInputFormat - MapReduce Source
• TableOutputFormat - MapReduce Sink
17
Tuesday, August 17, 2010
![Page 33: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/33.jpg)
Other Features
• Compression
• In memory column families
• Multiple masters
• Rolling restart
• Bloom filters
• Efficient bulk loads
• Source and sink for Hive, Pig, Cascading
18
Tuesday, August 17, 2010
![Page 34: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/34.jpg)
Things being worked on
• Master rewrite
• Move more stuff into ZooKeeper
• Column family based access control
• Inter cluster replication (managed by ZK)
• Store Lucene indexes (HBasene)
19
Tuesday, August 17, 2010
![Page 35: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/35.jpg)
Use Cases
Tuesday, August 17, 2010
![Page 36: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/36.jpg)
HBase @ SU*
• Backend for su.pr
• Real time serving + MR analytics (separate clusters)
• 50% cascading, 50% java MR
• Prod cluster (~20 nodes) serves 20k requests/sec
• All new features are backed by HBase
• Hardware: 2xi7, 24GB RAM, 4x1TB
21*Source: Personal communication with
J-D Cryans, StumbleUponTuesday, August 17, 2010
![Page 37: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/37.jpg)
HBase @ Mozilla*• Socorro - crash reporting system
• Catch, process and present crash info for Firefox, Thunderbird, Fennec, Camino, Seamonkey
• 1.5m crash reports/day
• Earlier: NFS, PostgreSQL
• 17 node production cluster
• Dual Quad Core + 24GB RAM + 4x1TB
• Some user facing reports still served by PostgreSQL. Being ported to HBase in next Socorro version
22*Source: http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/Tuesday, August 17, 2010
![Page 38: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/38.jpg)
Data Integration*
• Multiple heterogenous data sources
• Notion of connected data
• Think RDF
• Graph connecting data elements across systems
• Store in HBase, build transitive closures
• Pattern mining
23*Source: ClouDFuse - Scalable data integration in the cloud, MS Project, Amandeep Khurana, UC Santa CruzTuesday, August 17, 2010
![Page 39: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/39.jpg)
HBase @ Trend Micro*
• Store threat information - Smart Protection Network
• Open source cloud computing initiative - TCloud
• Primarily run off EC2
24*Source: https://hbase.s3.amazonaws.com/hbase/HBase-Trend-HUG10.pdfTuesday, August 17, 2010
![Page 40: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/40.jpg)
HBase @ Yahoo*
• Content optimization
• Meta-data about content stored in HBase
• Used for extracting item features
• Used in conjunction with PNUTS, Hadoop
• Process 100s of GB in each run
25*Source: http://www.slideshare.net/ydn/7-online-contentoptimizationhadoopsummit2010Tuesday, August 17, 2010
![Page 41: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/41.jpg)
HBase @ Twitter*
• 7TB/day incoming data, increasing
• Analytics
• People search
• Building new solutions on HBase
• Part of a much larger scheme of things
• Scribe, Crane, Pig, MySQL, Cassandra, Oink, Elephant Bird, Birdbrain, Hadoop
26
*Sources: http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010http://www.slideshare.net/ydn/3-hadoop-pigattwitterhadoopsummit2010Tuesday, August 17, 2010
![Page 42: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/42.jpg)
Others• Facebook
• Flurry
• Adobe
• Runa
• GumGum
• Openplaces
• Meetup.com
• Powerset
• WorldLingo
• Lily
• Drawn To Scale
• RapLeaf
• ...
27
Tuesday, August 17, 2010
![Page 43: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/43.jpg)
How to think in HBase?
Tuesday, August 17, 2010
![Page 44: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/44.jpg)
HBase v/s RDBMS
• Neither solves all problems
• It’s really a wrong comparison
• But puts things in context
29
Tuesday, August 17, 2010
![Page 45: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/45.jpg)
HBase v/s RDBMS
30
HBase RDBMSColumn oriented Row oriented (mostly)
Flexible schema, add columns on the fly
Fixed schema
Good with sparse tables Not optimized for sparse tables
No query language SQL
Wide tables Narrow tables
Joins using MR - not optimizedOptimized for joins (small, fast ones too!)
Tight integration with MR Not really...
Tuesday, August 17, 2010
![Page 46: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/46.jpg)
HBase v/s RDBMS
31
HBase RDBMSDe-normalize your data Normalize as you can
Horizontal scalability. Just add hardware
Hard to shard and scale
Consistent Consistent
No transactions Transactional
Good for semi structured data as well as structured data
Good for structured data
Tuesday, August 17, 2010
![Page 47: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/47.jpg)
HBase v/s RDBMS
32
Tuesday, August 17, 2010
![Page 48: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/48.jpg)
HBase v/s RDBMS
32
Rule: You probably don’t need HBase if your data can easily fit and be processed on a single
RDBMS box.
Tuesday, August 17, 2010
![Page 49: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/49.jpg)
HBase v/s RDBMS
32
Rule: You probably don’t need HBase if your data can easily fit and be processed on a single
RDBMS box.
But then, you are at Hadoop Day, so it probably can’t!
Tuesday, August 17, 2010
![Page 50: HBase @ Hadoop Day Seattle](https://reader033.vdocuments.site/reader033/viewer/2022052622/5591071b1a28abb26f8b471c/html5/thumbnails/50.jpg)
Q&A
Tuesday, August 17, 2010