intro to hbase
DESCRIPTION
Introduction to HBase briefly covers the following topics: what is HBase, how and when to used it.TRANSCRIPT
![Page 1: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/1.jpg)
Intro to HBaseAlex Baranau, Sematext International, 2012
Monday, July 9, 12
![Page 2: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/2.jpg)
About Me
Software Engineer at Sematext International
http://blog.sematext.com/author/abaranau
@abaranau
http://github.com/sematext (abaranau)
Monday, July 9, 12
![Page 3: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/3.jpg)
Agenda
What is HBase?
How to use HBase?
When to use HBase?
Monday, July 9, 12
![Page 4: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/4.jpg)
What is HBase?
Monday, July 9, 12
![Page 5: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/5.jpg)
What: HBase is...
Think of it as a sparse, consistent, distributed, multidimensional, sorted map:
labeled tables of rows
row consist of key-value cells:
Open-source non-relational distributed column-oriented database modeled after Google’s BigTable.
(row key, column family, column, timestamp) -> value
Monday, July 9, 12
![Page 6: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/6.jpg)
What HBase is NOTNot an SQL database
Not relational
No joins
No fancy query language and no sophisticated query engine
No transactions out-of-the box
No secondary indices out-of-the box
Not a drop-in replacement for your RDBMS
Monday, July 9, 12
![Page 7: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/7.jpg)
What: Features-1
Linear scalability, capable of storing hundreds of terabytes of data
Automatic and configurable sharding of tables
Automatic failover support
Strictly consistent reads and writes
Monday, July 9, 12
![Page 8: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/8.jpg)
What: Part of Hadoop ecosystem
Provides realtime random read/write access to data stored in HDFS
HDFS
HBase
Data Producer
Data Consumer
write
write
write
read
read
Monday, July 9, 12
![Page 9: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/9.jpg)
What: Features-2Integrates nicely with Hadoop MapReduce (both as source and destination)
Easy Java API for client access
Thrift gateway and REST APIs
Bulk import of large amount of data
Replication across clusters & backup options
Block cache and Bloom filters for real-time queries
and many more...
Monday, July 9, 12
![Page 10: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/10.jpg)
How to use HBase?
Monday, July 9, 12
![Page 11: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/11.jpg)
How: the DataRow keys uninterpreted byte arrays
Columns grouped in columnfamilies (CFs)
CFs defined statically upon table creation
Cell is uninterpreted byte array and a timestamp
Row Key Data
Minsk geo:{‘country’:‘Belarus’,‘region’:‘Minsk’}demography:{‘population’:‘1,937,000’@ts=2011}
New_York_Citygeo:{‘country’:‘USA’,‘state’:’NY’}
demography:{‘population’:‘8,175,133’@ts=2010,‘population’:‘8,244,910’@ts=2011}
Suva geo:{‘country’:‘Fiji’}
Different data separated into CFs
All values stores as byte arrays
Rows are ordered and accessed by
row key
Rows can have different columns
Cell can have multiple versions
Data can be very “sparse”
Monday, July 9, 12
![Page 12: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/12.jpg)
How: Writing the DataRow updates are atomic
Updates across multiple rows are NOT atomic, no transaction support out of the box
HBase stores N versions of a cell (default 3)
Tables are usually “sparse”, not all columns populated in a row
Monday, July 9, 12
![Page 13: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/13.jpg)
How: Reading the DataReader will always read the last written (and committed) values
Reading single row: Get
Reading multiple rows: Scan (very fast)
Scan usually defines start key and stop key
Rows are ordered, easy to do partial key scan
Query predicate pushed down via server-side Filters
Row Key Data‘login_2012-03-01.00:09:17’ d:{‘user’:‘alex’}
... ...‘login_2012-03-01.23:59:35’ d:{‘user’:‘otis’}‘login_2012-03-02.00:00:21’ d:{‘user’:‘david’}
Monday, July 9, 12
![Page 14: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/14.jpg)
How: MapReduce Integration
Out of the box integration with Hadoop MapReduce
Data from HBase table can be source for MR job
MR job can write data into HBase
MR job can write data into HDFS directly and then output files can be very quickly loaded into HBase via “Bulk Loading” functionality
Monday, July 9, 12
![Page 15: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/15.jpg)
How: Sharding the DataAutomatic and configurable sharding of tables:
Tables partitioned into Regions
Region defined by start & end row keys
Regions are the “atoms” of distribution
Regions are assigned to RegionServers (HBase cluster slaves)
Monday, July 9, 12
![Page 16: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/16.jpg)
HMaster
How: Setup: ComponentsHBase components
RegionServer
RegionServer
ZooKeeper
RegionServer
ZooKeeperZooKeeper
client
RegionServerRegionServer
HMaster
Monday, July 9, 12
![Page 17: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/17.jpg)
How: Setup: Hadoop ClusterTypical Hadoop+HBase setup
HMaster
NameNode
DataNode DataNode
RegionServer RegionServer
Task
Trac
ker
Task
Trac
ker
JobTrackerMaster Node
Slave Node Slave Node
HDFS
MapReduce
HBase
Slave Nodes
Monday, July 9, 12
![Page 18: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/18.jpg)
How: Setup: Automatic Failover
DataNode failures handled by HDFS (replication)
RSs failures (incl. caused by whole server failure) handled automatically
Master re-assignes Regions to available RSs
HMaster failover: automatic with multiple HMasters
Monday, July 9, 12
![Page 19: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/19.jpg)
When to Use HBase?
Monday, July 9, 12
![Page 20: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/20.jpg)
When: What HBase is good at
Serving large amount of data: built to scale from the get-go
fast random access to the data
Write-heavy applications*
Append-style writing (inserting/overwriting new data) rather than heavy read-modify-write operations**
* clients should handle the loss of HTable client-side buffer** see https://github.com/sematext/HBaseHUT
Monday, July 9, 12
![Page 21: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/21.jpg)
When: HBase vs ...
Favors consistency over availability
Part of a Hadoop ecosystem
Great community; adopted by tech giants like Facebook, Twitter, Yahoo!, Adobe, etc.
Monday, July 9, 12
![Page 22: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/22.jpg)
When: Use-casesAudit logging systems
track user actions
answer questions/queries like:
what are the last 10 actions made by user?row key: userId_timestamp
which users logged into system yesterday?row key: action_timestamp_userId
Monday, July 9, 12
![Page 23: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/23.jpg)
When: Use-cases
Real-time analytics, OLAP
real-time counters
interactive reports showing trends, breakdowns, etc
time-series databases
Monday, July 9, 12
![Page 24: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/24.jpg)
When: Use-casesMonitoring system example
Monday, July 9, 12
![Page 25: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/25.jpg)
When: Use-casesMessages-centered systems
twitter-like messages/statuses
Content management systems
serving content out of HBase
Canonical use-case: webtable (pages stored during crawling the web)
And others
Monday, July 9, 12
![Page 26: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/26.jpg)
Future
Making stable enough to substitute RDBMS in mission critical cases
Easier system management
Performance improvements
Monday, July 9, 12
![Page 27: Intro to HBase](https://reader033.vdocuments.site/reader033/viewer/2022061212/54955890ac795925288b4fb7/html5/thumbnails/27.jpg)
Qs?(next: Intro into HBase Internals)
Sematext is hiring!Monday, July 9, 12