hbasecon 2012 | hbase for the worlds libraries - oclc

12
Apache HBase at OCLC Ron Buckley May 22, 2012 [email protected]

Upload: cloudera-inc

Post on 18-Jun-2015

1.539 views

Category:

Technology


0 download

DESCRIPTION

WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.

TRANSCRIPT

Page 1: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Apache HBase at OCLCApache HBase at OCLC

Ron BuckleyMay 22, [email protected]

Page 2: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

About OCLCAbout OCLC

OCLC delivers single-search-box access to more than 943 million items from your library and the world's library collections. You'll find:

1.8 Billion Ownership Information indications

214+ million books in libraries worldwide

663+ million articles with one-click access to full text

28+ million digital items from trusted sources like Google Books, OAIster and HathiTrust

13+ million eBooks from leading aggregators and publishers

44+ million pieces of evaluative content (Tables of Contents, cover art, summaries, etc.) included at no additional charge

And a LOT more (Facilitate Interlibrary Loan, API access, library centric research)

Page 3: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Main Case for OCLCMain Case for OCLC

• Library gets a new book.

• Librarian needs to enter all the data about that item into their local system.

• It takes quite some time to correctly enter cataloging data into local system.

• Thousands of libraries are all going to get the same book and do the same things . Thereby replicating each others work.

• There should be a system whereby libraries can share and build on each others work.

• SaaS before buzzwords were cool. System proposed in July 1966. First use in 1971.

*A member of the HBase implementation team also worked on the initial OCLC system.

Page 4: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Current Data State at OCLCCurrent Data State at OCLC

• Oracle (WorldCat – Oracle RAC)

• SAN Storage (Approximately 20 TB)

• Several other smaller instances of Oracle

• A LOT of stored procedures for read and update. The most commonly used are 10 years old and difficult to follow (being polite)

• Two copies of the primary database in other formats, various processes to keep them in sync (or not)

Page 5: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Schema Design – Oracle VersionSchema Design – Oracle Version

4 Main Tables, Primary Key (xwcmd_id) is an ever increasing OCLC assigned number for every library resource.

Page 6: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Schema Design – HBase VersionSchema Design – HBase Version

4 Tables become 1

Use Columns as data

Page 7: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Using column qualifiers to represent library ownershipUsing column qualifiers to represent library ownership

hbase(main):001:0> get 'Worldcat','1‘

data:createDate value=19690526 00:00:00.000

data:hold:10810 value={"md":[{"CDATE":"20080410 15:38:45.000"},{"CPID":"NA"},{"UDATE":“20080411 15:05:28.000"},{"UPID":"NA"}]}

data:hold:1100 value={"md":[{"CDATE":"20040826 02:08:57.000"},{"CPID":"NA"},{"UDATE":“20040826 02:08:57.000"},{"UPID":"NA"}]}

Qualifier Value

data:hold:10810 "md":[{"CDATE":"20080410 15:38:45.000"},{"CPID":"NA"},{"UDATE":“20080411 15:05:28.000"},{"UPID":"NA"}]}

data:hold:1100 "md":[{"CDATE":"20040826 02:08:57.000"},{"CPID":"NA"},{"UDATE":“20040826 02:08:57.000"},{"UPID":"NA"}]}

data:hold:727 "md":[{"CDATE":"20120522:08:57.000"},{"CPID":"NA"},{"UDATE":“20120522:08:57.000"},{"UPID":"NA"}]}

Page 8: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

AdvantagesAdvantages

• Everything in one I/O – We get the record, all of its metadata and a complete set of ‘who owns it and for how long’, in one call to HBase. HBase can generally read it in 1 physical I/O.

• New requirements – The existing Oracle table is binary indicator of ‘I own this’. Adding new columns to the table was going to be very difficult.

• With HBase, we’re now storing complete ownership, by just making up new column qualifiers.

Page 9: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

ProblemsProblems

Nagle – We’ve disabled Nagle across the board.

HBase Balancer – We’ve written a script that balances (outside of the default balancer) at the table. Hoping that the “Allow regions to be load-balanced by table” is included in 0.94 (HBASE-3373)

IOPS – For us, HBase is used for online, user facing traffic. Our cluster is designed such that we have plenty of capacity for this use. It’s easy for Map Reduce activity to fully utilize the amount of IO that’s available and not leave HBase anything to work with.

Page 10: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Status – Hardware/Software SystemsStatus – Hardware/Software Systems

Production Cluster

• 50 Nodes – 3 ‘Control’ Nodes, 3 ‘edge’ Nodes, 44 Data Nodes

• 8 CPU/32 GB Ram/8 TB Disk

• 3 Rack configuration – 10 GB interconnects

6 Node Clusters – Used for testing and disaster recovery

• 2 development clusters – IntegrationTest, ProofOfConcept

• 2 clusters in a separate datacenter – Business Continuity, Pre-production Testing

Versions

• Cloudera Distribution 3 Update 3 – CDH3U3

• Apache HBase 0.92.1

Page 11: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Backup/RestoreBackup/Restore

We’ve built our own backup/restore capability, like that described in:

https://issues.apache.org/jira/browse/HBASE-4618

It allow for both inter and intra-datacenter backup and restores.

On github at:

https://github.com/oclc/HBase-Backup

The backup runs weekly and on demand.

Page 12: HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

Other Interesting Data Sets OCLC is moving to HBaseOther Interesting Data Sets OCLC is moving to HBase

• The Dewey Decimal Editorial System - The system where

the editors of the Dewey Decimal System do their work.

•  VIAF - "Virtual International Authority File" - A joint project of several national libraries plus selected regional and trans-national library agencies. The project's goal is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web.