hbasecon 2012 | living data: applying adaptable schemas to hbase - aaron kimball, wibidata

11

Click here to load reader

Upload: cloudera-inc

Post on 30-Jun-2015

1.399 views

Category:

Technology


0 download

DESCRIPTION

HBase application developers face a number of challenges: schema management is performed at the application level, decoupled components of a system can break one another in unexpected ways, less-technical users cannot easily access data, and evolving data collection and analysis needs are difficult to plan for. In this talk, we describe a schema management methodology based on Apache Avro that enables users and applications to share data in HBase in a scalable, evolvable fashion. By adopting these practices, engineers independently using the same data have guarantees on how their applications interact. As data collection needs change, applications are resilient to drift in the underlying data representation. This methodology results in a data dictionary that allows less-technical users to understand what data is available to them for analysis and inspect data using general-purpose tools (for example, export it via Sqoop to an RDBMS). And because of Avro’s cross-language capabilities, HBase’s power can reach new domains, like web apps built in Ruby.

TRANSCRIPT

Page 1: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData
Page 2: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

Living Data: Applying Adaptable Schemas to HBase

WibiData, Inc.

Aaron Kimball – CTO

Page 3: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

HBase is a nexus for your data

Page 4: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

HBase: Schema free (unfortunately)

• Cells only hold byte arrays• Column names implicitly defined by apps• Each app must (de)serialize values correctly• Changing a schema requires rewriting a

column—and updating every reader/writer

Page 5: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

Datatypes can get rooted in place

=

Page 6: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

Avro: Flexible schemas

=

Page 7: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

Avro decouples schemas

Page 8: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

Every cell stores its schema (hash)

Page 9: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

Layout table stores common schemas

<column> <name>info:email</name> <description>User email address</description> <schema>“string”</schema></column>

• Data dictionary provides reference to engineers on different projects

• Common schemas used by tools that want to enforce a “default” schema for a column (e.g., Sqoop-based exports)

Page 10: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

Conclusions

• Avro allows decoupled applications to:– Share the same data store– Change individual applications without downtime– Eliminates need to structurally modify data

• Layout management allows:– Developers to communicate about data without

using code– Data-agnostic applications to manipulate

structured information

Page 11: HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimball, WibiData

www.wibidata.com / @wibidataAaron Kimball – [email protected]