the return of the hierarchical model

The Return of theHierarchical Model

Jukka Zitting @ Day Software

/Agenda

Part 1: Hierarchy

Concepts -

Benefits -

Drawbacks –

Examples -

Part 2: Case Study

- JCR

- Jackrabbit

- Sling

- Lessons Learned

questions and comments allowed

/Hierarchy/Concepts Every record has a parent record

Except the root No cyclical parent relations allowed Referential integrity, but often no other reference

types supported

A name identifies a record within its parent The name is not necessarily unique (XML, DNS, etc.) Path as an identifier: /path/to/record

Record hierarchy is distinct from type hierarchy Structural flexibility, optionally limited by type

constraints

A B

C D E

F

/Hierarchy/BenefitsNatural

Data in many domains is inherently hierarchical Easy to understand

Self-similar Recursive algorithms Incremental map-reduce!

Scalable Partitioning Parallel processing

Efficient Highly optimized path-based access and “joins”

on the parent-child and subtree relationships

/Hierarchy/DrawbacksLimited support for references

Graph databases solve this problem, at a cost DAG a partial solution

Handling of flat structures Chronological: blogs, tweets, email, log entries, etc. Sets: wiki pages, user accounts, etc. Often requires an artificial hierarchy, e.g. /blog/2010/06/entry-for-today

Standards are domain-specific or limited in scope POSIX, DNS, XPath/XQuery, JCR, etc.

Difficulty of organizing things Coming up with good names for records is hard Hierarchy requires maintenance

/Hierarchy/ExamplesFile system

DNS

LDAP

XML

WebDAV

RDBMS

/Hierarchy/Examples/File SystemUniversally available

Two main types: files and folders Notable extensions: /dev/* and /proc/* Unix philosophy: Everything is a file!

Heavily optimized for specific use cases

Limited support for fine-grained data Some systems support things like extended attributes

Built-in access controls, but usually no query support

Major limitations in distributed solutions SAN and NAS solutions reasonably efficient but limited in scope Truly distributed systems like HDFS applicable only for limited use cases

/Hierarchy/Examples/DNSGlobally distributed, heterogenous, eventually consistent

In production since 1983!

Standardized query and update protocols

Domain-specific, highly optimized for scalability

Multiple records can have the same name

Fine-grained record types: A, NS, MX, TXT, AAAA, etc.

Security issues, both in design and implementations Not much impact in practice

/Hierarchy/Examples/LDAPProtocol for accessing X.500-style directories

Record names are constructed from selected properties dn: cn=John Doe, dc=example, dc=com

Record types defined by extensible schemas

Limited form of record references

Fairly powerful search Though no aggregate queries or arbitrary joins

Optimized for fine-grained data that is mostly read

Replication and distributed use widely supported

/Hierarchy/Examples/XMLData storage based on the XML DOM

Various levels of conformance

Highly buzzword compliant in the early 2000’s Few of the XML database products are still in active use

Inefficient handling of binary data (at all granularities)

Powerful query and transformation tooling XPath, XQuery, XSLT, etc. Many implementations not optimized for performance

Optional type constraints with XML Schema, etc.

The result? XML extensions in SQL

/Hierarchy/Examples/WebDAVExtends HTTP with concepts of collections and properties

Also: locking, versioning, search, etc.

Often used (only) for HTTP-based access to a file system Also leveraged by fs-like systems like Subversion

Limited XML-based query with PROPFIND More query power with DASL

Somewhat heavy-weight for fine-grained access

Fragmented and often incompatible implementations File system backend as the lowest common denominator cf. AtomPub

/Hierarchy/Examples/RDBMSVarious ways of representing hierarchies in RDBM systems

Adjacency model: Each row has a reference to the parent Nested sets: Rows numbered in depth-first traversal order etc.

Little structural flexibility

Expensive parent-child or subtree joins Vendor-specific extensions to address this problem

Two words: Impedance mismatch

/Hierarchy/SummaryData storage/management using an explicit tree hierarchy

Natural mapping, nice non-functional characteristics

Limited functionality, lack of generic standards

Widely used, but in domain-specific ways Extremely efficient/scalable for certain data models

How about a generic, feature-rich hierarchical database?

/Case/JCRContent Repository for Java Technology API (JCR)

JCR 1.0 out in 2005, specified in JSR 170 JCR 2.0 out in 2009, specified in JSR 283 Work on JCR 2.1 starting

A content repository is a hierarchical content store with full text search, observation, versioning, transactions, etc. JCR 2.0 adds retention, type management, join queries, etc.

Designed for both structured and unstructured content handling of both finely and coarsely grained data

Application platform more than an integration API

/Case/JackrabbitReference implementation of both JCR 1.0 and 2.0

Primary focus on feature-completeness

Apache incubator since 2004, TLP since 2006

Internal storage through an abstracted key-value API Tree model implemented on top of that Lucene search index maintained separately Separate journal for cluster deployments

Advanced WebDAV support

Jackrabbit 3: Focus on scalability, modularity

/Case/SlingWeb framework based on the JCR content model

Apache incubator since 2007, TLP since 2009

Intuitive URL mapping Path selects the underlying content resource Optional selectors and extensions determine representation

JSON and POST servlets with Javascript support

OSGi for server-side modularity

Everything is content

/Case/Lessons LearnedContent-driven development

Data first, structure later

Distribute for redundancy Modern hardware goes a long way for scalability/performance For small/medium deployments, distribution is more important for fault-tolerance

especially in cloud environments

Relationships are important JCR 2.0 is a DAG, plus references for expressing full graphs Referential integrity not so important

Notable data sets are flat

Don’t forget tool support for ad-hoc tasks!

/Questions?

http://jackrabbit.apache.org/

http://sling.apache.org/

http://www.day.com/jsr283

the return of the hierarchical model

Technology

finegrained data

finegrained record types

xml extensions

limited use cases

com record types

xml schema

java technology api

unique xml