what hadoop got right: enabling unconstrained data access

10
Crack Open Your Operational Database Jamie Martin [email protected] September 24th, 2013

Upload: trandat

Post on 31-Dec-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Crack Open Your Operational DatabaseJamie [email protected] 24th, 2013

Analytics on Operational Data● Most analytics are derived from operational data● Two canonical approaches

○ in-situ: run analytics on operational store○ ex-situ: move data (ETL) to optimized store

● Crack open the operational database○ direct external access into live OLTP database

In-Situ Analytics on Operational Data is Limited● OLTP databases are not built for analytics

○ built for short transactions, simple queries over modest data sets○ limited query expressiveness○ storage impedance mismatch (e.g. row vs. column)○ hybrids exist, but cannot bridge the gap across all dimensions

● Operational databases are usually resource constrained○ limited CPU, cache and IOPS for analytics○ long running queries cause lock conflicts or MVCC inefficiencies

● Run analytics on an optimized analytics engine○ optimized columnar stores○ massively scalable compute engines○ fast aggregation engines (OLAP)

Ex-situ Analytics is an ETL Nightmare● Get a snapshot of operational store

○ must be non-disruptive○ usually needs to be transactionally consistent

● Run analytics somewhere else○ use some other compute perhaps with more optimal storage

● Capture ongoing changes from OLTP engine○ often need to keep the analytics live

● In practice this is really really painful○ ETL nightmare: expensive, rigid, slow, fragile○ data governance and provenance problems

Hadoop: Unconstrained Data Access● Open data ecosystem

○ distributed storage: HDFS○ distributed compute: M/R (YARN)○ really interesting data access possibilities

● Unconstrained access to data○ data ‘files’ are all out there in the wild on HDFS○ storage formats are typically public (implement M/R Input/OutputFormat)○ ecosystem encourages integration (can run M/R directly on HBase HFiles)○ this is very different than your typical DBMS

OLTP Enabled Analytics: Snapshots● Goal: take snapshot directly against an active database● Getting to bytes is complicated

○ understand semantics of data - columns, datatypes for table T○ logical to physical mapping - where is the data○ physical consistency - coordination with writers○ transactional consistency○ persistence formats - understanding data layout (rows, columns, etc)

● Traditional database is a black box

contents of table: system catalogs, JDBC metadata, etc

where is the data: table spaces->dbs->partitions->...->extents->pages

physical consistency: in-memory latches, pinning

transactional consistency: in-memory lock tables, MVCC information

persistence formats: proprietary

data

results

queries

Direct Snapshots● An approach to direct external access

○ logical to physical data mapping externalized through public catalog service■ find the specific persistent artifacts that contain desired data■ DBMS abdicates space management

○ physical consistency without latching■ immutable storage (not Aries)■ anyone can read persistence w/o coordination

○ transactional consistency through MVCC■ records contain transaction information■ consistent point in time via filters on data (not PiTR)

○ published persistence formats● These are the same techniques that are needed to scale up and out

○ MVCC & immutable data to scale up○ cross node catalog describing persistence to scale out

Taking a Snapshot● Snapshot acquisition

○ obtain snapshot for table T■ locate immutable artifacts that may have data for table T■ register interest in them as of a point in time (MVCC)■ get a consistent snapshot

○ access the data directly, with impunity■ direct analytics, e.g. M/R on OLTP data■ dump in secondary system for subsequent analytics

○ release snapshot● Consistency without fine-grained coordination

Change Detection● Allow direct external access to OLTP transaction log

○ transaction log as a externally meaningful data stream● Externalized access

○ track transaction logs in external catalog○ physical consistency - logs are already append only/immutable○ transactional consistency - tie data MVCC to log records○ published log formats

● Models○ pull log chunks as needed

■ apply them to snapshots○ push log records on a data bus

■ enables streaming analytics

Challenges● Schema evolution

○ snapshot cannot require DDL coordination○ hard to receive schema changes from the firehose of changes

● Externalizing persistence formats is easier said than done