what hadoop got right: enabling unconstrained data access
TRANSCRIPT
Crack Open Your Operational DatabaseJamie [email protected] 24th, 2013
Analytics on Operational Data● Most analytics are derived from operational data● Two canonical approaches
○ in-situ: run analytics on operational store○ ex-situ: move data (ETL) to optimized store
● Crack open the operational database○ direct external access into live OLTP database
In-Situ Analytics on Operational Data is Limited● OLTP databases are not built for analytics
○ built for short transactions, simple queries over modest data sets○ limited query expressiveness○ storage impedance mismatch (e.g. row vs. column)○ hybrids exist, but cannot bridge the gap across all dimensions
● Operational databases are usually resource constrained○ limited CPU, cache and IOPS for analytics○ long running queries cause lock conflicts or MVCC inefficiencies
● Run analytics on an optimized analytics engine○ optimized columnar stores○ massively scalable compute engines○ fast aggregation engines (OLAP)
Ex-situ Analytics is an ETL Nightmare● Get a snapshot of operational store
○ must be non-disruptive○ usually needs to be transactionally consistent
● Run analytics somewhere else○ use some other compute perhaps with more optimal storage
● Capture ongoing changes from OLTP engine○ often need to keep the analytics live
● In practice this is really really painful○ ETL nightmare: expensive, rigid, slow, fragile○ data governance and provenance problems
Hadoop: Unconstrained Data Access● Open data ecosystem
○ distributed storage: HDFS○ distributed compute: M/R (YARN)○ really interesting data access possibilities
● Unconstrained access to data○ data ‘files’ are all out there in the wild on HDFS○ storage formats are typically public (implement M/R Input/OutputFormat)○ ecosystem encourages integration (can run M/R directly on HBase HFiles)○ this is very different than your typical DBMS
OLTP Enabled Analytics: Snapshots● Goal: take snapshot directly against an active database● Getting to bytes is complicated
○ understand semantics of data - columns, datatypes for table T○ logical to physical mapping - where is the data○ physical consistency - coordination with writers○ transactional consistency○ persistence formats - understanding data layout (rows, columns, etc)
● Traditional database is a black box
contents of table: system catalogs, JDBC metadata, etc
where is the data: table spaces->dbs->partitions->...->extents->pages
physical consistency: in-memory latches, pinning
transactional consistency: in-memory lock tables, MVCC information
persistence formats: proprietary
data
results
queries
Direct Snapshots● An approach to direct external access
○ logical to physical data mapping externalized through public catalog service■ find the specific persistent artifacts that contain desired data■ DBMS abdicates space management
○ physical consistency without latching■ immutable storage (not Aries)■ anyone can read persistence w/o coordination
○ transactional consistency through MVCC■ records contain transaction information■ consistent point in time via filters on data (not PiTR)
○ published persistence formats● These are the same techniques that are needed to scale up and out
○ MVCC & immutable data to scale up○ cross node catalog describing persistence to scale out
Taking a Snapshot● Snapshot acquisition
○ obtain snapshot for table T■ locate immutable artifacts that may have data for table T■ register interest in them as of a point in time (MVCC)■ get a consistent snapshot
○ access the data directly, with impunity■ direct analytics, e.g. M/R on OLTP data■ dump in secondary system for subsequent analytics
○ release snapshot● Consistency without fine-grained coordination
Change Detection● Allow direct external access to OLTP transaction log
○ transaction log as a externally meaningful data stream● Externalized access
○ track transaction logs in external catalog○ physical consistency - logs are already append only/immutable○ transactional consistency - tie data MVCC to log records○ published log formats
● Models○ pull log chunks as needed
■ apply them to snapshots○ push log records on a data bus
■ enables streaming analytics