discardable in-memory materialized queries with hadoop

25
Page 1 © Hortonworks Inc. 2014 Discardable, In-Memory Materialized Query for Hadoop Julian Hyde Julian Hyde June 3 rd , 2014

Upload: julianhyde

Post on 15-Jan-2015

2.723 views

Category:

Technology


2 download

DESCRIPTION

What to do with all that memory in a Hadoop cluster? Should we load all of our data into memory to process it? The goal should be to put memory into its right place in the storage hierarchy, alongside disk and solid-state drives (SSD). Data should reside in the right place for how it is being used, and should be organized appropriately for where it resides. This proposed solution requires a new kind of data set called the Discardable, In-Memory, Materialized Query (DIMMQ). In this session we will talk through how we can build on existing Hadoop facilities to deliver three key underlying concepts that enable this approach.

TRANSCRIPT

Page 1: Discardable In-Memory Materialized Queries With Hadoop

Page 1 © Hortonworks Inc. 2014

Discardable, In-MemoryMaterialized Query for HadoopJulian Hyde Julian Hyde

June 3rd, 2014

Page 2: Discardable In-Memory Materialized Queries With Hadoop

Page 2 © Hortonworks Inc. 2014

About me

Julian Hyde

Architect at Hortonworks

Open source:• Founder & lead, Apache Optiq (query optimization framework)

• Founder & lead, Pentaho Mondrian (analysis engine)

• Committer, Apache Drill

• Contributor, Apache Hive

• Contributor, Cascading Lingual (SQL interface to Cascading)

Past:• SQLstream (streaming SQL)

• Broadbase (data warehouse)

• Oracle (SQL kernel development)

Page 3: Discardable In-Memory Materialized Queries With Hadoop

Page 3 © Hortonworks Inc. 2014

Before we get started…

The bad news• This software is not available to download

• I am a database bigot

• I am a BI bigot

The good news• I believe in Hadoop

• Now is a great time to discuss where Hadoop is going

Page 4: Discardable In-Memory Materialized Queries With Hadoop

Page 4 © Hortonworks Inc. 2014

Hadoop today

Brute forceHadoop brings a lot of CPU, disk, IO

Yarn, Tez, Vectorization are making Hadoop faster

How to use that brute force is left to the application

Business IntelligenceBest practice is to pull data out of Hadoop

• Populate enterprise data warehouse

• In-memory analytics

• Custom analytics, e.g. Lambda architecture

Ineffective use of memory

Opportunity to make Hadoop smarter

Page 5: Discardable In-Memory Materialized Queries With Hadoop

Page 5 © Hortonworks Inc. 2014

Brute force + diplomacy

Page 6: Discardable In-Memory Materialized Queries With Hadoop

Page 6 © Hortonworks Inc. 2014

Hardware trends

Typical Hadoop server configuration

Lots of memory - but not enough for all data

More heterogeneous mix (disk + SSD + memory)

What to do about memory?

Year 2009 2014

Cores 4 – 8 24

Memory 8 GB 128 GB

SSD None 1 TB

Disk 4 x 1 TB 12 x 4 TB

Disk : memory 512 : 1 384 : 1

Page 7: Discardable In-Memory Materialized Queries With Hadoop

Page 7 © Hortonworks Inc. 2014

Dumb use of memory - Buffer cache

Example:50 TB data

1 TB memory

Full-table scan

Analogous to virtual memory• Great while it works, but…

• Table scan nukes the cache!

Table (disk)

Buffer cache(memory)

Table scan

Queryoperators(Tez, orwhatever)

Pre-fetch

Page 8: Discardable In-Memory Materialized Queries With Hadoop

Page 8 © Hortonworks Inc. 2014

“Document-oriented” analysis

Operate on working sets small enough to fit in memory

Analogous to working on a document (e.g. a spreadsheet)• Works well for problems that fit into memory (e.g. some machine-learning algorithms)

• If your problem grows, you’re out of luck

Working set (memory)

Table (disk)

Interactive user

Page 9: Discardable In-Memory Materialized Queries With Hadoop

Page 9 © Hortonworks Inc. 2014

Smarter use of memory - Materialized queries

In-memorymaterializedqueries

Tableson disk

Page 10: Discardable In-Memory Materialized Queries With Hadoop

Page 10 © Hortonworks Inc. 2014

Census data

Census table – 300M records

Which state has the most males?

Brute force – read 150M records

Smarter – read 50 records

CREATE TABLE Census (id, gender, zipcode, state, age);

SELECT state, COUNT(*) AS cFROM CensusWHERE gender = ‘M’GROUP BY stateORDER BY c DESC LIMIT 1;

CREATE TABLE CensusSummary ASSELECT state, gender, COUNT(*) AS cFROM CensusGROUP BY state, gender;

SELECT state, cFROM CensusSummaryWHERE gender = ‘M’ORDER BY c DESC LIMIT 1;

Page 11: Discardable In-Memory Materialized Queries With Hadoop

Page 11 © Hortonworks Inc. 2014

Materialized view – Automatic smartness

A materialized view is a table that is declared to be identical to a given query

Optimizer rewrites query on “Census” to use “CensusSummary” instead

Even smarter – read 50 records

CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) ASSELECT state, gender, COUNT(*) AS cFROM CensusGROUP BY state, gender;

SELECT state, COUNT(*) AS cFROM CensusWHERE gender = ‘M’GROUP BY stateORDER BY c DESC LIMIT 1;

Page 12: Discardable In-Memory Materialized Queries With Hadoop

Page 12 © Hortonworks Inc. 2014

Indistinguishable from magic

Run query #1

System creates materialized view in background

Related query #2 runs faster

SELECT state, COUNT(*) AS cFROM CensusWHERE gender = ‘M’GROUP BY stateORDER BY c DESC LIMIT 1;

CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) ASSELECT state, gender, COUNT(*) AS cFROM CensusGROUP BY state, gender;

SELECT state, COUNT(NULLIF(gender, ‘F’)) AS males, COUNT(NULLIF(gender, ‘M’)) AS femalesFROM CensusGROUP BY stateHAVING females > males;

Page 13: Discardable In-Memory Materialized Queries With Hadoop

Page 13 © Hortonworks Inc. 2014

Materialized views - Classic

Classic materialized view (Oracle, DB2, Teradata, MSSql)1. A table defined using a SQL query

2. Designed by DBA

3. Storage same as a regular table

1. On disk

2. Can define indexes

4. DB populates the table

5. Queries are rewritten to use the table**

6. DB updates the table to reflect changes to source data (usually deferred)*

*Magic required

SELECT t.year, AVG(s.units)FROM SalesFact AS sJOIN TimeDim AS t USING (timeId)GROUP BY t.year;

CREATE MATERIALIZED VIEW SalesMonthZip ASSELECT t.year, t.month, c.state, c.zipcode, COUNT(*), SUM(s.units), SUM(s.price)FROM SalesFact AS sJOIN TimeDim AS t USING (timeId)JOIN CustomerDim AS c USING (customerId)GROUP BY t.year, t.month, c.state, c.zipcode;

Page 14: Discardable In-Memory Materialized Queries With Hadoop

Page 14 © Hortonworks Inc. 2014

Materialized views - DIMMQ

DIMMQs - Discardable, In-memory Materialized Queries

Differences with classic materialized views1. May be in-memory

2. HDFS may discard – based on DDM (Distributed Discardable Memory)

3. Lifecycle support:

1. Assume table is populated

2. Don’t populate & maintain

3. User can flag as valid, invalid, or change definition (e.g. date range)

4. HDFS may discard

4. More design options:

1. DBA specifies

2. Retain query results (or partial results)

3. An agent builds MVs based on query traffic

Page 15: Discardable In-Memory Materialized Queries With Hadoop

Page 15 © Hortonworks Inc. 2014

DIMMQ compared to Spark RDDs

It’s not “either / or”• Spark-on-YARN, SQL-on-Spark already exist; Cascading-on-Spark common soon

• Hive-queries-on-RDDs, DIMMQs populated by Spark, Spark-on-Hive-tables are possible

Spark RDD DIMMQ

Access By reference By algebraic expression

Sharing Within session Across sessions

Execution model Built-in External

Recovery Yes No

Discard Failure or GC Failure or cost-based

Native organization Memory Disk

Language Scala (or other JVM language)

Language-independent

Page 16: Discardable In-Memory Materialized Queries With Hadoop

Page 16 © Hortonworks Inc. 2014

Data independence

This is not just about SQL standards compliance!Materialized views are supposed to be transparent in creation, maintenance and use.

If not one DBA ever types “CREATE MATERIALIZED VIEW”, we have still succeeded

Data independenceAbility to move data around and not tell your application

Replicas

Redundant copies

Moving between disk and memory

Sort order, projections (à la Vertica), aggregates (à la Microstrategy)

Indexes, and other weird data structures

Page 17: Discardable In-Memory Materialized Queries With Hadoop

Page 17 © Hortonworks Inc. 2014

Implementing DIMMQs

Relational algebraApache Optiq (just entered incubator – yay!)

Algebra, rewrite rules, cost model

MetadataHive: “CREATE MATERIALIZED VIEW”

Definitions of materialized views in HCatalog

HDFS - Discardable Distributed Memory (DDM)Off-heap data in memory-mapped files

Discard policy

Build in-memory, replicate to disk; or vice versa

Central namespace

Evolution of existing components

Page 18: Discardable In-Memory Materialized Queries With Hadoop

Page 18 © Hortonworks Inc. 2014

Tiled queries in distributed memory

Query: SELECT x, SUM(y) FROM t GROUP BY x

In-memorymaterializedqueries

Tableson disk

Page 19: Discardable In-Memory Materialized Queries With Hadoop

Page 19 © Hortonworks Inc. 2014

An adaptive system

Ongoing activities:• Agent suggests new MVs

• MVs are built in background

• Ongoing query activity uses MVs

• User marks MVs as invalid due to source data changes

• HDFS throws out MVs that are not pulling their weight

Dynamic equilibriumDIMMQs continually created & destroyed

System moves data around to adapt to changing usage patterns

Page 20: Discardable In-Memory Materialized Queries With Hadoop

Page 20 © Hortonworks Inc. 2014

Lambda architecture

From “Runaway complexity in Big Data and a plan to stop it” (Nathan Marz)

Page 21: Discardable In-Memory Materialized Queries With Hadoop

Page 21 © Hortonworks Inc. 2014

Lambda architecture in Hadoop via DIMMQs

Use DIMMQs for materialized historic & streaming data

MV on disk

MV in key-value store

Hadoop

Query via SQL

Page 22: Discardable In-Memory Materialized Queries With Hadoop

Page 22 © Hortonworks Inc. 2014

Variations on a theme

Materialized queries don’t have to be in memory

Materialized queries don’t need to be discardable

Materialized queries don’t need to be accessed via SQL

Materialized queries allow novel data structures to be described

Maintaining materialized queries - build on Hive ACID

Fine-grained invalidation

Streaming into DIMMQs

In-memory tables don’t have to be materialized queries

Data aging – Older data to cheaper storage

Discardable

In-memory

Algebraic

Page 23: Discardable In-Memory Materialized Queries With Hadoop

Page 23 © Hortonworks Inc. 2014

Lattice

Space of possible materialized viewsA star schema, with mandatory many-to-one relationships

Each view is a projected, filtered aggregation• Sales by zipcode and quarter in 2013

• Sales by state in Q1, 2012

Lattice gathers stats• “I used MV m to answer query q and avoided

fetching r rows”

• Cost of MV = construction effort + memory * time

• Utility of MV = query processing effort saved

Recommends & builds optimal MVs

CREATE LATTICE SalesStar ASSELECT *FROM SalesFact AS sJOIN TimeDim AS t USING (timeId)JOIN CustomerDim AS c USING (customerId);

Page 24: Discardable In-Memory Materialized Queries With Hadoop

Page 24 © Hortonworks Inc. 2014

Conclusion

Broadening the tent – batch, interactive BI, streaming, iterative

Declarative, algebraic, transparent

Seamless movement between memory and disk

Adding brains to complement Hadoop’s brawn

Page 25: Discardable In-Memory Materialized Queries With Hadoop

Page 25 © Hortonworks Inc. 2014

Thank you!

My next talk:“Cost-based query optimization in Hive”4:35pm Wednesday

@julianhyde

http://hortonworks.com/blog/dmmq/

http://hortonworks.com/blog/ddm/

http://incubator.apache.org/projects/optiq.html