taming the elephant: the power of sql on hadoop

44
Grab some coffee and enjoy the pre- show banter before the top of the hour!

Upload: inside-analysis

Post on 01-Jul-2015

164 views

Category:

Technology


1 download

DESCRIPTION

Hot Technologies with Dr. Robin Bloor, John O’Brien and Actian Live Webcast on July 16, 1014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=af8301607c715cde31012925e17cba93 The latest Holy Grail in the data world is SQL on Hadoop – marrying the long-standing query language with the innovative data platform. Cracking this nut will pave the way for a decidedly new kind of information architecture, one that is less constrained by the complexity of current data environments. Data-driven businesses will have faster access to more data, with much greater flexibility for mixing and matching, slicing and dicing. Register for this episode of Hot Technologies to hear veteran Analysts Dr. Robin Bloor and John O’Brien as they provide their insights about what’s required to effectively fuse SQL and Hadoop. They’ll be briefed by John Santaferraro of Actian, who will tout his company’s recent announcement of its Actian Analytics Platform – Hadoop SQL Edition, which leverages vector processing to greatly increase the speed of complex data management functions. He’ll offer a demo of the solution to show how it enables SQL queries on Hadoop, and also provides enterprise hardness including security and full ACID compliance. Visit InsideAnlaysis.com for more information.

TRANSCRIPT

Page 1: Taming the Elephant: The Power of SQL on Hadoop

Grab some coffee and enjoy the pre-show banter before the top of the hour!

Page 2: Taming the Elephant: The Power of SQL on Hadoop

H T  Technologies    of   2014  

Page 3: Taming the Elephant: The Power of SQL on Hadoop

HOST:  Eric  Kavanagh  

Page 4: Taming the Elephant: The Power of SQL on Hadoop

     THIS  YEAR  is…  

Page 5: Taming the Elephant: The Power of SQL on Hadoop

SQL  on  Hadoop  

  SQL  has  been  the  de  facto  query  language  for  decades  

  Hadoop  provides  an  innovative  data  platform,  but  accessing  and  leveraging  the  file  system  has  so  far  often  meant  the  need  for  a  whole  new  skill  set  

  The  marriage  of  highly  performant  SQL  and  Hadoop  can  be  a  giant  step  forward    

Page 6: Taming the Elephant: The Power of SQL on Hadoop

ANALYST:  

John  O’Brien  Principal  &  CEO,    Radiant  Advisors  

ANALYST:  

Robin  Bloor  Chief  Analyst,    The  Bloor  Group  

GUEST:  

John  Santaferraro  Vice  President  of  Marketing,    Actian  TH

E  LINE  UP  

Page 7: Taming the Elephant: The Power of SQL on Hadoop

INTRODUCING  

John  O’Brien  

Page 8: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

TAMING THE ELEPHANT: !THE POWER OF SQL-ON-HADOOP

Hot Technologies – Inside Analysis July 16, 2014

John O’Brien | Principal Advisor and CEO, Radiant Advisors @obrienjw @radiantadvisors [email protected]

8

Page 9: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

Enable Highly Iterative Access, assemble, verify, deploy process, and modern data platform Enable fail-fast, short shelf life, personalized to enterprise context

Self-Sufficiency is the New Self-Service Agility and data integration through abstraction usage

Enable many business analysts, not just programmers, with pre-built

Intuitive Visualization Tools Oriented All forms of business analytics required from SQL, nPath, Graph,

Textual, Statistical, Predictive to achieve business goals

The Power of SQL-on-Hadoop HOW SQL UNLOCKS DISCOVERY

9

Page 10: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

Busin

ess

Value

Users Involved

Power Users

Analysts & Casual Users

MapReduce

çHCatalog

BI To

ol

Very Few Data Scientists

Many Many Consumers

DB

More Analysts

Hadoop Distributed File System

Hive

PIG

Hadoop v1

The Power of SQL-on-Hadoop UNLOCKING BIG DATA VALUE

Have to meet the Casual Users expectations

10

Page 11: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

The Power of SQL-on-Hadoop INDEPENDENT BENCHMARK DOWNLOAD

11

Page 12: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

12

The Power of SQL-on-Hadoop KEY EVALUATION CONSIDERATIONS

Evaluation Criteria

SQL Capability • Tools Compatibility • ANSI SQL • Analytic SQL • User Defined Functions

Scalability • How many nodes max? • All nodes in cluster? • Subset of cluster? • Data duplication?

Speed • Response time • Ad-hoc workloads • Without caching • Concurrency

Architecture • YARN compatible • Data file formats • Data Lake strategy • Semantic Layer

Page 13: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

*Vor

tex

PIG

Hive

-QL

MapReduce

Hadoop HDFS

Hadoop v1

Map

Re

duce

PIG

Hive

0.

13

YARN

Hadoop HDFS

Hadoop v2

PIG

Hive

M/R

YARN

Hadoop HDFS

Impa

la, H

AWQ

Infin

iDB,

Pre

sto

MPP

Eng

ines

The Power of SQL-on-Hadoop EVOLVING ARCHITECTURE FOR SQL

Tez

Tez

Batch-oriented SQL Interactive SQL Architectural SQL

Hadoop v2 with more SQL options

Page 14: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

Flexibility Class

14

Enterprise Data

Warehouses

Master Reference

Data

Discovery, Scalable, Programs Stable, Context, SQL Discovery & Analytics Oriented

Apache Hadoop

Highly Optimized for Analytics

In-memory MOLAP MPP

Optimized Class Reference Class

Gen

erat

e

Hiv

e S

QL

askdjfl kasjdfl iuyuiio

Highly Specialized for Analytics

Graphs Document

Stores Text

Analytics

P

IG /

Hiv

e

Map

Red

uce

Ope

ratio

nal S

yste

ms,

Big

Dat

a, S

tream

s

HD

FS

Columnar

Extending SQL Access to Big Data and Hadoop via Hive and other HDFS SQL engines

The Power of SQL-on-Hadoop MODERN DATA PLATFORM UNIFIED SQL

Page 15: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

THANK YOU!

For more information www.RadiantAdvisors.com

Twitter: @RadiantAdvisors RSS: feed://radiantadvisors.com/feed/ Email us at: [email protected] Linked IN: www.linkedin.com/company/radiant-advisors

Page 16: Taming the Elephant: The Power of SQL on Hadoop

© Copyright 2014 Radiant Advisors. All Rights Reserved

16

1.  What file format do you recommend loading data into for SQL? (e.g. RC, ORC, Sequence, Parquet, JSON, proprietary)

2.  Are the data files accessible by other Hadoop engines (Hive, PIG, MapReduce, Spark) or duplicated for SQL access?

3.  Where is the schema meta data stored in Hadoop? (e.g. Hive metastore, HCatalog, other)

4.  How easy is it for business analyst to create and work with schema definitions?

The Power of SQL-on-Hadoop ANALYST QUESTIONS

Page 17: Taming the Elephant: The Power of SQL on Hadoop

INTRODUCING  

Dr.  Robin  Bloor  

Page 18: Taming the Elephant: The Power of SQL on Hadoop
Page 19: Taming the Elephant: The Power of SQL on Hadoop

Hadoop

The Obvious Role of Hadoop is as the Staging Area for Data

Refinement

But it can also be a file system for a database

Page 20: Taming the Elephant: The Power of SQL on Hadoop

Big Data Architecture In Overview

Think Logical, Implement Physical

Page 21: Taming the Elephant: The Power of SQL on Hadoop

Two Data Flows

Page 22: Taming the Elephant: The Power of SQL on Hadoop

Within The Data Hub

Page 23: Taming the Elephant: The Power of SQL on Hadoop

Within The Data Hub

Nevertheless, the main workload is SQL And SQL with analytics

Page 24: Taming the Elephant: The Power of SQL on Hadoop

SQL on Hadoop

It’s not about SQL on Hadoop, It’s about fast SQL on Hadoop

Hadoop both as a file system and a database is probably

desirable

Page 25: Taming the Elephant: The Power of SQL on Hadoop

INTRODUCING  

John  Santaferraro  

Page 26: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  26

Ac'an  Analy'cs  Pla7orm  Hadoop  SQL  Edi'on  Hot  Technologies  Webinar  

John  Santaferraro,  VP  of  Solu'on  and  Product  Marke'ng,  Ac'an  

July  16,  2014  

Page 27: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  27

Transforma)onal  Value  Data  Explosion  

? Actian Analytics PlatformTM

Analyze Act Connect

Customer Delight

Competitive Advantage

World-Class Risk Management

Disruptive New Business Models

Ac'an  Turns  Data  into  Transforma'onal  Value  

Discovery without limitations Low latency at any scale

Reactive to predictive Static to dynamic

Segment of 1

Best  in  Class  Usage  

Design-time & run-time optimization Linear parallelism Rich analytics DNA Pipeline architecture Affordable unlimited scale

Best  in  Class  Capabili)es  

Page 28: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  28

Libraries of Analytics

Mas

sive

ly P

aral

lel

Inte

grat

ion

Hadoop High-

Performance, Low Latency Analytics in

Database

Connections for Any Data

Actian Analytics PlatformTM

Enterprise Data

Machine Data

Social Data

Business Processes

Users

Machines

Applications

Data Warehouse

Real-Tim

e A

nalytic Services

Visual Data Science and Analytics Workbench

SaaS Data

Ac'an  Analy'cs  Pla7orm:    Next  Genera'on  Big  Data  Analy'cs  

Amazon Redshift

High Performance Data Science

Natively in Hadoop

Page 29: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  29

What  HOT?  "   Turns  Hadoop  into  a  High-­‐Performance,  Fully-­‐Func'onal  Analy'cs  Pla7orm  

What  makes  it  HOT?  "   Highest  performing,  most  industrialized  SQL  access  to  Hadoop  data  

"   Only  end-­‐to-­‐end  analy'c  processing  na'vely  in  Hadoop    

"   Most  consumable,  accessible,  manageable  Hadoop  analy'cs  

What  does  this  mean  to  YOU?    "   Removes  all  barriers  for  business  access  to  big  data  analy'cs  

"   Unleashes  millions  of  business-­‐savvy,  SQL  users  with  no  constraints  on  Hadoop  data    

"   Accelerates  'me  to  value    and  turns  Hadoop  data  into  transforma'onal  value  

Ac'an  Analy'cs  Pla7orm  –  Hadoop  SQL  Edi'on  Industrialized,  High-­‐Performance  SQL  in  Hadoop  

Page 30: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  30

HADOOP

YARN Namenode

HDFS

SQL

Datanode

HDFS

Visual Data Science

& Analytics Workbench

Ac'an  Analy'cs  Pla7orm  –  Hadoop  SQL  Edi'on  Transform  Hadoop  into  a  High  Performance  Analy?cs  Pla@orm  

Datanode

HDFS

Datanode

HDFS

Datanode

HDFS

X100 X100 X100

Read  Load    

Ac'an  Vector  Blend  &  Enrich  

Data  Science    &  Analy'cs  

Datanode

HDFS

X100

HDFS

Vector

•  Original file format •  Standard block

replication

•  Column-based blocks

•  Binary •  Compressed •  Partitioned

•  Faster Loading •  Faster SQL •  Standard SQL •  Better Scaling

High Performance, Industrialized SQL

Database

High Performance Dataflow Engine

Page 31: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  31

Visual  Data  Science  &  Analy'cs  Workbench  •  Drag/drop 1000+ analytic functions •  Connect, blend, & enrich data •  Perform discovery analytics & data science •  Build and test predictive models

MapReduce  

Coding  

Page 32: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  32

"   Comprehensive  –  covers  full  analy'c  process:  data  blending  &  enrichment,  discovery  &  data  science,  analy'cs  &  opera'onal  BI  

"   Accessible  –  standard  ANSI  SQL-­‐92  to  support  standard  BI  tools;  plus  key  advanced  analy'cs  including  cube,  grouping  sets  and  windowing  func'ons    

"   Op)mized  –  mature,  proven  planner  and  op'mizer;  op'mal  use  of  every  node,  CPU,  memory,  and  cache  

"   Secure  –  na've  DBMS  security  including  authen'ca'on,  user  and  role-­‐based  security,  data  protec'on,  and  encryp'on    

"   Reliable  -­‐  fully  ACID-­‐compliant  with  mul'-­‐version  read  consistency,  plus  system-­‐wide  failover  protec'on    

"   Manageable  –  resources  managed  automa'cally  in  Hadoop  via  YARN  

"   Consumable  –  now  usable  by  millions  of  users  with  every  SQL  tool  and  applica'on  on  the  planet  

"   Scalable  –  unlimited  expansion  to  handle  extreme  #s  of  users,  nodes,  data  

Most  Industrialized  SQL  in  Hadoop  

Page 33: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  33

Up to 30X Faster Than Impala

0

5

10

15

20

25

30

35

Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98

“Impala Subset” of TPC-DS at Scale Factor 3000 (3TB) Actian vs Impala

Impala Actian

Background to “Impala Subset “of TPC-DS benchmark can be found here: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/

Both Executed on the Same Hardware and Software Environment: 5 Node Cluster with 64GB of RAM per node and 12x2TB Hard Disks.

Average

Highest  Performing  SQL  in  Hadoop  Ti

mes

Fas

ter T

han

Impa

la

Page 34: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  34

“wrapped legacy”

“from scratch”

Maturity  (SQL support,

ACID, reliability, security, connectivity,

performance)

Hadoop  Integra)on  Low Native

High

“connections” Mature & Integrated

“SQL  on  Hadoop”  Vendor  Landscape  

+ End-to-End

Page 35: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  35

Libraries  of  Analy'cs  

Hadoop

Connec'ons  for  Any  Data  

Ac'an  Analy'cs  Pla7orm  –  Hadoop  SQL  Edi'on  

Visual  Data  Science  and  Analy'c  Workbench  

High  Performance  Dataflow  Engine  

High  Performance,  Industrialized  SQL    Analy)cs  Database  

Removes  all  barriers  for  business  access  to  big  data  analy'cs  

Business Processes

Users

Machines

Applications

Expansive  Connec'vity    Data  Blending  &  Enrichment    Discovery    Data  Science    Analy'cs    Opera'onal  BI  

Enterprise Data

Machine Data

Social Data

Data Warehouse

SaaS Data

Amazon Redshift

Page 36: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  36

Ubiquitous  Skills  

■  1  Million+  SQL  Users  

■  $  Inexpensive  ■  Easy  to  find,  in  most  companies  

■  Embedded  in  the  business  

Specialty  Skills  

■  150K  MapReduce  Programmers  

■  $$$  Expensive  ■  170K  Shortage,  hard  to  find  ■  Separate  from  the  business  

Unleash  millions  of  business-­‐savvy,  SQL  users    with  no  constraints  on  Hadoop  data  

Actian Analytics PlatformTM

Analyze Act Connect +  

Page 37: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  37

Accelerate  'me  to  value  and  turn  Hadoop  data  into  transforma'onal  value  

Data  Scien'st  

Discover  new  opportuni'es,  build  and  test  models.  Come  up  with  candidate  models.  

Data  Miner  

Validate  models,  apply  data  mining  techniques.  Choose  and  maintain  contender  models.  

Business  Analyst  

Select  model  for  deployment  based  on  business  impact.  

Opera'onal  User  

Use  models  for  opera'onal  intelligence  and  embed  analy'cs  in  real-­‐'me  systems.  

COMMON  DATA  &  ANALYTICS  ACCESS  

COLLABORATIVE    DATA  SCIENCE  ENVIRONMENT  

Page 38: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  38

Actian transforms Hadoop from a data lake into a high-performance analytics platform.

Ac'an  Analy'cs  Pla7orm  –  Hadoop  SQL  Edi'on  Industrialized,  High-­‐Performance  SQL  in  Hadoop  

"   Only  end-­‐to-­‐end  analy'c  processing  na'vely  in  Hadoop    

"   Highest  performing,  most  industrialized  SQL  in  Hadoop  

"   Removes  all  barriers  for  business  access  to  big  data  analy'cs  

"   Unleashes  millions  of  business-­‐savvy  SQL  users  on  Hadoop  data    

"   Speed  'me  to  value  for  big  data  analy'cs  projects  

"   Outperforms  Cloudera’s  Impala  by  up  to  30x  

Page 39: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  39

What  Big  Data  Analy'cs  Pricing  Was  Meant  to  Be  

All-In-One (1 SKU)

Right-to-Deploy (no limits)

Page 40: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  40

www.ac'an.com  

facebook.com/ac'ancorp  

@ac'ancorp  

Thank  You  

Page 41: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  41

Vector  in  Hadoop  Technical  Overview  

Page 42: Taming the Elephant: The Power of SQL on Hadoop

Confiden'al  ©  2014  Ac'an  Corpora'on  42

ING

RE

S

SQL parser

Optimizer

Cross compiler

parsed tree

query plan

Client Application or BI Tools

X100 algebra

X10

0

Distributed rewriter

Builder

Execution engine

annotated query tree

operator tree

Buffer manager

data data request

HDFS

Lead

er n

ode

(nam

enod

e)

SQL query

I/O

X10

0

Rewriter

Builder

Execution engine

annotated query tree

partial operator tree

Buffer manager

data data request

HDFS Wor

ker n

ode

[1..n

] (da

tano

des)

I/O

MPI

annotated tree

result

MPI

partial result set

MP

I in

ter-

node

com

mun

icat

ion

Active Passive Fail--over for Leader Node

Actian Director for Management

Vector  in  Hadoop  Architecture  

Page 43: Taming the Elephant: The Power of SQL on Hadoop
Page 44: Taming the Elephant: The Power of SQL on Hadoop

The  Archive  Trifecta:  •  Inside  Analysis    www.insideanalysis.com  •  SlideShare    www.slideshare.net/InsideAnalysis  •  YouTube    www.youtube.com/user/BloorGroup  

THANK  YOU!