cloudera impala - san diego big data meetup august 13th 2014

44
1 Cloudera Impala SD Big Data Monthly Meetup #2 August 13 th 2014 Maxime Dumas Systems Engineer

Upload: cdmaxime

Post on 13-Jan-2015

297 views

Category:

Software


4 download

DESCRIPTION

Cloudera Impala presentation to San Diego Big Data Meetup (http://www.meetup.com/sdbigdata/events/189420582/)

TRANSCRIPT

Page 1: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

1

Cloudera  Impala  SD  Big  Data  Monthly  Meetup  #2  August  13th  2014    Maxime  Dumas  Systems  Engineer  

Page 2: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Thirty  Seconds  About  Max  

•  Systems  Engineer  •  aka  Sales  Engineer  •  SoCal,  AZ,  NV  

•  former  coder  of  PHP  •  teaches  meditaLon  +  yoga  •  from  Montreal,  Canada  

2  

Page 3: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

What  Does  Cloudera  Do?  

•  product  •  distribuLon  of  Hadoop  components,  Apache  licensed  •  enterprise  tooling  

•  support  •  training  •  services  (aka  consulLng)  •  community  

3

Page 4: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

What  This  Talk  Isn’t  About  

•  deploying  •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor  

•  sizing  &  tuning  •  depends  heavily  on  data  and  workload  

•  coding  •  unless  you  count  XML  or  CSV  or  SQL  

•  algorithms  

4

Page 5: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Public  Domain  IFCAR  

Page 6: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

What  is  Cloudera  Impala?  

6

Page 7: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

cloud·∙e·∙ra  im·∙pal·∙a  

7

/kloudˈi(ə)rə  imˈpalə/    noun    

a  modern,  open  source,  MPP  SQL  query  engine  for  Apache  Hadoop.    “Cloudera  Impala  provides  fast,  ad  hoc  SQL  query  capability  for  Apache  Hadoop,  complemenLng  tradiLonal  MapReduce  batch  processing.”  

Page 8: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

8

Quick  and  dirty,  for  context.  

The  Apache  Hadoop  Ecosystem  

Page 9: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Why  “Ecosystem?”  

•  In  the  beginning,  just  Hadoop  •  HDFS  •  MapReduce  

•  Today,  dozens  of  interrelated  components  •  I/O  •  Processing  •  Specialty  ApplicaLons  •  ConfiguraLon  •  Workflow  

9

Page 10: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

HDFS  

•  Distributed,  highly  fault-­‐tolerant  filesystem  •  OpLmized  for  large  streaming  access  to  data  •  Based  on  Google  File  System  

•  hjp://research.google.com/archive/gfs.html  

10

Page 11: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Lots  of  Commodity  Machines  

11

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Page 12: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

MapReduce  (MR)  

•  Programming  paradigm  •  Batch  oriented,  not  realLme  •  Works  well  with  distributed  compuLng  •  Lots  of  Java,  but  other  languages  supported  •  Based  on  Google’s  paper  

•  hjp://research.google.com/archive/mapreduce.html  

12

Page 13: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Apache  Hive  

•  AbstracLon  of  Hadoop’s  Java  API  •  HiveQL  “compiles”  down  to  MR  

•  a  “SQL-­‐like”  language  

•  Eases  analysis  using  MapReduce  

13

Page 14: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Apache  Hive  Metastore  

•  Maps  HDFS  files  to  DB-­‐like  resources  •  Databases  •  Tables  •  Column/field  names,  data  types  •  Roles/users  •  InputFormat/OutputFormat  

14

Page 15: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Sqoop  

©2011 Cloudera, Inc. All Rights Reserved. 15

•  SQL  to  Hadoop  

•  Tool  to  import/export  any  JDBC-­‐supported  database  into  Hadoop  

•  Transfer  data  between  Hadoop  and  external  databases  or  EDW  

•  High  performance  connectors  for  some  RDBMS  

•  Oracle,  Teradata,  Netezza  

•  Developed  at  Cloudera  

Page 16: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

16  

Page 17: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

17

Familiar  interface,  but  more  powerful.  

Cloudera  Impala  

Page 18: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Cloudera  Impala  

18

Interac(ve  SQL  for  Hadoop  § Responses  in  seconds  § Nearly  ANSI-­‐92  standard  SQL  with  Hive  SQL  

Na(ve  MPP  Query  Engine  § Purpose-­‐built  for  low-­‐latency  queries  §  Separate  runLme  from  MapReduce  § Designed  as  part  of  the  Hadoop  ecosystem  

Open  Source  § Apache-­‐licensed  

Page 19: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Benefits  of  Impala  

19

More  &  Faster  Value  from  “Big  Data”  §  InteracLve  BI/AnalyLcs  experience  via  SQL  §  No  delays  from  data  migraLon  

Flexibility  §  Query  across  exisLng  data  §  Select  best-­‐fit  file  formats  (Parquet,  Avro,  etc.)  §  Run  mulLple  frameworks  on  the  same  data  at  the  same  Lme    

Cost  Efficiency  §  Reduce  movement,  duplicate  storage  &  compute  §  10%  to  1%  the  cost  of  analyLc  DBMS  

Full  Fidelity  Analysis  §  No  loss  from  aggregaLons  or  fixed  schemas  

Page 20: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Impala  Use  Cases  

20

InteracLve  BI/analyLcs  on  more  data  

Asking  new  quesLons  –  exploraLon,  ML  

Data  processing  with  Lght  SLAs  

Query-­‐able  archive  w/full  fidelity  

Cost-­‐effec(ve,  ad  hoc  query  environment  that  offloads  the  data  warehouse  for:  

Page 21: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Our  Design  Strategy  

21

One  pool  of  (open)  data  

One  metadata  model  

One  security  framework  

One  set  of  system  resources  

An  Integrated  Part  of  the  Hadoop  System  

In-­‐Memory  Processing  &  Streaming  

Spark  

Storage  

Integra(on  

Resource  Management  

Metad

ata  

Batch  Processing  MAPREDUCE,  HIVE  &  PIG  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  

Engines  

InteracLve  SQL  

CLOUDERA  IMPALA  

InteracLve  Search  CLOUDERA  SEARCH  

Machine  Learning  MAHOUT,  

ClouderaML,  Oryx  

Math  &  Sta(s(cs  

SAS,  R    

Security  

Page 22: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Impala  Key  Features  

22

Fast   Flexible   Secure  

Easy  to  Implement   Easy  to  Use   Simple  to  Manage  

§  In-­‐memory  data  transfers  §  ParLLoned  joins  

§  Fully  distributed  aggregaLons  

§  Query  data  in  HDFS  &  HBase  §  Supports  mul(ple  file  formats  

&  compression  algorithms  

§  Java  &  Na(ve  UDFs,  UDAFs  

§  Integrated  with  Hadoop  security  

§  Kerberos  authenLcaLon  

§  Authoriza(on  (Sentry)  

§  Leverages  Hive’s  ODBC/JDBC  connectors,  metastore  &  SQL  syntax    

§  Open  source  

§  Interact  with  data  via  SQL  §  CerLfied  with  leading  BI  tools  

§  Deploy,  configure  &  monitor  with  Cloudera  Manager  

§  Integrated  with  Hadoop  resource  management  

Page 23: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

What’s  Coming?*  

23

SQL  2003-­‐Compliant  AnalyLc  Window  FuncLons  

AddiLonal  AuthenLcaLon  Mechanisms  

User  Defined  Table  FuncLons  

Intra-­‐node  Parallelized  AggregaLons  &  Joins  

Nested  Data  

Enhanced  YARN-­‐Integrated  Resource  Manager  

Dynamic  ParLLon  Pruning  

In  the  Near  Term:  

*On  the  roadmap…  no  guarantees    

Page 24: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Impala  Plays  Well  with  Others  

24

BI  Partners:  Building  on  the  

Enterprise  Standard  POWERED BY

IMPALA

Page 25: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Not  All  SQL  On  Hadoop  Is  Created  Equal  

25

Batch  MapReduce  Make  MapReduce  faster  

Slow,  s(ll  batch  

Remote  Query  Pull  data  from  HDFS  over  the  network  to  the  DW  

compute  layer  

Slow,  expensive  

Siloed  DBMS  Load  data  into  a  

proprietary  database  file  

Rigid,  siloed  data,  slow  ETL  

Impala  Na(ve  MPP  query  engine  that’s  integrated  into  

Hadoop  

Fast,  flexible,    cost-­‐effec(ve  

$

Page 26: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

DMBS  Hadoop  

More  Detail  On  AlternaLve  Approaches  

26

Batch  MapReduce  

§  Batch-­‐oriented  §  High  latency  

Remote  Query   Siloed  DBMS  

Hadoop   DMBS  

HDFS   Storage  

Compute   Compute  

§  Network  bojleneck  §  2x  the  hardware  §  Duplicate  metadata,  security,  SQL,  etc.  

Storage  (HDFS)  

Integra(on  

Resource  Management  

Hado

op  M

etad

ata  

DBMS  

Hadoop  Engines  

MAPREDUCE,  HIVE,  PIG,  IMPALA,  ETC.  

DBMS  Metad

ata  

PROPRIETARY   STANDARD  &  SHARED  

§  RDBMS  rigidity  §  Query  subset  of  data  §  Duplicate  storage,  metadata,  security,  SQL,  etc.  

Storage  

Integra(on  

Resource  Management  

Metad

ata  

Batch  Processing  

…  InteracLve  SQL  

Machine  Learning  

HDFS   HBase  

Security   Security  

Page 27: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Other  Sexy  New  Big  Data  MPP  Tools  

27

Presto  Purpose-­‐Built  MPP  Engine;  Similar  Architecture  to  Impala;  Few  Performance  Comparisons,  but  Impala  Anecdotally  5x-­‐10x  Faster    

Shark  Hive-­‐CompaLble  Data  Warehouse  for  Spark;  Great  Performance  unLl  Required  to  go  to  Disk,  at  Which  Point  Impala  Bejer;  With  HDFS  Caching  Impala  will  Perform  on  Par  from  a  Memory  PerspecLve    

Drill  Open  Source  version  of  Dremel;  Another  MPP  Engine;  MulLple  Data  Formats  and  Sources    

Phoenix  –  Sort  Of  SQL  Skin  over  HBase  (and  Only  HBase);  Subset  of  SQL  Standard  

Page 28: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

What  About  an  EDW/RDBMS?  

“Right  Tool  for  the  Right  Job”    EDW/RDBMS  Great  For:  

•  OLTP’s  complex  transacLons  •  Highly  planned  and  opLmized  known  workloads  •  Opera'onal  reports  and  repeated  known  queries  

 Impala  Great  For:  

•  Exploratory  analy'cs  with  previously-­‐unknown  queries  •  Queries  on  big  and  growing  data  sets  

EDW/RDBMS  Can’t:  •  Dump  in  raw  data  then  later  define  schema  and  query  what  you  want  •  Evolve  schemas  without  an  expensive  schema  upgrade  planning  process  •  Simply  scale  just  by  adding  industry-­‐standard  servers  •  Store  at  <  $1k/TB  instead  of  $10-­‐150k/TB  

28

Page 29: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

29

Impala  Technical  Details  

Page 30: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

The  Impala  Advantage  

30

No  MapReduce;  No  JVM;  All  NaLve  

In-­‐Memory  Data  Transfers  

Saturate  Disks  on  Reads  

OpLmized  File  Format  (ie  Parquet)  

In-­‐Memory  HDFS  Caching    Cost-­‐Based  Join  Order  OpLmizaLon  –  Frees  User  from  Having  to  Guess  the  Correct  Join  Order  

Where  does  the  Performance  Come  From?  

Page 31: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Impala  and  Hive  

31

Shares  Everything  Client-­‐Facing  §  Metadata  (table  definiLons)  §  ODBC/JDBC  drivers  §  SQL  syntax  (Hive  SQL)  §  Flexible  file  formats  §  Machine  pool  §  Hue  GUI  

But  Built  for  Different  Purposes  §  Hive:  runs  on  MapReduce  and  ideal  for  batch  processing  

§  Impala:  naLve  MPP  query  engine  ideal  for  interacLve  SQL  

Storage  

Integra(on  

Resource  Management  

Metad

ata  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  

Hive  SQL  Syntax   Impala  

SQL  Syntax  +  Compute  Framework  MapReduce  

Compute  Framework  

Batch  Processing  

InteracLve  

SQL  

Page 32: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Impala  Query  ExecuLon  

32

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  request  

1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  

Page 33: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Impala  Query  ExecuLon  

33

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

2)  Planner  turns  request  into  collec(ons  of  plan  fragments  3)  Coordinator  ini(ates  execu(on  on  impalad(s)  local  to  data  

Page 34: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Impala  Query  ExecuLon  

34

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  Hive  

Metastore   HDFS  NN   Statestore  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  

Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

4)  Intermediate  results  are  streamed  between  impalad(s)  5)  Query  results  are  streamed  back  to  client  

Query  results  

Page 35: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Parquet  File  Format  

35

Open  source,  columnar  Hadoop  file  format  developed  by  Cloudera  &  Twiler  Limits  the  IO  to  only  the  data  that  is  needed  

Supports  storing  each  column  in  a  separate  file  

Saves  space:  columnar  layout  compresses  bejer  

Enables  bejer  scans:  load  only  the  columns  that  are  needed  

Supports  index  pages  for  fast  lookup  

Extensible  value  encodings  

Page 36: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

36

Impala  Performance  Results  

Page 37: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Impala  Performance  Results  

•  Impala’s  Milestone  in  Jan  2014:  •  Comparable  commercial  MPP  DBMS  speed  •  NaLvely  on  Hadoop    

•  Three  Result  Sets:  •  Impala  vs  Hive  0.12  (Impala  6-­‐70x  faster)  •  Impala  vs  “DBMS-­‐Y”  (Impala  average  of  2x  faster)  •  Impala  scalability  (Impala  achieves  linear  scale)    

•  Background  •  20  pre-­‐selected,  diverse  TPC-­‐DS  queries  (modified  to  remove  unsupported  

language)  •  Sufficient  data  scale  for  realisLc  comparison  (3  TB,  15  TB,  and  30  TB)  •  RealisLc  nodes  (e.g.  8-­‐core  CPU,  96GB  RAM,  12x2TB  disks)  •  Methodical  tesLng  (mulLple  runs,  reviewed  fairness  for  compeLLon,  etc)    

•  Details:  hjp://blog.cloudera.com/blog/2014/01/impala-­‐performance-­‐dbms-­‐class-­‐speed/  

37

Page 38: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Enough  slides…  DEMO  TIME!  

38

Page 39: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

So  What  is  Cloudera  Impala?  

39

Page 40: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

What’s  Next?  

•  Download  Hadoop!  •  CDH  available  at  www.cloudera.com  •  Try  it  online:  Cloudera  Live  

•  Cloudera  provides  pre-­‐loaded  VMs  •  hjp://Lny.cloudera.com/quickstartvm  

•  Ride  Impala!  •  hjp://impala.io/    

40

Page 41: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

41

SAN  DIEGO  BIG  DATA  

Special  thanks:  

Page 42: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

42

Preferably  related  to  the  talk…  or  not.  

QuesLons?  

Page 43: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

43

Thank  You!  Maxime  Dumas  [email protected]      We’re  hiring.  

Page 44: Cloudera Impala - San Diego Big Data Meetup August 13th 2014

44