ibis: scaling the python data experience

13
1 © Cloudera, Inc. All rights reserved. Ibis: Scaling the Python Data Experience Wes McKinney Marcel Kornacker JusFn Erickson Silvius Rus

Upload: wes-mckinney

Post on 14-Aug-2015

1.233 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Ibis: Scaling the Python Data Experience

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis:  Scaling  the  Python  Data  Experience  Wes  McKinney                    Marcel  Kornacker  JusFn  Erickson    Silvius  Rus  

Page 2: Ibis: Scaling the Python Data Experience

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Wes  McKinney  

• A  key  person  in  building  today’s  open  source  Python  data  community  • Creator  of  pandas,  a  standard  Python  data  wrangling  and  analyFcs  toolkit  used  by  data  scienFsts  • Author  of  best-­‐selling  canonical  text  Python  for  Data  Analysis  (2012)  •  Formerly  Founder/CEO  of  DataPad  (acquired  by  Cloudera  in  2014)  

Page 3: Ibis: Scaling the Python Data Experience

3  ©  Cloudera,  Inc.  All  rights  reserved.  

Python  is  popular…  

• Python  has  become  a  standard  language  of  data  science  • Why  is  it  popular?  • Maximizes  producFvity  for  data  engineers  and  data  scienFsts  • Build  robust  so[ware  and  do  interacFve  data  analysis  with  100%  Python  code    • Easy-­‐to-­‐learn  and  makes  happy  and  producFve  data  teams    • Large,  diverse  open  source  development  community  • Comprehensive  libraries:  data  wrangling,  ML,  visualizaFon,  etc.  

• Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium  size  data  

Page 4: Ibis: Scaling the Python Data Experience

4  ©  Cloudera,  Inc.  All  rights  reserved.  

…but  Python  does  not  scale  today  

• Python  ecosystem  confined  to  single-­‐node  analysis  • Great  for  smaller  data  sets  • Requires  sampling  or  aggregaFons  for  larger  data  • Distributed  tools  compromise  in  various  ways  

• ExtracFng  samples  or  aggregaFons  for  larger  data  means:  • “Scales”  by  losing  more  fidelity  • AddiFonal  ETL  overhead  to  extract  samples/aggregaFons  • Loss  of  producFvity  with  mulFple  languages,  tools,  etc  • Blocks  certain  analysis  and  use  cases  

Page 5: Ibis: Scaling the Python Data Experience

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis:  Same  Python,  now  at  scale  

• Target  user:  • Data  scienFsts  and  data  engineers  (“Python  data  users”)  

• Goals:  • Mirrors  single-­‐node  Python  experience  • Scales  to  any  node  and  data  size  • No  compromise  in  funcFonality  or  usability  •  InteracFve  experience  at  naFve  hardware  speeds  

Page 6: Ibis: Scaling the Python Data Experience

6  ©  Cloudera,  Inc.  All  rights  reserved.  

What’s  announced?  

•  First  public  release  of  Ibis  • hgp://ibis-­‐project.org  

• Beta  release  to  Cloudera  Labs  •  InviFng  usage  and  community  development  • Apache-­‐licensed  open-­‐source  

Page 7: Ibis: Scaling the Python Data Experience

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis’s  Vision  

• Uncompromised  Python  experience  • 100%  Python  end-­‐to-­‐end  user  workflows    • Enable  integraFon  with  the  exisFng  Python  data  ecosystem  (pandas,  scikit-­‐learn,  NumPy,  etc)  

•  InteracFve  at  big  data  scale  • Full-­‐fidelity  analysis  without  extracFons  • Scalability  for  big  data  • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  

Page 8: Ibis: Scaling the Python Data Experience

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Page 9: Ibis: Scaling the Python Data Experience

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Advantages  of  our  approach  

• Analyze  big  data  100%  in  Python,  with  the  same  ease  as  small/medium  data  on  the  local  filesystem  •  Full-­‐fidelity  data  access  •  Familiar  Python  experience  and  integraFon  with  exisFng  Python  data  libraries  • Provide  a  means  for  Python  high  performance  compuFng  tools  to  be  leveraged  at  Hadoop-­‐scale  

Page 10: Ibis: Scaling the Python Data Experience

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Beta  0.3  release    

• High  level  Python  API  for  describing  analyFcs  and  ETL  that  can  be  executed  by  Impala  • Familiar  API  for  users  of  pandas  • Comprehensive  coverage  of  operaFons  expressible  as  relaFonal  data  flows  

•  Integrated  tools  for  managing  data  in  HDFS  •  Simple  workflows  to  query  data  files  in  several  formats  (Parquet,  Avro,  Text)  • pandas  data  interchange  

Page 11: Ibis: Scaling the Python Data Experience

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis/Impala  Joint  Roadmap  

• More  natural  data  modeling  • Complex  types  support  

•  IntegraFon  with  full  Python  data  ecosystem  • Advanced  analyFcs  +  machine  learning  • Enable  use  of  performance  compuFng  tools  

• User  extensibility  with  naFve  performance  •  In-­‐memory  columnar  format  • Python-­‐to-­‐LLVM  IR  compilaFon  

• Workflow  and  usability  tools  

Page 12: Ibis: Scaling the Python Data Experience

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Benefits  of  Ibis  

• Maximize  developer  producFvity  • Mirrors  single-­‐node  Python  experience  • Solve  big  data  problems  without  leaving  Python  • Leverage  Python  skills,  ecosystem,  and  tools  

• Python  as  first-­‐class  language  for  Hadoop  • Full-­‐fidelity  analysis  without  extracFons  • Python  analysis  at  any  scale  • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  

Page 13: Ibis: Scaling the Python Data Experience

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  [email protected]