cis13: big data platform vendor’s perspective: insights from the bleeding edge

21
Securing the Hadoop Ecosystem Aaron T. Myers (ATM) @ Cloudera Cloud Iden?ty Summit, July 2013

Upload: cloudidsummit

Post on 26-Jan-2015

116 views

Category:

Technology


5 download

DESCRIPTION

Aaron T. Myers (ATM), Software Engineer, Cloudera, Inc. The era of “Big Data for the masses” is upon us. Despite the mindshare Big Data has been receiving – driven by the development and distribution of Apache Hadoop, the first commercialized release was only in December of 2011 by Cloudera, Inc. Cloudera remains the leading Hadoop platform provider in the market today. Now, with a diverse enterprise and government early adopter customer list, through Cloudera we can get a bird’s eye view of the leading authentication issues beginning to emerge from these companies headed out of the sandbox and into full production. Speaker Aaron T. Myers (ATM) was one of Cloudera’s earliest engineers and maintains a core focus on Apache Hadoop core, specifically focused on HDFS and Hadoop’s security features. ATM is an Apache Hadoop PMC Member and Committer.

TRANSCRIPT

Securing  the  Hadoop  Ecosystem  

Aaron  T.  Myers  (ATM)  @  Cloudera    

Cloud  Iden?ty  Summit,  July  2013  

Who  am  I?  

•  SoHware  Engineer  at  Cloudera  •  Hadoop  CommiJer  and  PMC  Member  at  Apache  SoHware  Founda?on  

•  Primarily  work  on  Hadoop  Security  and  HDFS  •  Masters  thesis  focused  on  systems  security  

Agenda  

•  What  is  Hadoop?  •  Hadoop  Ecosystem  Interac?ons  •  Hadoop  Authen?ca?on  •  Hadoop  Authoriza?on  •  IT  Infrastructure  Integra?on  •  The  Future:  Where  Hadoop  Security  is  Headed  

Hadoop  Is…  

•  A  distributed  system  •  Designed  for  massive  scaling  of  storage  and  compute  across  many  (10s-­‐1000s)  nodes  

•  An  ecosystem  •  Hadoop  is  the  kernel,  apps  on  top  are  user-­‐level  programs  •  e.g.  Impala,  Hive,  Oozie,  HBase,  etc.  

•  A  security  pain  •  Designed  to  run  arbitrary  code  submiJed  by  users  

•  Another  place  where  many  users  interact  with  the  system  •  Many  orgs  provide  “Hadoop  as  a  service”  

Hadoop  Is…  

•  Not  secure  by  default  •  No  authen?ca?on  whatsoever  •  Usually  behind  a  corporate  firewall  

•  OHen  accessed  by  common  BI  tools  •  Tableau,  SAS,  Microstrategy,  etc.  

•  Expected  to  be  integrated  into  corporate  IT  infra  •  SSO,  etc.  

Hadoop  on  its  Own  

Hadoop  

NN  

DN      TT  

JT  

DN      TT  

DN      TT  

MR  client  

Map  Task  

Map  Task  

Reduce  Task  

SNN  

hdfs,  hJpfs  &  mapred  users   end  users   protocols:  RPC/data  transfer/HTTP  

H6pFS  

HDFS  client  

WebHdfs  client  

The  Hadoop  Ecosystem  

•  Storage  •  HBase  •  HDFS  

•  Processing  •  Map/Reduce  •  YARN  

•  Querying  •  Hive,  Impala  (SQL)  •  Pig  (DSL)  

•  Cron,  workflows  •  Oozie  

•  Data  ingest  

•  Flume  (streaming)  •  Sqoop  (batch)  

•  Live  data  serving  •  HBase  

•  Pipelines  •  Crunch,  Cascading  

•  GUI  •  Hue  

•  Management  •  Cloudera  Manager  

Hadoop  and  Friends  

Hadoop  

Hive  Metastore  

Hbase  

Oozie  

Hue  

Impala  

Zookeeper  

Flume  MapRed  

Pig  

Crunch  

Cascading  

Sqoop  

Hive  

Hbase  

Oozie  

Impala  

browser  

Flume  

services  clients   clients  RPC  

HTTP  

ThriH  

HTTP  

RPC  

ThriH  

HTTP  

RPC  

service  users   end  users   protocols:  RPCs/data/HTTP/ThriH/Avro-­‐RPC  

Avro  RPC  

WebHdfs  

HTTP  

RPC  Zookeeper  

•  Hadoop  Authen?ca?on  based  on  Kerberos  •  Usually  MIT,  also  Ac?ve  Directory  

•  End  Users  to  services,  as  a  user  •  CLI  &  libraries:  Kerberos  (kinit  or  keytab)  •  Web  UIs:  Kerberos  SPNEGO  &  pluggable  HTTP  auth  

•  Services  to  Services,  as  a  service  •  Creden?als:  Kerberos  (keytab)  

•  Services  to  Services,  on  behalf  of  a  user  •  Proxy-­‐user  (aHer  Kerberos  for  service)  

•  Job  tasks  to  Services,  on  behalf  of  a  user  •  Job  delega?on  token  

Authen?ca?on  Details  

•  HDFS  Data  •  File  System  permissions  (Unix  like  user/group  permissions)  

•  HBase  Data  •  Read/Write  Access  Control  Lists  (ACLs)  at  table  level  

•  Hive  Metastore  (Hive,  Impala)  •  Leverages/proxies  HDFS  permissions  for  tables  &  par??ons  

•  Hive  Server  (Hive,  Impala)  (coming)  •  More  advanced  GRANT/REVOKE  with  ACLs  for  tables  

•  Jobs  (Hadoop,  Oozie)  •  Job  ACLs  for  Hadoop  Scheduler  Queues,  manage  &  view  jobs  

•  Zookeeper  •  ACLs  at  znodes,  authen?cated  &  read/write  

Authoriza?on  Details  

IT  Integra?on:  Kerberos  

•  Users  don’t  want  Yet  Another  Creden?al  •  Corp  IT  doesn’t  want  to  provision  thousands  of  service  principals  

•  Solu?on:  local  KDC  +  one-­‐way  trust  •  Run  a  KDC  (usually  MIT  Kerberos)  in  the  cluster  

•  Put  all  service  principals  here  

•  Set  up  one-­‐way  trust  of  central  corporate  realm  by  local  KDC  •  Normal  user  creden?als  can  be  used  to  access  Hadoop  

IT  Integra?on:  Groups  

•  Much  of  Hadoop  authoriza?on  uses  “groups”  •  User  ‘atm’  might  belong  to  groups  ‘analysts’,  ‘eng’,  etc.  

•  Users’  groups  are  not  stored  in  Hadoop  anywhere  •  Refers  to  external  system  to  determine  group  membership  •  NN/JT/Oozie/Hive  servers  all  must  perform  group  mapping  

•  Default  plugins  for  user/group  mapping:  •  ShellBasedUnixGroupsMapping  –  forks/runs  `/bin/id’  •  JniBasedUnixGroupsMapping  –  makes  a  system  call  •  LdapGroupsMapping  –  talks  directly  to  an  LDAP  server  

IT  Integra?on:  Kerberos  +  LDAP  

Hadoop  Cluster  

Local  KDC    

hdfs/[email protected] yarn/[email protected]

Central  Ac?ve  Directory    

[email protected] [email protected]

Cross-­‐realm  trust  

NN   JT  

LDAP  group  mapping  

IT  Integra?on:  Web  Interfaces  

•  Most  web  interfaces  authen?cate  using  SPNEGO  •  Standard  HTTP  authen?ca?on  protocol  •  Used  internally  by  services  which  communicate  over  HTTP  •  Most  browsers  support  Kerberos  SPNEGO  authen?ca?on  

•  Hadoop  components  which  use  servlets  for  web  interfaces  can  plug  in  custom  filter  •  Integrate  with  intranet  SSO  HTTP  solu?on  

IT  Integra?on:  Web  Interfaces  

•  Most  web  interfaces  authen?cate  using  SPNEGO  •  Standard  HTTP  authen?ca?on  protocol  •  Used  internally  by  services  which  communicate  over  HTTP  •  Most  browsers  support  Kerberos  SPNEGO  authen?ca?on  

•  Hadoop  components  which  use  servlets  for  web  interfaces  can  plug  in  custom  filter  •  Integrate  with  intranet  SSO  HTTP  solu?on  

Issues  with  Hadoop  Security  

•  SSO  is  poorly  and  not  universally  supported  •  Only  supported  for  the  web  interfaces,  liJle  used,  etc.  

•  Kerberos  the  only  op?on  •  Not  all  orgs  comfortable  administering  net  new  Kerberos  realm  

•  Not  well-­‐suited  for  cloud  deployments  •  Need  properly  working  reverse  DNS  •  Pain  to  provision  KDC,  distribute  keytabs  

•  Kerberos  tough  for  management  tools  •  No  Kerberos  administra?ve  API/protocol  

Issues  with  Hadoop  Security  (cont.)  

•  Isola?on  of  user  tasks  currently  requires  separate  local  Unix  accounts  on  all  boxes  •  Need  to  integrate  with  LDAP  using  PAM  or  something  like  it  

•  HDFS  authoriza?on  only  supports  Unix-­‐style  permissions  •  Not  expressive  enough  for  some  applica?ons,  e.g.  Hive  

Future  Development  

•  Full  SSO  support  •  OAUTH  the  most  commonly  requested,  first  goal  

•  Decouple  Hadoop  RPC  implementa?on  from  Kerberos  •  Make  authen?ca?on  system  fully  pluggable  for  custom  implementa?ons  

•  Any  service  which  can  provide  bidirec?onal  authen?ca?on  

•  Improve  management  tools  •  Cloudera  Manager  can  manage  more  of  the  security  infrastructure  

Future  Development  (cont.)  

•  Use  beJer  isola?on  methods  for  user  tasks  •  Linux  containers  •  Solaris  “zones”  •  Etc.  

•  BeJer  authoriza?on  capabili?es  •  Talk  of  adding  ACL  support  to  HDFS  •  Hive  Server  2  will  provide  rich  authoriza?on  capabili?es  

Q&A  

Thanks  

Aaron  T.  Myers  (ATM)  @  Cloudera    

Cloud  Iden?ty  Summit,  July  2013