nyc hug - application architectures with apache hadoop

Post on 02-Dec-2014

147 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation at NYC HUG on Application Architectures with Apache Hadoop

TRANSCRIPT

1

Headline  Goes  Here  Speaker  Name  or  Subhead  Goes  Here  

DO  NOT  USE  PUBLICLY  PRIOR  TO  10/23/12  

ApplicaAon  Architectures  with  Apache  Hadoop  Mark  Grover  |  @mark_grover  NYC  HUG  slideshare.com/markgrover  October  14th,  2014  

©2014 Cloudera, Inc. All Rights Reserved.

About  Me  •  CommiPer  on  Apache  Bigtop,  commiPer  and  PPMC  member  on  Apache  Sentry  (incubaAng).  

•  Contributor  to  Hadoop,  Hive,  Spark,  Sqoop,  Flume.  •  SoWware  developer  at  Cloudera  • @mark_grover  

2 ©2014 Cloudera, Inc. All Rights Reserved.

Co-­‐authoring  O’Reilly  book  

• @hadooparchbook  •  hadooparchitecturebook.com  •  Strata  Hadoop  World  Tutorial  

•  at  9  AM  tomorrow  

©2014 Cloudera, Inc. All Rights Reserved. 3

4

Click  Stream  Analysis  

Case  Study  

©2014 Cloudera, Inc. All Rights Reserved.

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 5  

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 6  

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 7  

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 8  

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 9  

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 10  

AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved. 11  

Web  Logs  –  Combined  Log  Format  

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

12  

Clickstream  AnalyAcs  

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

13  

Similar  use-­‐cases  

•  Sensors  –  heart,  agriculture,  etc.  •  Casinos  –  session  of  a  person  at  a  table  

©2014 Cloudera, Inc. All Rights Reserved. 14

Challenges  of  Hadoop  ImplementaAon  

©2014 Cloudera, Inc. All Rights Reserved. 15  

Challenges  of  Hadoop  ImplementaAon  

©2014 Cloudera, Inc. All Rights Reserved. 16  

Other  challenges  -­‐  Architectural  ConsideraAons    

•  Storage  managers?  •  HDFS?  HBase?  

•  Data  storage  and  modeling:  •  File  formats?  Compression?  Schema  design?  

•  Data  movement  •  How  do  we  actually  get  the  data  into  Hadoop?  How  do  we  get  it  out?  

•  Metadata  •  How  do  we  manage  data  about  the  data?  

•  Processing  •  How  can  we  transform  it?  How  do  we  query  it?  

•  OrchestraAon  •  How  do  we  manage  the  workflow  for  all  of  this?  

©2014 Cloudera, Inc. All Rights Reserved. 17

18

Since  that’s  all  what  the  Ame  allows  todayJ  

2.  Processing  

©2014 Cloudera, Inc. All Rights Reserved.

Processing  

• De-­‐duplicaAon  •  Filtering  •  SessionizaAon  

19 ©2014 Cloudera, Inc. All Rights Reserved.

DeduplicaAon  –  remove  duplicate  records  

©2014 Cloudera, Inc. All Rights Reserved. 20  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Filtering  –  filter  out  invalid  records  

©2014 Cloudera, Inc. All Rights Reserved. 21  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

SessionizaAon  

©2014 Cloudera, Inc. All Rights Reserved. 22  

Website  visit  

Visitor  1  Session  1  

Visitor  1  Session  2  

Visitor  2  Session  1  

>  30  minutes  

Why  sessionize?  

Helps  answers  quesAons  like:  • What  is  my  website’s  bounce  rate?  

•  i.e.  how  many  %  of  visitors  don’t  go  past  the  landing  page?  • Which  markeAng  channels  (e.g.  organic  search,  display  ad,  etc.)  are  leading  to  most  sessions?  

• Which  ones  of  those  lead  to  most  conversions  (e.g.  people  buying  things,  signing  up,  etc.)  

• Do  aPribuAon  analysis  –  which  channels  are  responsible  for  most  conversions?  

23 ©2014 Cloudera, Inc. All Rights Reserved.

SessionizaAon  

©2014 Cloudera, Inc. All Rights Reserved. 24  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166

How  to  Sessionize?  

1.  Given  a  list  of  clicks,  determine  which  clicks  came  from  the  same  user  

2.  Given  a  parAcular  user's  clicks,  determine  if  a  given  click  is  a  part  of  a  new  session  or  a  conAnuaAon  of  the  previous  session  

25 ©2014 Cloudera, Inc. All Rights Reserved.

#1  –  Which  clicks  are  from  same  user?  

• We  can  use:  •  IP  address  (244.157.45.12)  •  Cookies  (A9A3BECE0563982D)  •  IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")  

26 ©2014 Cloudera, Inc. All Rights Reserved.

#1  –  Which  clicks  are  from  same  user?  

• We  can  use:  •  IP  address  (244.157.45.12)  •  Cookies  (A9A3BECE0563982D)  •  IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")  

27 ©2014 Cloudera, Inc. All Rights Reserved.

#1  –  Which  clicks  are  from  same  user?  

©2014 Cloudera, Inc. All Rights Reserved. 28  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

#2  –  Which  clicks    part  of  the  same  session?  

©2014 Cloudera, Inc. All Rights Reserved. 29  

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

>  30  mins  apart  =  different  sessions  

30

github.com/hadooparchitecturebook/clickstream-­‐tutorial  

SessionizaAon  in  MapReduce  

©2014 Cloudera, Inc. All Rights Reserved.

SessionizaAon  in  MapReduce  

31 ©2014 Cloudera, Inc. All Rights Reserved.

Map  

Reduce  

Reduce  

Log  line   IP1,  log  lines  

IP1,  log  line

s  

Log  line,  session  ID  

Map  

Map  

Log  line  

Log  line   IP2,  log  lines  

IP2,  log  lines   Log  line,  session  ID  

Mapper  for  SessionizaAon  

32 ©2014 Cloudera, Inc. All Rights Reserved.

 public  static  class  SessionizeMapper                          extends  Mapper<Object,  Text,  IpTimestampKey,  Text>  {                  private  Matcher  logRecordMatcher;                    public  void  map(Object  key,  Text  value,  Context  context                  )  throws  IOException,  InterruptedException  {                          logRecordMatcher  =  logRecordPattern.matcher(value.toString());                            //  We  only  emit  something  out  if  the  record  matches  with  our  regex.  Otherwise,  we  assume  the  record  is  busted  and  simply  ignore  it                          if  (logRecordMatcher.matches())  {                                  String  ip  =  logRecordMatcher.group(1);                                  DateTime  timestamp  =  DateTime.parse(logRecordMatcher.group(2),  TIMESTAMP_FORMATTER);                                  Long  unixTimestamp  =  timestamp.getMillis();                                  IpTimestampKey  outputKey  =  new  IpTimestampKey(ip,  unixTimestamp);                                  context.write(outputKey,  value);                          }                  }          }  

Reducer  for  SessionizaAon  

33 ©2014 Cloudera, Inc. All Rights Reserved.

public  static  class  SessionizeReducer                                extends  Reducer<IpTimestampKey,  Text,  IpTimestampKey,  Text>  {    

       private  Text  result  =  new  Text();            

       public  void  reduce(IpTimestampKey  key,  Iterable<Text>  values,                                                            Context  context    

       )  throws  IOException,  InterruptedException  {                    //  The  sessionId  generated  here  is  per  day,  per  IP.  So,  any  queries                    //  that  will  be  done  as  if  this  session  ID  were  global,  would  require                    //  a  combination  of  the  day  in  question  and  IP  as  well.                    String  sessionId  =  null;                    Long  lastTimeStamp  =  null;                    for  (Text  value  :  values)  {                    String  logRecord  =  value.toString();    

Reducer  for  SessionizaAon  

34 ©2014 Cloudera, Inc. All Rights Reserved.

               //  If  this  is  the  first  record  for  this  user  or  it's  been  more  than  the  timeout  since                    //  the  last  click  from  this  user,  let's  increment  the  session  ID.                    if  (lastTimeStamp  ==  null  ||  (key.getUnixTimestamp()  -­‐  lastTimeStamp  >  SESSION_TIMEOUT_IN_MS))  {                            sessionId  =  key.getIp()  +  "+"  +  key.getUnixTimestamp();                    }                    lastTimeStamp  =  key.getUnixTimestamp();                    result.set(logRecord  +  "  "  +  sessionId);                    //  Since  we  only  care  about  printing  out  the  entire  record  in  the  result,  with  session  ID  appended                    //  at  the  end,  we  just  emit  out  "null"  for  the  key                    context.write(null,  result);                    }            }    }      

Secondary  sorAng  –  by  Amestamp  

• Need  records  to  reducer  to  be  grouped  by  IP  address  and  sorted  by  Amestamp  –  a  concept  called  secondary  sor/ng  

•  Instead  of  using  just  IP  address  as  map  output  key  and  reduce  input  key  

• We  use  a  composite  key  (IP,  Amestamp)  as  map  output  key  and  reduce  input  key  

35 ©2014 Cloudera, Inc. All Rights Reserved.

Secondary  sorAng  –  vocabulary  

•  Composite  key  –  IP  address,  Amestamp  • Natural  key  –  IP  address  •  Secondary  sort  key  -­‐  Amestamp  

36 ©2014 Cloudera, Inc. All Rights Reserved.

Secondary  sorAng  

•  Custom  Grouping  Comparator  –  on  Natural  Key  (IP)  •  Custom  Sort  Comparator  –  on  Composite  Key  (IP,  address)  •  Custom  ParAAoner  –  on  Natural  Key  (IP)    job.setGroupingComparatorClass(NaturalKeyComparator.class);        job.setSortComparatorClass(CompositeKeyComparator.class);  job.setPartitionerClass(NaturalKeyPartitioner.class);    

37 ©2014 Cloudera, Inc. All Rights Reserved.

38

Final  Architecture  

©2014 Cloudera, Inc. All Rights Reserved.

©2014 Cloudera, Inc. All Rights Reserved. 39  

Hadoop  Cluster  

BI/VisualizaAon  tool  (e.g.  

microstrategy)  

BI  Analysts  

Spark   For  machine  learning  and  graph  processing  

R/Python   StaAsAcal  Analysis  

Custom  Apps  

3.  Accessing  

2.  Processing  

4.  OrchestraAon  via  Oozie  1.  IngesAon  

OperaAonal  Data  Store  

CRM  System  Via  Sqoop  

Web  servers  

Website  users  

Final  Architecture  –  High  Level  Overview  

40  

Data  Sources   IngesAon  

Data  Storage/Processing  

Data  ReporAng/Analysis  

©2014 Cloudera, Inc. All Rights Reserved.

Final  Architecture  –  High  Level  Overview  

41  

Data  Sources   IngesAon  

Data  Storage/Processing  

Data  ReporAng/Analysis  

©2014 Cloudera, Inc. All Rights Reserved.

Final  Architecture  –  IngesAon  

42  

Web  App   Avro  Agent  Web  App   Avro  Agent  

Web  App   Avro  Agent  Web  App   Avro  Agent  

Web  App   Avro  Agent  Web  App   Avro  Agent  

Web  App   Avro  Agent  Web  App   Avro  Agent  

Flume  Agent  

Flume  Agent  

Flume  Agent  

Flume  Agent  

Fan-­‐in    PaPern  

MulA  Agents  for    Failover  and  rolling  restarts  

HDFS    

©2014 Cloudera, Inc. All Rights Reserved.

Final  Architecture  –  High  Level  Overview  

43  

Data  Sources   IngesAon  

Data  Storage/Processing  

Data  ReporAng/Analysis  

©2014 Cloudera, Inc. All Rights Reserved.

Final  Architecture  –  Storage  and  Processing  

44  

/etl/weblogs/20140331/  /etl/weblogs/20140401/  …  

Data  Processing  /data/markeAng/clickstream/bouncerate/  /data/markeAng/clickstream/aPribuAon/  …  

©2014 Cloudera, Inc. All Rights Reserved.

Final  Architecture  –  High  Level  Overview  

45  

Data  Sources   IngesAon  

Data  Storage/Processing  

Data  ReporAng/Analysis  

©2014 Cloudera, Inc. All Rights Reserved.

Final  Architecture  –  Data  Access  

46  

Hive/Impala  

BI/AnalyAcs  Tools  

DWH  Sqoop  

Local  Disk  

R,  etc.  

DB  import  tool  

JDBC/ODBC  

©2014 Cloudera, Inc. All Rights Reserved.

Contact  info  • Mark  Grover  

• @mark_grover  •  www.linkedin.com/in/grovermark  

•  Slides  at  slideshare.net/markgrover  

47 ©2014 Cloudera, Inc. All Rights Reserved.

48 ©2014 Cloudera, Inc. All Rights Reserved.

top related