big data @ orange - dev day 2013 - part 2

[email protected] !

BigData @ Digital Factory!

une petite histoire en cours d’écriture!

Olivier Varene! DSIF/DFY!Orange DevDay 2013 !!

[email protected] !

Hadoop!

[email protected] !

Hadoop - Core!

MapReduce!HDFS!

[email protected] !

Had

oop

gene

alog

y!

[email protected] !

Hadoop Time bar!

0.2[0-2].X! 0.23.x!1.x!

2.x!

[email protected] !

Hadoop Distribution!

Packaging!Deployment!

Support!

[email protected] !

Main Distributions!Licence! Business Model! Support!

Apache! Apache 2.0! Fundation! community only!

HortonWorks! Apache 2.0!HortonWorks (add-on)!

PS + Training + support!

community + Professional!

Cloudera!Apache 2.0!

Closed Source (not core)!

PS + Licencing + Training + support!


MapR! Apache 2.0!Closed Source (FS)!

PS + Licencing + support!


WanDisco! Apache 2.0!Closed Source (DConE)!

PS + Licencing + Training + support!


PS: Professional Services!

[email protected] !

Big Name Distributions!

•  IBM InfoSphere BigInsights!

•  GreenPlum (EMC)!

•  Intel Distribution for Hadoop!

•  …!

Paying & Closed Source !

[email protected] !

Big Data Suite!Tooling!

Code generation!Scheduling!Integration!

[email protected] !

Tools (1st level)!Tool"! Description! Licence!

Apache Pig! Scripting Platform! Apache 2!Apache Hive! Data Access & Query! Apache 2!

Apache HCatalog! Metadata Services! Apache 2!Apache HBase! NoSQL Database! Apache 2!

Apache ZooKeeper! Cluster Coordination! Apache 2!Apache Tez ! Query processing! Apache 2!

Apache Oozie! Workflow Scheduler! Apache 2!Apache Sqoop! Data Integration Services! Apache 2!

[email protected] !

Tools (add-ons)!Tool"! Description! Licence!

Teradata connector! Connector! Terradata + Distribution!

Hive ODBC! ODBC! Distribution!

Mahout! Data Mining! Apache 2!

Cascading! Fault Tolerant API / Framework! Apache 2!

Cassandra Connector! Connector to Cassandra NoSQL! Apache 2!

MongoDB Connector! Connector to MongoDB! Apache 2!

…!

[email protected] !

Landscape!

[email protected] !

@ Digital Factory!DSIF / Digital Factory!

[email protected] !

Back in Time!

•  PageRank calculus on billions nodes and 10s billions edges

•  regularly failed ! (hardware ...)

•  4 to 8 weeks calculus

•  unscalable

•  failure rate around 80%

•  One person full time to supervise !

- 3 years!

[email protected] !

Answer ?!

Internal!Development!+ full control!- long term!- €€ !

OpenSource!+ €€!+ short term!- support!- evolution!

[email protected] !

Success!

In PRODUCTION since 2010 !

[email protected] !

How does it work ?!

[email protected] !

Hadoop Axioms!

•  System shall manage and heal himself"•  Performance shall scale linearly"• Compute shall move to data"• Modular and extensible!

[email protected] !

HDFS (Simple)!Self-healing High-Bandwidth Clustered Storage!

[email protected] !

[email protected] !

MapReduce V1 (Simple)!

cat <data> | <Mapper> | sort | <Reducer>!

[email protected] !


cat <data> | …………... | sort | …………….!

Framework

[email protected] !


Your program

……………. <Mapper> ..…… <Reducer>!

[email protected] !

[email protected] !

YARN!Allow plugging in new paradigms!

[email protected] !

MapReduce V1!

Map()

Map()

Map()

Map()

Map()

Reduce()

Reduce()

partXX

partXX

Data on HDFS

Sort!Partition!

Map! Reduce!

[email protected] !

Before map()!

Data on HDFS

Block of Data

Block of Data

Map()

Map()

SlicingPartitioning

JobTracker calculateslocality for job assignment

and input split data

…(Kin,Vin) (Kout,Vout)

[email protected] !

Java (Api)!Mapper!Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {

[void setup();]

[void cleanup();]

void map(Kin,Vin,context) {

…. Your program …! }

}

[email protected] !

before reduce()!

Map() filefilefile

RAMsorting

disk write

temporary intermediate files

sorted in each file

Combine() filefile

1 or more times

temporary intermediate files

OPTIONAL

key namespace partitioning

(Kout,Vout) (Kout,Vout)

RAMsorting

disk write

(Kout,Vout)

Partition()partpartpart

JobTrackerdistribution to

reducers

[email protected] !

Java (Api)!Reducer!Class YourReducer extends

Reducer(Kin,Vin,Kout,Vout) {

[void setup();]

[void cleanup();]

void reduce(Kin,List<Vin>,context) {

…. Your program …

}

}

[email protected] !

Optimization tips!•  JVM!

•  Algorithm in MapReduce paradigm!

•  Combiner!

•  Sort algorithm!

•  Partitioning!

[email protected] !

Streaming!

… | <mapper> | … | <reducer> |…!

•  STDIN !•  STDOUT!•  Text as input and output by default!•  ‘\t’ as default separator!•  Use your language : perl, python, shell, ruby, … !•  (interpreter needed on all nodes)!

hadoop jar $streamingJar –input <inputDir> -output <outputDir> !-mapper <mapProg> -reducer <reduceProg> -file <files>!

[email protected] !

Pipe – C++!

… | <mapper> | … | <reducer> |…!

•  Socket communication!•  Bytes as input and output!•  C++ API!

hadoop put <binFile> <toHDFS…>!hadoop pipes –input <inputDir> -output <outputDir> ! -program <path/binfile> [-conf <confFile>]!

class MyMap: public HadoopPipes::Mapper { … }

class MyReducer: public HadoopPipes::Reducer { … }

[email protected] !

Too difficult!

Hopefully there are tools that can generate code for you or let you do SQL queries !!!!

Tools! Algo / Libs!

[email protected] !

PIG!Scripting Language :!

•  Simple!

•  Parallel execution!

•  Data oriented!

•  Extensible via UDF!

•  Automatic performance enhancement via compiler!

set job.name calculateGraphDegres!!%default nbpigreducers 10!set default_parallel $nbpigreducers!!-- degres sortant!A = load '$degout' using PigStorage() as (url:chararray,out_deg:int);!-- keep entries where out_deg > 1!A2 = filter A by (out_deg > 1);!B = order A2 by out_deg DESC;!store B into '$degoutOrdered';!!-- distribution des degres sortants!C = foreach A generate out_deg,1 as deg_occ;!D = group C by out_deg;!E = foreach D generate FLATTEN(group) as out_deg,SUM(C.deg_occ) as deg_occ;!F = order E by out_deg ASC;!store F into '$degoutDistrib';!

[email protected] !

Hive!Querying Language :!

•  HiveQL (sql like)!

•  ETL Tool!

•  HDFS, HBase, Thrift …!

•  MapReduce interface (with streaming to python …)!

•  Extensible via UDF!

CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" !LOCATION ‘b-file/input/';! !CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" !LOCATION ’b-file/output/1/’;!!INSERT INTO TABLE b_packet_out!select count(*) as overall, !sum( if(protocol like 'îp:tcp',1,0) as tcp, sum( if(protocol like 'îp:udp',1.0) as udp, sum( if(protocol like 'îp:icmp'1,0) as icmp !from b_packet;!

[email protected] !

R!Rhadoop :

https://github.com/RevolutionAnalytics/RHadoop/wiki!

•  rmr : functions providing mapreduce in R!

•  rhdfs : functions providing dhfs operations in R!

•  rhbase : functions providing hbase operations in R!

library(rmr2) library(rhdfs) gdp <- read.csv("GDP_converted.csv") hdfs.init() gdp.values <- to.dfs(gdp) gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count)

[email protected] !

Gui!

Tools!

Poc !

Time saver!Prototyping!Visualize complex processes!Fast changes!

But need to know the inside for optimization!

[email protected] !

SQL!

Hbase !

Phoenix !Hive !

Tajo !

HDFS!

Impala ! Presto !

ODBC/JDBC!

HiveQL!JDBC!

SQL! HQL!ISO!PSQL!

Prod / Beta & Alpha products!

[email protected] !

Sqoop!Transfer from/to HDFS to/from Structured storage via

JDBC connectors : PostGresql, MySQL, Oracle, Terradata, …!

RDBMS!

NoSQL!Hadoop!process!

Sqoop!import!

Sqoop!export!

[email protected] !

Oozie!

[email protected] !

Nowadays !@ Digital Factory ?!

[email protected] !

In Production!•  Since 2010!

•  Growth by internal projects needs!

•  Recycling Servers (€€ savings)!

•  We learned as we walked : !* tar -> cdh3 -> cdh4 …!* optimizations!* Run processes …!

[email protected] !

Production « PFS »!•  Shared among different teams!•  xx nodes on COTS!•  xxx TBytes!•  >xxx jobs / per day!•  Monitoring : Xymon !•  Graphing via NetStat (SNMP / RRD : x’s oids/second)!•  Automatic Configuration!

[email protected] !

Architecture!

HIVE!

MapReduce!

HDFS!

ZooK

eepe

r!

Mahout!

Oozie!

Khiops!

Sqoop!

Real Time Query Engine!R!

HIVE Server!

Web Service!

Flume!

App Services!

PIG!

HCatalog!HBase!

Cassandra!

Cascading!

in POC!

[email protected] !

Benefits!•  Infrastructure cost!

•  Development cost!

•  Robustness!

•  Scalability!

•  New development areas (Graph Mining, Logs statistics …)!

€ !-70% loc!-50% dev time!-75% run cost!

[email protected] !

A few of our use cases!

[email protected] !

Graph algorithms for http://www.lemoteur.fr/!

Scoring - Search Engine!

xx TB compressed!xx billions nodes!>xxx billions edges!

xxx TB in RAM!

xRank!

[email protected] !

Customers’ statistical behaviors, ads display optimization, …!

Profiling!

xxx GB / daily!+!xxx GB / monthly (customer DB)!

Customer profile!

[email protected] !

Log Analysis!

xx billion events daily!

OJD certified Measurements : Internet and Mobile, Customers’ journey analysis, …!

KPIs!

[email protected] !

with NoSQL!

Hadoop over Cassandra!

(next session)!

[email protected] !

Benefits & Drawbacks!

Scalable!Stable!

RUN Cost!Development Cost!

Performance!Very fast evolution!

New Dev areas!

Learning curve!Debug!

Algorithms!Complex!

Very fast evolution!

[email protected] !

Future!•  Enhance Security and robustness!

•  Create Services & Functional Catalog!

•  Continue building our expertise : Fast Data, Cascading, MR2, …!

•  A thousand nodes cluster !!

•  Help other teams to go on Production!

CONTACT US : [email protected]!

[email protected] !

Thank you!Merci!

Olivier Varene! DSIF/DFY!Orange DevDay 2013 !

[email protected] !

My Thanks to!

•  Apache http://www.apache.org/!

•  http://hadooper.blogspot.com/!

•  Cloudera http://www.cloudera.com/!

•  HortonWorks http://www.hortonworks.com/!

•  And all the community !!

big data @ orange - dev day 2013 - part 2

Technology

apache hive

apache pig

apache sqoop

apache zookeeper

apache oozie

apache tez

apache hbase

apache hcatalog