big data @ orange - dev day 2013 - part 2
DESCRIPTION
Big Data @ Orange - Dev Day 2013 - part 2TRANSCRIPT
![Page 1: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/1.jpg)
BigData @ Digital Factory!
une petite histoire en cours d’écriture!
Olivier Varene! DSIF/DFY!Orange DevDay 2013 !!
![Page 7: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/7.jpg)
Main Distributions!Licence! Business Model! Support!
Apache! Apache 2.0! Fundation! community only!
HortonWorks! Apache 2.0!HortonWorks (add-on)!
PS + Training + support!
community + Professional!
Cloudera!Apache 2.0!
Closed Source (not core)!
PS + Licencing + Training + support!
community + Professional!
MapR! Apache 2.0!Closed Source (FS)!
PS + Licencing + support!
community + Professional!
WanDisco! Apache 2.0!Closed Source (DConE)!
PS + Licencing + Training + support!
community + Professional!
PS: Professional Services!
![Page 8: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/8.jpg)
Big Name Distributions!
• IBM InfoSphere BigInsights!
• GreenPlum (EMC)!
• Intel Distribution for Hadoop!
• …!
Paying & Closed Source !
![Page 10: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/10.jpg)
Tools (1st level)!Tool"! Description! Licence!
Apache Pig! Scripting Platform! Apache 2!Apache Hive! Data Access & Query! Apache 2!
Apache HCatalog! Metadata Services! Apache 2!Apache HBase! NoSQL Database! Apache 2!
Apache ZooKeeper! Cluster Coordination! Apache 2!Apache Tez ! Query processing! Apache 2!
Apache Oozie! Workflow Scheduler! Apache 2!Apache Sqoop! Data Integration Services! Apache 2!
![Page 11: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/11.jpg)
Tools (add-ons)!Tool"! Description! Licence!
Teradata connector! Connector! Terradata + Distribution!
Hive ODBC! ODBC! Distribution!
Mahout! Data Mining! Apache 2!
Cascading! Fault Tolerant API / Framework! Apache 2!
Cassandra Connector! Connector to Cassandra NoSQL! Apache 2!
MongoDB Connector! Connector to MongoDB! Apache 2!
…!
![Page 14: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/14.jpg)
Back in Time!
• PageRank calculus on billions nodes and 10s billions edges
• regularly failed ! (hardware ...)
• 4 to 8 weeks calculus
• unscalable
• failure rate around 80%
• One person full time to supervise !
- 3 years!
![Page 15: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/15.jpg)
Answer ?!
Internal!Development!+ full control!- long term!- €€ !
OpenSource!+ €€!+ short term!- support!- evolution!
![Page 18: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/18.jpg)
Hadoop Axioms!
• System shall manage and heal himself"• Performance shall scale linearly"• Compute shall move to data"• Modular and extensible!
![Page 27: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/27.jpg)
MapReduce V1!
Map()
Map()
Map()
Map()
Map()
Reduce()
Reduce()
partXX
partXX
Data on HDFS
Sort!Partition!
Map! Reduce!
![Page 28: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/28.jpg)
Before map()!
Data on HDFS
Block of Data
Block of Data
Map()
Map()
SlicingPartitioning
JobTracker calculateslocality for job assignment
and input split data
…(Kin,Vin) (Kout,Vout)
![Page 29: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/29.jpg)
Java (Api)!Mapper!Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void map(Kin,Vin,context) {
…. Your program …! }
}
![Page 30: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/30.jpg)
before reduce()!
Map() filefilefile
RAMsorting
disk write
temporary intermediate files
sorted in each file
Combine() filefile
1 or more times
temporary intermediate files
OPTIONAL
key namespace partitioning
(Kout,Vout) (Kout,Vout)
RAMsorting
disk write
(Kout,Vout)
Partition()partpartpart
JobTrackerdistribution to
reducers
![Page 31: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/31.jpg)
Java (Api)!Reducer!Class YourReducer extends
Reducer(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void reduce(Kin,List<Vin>,context) {
…. Your program …
}
}
![Page 32: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/32.jpg)
Optimization tips!• JVM!
• Algorithm in MapReduce paradigm!
• Combiner!
• Sort algorithm!
• Partitioning!
![Page 33: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/33.jpg)
Streaming!
… | <mapper> | … | <reducer> |…!
• STDIN !• STDOUT!• Text as input and output by default!• ‘\t’ as default separator!• Use your language : perl, python, shell, ruby, … !• (interpreter needed on all nodes)!
hadoop jar $streamingJar –input <inputDir> -output <outputDir> !-mapper <mapProg> -reducer <reduceProg> -file <files>!
![Page 34: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/34.jpg)
Pipe – C++!
… | <mapper> | … | <reducer> |…!
• Socket communication!• Bytes as input and output!• C++ API!
hadoop put <binFile> <toHDFS…>!hadoop pipes –input <inputDir> -output <outputDir> ! -program <path/binfile> [-conf <confFile>]!
class MyMap: public HadoopPipes::Mapper { … }
class MyReducer: public HadoopPipes::Reducer { … }
![Page 35: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/35.jpg)
Too difficult!
Hopefully there are tools that can generate code for you or let you do SQL queries !!!!
Tools! Algo / Libs!
![Page 36: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/36.jpg)
PIG!Scripting Language :!
• Simple!
• Parallel execution!
• Data oriented!
• Extensible via UDF!
• Automatic performance enhancement via compiler!
set job.name calculateGraphDegres!!%default nbpigreducers 10!set default_parallel $nbpigreducers!!-- degres sortant!A = load '$degout' using PigStorage() as (url:chararray,out_deg:int);!-- keep entries where out_deg > 1!A2 = filter A by (out_deg > 1);!B = order A2 by out_deg DESC;!store B into '$degoutOrdered';!!-- distribution des degres sortants!C = foreach A generate out_deg,1 as deg_occ;!D = group C by out_deg;!E = foreach D generate FLATTEN(group) as out_deg,SUM(C.deg_occ) as deg_occ;!F = order E by out_deg ASC;!store F into '$degoutDistrib';!
![Page 37: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/37.jpg)
Hive!Querying Language :!
• HiveQL (sql like)!
• ETL Tool!
• HDFS, HBase, Thrift …!
• MapReduce interface (with streaming to python …)!
• Extensible via UDF!
CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" !LOCATION ‘b-file/input/';! !CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" !LOCATION ’b-file/output/1/’;!!INSERT INTO TABLE b_packet_out!select count(*) as overall, !sum( if(protocol like '^ip:tcp',1,0) as tcp, sum( if(protocol like '^ip:udp',1.0) as udp, sum( if(protocol like '^ip:icmp'1,0) as icmp !from b_packet;!
![Page 38: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/38.jpg)
R!Rhadoop :
https://github.com/RevolutionAnalytics/RHadoop/wiki!
• rmr : functions providing mapreduce in R!
• rhdfs : functions providing dhfs operations in R!
• rhbase : functions providing hbase operations in R!
library(rmr2) library(rhdfs) gdp <- read.csv("GDP_converted.csv") hdfs.init() gdp.values <- to.dfs(gdp) gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count)
![Page 39: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/39.jpg)
Gui!
Tools!
Poc !
Time saver!Prototyping!Visualize complex processes!Fast changes!
But need to know the inside for optimization!
![Page 40: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/40.jpg)
SQL!
Hbase !
Phoenix !Hive !
Tajo !
HDFS!
Impala ! Presto !
ODBC/JDBC!
HiveQL!JDBC!
SQL! HQL!ISO!PSQL!
Prod / Beta & Alpha products!
![Page 41: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/41.jpg)
Sqoop!Transfer from/to HDFS to/from Structured storage via
JDBC connectors : PostGresql, MySQL, Oracle, Terradata, …!
RDBMS!
NoSQL!Hadoop!process!
Sqoop!import!
Sqoop!export!
![Page 44: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/44.jpg)
In Production!• Since 2010!
• Growth by internal projects needs!
• Recycling Servers (€€ savings)!
• We learned as we walked : !* tar -> cdh3 -> cdh4 …!* optimizations!* Run processes …!
![Page 45: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/45.jpg)
Production « PFS »!• Shared among different teams!• xx nodes on COTS!• xxx TBytes!• >xxx jobs / per day!• Monitoring : Xymon !• Graphing via NetStat (SNMP / RRD : x’s oids/second)!• Automatic Configuration!
![Page 46: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/46.jpg)
Architecture!
HIVE!
MapReduce!
HDFS!
ZooK
eepe
r!
Mahout!
Oozie!
Khiops!
Sqoop!
Real Time Query Engine!R!
HIVE Server!
Web Service!
Flume!
App Services!
PIG!
HCatalog!HBase!
Cassandra!
Cascading!
in POC!
![Page 47: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/47.jpg)
Benefits!• Infrastructure cost!
• Development cost!
• Robustness!
• Scalability!
• New development areas (Graph Mining, Logs statistics …)!
€ !-70% loc!-50% dev time!-75% run cost!
![Page 49: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/49.jpg)
Graph algorithms for http://www.lemoteur.fr/!
Scoring - Search Engine!
xx TB compressed!xx billions nodes!>xxx billions edges!
xxx TB in RAM!
xRank!
![Page 50: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/50.jpg)
Customers’ statistical behaviors, ads display optimization, …!
Profiling!
xxx GB / daily!+!xxx GB / monthly (customer DB)!
Customer profile!
![Page 51: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/51.jpg)
Log Analysis!
xx billion events daily!
OJD certified Measurements : Internet and Mobile, Customers’ journey analysis, …!
KPIs!
![Page 53: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/53.jpg)
Benefits & Drawbacks!
Scalable!Stable!
RUN Cost!Development Cost!
Performance!Very fast evolution!
New Dev areas!
Learning curve!Debug!
Algorithms!Complex!
Very fast evolution!
![Page 54: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/54.jpg)
Future!• Enhance Security and robustness!
• Create Services & Functional Catalog!
• Continue building our expertise : Fast Data, Cascading, MR2, …!
• A thousand nodes cluster !!
• Help other teams to go on Production!
CONTACT US : [email protected]!
![Page 56: Big Data @ Orange - Dev Day 2013 - part 2](https://reader038.vdocuments.site/reader038/viewer/2022102815/554f7237b4c9058a148b53cb/html5/thumbnails/56.jpg)
My Thanks to!
• Apache http://www.apache.org/!
• http://hadooper.blogspot.com/!
• Cloudera http://www.cloudera.com/!
• HortonWorks http://www.hortonworks.com/!
• And all the community !!