Download - Apache Spark REX Heuritech for La Poste
APACHE SPARK REX
ABOUT MEDidier Marin
PhD in Computer Science (UPMC) Machine Learning, Reinforcement Learning & RoboticsCo-founder of HeuritechLikes functional programming and distributed computing
We develop tools to make sense from raw text dataCustomer insight using the text of visited web pages
Data Analytics Platform Qualify users using their web logs50M lines/dayMatch CRM and web data
WHY SPARK ?Performance, in particular whenbatch size < total RAM in clusterMore general than MR, high-level APIExtensions (ML, streaming) andconnectors (Cassandra)Growing community
PARSING LOGSdef parseLine(line: String): Either[ParsingError, LogData] = ???
val logs = sc.textFile("logfile").map(parseLine(_))
val validLogs = logs.flatMap(_.right.toOption)
LAMBDA ARCHITECTURE
IMPLEMENTATION
CLUSTER CONFIGURATIONLXC + saltN containers : 1 master/executor + (N-1) executorsCassandra node for each Spark executorUsing an "uber"-JAR to submit jobsSharing data through NFS
MANAGING SPARK'S MEMORYDefault: 40 % working memory, 60 % cache20 % of cache used to unroll blocks
Explicit caching for huge RDDs we reuse:validLogs.persist(StorageLevel.MEMORY_AND_DISK)
Partition tuning may be necessary (spark.default.parallelism)
AGGREGATIONval words = sc.parallelize(List("a","b","a","c"))
words.groupBy(x=>x).mapValues(_.size).collect
// Array((a,2), (b,1), (c,1))
words.map(x=>(x,1)).reduceByKey(_+_).collect
// Array((a,2), (b,1), (c,1))
AGGREGATIONgroupBy
see also &
AGGREGATIONreduceByKey
combineByKey foldByKey
Databricks knowledge base
Spark users mailing list
Parsing Apache logs with Spark (Scala)
USEFUL LINKSgithub.com/databricks/spark-knowledgebase
apache-spark-user-list.1001560.n3.nabble.com
alvinalexander.com/scala/analyzing-apache-access-logs-files-spark-scala
THANK YOU !