heuritech: apache spark rex
TRANSCRIPT
ABOUT MEDidier Marin
PhD in Computer Science (UPMC) Machine Learning, Reinforcement Learning & RoboticsCo-founder of HeuritechLikes functional programming and distributed computing
We develop tools to make sense from raw text dataCustomer insight using the text of visited web pages
WHY SPARK ?Performance, in particular whenbatch size < total RAM in clusterMore general than MR, high-level APIExtensions (ML, streaming) andconnectors (Cassandra)Growing community
PARSING LOGSdef parseLine(line: String): Either[ParsingError, LogData] = ???
val logs = sc.textFile("logfile").map(parseLine(_))
val validLogs = logs.flatMap(_.right.toOption)
CLUSTER CONFIGURATIONLXC + saltN containers : 1 master/executor + (N-1) executorsCassandra node for each Spark executorUsing an "uber"-JAR to submit jobsSharing data through NFS
MANAGING SPARK'S MEMORYDefault: 40 % working memory, 60 % cache20 % of cache used to unroll blocks
Explicit caching for huge RDDs we reuse:validLogs.persist(StorageLevel.MEMORY_AND_DISK)
Partition tuning may be necessary (spark.default.parallelism)
AGGREGATIONval words = sc.parallelize(List("a","b","a","c"))
words.groupBy(x=>x).mapValues(_.size).collect
// Array((a,2), (b,1), (c,1))
words.map(x=>(x,1)).reduceByKey(_+_).collect
// Array((a,2), (b,1), (c,1))
Databricks knowledge base
Spark users mailing list
Parsing Apache logs with Spark (Scala)
USEFUL LINKSgithub.com/databricks/spark-knowledgebase
apache-spark-user-list.1001560.n3.nabble.com
alvinalexander.com/scala/analyzing-apache-access-logs-files-spark-scala