scala 20140715

Intro to Apache Spark:Fast cluster computing engine for Hadoop

Intro to Scala:Object-oriented and functional language for the Java Virtual Machine

ACM SIGKDD, 7/9/2014

Roger Huang

Lead System Architect

[email protected]

[email protected]

@BigDataWrangler

mailto:[email protected]



http://spark.apache.org/

2Intro to Spark: Intro to Scala | 7/9/2014

About me: Roger Huang• Visa

– Digital & Mobile Products Architecture, Strategic Projects & infrastructure

– Search infrastructure

– Customer segmentation

– Logging Framework

– Splunk on Hadoop (Hunk)

– Real-time monitoring

– Data

• PayPal– Java Infrastructure


Different perspectives on an elephant Scala


Outline• Spark

– Hadoop eco system

• Scala– Background

• Why Scala?– For the computer scientist

– For the Java / OO programmer

– For the Spark developer

– For the Big Data developer

– For the Big Data scientist / mathematician

– For the system architect


Spark in the Hadoop ecosystem


Spark Ecosystem of Software Projects

• Spark [Ognen]– APIs: Scala, Python [Robert], Java

• “SQL”– Shark (Hive + Spark) [Roger]

– SparkSQL (alpha)

• Machine Learning Library (MLlib) [Omar]– Clustering

– Classification• binary classification

• Linear regression

– recommendations

• Spark Streaming [Chance]

• GraphX [Srini]

• …


Resilient Distributed Dataset• Fault tolerant collection of elements partitioned across the

nodes of the cluster that can be operated on in parallel

• Data sources for RDDs– Parallelized collections

• From Scala collections

– Hadoop datasets• From HDFS, any Hadoop supported storage system (Hbase, Amazon

S3, …)

• Text files, SequenceFile, any Hadoop InputFormat

• Two types of operations– Transformation

• takes an existing dataset and creates a new one

– Action• takes a dataset, run a computation, and return value to driver program


(Some) RDD Operations• Transformations

– map(func)

– filter(func)

– flatMap(func)

– mapPartitions(func)

– mapPartitionsWithIndex(func)

– sample(withReplacement, fraction, seed)

– union(otherDataset)

– distinct()

– groupByKey()

– reduceByKey(func)

– sortByKey()

– Join(otherDataset)

– cogroup(otherDataset)

– cartesian(otherDataset)

• Actions– reduce(func)

– collect()

– count()

– first()

– take(n)

– takeSample(withReplacement, num, seed)

– saveAsTextFile(path)

– saveAsSequenceFile(path)

– countByKey()

– foreach(func)

– …


Scala background• Scalable, Object oriented, functional language

– Version 2.11 (4/2014)

• Runs on the Java Virtual Machine

• Martin Odersky

– javac

– Java generics

• http://scala-lang.org/, REPL

• http://www.scala-lang.org/api/current

• http://scala-ide.org/

• http://www.scala-sbt.org/, Simple build tool

• Who’s using Scala?

– Twitter, LinkedIn, …

• Powered by Scala

– Apache Spark, Apache Kafka, Akka,…

http://scala-lang.org/



http://www.scala-lang.org/api/current

http://scala-ide.org/

http://scala-ide.org/

http://www.scala-sbt.org/

http://www.scala-sbt.org/


Outline• Spark





– For the Hadoop/Spark developer





Scala for the computer scientist: functional programming (FP)


Scala for the computer scientist: functional programming (FP)

• Math functions, e.g., f(x) = y– A function has a single responsibility

– A function has no side effects

– A function is referentially transparent• A function outputs the same value for the same inputs.

• Functional programming– expresses computation as the evaluation and composition of

mathematical functions

– Avoid side effects and mutating state data


Why functional programming?

• Multi core processors

• Concurrency– Computation as a series of independent data transformations

– Parallel data transformations without side effects

• Referential transparency


Scala for the computer scientist: functional programming

• Functions– Lambda, closure

• For-comprehensions

• Type inference

• Pattern matching

• Higher order functions– map, flatMap, foldLeft

• And more …


FP: functions

• Anonymous function– Function without a name

– lambda function

• Example– scala> List(100, 200, 300) map { _ * 10/100}

– res0: List[Int] = List(10, 20, 30)

• Closure (Wikipedia)– Closure = A function, together with a referencing environment – a

table storing a reference to each of the non-local variables of that function.

– A closure allows a function to access those non-local variables even when invoked outside its immediate lexical scope.


FP: functions

• applyPercentage is an example of a closure– scala> var percentage = 10

– percentage: Int = 10

– scala> val applyPercentage = (amount: Int) => amount * percentage / 100

– applyPercentage: Int => Int = <function1>

– scala> percentage = 20

– percentage: Int = 20

– scala> List (100, 200, 300) map applyPercentage

– res1: List[Int] = List(20, 40, 60)

– scala>


FP: functions

• Anonymous function

• Closure


FP: Higher order functionsscala> :load Person.scala

Loading Person.scala...

defined class Person

scala> val jd = new Person("John", "Doe", 17)

jd: Person = Person@372a6e85

scala> val rh = new Person("Roger", "Huang", 34)

rh: Person = Person@611c4041

scala> val people = Array(jd, rh)

people: Array[Person] = Array(Person@372a6e85, Person@611c4041)

scala> val (minors, adults) = people partition (_.age < 18)

minors: Array[Person] = Array(Person@372a6e85)

adults: Array[Person] = Array(Person@611c4041)

scala>


FP: Higher order functions

• HOF– takes a function as an argument

– Returns a function


FP: Higher order functions: map

• Creates a new collection from an existing collection by applying a function

• Anonymous functionscala> List(1, 2, 3 ) map { (x: Int) => x + 1 }

res0: List[Int] = List(2, 3, 4)

• Function literalscala> List(1, 2, 3) map { _ + 1 }


• Passing an existing functionscala> def addOne(num: Int) = num + 1

addOne: (num: Int)Int

scala> List(1, 2, 3) map addOne



FP: Higher order functions: map


FP: Higher order functions: flatmap


FP: for-comprehension

• Syntax– for ( <generator> | <guard> ) <expression> [yield] <expression>

• Types– Imperative form. Does not return a value.

scala> val aList = List(1, 2, 3)

aList: List[Int] = List(1, 2, 3)

scala> val bList = List(4, 5, 6)

bList: List[Int] = List(4, 5, 6)

scala> for { a <- aList; if (a < 2); b <- bList; if (b < 7) } println( a + b )

5

6

7



• Syntax– for ( <generator> | <guard> ) <expression> [yield] <expression>

• Types– Functional form (a.k.a., sequence comprehension) . Returns/yields

a value

scala> for { a <- aList; b <- bList} yield a + b

res0: List[Int] = List(5, 6, 7, 6, 7, 8, 7, 8, 9)

scala> res0.take(1)

res1: List[Int] = List(5)

scala> for { a <- aList; if (a < 2); b <- bList } yield a + b


scala>


FP: foldLeft• scala> val numbers = 1.to(10)

• numbers: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

• scala> def add( a:Int, b:Int ): Int = { a + b }

• add: (a: Int, b: Int)Int

• scala> numbers.foldLeft(0){ add }

• res0: Int = 55

• scala> numbers.foldLeft(0){ (acc, b) => acc + b }

• res1: Int = 55

• scala>


FP: foldLeft


FP: find the last item in an array

• scala> val ns = Array(20, 40, 60)

• ns: Array[Int] = Array(20, 40, 60)

• scala> ns.foldLeft(ns.head) {(acc, b) => b}

• res0: Int = 60

• scala>


FP: reverse an array w/ foldLeft

• scala> val ns = Array(20, 40, 60)

• ns: Array[Int] = Array(20, 40, 60)

• scala> ns.foldLeft( Array[Int]() ) { (acc, b) => b +: acc}

• res1: Array[Int] = Array(60, 40, 20)

• scala>


FP: reverse an array w/ foldLeft


Outline• Spark










Scala for the Java / OO developer: • Interoperable w/ Java

• Case classes

• Mixins with traits


Scala for the Java / OO developer: • case class

– Implements equals(), hashCode(), toString()

– Can be used in Pattern Matching


Scala for the Java / OO developer: • http://

docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html

• map– <R> Stream<R> map(Function<? super T,? extends

R> mapper)Returns a stream consisting of the results of applying the given function to the elements of this stream.This is an intermediate operation.

• flatMap– <R> Stream<R> flatMap(Function<? super T,? extends Stream<?

extends R>> mapper)Returns a stream consisting of the results of replacing each element of this stream with the contents of a mapped stream produced by applying the provided mapping function to each element. Each mapped stream is closed after its contents have been placed into this stream. (If a mapped stream is null an empty stream is used, instead.)This is an intermediate operation.

`

http://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html




http://docs.oracle.com/javase/8/docs/api/java/util/function/Function.html


http://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html%23StreamOps


http://docs.oracle.com/javase/8/docs/api/java/util/function/Function.html



http://docs.oracle.com/javase/8/docs/api/java/util/stream/BaseStream.html%23close--

http://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html%23StreamOps


Outline• Spark










Scala for the Spark developer• ResilientDistributedDataset (RDD)

• A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist.

• http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

http://spark.apache.org/docs/latest/api/scala/index.html%23org.apache.spark.rdd.RDD




Outline• Spark










Scala for the Big Data developer• Spark

– Programming API in Scala

– Implemented in Scala

• Scalding– Scala DSL on top of Cascading

– data processing API and processing query planner used for defining, sharing, and executing data-processing workflows

– Abstractions: tuples, pipes, source/sink taps

• Algebird

• Summingbird– Library that lets you write MapReduce programs that look like

native Scala or Java collection transformations

– Execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

https://github.com/nathanmarz/storm

https://github.com/twitter/scalding


Outline• Spark










Scala for the Big Data scientist / mathematician

• Monoid– If you want to “attach” operations such as +, -, *, / or <= to data

objects (e.g., Bloom filters), then you want to provide monoid forms of those data objects

– Consists of• A set of objects

• Binary operation that satisfies the monoid axioms

• Monad– If you want to create a data processing pipeline that transforms the

state of a data object

– composition


Outline• Spark










Scala for the system architect• Concurrency

• Problem:

– Threads

– Shared mutable state

– Locks,

• Solution:

– message passing concurrency w/ Actors

– Future, Promise

• Abstractions

– Actor

• an object that processes a message

• encapsulates state (state not shared)

– ActorRef

– Message, usually sent asynchronously

– Mailbox

– ActorSystem


Scala for the system architect: Akka• Fault tolerance

– Supervision

– Strategies• Resume, restart, stop, escalate, …

• Scale out: remote actors– Via configuration


Scala for the system architect• Parallel collections

– scala> import scala.collection.parallel.immutable._

– import scala.collection.parallel.immutable._

– scala> ParVector(10, 20, 30, 40, 50, 60, 70, 80, 90) .map { x =>

– | println( Thread.currentThread.getName); x / 2 }

– ForkJoinPool-1-worker-13









– res0: scala.collection.parallel.immutable.ParVector[Int] = ParVector(5, 10, 15,

– 20, 25, 30, 35, 40, 45)

– scala>


Sequential collections


Parallel collections


Outline• Spark










Different perspectives on an elephant Scala


Spark in the Hadoop ecosystem


References• http://scala-lang.org/

• Scala in Action, Nilanjan Raychaudhuri

• Grokking Functional Programming, Aslam Khan

• Michael Noll



Intro to Apache Spark:Fast cluster computing engine for Hadoop

Intro to Scala:Object-oriented and functional language for the Java Virtual Machine

ACM SIGKDD, 7/9/2014

Roger Huang

Lead System Architect

Digital & Mobile Products Architecture

[email protected]

[email protected]



http://spark.apache.org/

scala 20140715

Documents

scala apache spark

scala percentage

hadoop intro

elephant scala

scala val applypercentage

scala loading person

scala background scalable

higher order functions