data science at scale with spark (pdf)
TRANSCRIPT
![Page 2: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/2.jpg)
<shameless> <plug>
</plug> </shameless>
![Page 3: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/3.jpg)
“TrollingtheHadoopcommunitysince2012...”
![Page 4: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/4.jpg)
![Page 5: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/5.jpg)
Why the JVM?
![Page 6: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/6.jpg)
The JVM
AkkaBreezeAlgebirdSpire & CatsAxle...
![Page 7: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/7.jpg)
Big Data Ecosystem
![Page 8: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/8.jpg)
Hadoop
Hadoop
![Page 9: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/9.jpg)
9
Hadoopmaster
Resource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgrslave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
![Page 10: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/10.jpg)
10
Hadoopmaster
Resource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgrslave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
HDFS
HDFS
![Page 11: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/11.jpg)
11
masterResource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgrslave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
HDFS
HDFS
MapReduceJob
MapReduceJob
MapReduceJob
![Page 12: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/12.jpg)
Hadoop
MapReduce
![Page 13: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/13.jpg)
13
Example: Inverted Indexinverse indexblock
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block......
block......
block......
(.../hadoop,1),(.../hive,1)and
......
wikipedia.org/hadoopHadoop providesMapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL
...
...
![Page 14: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/14.jpg)
14
Example: Inverted Index
wikipedia.org/hadoopHadoop providesMapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL
...
...
Web Crawl
indexblock
......
Hadoop provides...wikipedia.org/hadoop
......
block......
HBase stores...wikipedia.org/hbase
......
block......
Hive queries...wikipedia.org/hive
......
inverse indexblock
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block......
block......
block......
(.../hadoop,1),(.../hive,1)and
......
Miracle!!
Compute Inverted Index
![Page 15: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/15.jpg)
15
wikipedia.org/hadoopHadoop providesMapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL
...
...
Web Crawl
indexblock
......
Hadoop provides...wikipedia.org/hadoop
......
block......
HBase stores...wikipedia.org/hbase
......
block......
Hive queries...wikipedia.org/hive
......
inverse indexblock
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block......
block......
block......
(.../hadoop,1),(.../hive,1)and
......
Miracle!!
Compute Inverted Index
![Page 16: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/16.jpg)
16
wikipedia.org/hadoopHadoop providesMapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL
...
...
Web Crawl
indexblock
......
Hadoop provides...wikipedia.org/hadoop
......
block......
HBase stores...wikipedia.org/hbase
......
block......
Hive queries...wikipedia.org/hive
......
inverse indexblock
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block......
block......
block......
(.../hadoop,1),(.../hive,1)and
......
Miracle!!
Compute Inverted Index
![Page 17: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/17.jpg)
17
wikipedia.org/hadoopHadoop providesMapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL
...
...
Web Crawl
indexblock
......
Hadoop provides...wikipedia.org/hadoop
......
block......
HBase stores...wikipedia.org/hbase
......
block......
Hive queries...wikipedia.org/hive
......
inverse indexblock
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block......
block......
block......
(.../hadoop,1),(.../hive,1)and
......
Miracle!!
Compute Inverted Index
![Page 18: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/18.jpg)
18
wikipedia.org/hadoopHadoop providesMapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL
...
...
Web Crawl
indexblock
......
Hadoop provides...wikipedia.org/hadoop
......
block......
HBase stores...wikipedia.org/hbase
......
block......
Hive queries...wikipedia.org/hive
......
inverse indexblock
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block......
block......
block......
(.../hadoop,1),(.../hive,1)and
......
Miracle!!
Compute Inverted Index
![Page 19: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/19.jpg)
import java.io.IOException; import java.util.*;
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass(LineIndexMapper.class);
19
Java MapReduce Inverted Index
![Page 20: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/20.jpg)
20
public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass(LineIndexMapper.class); conf.setReducerClass(LineIndexMapper.class); client.setConf(conf);
try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } }
![Page 21: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/21.jpg)
21
public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text();
public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName);
String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) {
![Page 22: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/22.jpg)
private final static Text location = new Text();
public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName);
String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } }
22
Actualbusinesslogic.
![Page 23: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/23.jpg)
23
}
public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first = false; toReturn.append(values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } }
Actualbusinesslogic.
![Page 24: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/24.jpg)
Problems
Hard to implement algorithms...
24
![Page 25: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/25.jpg)
25
HigherLevel Tools?
![Page 26: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/26.jpg)
26
CREATE TABLE students (name STRING, age INT, gpa FLOAT); LOAD DATA ...; ... SELECT age, AVG(gpa) FROM students GROUP BY age;
![Page 27: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/27.jpg)
27
A = LOAD 'students' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = GROUP A BY age; C = FOREACH B GENERATE group AS age, AVG(gpa); DUMP c;
![Page 28: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/28.jpg)
28
Cascading (Java)
MapReduce
![Page 29: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/29.jpg)
29
import com.twitter.scalding._
class InvertedIndex(args: Args) extends Job(args) {
val texts = Tsv("texts.tsv", ('id, 'text)) val wordToIds = texts .flatMap(('id, 'text) -> ('word, 'id2)) { fields: (String, String) => val (id2, text) = text.split("\\s+").map { word => (word, id2) } }
val invertedIndex = wordToTweets .groupBy('word)(_.toList[String]('id2 -> 'ids)) invertedIndex.write(Tsv("output.tsv")) }
![Page 30: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/30.jpg)
Problems
Only “Batch mode”;What about streaming?
30
![Page 31: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/31.jpg)
Problems
Performance needsto be better
31
![Page 32: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/32.jpg)
32
Spark
![Page 33: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/33.jpg)
33
Productivity?
Very concise, elegant, functional APIs.•Python, R•Scala, Java
•... and SQL!
![Page 34: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/34.jpg)
34
Productivity?
Interactive shell (REPL)•Scala, Python, R, and SQLNotebooks (iPython/Jupyter-like)
![Page 35: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/35.jpg)
35
Composable primitives support wide class of algos:Iterative Machine Learning & Graph processing/traversal
Flexible for Algorithms?
![Page 36: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/36.jpg)
36
Efficient?
Builds a dataflow DAG:•Combines steps into “stages” •Can cache intermediate data
![Page 37: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/37.jpg)
37
Efficient?
The New DataFrame API has the same performance for all languages.
![Page 38: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/38.jpg)
38
Batch + Streaming?
Streams - “mini batch” processing:•Reuses “batch” code•Adds “window” functions
![Page 39: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/39.jpg)
39
Resilient Distributed Datasets (RDDs)
Cluster
Node NodeNode
RDD
Partition 1 Partition 1 Partition 1
![Page 40: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/40.jpg)
40
DStreamsDStream (discretized stream)
… …
Window of 3 RDD Batches #1
Time 2 RDD
Even
t
Even
t
Even
t
Even
t
Even
t
Time 1 RDD
Even
t
Even
t
Even
t
Time 3 RDD
Even
t
Even
t
Even
t
Even
t
Time 4 RDD
Even
t
Even
t
Even
t
…
…
Window of 3 RDD Batches #2
![Page 41: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/41.jpg)
41
Scala?
I’ll use Scala for the examples, but Data Scientists can use Python or R, if they prefer.
See Vitaly Gordon's opinion on Scala for Data Science
![Page 42: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/42.jpg)
42
Inverted Index
![Page 43: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/43.jpg)
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._
object InvertedIndex { def main(args: Array[String]) = {
val sc = new SparkContext("local", "Inverted Index")
sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map {
43
![Page 44: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/44.jpg)
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._
object InvertedIndex { def main(args: Array[String]) = {
val sc = new SparkContext("local", "Inverted Index")
sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map {
44
![Page 45: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/45.jpg)
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._
object InvertedIndex { def main(args: Array[String]) = {
val sc = new SparkContext("local", "Inverted Index")
sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map {
45
![Page 46: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/46.jpg)
val sc = new SparkContext("local", "Inverted Index")
sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2
46
![Page 47: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/47.jpg)
val sc = new SparkContext("local", "Inverted Index")
sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2
47
(word1,path1)(word2,path2)...
![Page 48: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/48.jpg)
word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") }
48
((word1,path1),N1)((word2,path2),N2)...
![Page 49: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/49.jpg)
word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") }
49
(word1,(path1,N1))(word2,(path2,N2))...
((word1,path1),N1)((word2,path2),N2)...
![Page 50: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/50.jpg)
50
(word,Seq((path1,n1),(path2,n2),(path3,n3),...))...
} .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("output/inverted-index")
sc.stop() } }
![Page 51: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/51.jpg)
} .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("output/inverted-index")
sc.stop() } }
51
(word,“(path4,80),(path19,51),(path8,12),...”)...
![Page 52: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/52.jpg)
} .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("output/inverted-index")
sc.stop() } }
52
![Page 53: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/53.jpg)
53
Altogether
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._
object InvertedIndex { def main(args: Array[String]) = {
val sc = new SparkContext( "local", "Inverted Index")
sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile(argz.outpath)
sc.stop() } }
![Page 54: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/54.jpg)
word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ")
54
Powerful,composable“operators”
![Page 55: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/55.jpg)
The larger ecosystem
55
![Page 56: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/56.jpg)
SQL Revisited
56
![Page 57: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/57.jpg)
Spark SQL
57
•Use HiveQL (Hive’s SQL dialect)•Use Spark’s own SQL dialect•Use the new DataFrame API
![Page 58: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/58.jpg)
Spark SQL
58
•Use HiveQL (Hive’s SQL dialect)•Use Spark’s own SQL dialect•Use the new DataFrame API
![Page 59: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/59.jpg)
59
import org.apache.spark.sql.hive._
val sc = new SparkContext(...) val sqlc = new HiveContext(sc)
sqlc.sql( "CREATE TABLE wc (word STRING, count INT)").show()
sqlc.sql(""" LOAD DATA LOCAL INPATH '/path/to/wc.txt' INTO TABLE wc""").show()
sqlc.sql(""" SELECT * FROM wc ORDER BY count DESC""").show()
![Page 60: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/60.jpg)
import org.apache.spark.sql.hive._
val sc = new SparkContext(...) val sqlc = new HiveContext(sc)
sqlc.sql( "CREATE TABLE wc (word STRING, count INT)").show()
sqlc.sql(""" LOAD DATA LOCAL INPATH '/path/to/wc.txt' INTO TABLE wc""").show()
sqlc.sql(""" SELECT * FROM wc ORDER BY count DESC""").show() 60
![Page 61: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/61.jpg)
import org.apache.spark.sql.hive._
val sc = new SparkContext(...) val sqlc = new HiveContext(sc)
sqlc.sql( "CREATE TABLE wc (word STRING, count INT)").show()
sqlc.sql(""" LOAD DATA LOCAL INPATH '/path/to/wc.txt' INTO TABLE wc""").show()
sqlc.sql(""" SELECT * FROM wc ORDER BY count DESC""").show() 61
![Page 62: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/62.jpg)
62
•Prefer Python??•Just replace:
•With this:
•and delete the vals.
import org.apache.spark.sql.hive._
from pyspark.sql import HiveContext
![Page 63: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/63.jpg)
63
•Prefer Just HiveQL??•Use spark-sql shell script:
CREATE TABLE wc (word STRING, count INT);
LOAD DATA LOCAL INPATH '/path/to/wc.txt' INTO TABLE wc;
SELECT * FROM wc ORDER BY count DESC;
![Page 64: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/64.jpg)
Spark SQL
64
•Use HiveQL (Hive’s SQL dialect)•Use Spark’s own SQL dialect•Use the new DataFrame API
![Page 65: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/65.jpg)
65
import org.apache.spark.sql._
val sc = new SparkContext(...) val sqlc = new HiveContext(sc)
val df = sqlc.read.parquet("/path/to/wc.parquet")
val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()
val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)
![Page 66: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/66.jpg)
import org.apache.spark.sql._
val sc = new SparkContext(...) val sqlc = new HiveContext(sc)
val df = sqlc.read.parquet("/path/to/wc.parquet")
val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()
val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)
66
![Page 67: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/67.jpg)
67
import org.apache.spark.sql._
val sc = new SparkContext(...) val sqlc = new HiveContext(sc)
val df = sqlc.read.parquet("/path/to/wc.parquet")
val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()
val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)
![Page 68: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/68.jpg)
import org.apache.spark.sql._
val sc = new SparkContext(...) val sqlc = new HiveContext(sc)
val df = sqlc.read.parquet("/path/to/wc.parquet")
val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()
val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)
68
![Page 69: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/69.jpg)
import org.apache.spark.sql._
val sc = new SparkContext(...) val sqlc = new HiveContext(sc)
val df = sqlc.read.parquet("/path/to/wc.parquet")
val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()
val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)
69
![Page 70: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/70.jpg)
Machine Learning
MLlib
![Page 71: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/71.jpg)
Streaming KMeansExample
![Page 72: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/72.jpg)
72
import ...spark.mllib.clustering.StreamingKMeans import ...spark.mllib.linalg.Vectors import ...spark.mllib.regression.LabeledPoint import ...spark.streaming.{ Seconds, StreamingContext}
val sc = new SparkContext(...) val ssc = new StreamingContext(sc, Seconds(10))
val trainingData = ssc.textFileStream(...) .map(Vectors.parse) val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)
val model = new StreamingKMeans() .setK(K_CLUSTERS)
![Page 73: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/73.jpg)
73
import ...spark.mllib.clustering.StreamingKMeans import ...spark.mllib.linalg.Vectors import ...spark.mllib.regression.LabeledPoint import ...spark.streaming.{ Seconds, StreamingContext}
val sc = new SparkContext(...) val ssc = new StreamingContext(sc, Seconds(10))
val trainingData = ssc.textFileStream(...) .map(Vectors.parse) val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)
val model = new StreamingKMeans() .setK(K_CLUSTERS)
![Page 74: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/74.jpg)
import ...spark.mllib.clustering.StreamingKMeans import ...spark.mllib.linalg.Vectors import ...spark.mllib.regression.LabeledPoint import ...spark.streaming.{ Seconds, StreamingContext}
val sc = new SparkContext(...) val ssc = new StreamingContext(sc, Seconds(10))
val trainingData = ssc.textFileStream(...) .map(Vectors.parse) val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)
val model = new StreamingKMeans() .setK(K_CLUSTERS)
74
![Page 75: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/75.jpg)
import ...spark.mllib.clustering.StreamingKMeans import ...spark.mllib.linalg.Vectors import ...spark.mllib.regression.LabeledPoint import ...spark.streaming.{ Seconds, StreamingContext}
val sc = new SparkContext(...) val ssc = new StreamingContext(sc, Seconds(10))
val trainingData = ssc.textFileStream(...) .map(Vectors.parse) val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)
val model = new StreamingKMeans() .setK(K_CLUSTERS)
75
![Page 76: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/76.jpg)
val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)
val model = new StreamingKMeans() .setK(K_CLUSTERS) .setDecayFactor(1.0) .setRandomCenters(N_FEATURES, 0.0)
val f: LabeledPoint => (Double, Vector) = lp => (lp.label, lp.features)
model.trainOn(trainingData) model.predictOnValues(testData.map(f)).print()
ssc.start() ssc.awaitTermination()
76
![Page 77: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/77.jpg)
val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)
val model = new StreamingKMeans() .setK(K_CLUSTERS) .setDecayFactor(1.0) .setRandomCenters(N_FEATURES, 0.0)
val f: LabeledPoint => (Double, Vector) = lp => (lp.label, lp.features)
model.trainOn(trainingData) model.predictOnValues(testData.map(f)).print()
ssc.start() ssc.awaitTermination()
77
![Page 78: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/78.jpg)
val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)
val model = new StreamingKMeans() .setK(K_CLUSTERS) .setDecayFactor(1.0) .setRandomCenters(N_FEATURES, 0.0)
val f: LabeledPoint => (Double, Vector) = lp => (lp.label, lp.features)
model.trainOn(trainingData) model.predictOnValues(testData.map(f)).print()
ssc.start() ssc.awaitTermination()
78
![Page 79: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/79.jpg)
val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)
val model = new StreamingKMeans() .setK(K_CLUSTERS) .setDecayFactor(1.0) .setRandomCenters(N_FEATURES, 0.0)
val f: LabeledPoint => (Double, Vector) = lp => (lp.label, lp.features)
model.trainOn(trainingData) model.predictOnValues(testData.map(f)).print()
ssc.start() ssc.awaitTermination()
79
![Page 80: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/80.jpg)
80
Graph Processing
GraphX
![Page 81: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/81.jpg)
81
•Social networks•Epidemics•Teh Interwebs
•“Page Rank”•...
GraphX
![Page 82: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/82.jpg)
82
import scala.collection.mutable import org.apache.spark._ import ...spark.storage.StorageLevel import ...spark.graphx._ import ...spark.graphx.lib._ import ...spark.graphx.PartitionStrategy._
val nEdgePartitions = 20 val partitionStrategy = PartitionStrategy.CanonicalRandomVertexCut val edgeStorageLevel = StorageLevel.MEMORY_ONLY val vertexStorageLevel = StorageLevel.MEMORY_ONLY val tolerance = 0.001F val input = "..."
val sc = new SparkContext(...)
![Page 83: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/83.jpg)
import scala.collection.mutable import org.apache.spark._ import ...spark.storage.StorageLevel import ...spark.graphx._ import ...spark.graphx.lib._ import ...spark.graphx.PartitionStrategy._
val nEdgePartitions = 20 val partitionStrategy = PartitionStrategy.CanonicalRandomVertexCut val edgeStorageLevel = StorageLevel.MEMORY_ONLY val vertexStorageLevel = StorageLevel.MEMORY_ONLY val tolerance = 0.001F val input = "..."
val sc = new SparkContext(...) 83
![Page 84: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/84.jpg)
import scala.collection.mutable import org.apache.spark._ import ...spark.storage.StorageLevel import ...spark.graphx._ import ...spark.graphx.lib._ import ...spark.graphx.PartitionStrategy._
val nEdgePartitions = 20 val partitionStrategy = PartitionStrategy.CanonicalRandomVertexCut val edgeStorageLevel = StorageLevel.MEMORY_ONLY val vertexStorageLevel = StorageLevel.MEMORY_ONLY val tolerance = 0.001F val input = "..."
val sc = new SparkContext(...) 84
![Page 85: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/85.jpg)
val tolerance = 0.001F val input = "..."
val sc = new SparkContext(...)
val unpartitionedGraph = GraphLoader.edgeListFile( sc, input, numEdgePartitions, edgeStorageLevel, vertexStorageLevel).cache
val graph = partitionStrategy.foldLeft( unpartitionedGraph)(_.partitionBy(_))
println("# vertices = " + graph.vertices.count) println("# edges = " + graph.edges.count)
val pr = PageRank.runUntilConvergence( graph, tolerance).vertices.cache()
85
![Page 86: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/86.jpg)
val tolerance = 0.001F val input = "..."
val sc = new SparkContext(...)
val unpartitionedGraph = GraphLoader.edgeListFile( sc, input, numEdgePartitions, edgeStorageLevel, vertexStorageLevel).cache
val graph = partitionStrategy.foldLeft( unpartitionedGraph)(_.partitionBy(_))
println("# vertices = " + graph.vertices.count) println("# edges = " + graph.edges.count)
val pr = PageRank.runUntilConvergence( graph, tolerance).vertices.cache()
86
![Page 87: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/87.jpg)
val graph = partitionStrategy.foldLeft( unpartitionedGraph)(_.partitionBy(_))
println("# vertices = " + graph.vertices.count) println("# edges = " + graph.edges.count)
val pr = PageRank.runUntilConvergence( graph, tolerance).vertices.cache()
println("Top ranks: “) pr.sortBy(tuple => -tuple._2, ascending=false). foreach(println)
pr.map { case (id, r) => id + "\t" + r }.saveAsTextFile(...) sc.stop()
87
![Page 88: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/88.jpg)
val graph = partitionStrategy.foldLeft( unpartitionedGraph)(_.partitionBy(_))
println("# vertices = " + graph.vertices.count) println("# edges = " + graph.edges.count)
val pr = PageRank.runUntilConvergence( graph, tolerance).vertices.cache()
println("Top ranks: “) pr.sortBy(tuple => -tuple._2, ascending=false). foreach(println)
pr.map { case (id, r) => id + "\t" + r }.saveAsTextFile(...) sc.stop()
88
![Page 90: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/90.jpg)
90
Bonus Slides:Details of MapReduce implementation for the Inverted Index
![Page 91: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/91.jpg)
91
1 Map step + 1 Reduce step
![Page 92: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/92.jpg)
92
1 Map step + 1 Reduce step
Map Task
(hadoop,(wikipedia.org/hadoop,1))
(mapreduce,(wikipedia.org/hadoop, 1))
(hdfs,(wikipedia.org/hadoop, 1))
(provides,(wikipedia.org/hadoop,1))
(and,(wikipedia.org/hadoop,1))
![Page 93: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/93.jpg)
93
1 Map step + 1 Reduce step
![Page 94: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/94.jpg)
94
1 Map step + 1 Reduce step
Reduce Task
(hadoop,Iterator((…/hadoop,1)))
(hdfs,Iterator((…/hadoop, 1), (…/hbase,1),(…/hive,1)))
(hbase, Iterator( (…/hbase,1),(…/hive,1)))
![Page 95: Data Science at Scale with Spark (pdf)](https://reader038.vdocuments.site/reader038/viewer/2022102812/58a1a9d51a28abd94d8c462b/html5/thumbnails/95.jpg)
95
1 Map step + 1 Reduce step