integrating elastic and apache spark - elastic london meetup (2015-09-24)
Post on 15-Apr-2017
338 Views
Preview:
TRANSCRIPT
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
Neil Andrassy – CTO at The Filter
@andrassy
24 September 2015 @ Elastic London Meetup
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
THE FILTER
IN MEDIA AND IN RETAIL, WE AIM TO UNDERSTAND…
• HOW STUFF RELATES TO OTHER STUFF
• HOW PEOPLE RELATE TO STUFF
GIVING USERS THE RIGHT “STUFF” AT THE RIGHT TIME…
• ALTERNATIVE PRODUCTS
• COHERENT PERSONALISED PLAYLISTS
• PRODUCTS YOU MIGHT LIKE
• CONTENT RELATED TO THIS PRODUCT
• RELEVANT NEWS
• ….AND MANY MORE
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
OUR HERITAGE
“The Filter is like a zen master, who knows me, knows what I am interested in, knows what’s out there and gives me what is relevant at the time that I really want it in the most appropriate way.”
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
OUR CHALLENGES
PERSONALISED IS HARD…
• READ SCALABILITY
• WRITE SCALABILITY (REALTIME PERSONALISATION FOR THE INDIVIDUAL)
• AVAILABILITY / FAULT TOLERANCE
MACHINE LEARNING IS HARD…
• DATA HUNGRY
• VOLUME – VELOCITY – VARIETY
• ML PROCESSES ARE RESOURCE INTENSIVE
MULTI-TENANCY IS HARD
• EVERY CATALOGUE IS UNIQUE / DIFFERENT
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
OUR DATA JOURNEY (pre Spark)
2004 – MS SQL
2011 – MS SQL + MONGODB
2012 - ELASTIC + MS SQL
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
ELASTIC good….
• STRUCTURED DATA
• UNSTRUCTURED DATA
• TIME-SERIES DATA
• READ SCALABILITY
• WRITE SCALABILITY
• SUPPORT FOR FAILURE
• EASY MANAGEMENT
• ….
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
ELASTIC not so good….
• DATA PROCESSING
• ETL
• BATCH
• STREAMS
• MACHINE LEARNING
• GRAPH
• ….
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
OUR DATA JOURNEY (continued)
2004 – MS SQL
2011 – MS SQL + MONGODB
2012 - ELASTIC + MS SQL
2014 – ELASTIC + SPARK
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
APACHE SPARK is…. A fast and general purpose engine for large-scale data processing
• SPEEDY – faster than Hadoop
• EASY TO USE API – Java, Scala, Python, R
• SCALABLE – makes clustered operation transparent
• FLEXIBLE / POWERFUL COMPONENTS
• CORE
• SQL
• STREAMING
• MLLIB
• GRAPHX
• ***ELASTICSEARCH-SPARK*** https://spark.apache.org
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
CLIENT, MASTER, WORKER
• CLIENT (DRIVER) – submits a job to the MASTER
• MASTER (MANAGER) – co-ordinates the job with the WORKERS
• WORKERS – “do” the actual work/tasks (on RDDs)
• Ideally co-locate these on ES data nodes
• Workers manage local executors (per app)
https://spark.apache.org
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
RDD Resilient Distributed Dataset
• An IMMUTABLE collection of data elements
• PARTITIONED for distributed processing (think SHARD)
• RESILIENT for failure tolerance / recovery
IMMUTABLE + PARTITIONED
PARALLELIZABLE + SCALABLE
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
ELASTIC-SPARK Part of ELASTICSEARCH-HADOOP – connects Spark with Elastic
• Support for READ and WRITE
• Support for SQL
• RDD partitioning and ES shards work together…
• PARTITION PER SHARD
• PARTITION / SHARD LOCALITY PREFERRED
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
DEMOS
Using road safety data from https://data.gov.uk/
• Accidents
• Casualties
Scala language – expressive and natural fit for parallel workloads /
RDD (but Java, Python etc. also available).
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
DEMO 1: LOAD CSV
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
Index from CSV – 1) Setup SparkContext
//Spark Core
import org.apache.spark.{SparkConf, SparkContext}
//ElasticSearch-Spark
import org.elasticsearch.spark._
//Configure and create a Spark context for our work
def InitialiseSparkContext: SparkContext = {
val sparkConfig = new SparkConf()
.setMaster("local[4]") //Run locally with 4 workers - can scale out easily later
.setAppName("Accident data loader") //Friendly job/app name
.set("es.index.auto.create", "true") //Optional job/app level ES settings
return new SparkContext(sparkConfig)
}
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
Index from CSV – 2) Load from file and prepare
// Read the CSV file (distributed / parallel iterable set of file lines
val csvRdd = sc.textFile(sourceFileName)
// Split and clean ALL the text file rows -> iterbale set of string[]
val headerAndRowsRdd = csvRdd.map(line => line.split(",").map(_.trim))
// Get headers – single row enumerable of string[] – broadcast to all partitions / workers as needed
val headerRdd = headerAndRowsRdd.first()
// Create a set of all data *except* header
val rowDataRdd = headerAndRowsRdd.filter(_(0) != headerRdd(0))
// Zip together headers and data into a iterable set of Maps (e.g. key->value)
val finalMapsRdd = rowDataRdd.map(rowValues => headerRdd.zip(rowValues).toMap)
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
Index from CSV – 3) Save to ES
//Finally, save it - no work *actually* executes until this line...
finalMaps.saveToEs(s"$destinationIndex/$destinationType")
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
DEMO 2: RE-INDEX A TYPE
Create target index and mappings as required and then…
Or, alternatively, take more control…
sparkContext.esRDD(s"$sourceIndex/$sourceType").saveToEs(s"$destIndex/$destType")
//Load data
val sourceRdd = sc.esRDD(s"$sourceIndex/$sourceType")
//Save to ES, extracting parent ID from source data
sourceRdd.saveToEs(s"$destIndex/$destType",Map("es.mapping.parent" -> "Accident_Index"))
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
DEMO 3: SQL / JOIN
1 - Create a SQLContext from a SparkContext…
2 - Declare tables
import org.apache.spark.sql.SQLContext
// sparkContext = existing SparkContext
val sqlContext = new SQLContext(sc)
sqlContext.sql(
"CREATE TEMPORARY TABLE accident " +
"USING org.elasticsearch.spark.sql " +
s"OPTIONS (resource '${indexName}_reindex/accident', pushdown 'true')")
sqlContext.sql(
"CREATE TEMPORARY TABLE casualty " +
"USING org.elasticsearch.spark.sql " +
s"OPTIONS (resource '${indexName}_reindex/casualty', pushdown 'true')")
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
DEMO 3: SQL / JOIN
3 – Query the tables into a DataFrame (effectively a SQL RDD)
4 – Collect results on client (be careful – don’t collect HUGE
datasets!)
val joinedDataFrame = sqlContext.sql(
"""SELECT
| `1st_Road_Class`,
| COUNT(a.Accident_Index) as count_Casualty,
| AVG(c.Age_of_Casualty) as avg_Age_of_Casualty,
| AVG(c.Casualty_Severity) as avg_Casualty_Severity
|FROM accident a
|INNER JOIN casualty c
| ON c.Accident_Index = a.Accident_Index
|WHERE c.Casualty_Severity < 3
|GROUP BY `1st_Road_Class`
|ORDER BY `1st_Road_Class`""".stripMargin
)
//Collect pulls the final result data back from the workers to the client
joinedData.collect().foreach(println)
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.
VERSIONS
• Elastic 1.7.2
• Apache Spark 1.5
• Scala 2.11.7
• ElasticSearch-Spark v2.2.0-m1
//SBT
libraryDependencies += ("org.apache.spark" %% "spark-core" % "1.5.0")
libraryDependencies += ("org.apache.spark" %% "spark-sql" % "1.5.0")
libraryDependencies += ("org.elasticsearch" %% "elasticsearch-spark" % "2.2.0-m1")
top related