integrating elastic and apache spark - elastic london meetup (2015-09-24)

Neil Andrassy – CTO at The Filter

@andrassy

24 September 2015 @ Elastic London Meetup

THE FILTER

IN MEDIA AND IN RETAIL, WE AIM TO UNDERSTAND…

• HOW STUFF RELATES TO OTHER STUFF

• HOW PEOPLE RELATE TO STUFF

GIVING USERS THE RIGHT “STUFF” AT THE RIGHT TIME…

• ALTERNATIVE PRODUCTS

• COHERENT PERSONALISED PLAYLISTS

• PRODUCTS YOU MIGHT LIKE

• CONTENT RELATED TO THIS PRODUCT

• RELEVANT NEWS

• ….AND MANY MORE

OUR HERITAGE

“The Filter is like a zen master, who knows me, knows what I am interested in, knows what’s out there and gives me what is relevant at the time that I really want it in the most appropriate way.”

OUR CHALLENGES

PERSONALISED IS HARD…

• READ SCALABILITY

• WRITE SCALABILITY (REALTIME PERSONALISATION FOR THE INDIVIDUAL)

• AVAILABILITY / FAULT TOLERANCE

MACHINE LEARNING IS HARD…

• DATA HUNGRY

• VOLUME – VELOCITY – VARIETY

• ML PROCESSES ARE RESOURCE INTENSIVE

MULTI-TENANCY IS HARD

• EVERY CATALOGUE IS UNIQUE / DIFFERENT

OUR DATA JOURNEY (pre Spark)

2004 – MS SQL

2011 – MS SQL + MONGODB

2012 - ELASTIC + MS SQL

ELASTIC good….

• STRUCTURED DATA

• UNSTRUCTURED DATA

• TIME-SERIES DATA

• READ SCALABILITY

• WRITE SCALABILITY

• SUPPORT FOR FAILURE

• EASY MANAGEMENT

• ….

ELASTIC not so good….

• DATA PROCESSING

• ETL

• BATCH

• STREAMS

• MACHINE LEARNING

• GRAPH

• ….

OUR DATA JOURNEY (continued)

2004 – MS SQL

2011 – MS SQL + MONGODB

2012 - ELASTIC + MS SQL

2014 – ELASTIC + SPARK

APACHE SPARK is…. A fast and general purpose engine for large-scale data processing

• SPEEDY – faster than Hadoop

• EASY TO USE API – Java, Scala, Python, R

• SCALABLE – makes clustered operation transparent

• FLEXIBLE / POWERFUL COMPONENTS

• CORE

• SQL

• STREAMING

• MLLIB

• GRAPHX

• ***ELASTICSEARCH-SPARK*** https://spark.apache.org

CLIENT, MASTER, WORKER

• CLIENT (DRIVER) – submits a job to the MASTER

• MASTER (MANAGER) – co-ordinates the job with the WORKERS

• WORKERS – “do” the actual work/tasks (on RDDs)

• Ideally co-locate these on ES data nodes

• Workers manage local executors (per app)

https://spark.apache.org

RDD Resilient Distributed Dataset

• An IMMUTABLE collection of data elements

• PARTITIONED for distributed processing (think SHARD)

• RESILIENT for failure tolerance / recovery

IMMUTABLE + PARTITIONED

PARALLELIZABLE + SCALABLE

ELASTIC-SPARK Part of ELASTICSEARCH-HADOOP – connects Spark with Elastic

• Support for READ and WRITE

• Support for SQL

• RDD partitioning and ES shards work together…

• PARTITION PER SHARD

• PARTITION / SHARD LOCALITY PREFERRED

Using road safety data from https://data.gov.uk/

• Accidents

• Casualties

Scala language – expressive and natural fit for parallel workloads /

RDD (but Java, Python etc. also available).

DEMO 1: LOAD CSV

Index from CSV – 1) Setup SparkContext

//Spark Core

import org.apache.spark.{SparkConf, SparkContext}

//ElasticSearch-Spark

import org.elasticsearch.spark._

//Configure and create a Spark context for our work

def InitialiseSparkContext: SparkContext = {

val sparkConfig = new SparkConf()

.setMaster("local[4]") //Run locally with 4 workers - can scale out easily later

.setAppName("Accident data loader") //Friendly job/app name

.set("es.index.auto.create", "true") //Optional job/app level ES settings

return new SparkContext(sparkConfig)

Index from CSV – 2) Load from file and prepare

// Read the CSV file (distributed / parallel iterable set of file lines

val csvRdd = sc.textFile(sourceFileName)

// Split and clean ALL the text file rows -> iterbale set of string[]

val headerAndRowsRdd = csvRdd.map(line => line.split(",").map(_.trim))

// Get headers – single row enumerable of string[] – broadcast to all partitions / workers as needed

val headerRdd = headerAndRowsRdd.first()

// Create a set of all data *except* header

val rowDataRdd = headerAndRowsRdd.filter(_(0) != headerRdd(0))

// Zip together headers and data into a iterable set of Maps (e.g. key->value)

val finalMapsRdd = rowDataRdd.map(rowValues => headerRdd.zip(rowValues).toMap)

Index from CSV – 3) Save to ES

//Finally, save it - no work *actually* executes until this line...

finalMaps.saveToEs(s"$destinationIndex/$destinationType")

DEMO 2: RE-INDEX A TYPE

Create target index and mappings as required and then…

Or, alternatively, take more control…

sparkContext.esRDD(s"$sourceIndex/$sourceType").saveToEs(s"$destIndex/$destType")

//Load data

val sourceRdd = sc.esRDD(s"$sourceIndex/$sourceType")

//Save to ES, extracting parent ID from source data

sourceRdd.saveToEs(s"$destIndex/$destType",Map("es.mapping.parent" -> "Accident_Index"))

DEMO 3: SQL / JOIN

1 - Create a SQLContext from a SparkContext…

2 - Declare tables

import org.apache.spark.sql.SQLContext

// sparkContext = existing SparkContext

val sqlContext = new SQLContext(sc)

sqlContext.sql(

"CREATE TEMPORARY TABLE accident " +

"USING org.elasticsearch.spark.sql " +

s"OPTIONS (resource '${indexName}_reindex/accident', pushdown 'true')")

sqlContext.sql(

"CREATE TEMPORARY TABLE casualty " +

"USING org.elasticsearch.spark.sql " +

s"OPTIONS (resource '${indexName}_reindex/casualty', pushdown 'true')")

DEMO 3: SQL / JOIN

3 – Query the tables into a DataFrame (effectively a SQL RDD)

4 – Collect results on client (be careful – don’t collect HUGE

datasets!)

val joinedDataFrame = sqlContext.sql(

"""SELECT

| `1st_Road_Class`,

| COUNT(a.Accident_Index) as count_Casualty,

| AVG(c.Age_of_Casualty) as avg_Age_of_Casualty,

| AVG(c.Casualty_Severity) as avg_Casualty_Severity

|FROM accident a

|INNER JOIN casualty c

| ON c.Accident_Index = a.Accident_Index

|WHERE c.Casualty_Severity < 3

|GROUP BY `1st_Road_Class`

|ORDER BY `1st_Road_Class`""".stripMargin

//Collect pulls the final result data back from the workers to the client

joinedData.collect().foreach(println)

VERSIONS

• Elastic 1.7.2

• Apache Spark 1.5

• Scala 2.11.7

• ElasticSearch-Spark v2.2.0-m1

libraryDependencies += ("org.apache.spark" %% "spark-core" % "1.5.0")

libraryDependencies += ("org.apache.spark" %% "spark-sql" % "1.5.0")

libraryDependencies += ("org.elasticsearch" %% "elasticsearch-spark" % "2.2.0-m1")

integrating elastic and apache spark - elastic london meetup (2015-09-24)

Technology

spark meetup v2.0.5

[spark meetup] spark streaming overview

spark meetup july 2015

elastic meetup porto alegre

apache spark meetup

budapest spark meetup - basics of spark coding

sparklint @ spark meetup chicago

spark 4th meetup londond - building a product with spark

ibm spark meetup - rdd & spark basics

jump start into apache spark (seattle spark meetup)

dec6 meetup spark presentation

spark meetup tensorframes

meetup spark 2.0

using apache spark to fight world hunger - spark meetup

dublin spark meetup - meetup 1 - intro to spark

spark sql deep dive @ melbourne spark meetup

spark on dataproc - israel spark meetup at taboola

spark meetup tchug

paris spark meetup : extension de spark (tachyon / spark...

spark meetup at uber