before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images...

145
Before we begin, visit https://github.com/ radanalyticsio/workshop to download images and code!

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Before we begin, visit https://github.com/radanalyticsio/workshop to download images and code!

Page 2: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Insightful Apps with Apache Spark and OpenShift

William Benton (@willb) Michael McCune (@FOSSJunkie)

Page 3: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

ForecastIntroducing insightful apps

Learning from data

Meet Apache Spark

Hands-on: data engineering and machine learning in Spark and building an insightful application in OpenShift

Page 4: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

PreliminariesMake sure you have OpenShift Origin installed (if you want to build an app) or at least Docker (if you just want to try out Apache Spark)

Pull all of the necessary images for the hands-on portion

Details here: https://github.com/radanalyticsio/workshop

Page 5: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Introducing insightful apps

Page 6: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Insightful applicationsInsightful applications collect and learn from data that users generate and provide in order to work better with longevity and popularity.

Almost every exciting or important contemporary app is insightful!

Page 7: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 8: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 9: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 10: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 11: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 12: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 13: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Learning from data

Page 14: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

BASIC CONCEPTS

Page 15: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

def classify(bike): if bar_type(bike) == "flat": if tire_width(bike) > 80: return "winter bike" if tire_width(bike) > 50 or has_suspension(bike): return "mountain bike" if frame_type(bike) == "step-through": return "city bike" elif bar_type(bike) == "drop": if tire_width(bike) <= 27: return "road bike" if tire_type(bike) == "knobby": return "cyclocross bike" return "touring bike" return "unknown bike"

Page 16: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

def classify(bike): if bar_type(bike) == "flat": if tire_width(bike) > 80: return "winter bike" if tire_width(bike) > 50 or has_suspension(bike): return "mountain bike" if frame_type(bike) == "step-through": return "city bike" elif bar_type(bike) == "drop": if tire_width(bike) <= 27: return "road bike" if tire_type(bike) == "knobby": return "cyclocross bike" return "touring bike" return "unknown bike"

road bike

cyclocross bike

touring bike

mountain bike

Page 17: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

def classify(bike): if bar_type(bike) == "flat": if tire_width(bike) > 80: return "winter bike" if tire_width(bike) > 50 or has_suspension(bike): return "mountain bike" if frame_type(bike) == "step-through": return "city bike" elif bar_type(bike) == "drop": if tire_width(bike) <= 27: return "road bike" if tire_type(bike) == "knobby": return "cyclocross bike" return "touring bike" return "unknown bike"

road bike

cyclocross bike

touring bike

mountain bike

Page 18: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

def classify(bike): if bar_type(bike) == "flat": if tire_width(bike) > 80: return "winter bike" if tire_width(bike) > 50 or has_suspension(bike): return "mountain bike" if frame_type(bike) == "step-through": return "city bike" elif bar_type(bike) == "drop": if tire_width(bike) <= 27: return "road bike" if tire_type(bike) == "knobby": return "cyclocross bike" return "touring bike" return "unknown bike"

road bike

cyclocross bike

touring bike

mountain bike

triathlon bike

Page 19: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

Page 20: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

Page 21: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

LABEL

Page 22: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPELABEL

Page 23: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPE

DROP FLATLABEL

Page 24: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

LABEL

Page 25: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

TIRE KNOBS

LABEL

Page 26: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

LABEL

Page 27: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

FRONT REARLABEL

Page 28: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

FRONT REARLABEL

Page 29: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature engineering

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

FRONT REARLABEL

cyclocross bike 1 0 33 1 0 0

Page 30: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

one-hot encoding

mountain bike 0 1 60 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

FRONT REARLABEL

cyclocross bike 1 0 33 1 0 0

(convert from a categorical feature with n values to an n-bit vector)

Page 31: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

value scaling

mountain bike 0 1 0.35 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

FRONT REARLABEL

cyclocross bike 1 0 0.13 1 0 0

(assuming that all tires are between 19mm and 130mm wide)

Page 32: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Approximation techniques

Page 33: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Approximation techniques

Page 34: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Approximation techniques

Page 35: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Approximation techniques

Page 36: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature hashing

0 0 0 0 0 … 0 0 0 0 0

A a aa aal aalii zythem Zythia zythum Zyzomys Zyzzogeton

Page 37: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Feature hashingdef hash_bucket(s): """ Assumes the existence of an external hash function. Returns a tuple of * a bucket (from 0-127, inclusive) and * a sign value (either +1 or -1). """ raw_hash = my_hash(s) & 0xFF sign = (raw_hash & 0x80) != 0 and -1 or 1 bucket = raw_hash & ~0x80 return (bucket, sign)

Page 38: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

"the" → (37, 1) "quick" → (121, -1) "brown" → (50, -1)

Feature hashingdef hash_bucket(s): """ Assumes the existence of an external hash function. Returns a tuple of * a bucket (from 0-127, inclusive) and * a sign value (either +1 or -1). """ raw_hash = my_hash(s) & 0xFF sign = (raw_hash & 0x80) != 0 and -1 or 1 bucket = raw_hash & ~0x80 return (bucket, sign)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 39: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

"the" → (37, 1) "quick" → (121, -1) "brown" → (50, -1) "fox" → (71, 1)

Feature hashingdef hash_bucket(s): """ Assumes the existence of an external hash function. Returns a tuple of * a bucket (from 0-127, inclusive) and * a sign value (either +1 or -1). """ raw_hash = my_hash(s) & 0xFF sign = (raw_hash & 0x80) != 0 and -1 or 1 bucket = raw_hash & ~0x80 return (bucket, sign)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0

Page 40: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

"the" → (37, 1) "quick" → (121, -1) "brown" → (50, -1) "fox" → (71, 1) "jumps" → (39, 1)

Feature hashingdef hash_bucket(s): """ Assumes the existence of an external hash function. Returns a tuple of * a bucket (from 0-127, inclusive) and * a sign value (either +1 or -1). """ raw_hash = my_hash(s) & 0xFF sign = (raw_hash & 0x80) != 0 and -1 or 1 bucket = raw_hash & ~0x80 return (bucket, sign)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0

Page 41: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

"the" → (37, 1) "quick" → (121, -1) "brown" → (50, -1) "fox" → (71, 1) "jumps" → (39, 1) "over" → (100, -1) "the" → (37, 1)

Feature hashingdef hash_bucket(s): """ Assumes the existence of an external hash function. Returns a tuple of * a bucket (from 0-127, inclusive) and * a sign value (either +1 or -1). """ raw_hash = my_hash(s) & 0xFF sign = (raw_hash & 0x80) != 0 and -1 or 1 bucket = raw_hash & ~0x80 return (bucket, sign)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0

0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0

Page 42: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

"the" → (37, 1) "quick" → (121, -1) "brown" → (50, -1) "fox" → (71, 1) "jumps" → (39, 1) "over" → (100, -1) "the" → (37, 1) "lazy" → (120, -1) "dog" → (54, 1)

Feature hashingdef hash_bucket(s): """ Assumes the existence of an external hash function. Returns a tuple of * a bucket (from 0-127, inclusive) and * a sign value (either +1 or -1). """ raw_hash = my_hash(s) & 0xFF sign = (raw_hash & 0x80) != 0 and -1 or 1 bucket = raw_hash & ~0x80 return (bucket, sign)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0

0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0

Page 43: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

"the" → (37, 1) "quick" → (121, -1) "brown" → (50, -1) "fox" → (71, 1) "jumps" → (39, 1) "over" → (100, -1) "the" → (37, 1) "lazy" → (120, -1) "dog" → (54, 1)

Feature hashingdef hash_bucket(s): """ Assumes the existence of an external hash function. Returns a tuple of * a bucket (from 0-127, inclusive) and * a sign value (either +1 or -1). """ raw_hash = my_hash(s) & 0xFF sign = (raw_hash & 0x80) != 0 and -1 or 1 bucket = raw_hash & ~0x80 return (bucket, sign)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0

0 0 -1 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 -1 -1 0 0 0 0 0 0

Page 44: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

CLASSIFICATION

Page 45: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 46: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 47: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 48: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 49: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 50: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 51: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 52: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

CLUSTERING

Page 53: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 54: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 55: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

RECOMMENDATION

Page 56: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 57: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

?

Page 58: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

OUTLIER DETECTION

Page 59: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 60: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 61: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

map tiles by Stamen Design (CC-BY 3.0) • map data © OpenStreetMap

Page 62: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

map tiles by Stamen Design (CC-BY 3.0) • map data © OpenStreetMap

Page 63: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

map tiles by Stamen Design (CC-BY 3.0) • map data © OpenStreetMap

Page 64: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

map tiles by Stamen Design (CC-BY 3.0) • map data © OpenStreetMap

Page 65: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

map tiles by Stamen Design (CC-BY 3.0) • map data © OpenStreetMap

Page 66: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

UNDERSTANDING DATA WITH MANY DIMENSIONS

Page 67: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 68: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

[4,7]

Page 69: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

[4,7]

Page 70: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

[4,7] [2,3,5]

Page 71: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

[4,7] [2,3,5]

Page 72: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

[4,7] [2,3,5][7,1,6,5,12,8,9,2,2,4, 7,11,6,1,5]

Page 73: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

[4,7] [2,3,5][7,1,6,5,12,8,9,2,2,4, 7,11,6,1,5]

Page 74: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

Page 75: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

Page 76: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

(q - p) • (q - p)

Page 77: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

pi - qi

i=1

n

Page 78: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

pi - qi

i=1

n

Page 79: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

p • qp q

Page 80: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

Page 81: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

Page 82: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

Page 83: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Similarity and distance

10

10

3=.7

Page 84: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Eliminating inessential features

mountain bike 0 1 0.35 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

FRONT REARLABEL

cyclocross bike 1 0 0.13 1 0 0

Page 85: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Eliminating inessential features

mountain bike 0 1 0.35 1 1 1

HANDLEBAR TYPE

DROP FLATTIRE SIZE

SUSPENSION?

TIRE KNOBS

FRONT REARLABEL

cyclocross bike 1 0 0.13 1 0 0

Page 86: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Very simple: random projection0 0 0 1 1 0 1 0 1 0

0 0 1 0 0 0 1 1 0 0

1 0 1 1 0 1 0 0 0 0

0 0 0 0 0 0 1 1 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 0 1 0 1 1 0

0 0 1 0 1 0 1 0 0 0

0 1 0 0 0 1 0 0 1 1

0 0 0 0 1 0 0 1 0 1

1 1 0 0 0 0 0 0 0 1

Page 87: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Very simple: random projection0 0 0 1 1 0 1 0 1 0

0 0 1 0 0 0 1 1 0 0

1 0 1 1 0 1 0 0 0 0

0 0 0 0 0 0 1 1 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 0 1 0 1 1 0

0 0 1 0 1 0 1 0 0 0

0 1 0 0 0 1 0 0 1 1

0 0 0 0 1 0 0 1 0 1

1 1 0 0 0 0 0 0 0 1

0.13 0.13

0.06 0.07

0.07 0.06

0.02 0.08

0.17 0.11

0.11 0.09

0.04 0.18

0.13 0.04

0.13 0.21

0.14 0.03

*

Page 88: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Very simple: random projection0 0 0 1 1 0 1 0 1 0

0 0 1 0 0 0 1 1 0 0

1 0 1 1 0 1 0 0 0 0

0 0 0 0 0 0 1 1 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 0 1 0 1 1 0

0 0 1 0 1 0 1 0 0 0

0 1 0 0 0 1 0 0 1 1

0 0 0 0 1 0 0 1 0 1

1 1 0 0 0 0 0 0 0 1

0.13 0.13

0.06 0.07

0.07 0.06

0.02 0.08

0.17 0.11

0.11 0.09

0.04 0.18

0.13 0.04

0.13 0.21

0.14 0.03

* =

Page 89: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

A linear approach: PCA0 0 0 1 1 0 1 0 1 0

0 0 1 0 0 0 1 1 0 0

1 0 1 1 0 1 0 0 0 0

0 0 0 0 0 0 1 1 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 0 1 0 1 1 0

0 0 1 0 1 0 1 0 0 0

0 1 0 0 0 1 0 0 1 1

0 0 0 0 1 0 0 1 0 1

1 1 0 0 0 0 0 0 0 1

Page 90: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

A linear approach: PCA0 0 0 1 1 0 1 0 1 0

0 0 1 0 0 0 1 1 0 0

1 0 1 1 0 1 0 0 0 0

0 0 0 0 0 0 1 1 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 0 1 0 1 1 0

0 0 1 0 1 0 1 0 0 0

0 1 0 0 0 1 0 0 1 1

0 0 0 0 1 0 0 1 0 1

1 1 0 0 0 0 0 0 0 1

Page 91: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 92: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

A nonlinear approach: t-SNE

Page 93: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

A nonlinear approach: t-SNE

p( | )

Page 94: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

p( | )≈

A nonlinear approach: t-SNE

p( | )

Page 95: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 96: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Tree-based approaches

Page 97: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Tree-based approaches

yes

no

yes

no

if orange

if !orange

if red

if !red

if !gray

if !gray

Page 98: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Tree-based approaches

yes

no

yes

no

if orange

if !orange

if red

if !red

if !gray

if !gray

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

Page 99: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Tree-based approaches

yes

no

yes

no

if orange

if !orange

if red

if !red

if !gray

if !gray

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

yes

no

no

yes

yes

no

yes

no

Page 100: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Self-organizing maps

Page 101: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Self-organizing maps

Page 102: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Self-organizing maps

Page 103: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Self-organizing maps

Page 104: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Self-organizing maps

https://github.com/radanalyticsio/silex

Page 105: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Meet Apache Spark

Page 106: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

A FUNDAMENTAL ABSTRACTION, NOT AN EXECUTION MODEL

Page 107: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Resilient Distributed Datasets are partitioned, lazy, and immutable homogeneous collections.

Page 108: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)
Page 109: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

1 2 3

Page 110: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

1 2 3 λ x: x % 2 != 0

FILTER

Page 111: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

1 2 3 λ x: x % 2 != 0 λ x: x * 3

FILTER MAP

Page 112: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

1 2 3 λ x: x % 2 != 0 λ x: x * 3

FILTER MAP

λ x: [x, x+1]

FLATMAP

Page 113: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

3 λ x: x % 2 != 0 λ x: x * 3

FILTER MAP

λ x: [x, x+1]

FLATMAP

3 4 9 10COLLECT

Page 114: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

1 2 3

Page 115: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

1 2 3 λ x: x % 2 != 0

FILTER

Page 116: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

2 3 λ x: x % 2 != 0

FILTER

Page 117: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

2 3 λ x: x % 2 != 0 λ x: x * 3

FILTER MAP

Page 118: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

2 3 λ x: x % 2 != 0 λ x: x * 3

FILTER MAP

λ x: [x, x+1]

FLATMAP

Page 119: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

2 3 λ x: x % 2 != 0 λ x: x * 3

FILTER MAP

λ x: [x, x+1]

FLATMAP

3 4 9 10SAVE AS TEXT FILE

Page 120: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

2 3 λ x: x % 2 != 0 λ x: x * 3

FILTER MAP

λ x: [x, x+1]

FLATMAP

3 4 9 10SAVE AS TEXT FILE

CACHE

Page 121: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

executor1

1 2 3

executorn

10 11 12

cluster manager

driver

Page 122: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

executor1

1 2 3

executorn

10 11 12

cluster manager

2 4 6 20 22 24

λ x: x * 2 λ x: x * 2

driver

Page 123: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

executor1

1 2 3

executorn

10 11 12

cluster manager

2 4 6 20 22 24

λ x: x * 2 λ x: x * 2

driver

CACHECACHE

Page 124: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Example: word countfile = sc.textFile("file://...")

counts = file.flatMap(lambda l: l.split(" ")) .map(lambda w: (w, 1)) .reduceByKey(lambda x, y: x + y)

# computation actually occurs here counts.saveAsTextFile("file://...")

Page 125: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Example: word countfile = sc.textFile("file://...")

counts = file.flatMap(lambda l: l.split(" ")) .map(lambda w: (w, 1)) .reduceByKey(lambda x, y: x + y)

# computation actually occurs here counts.saveAsTextFile("file://...")

Page 126: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Example: word countfile = sc.textFile("file://...")

counts = file.flatMap(lambda l: l.split(" ")) .map(lambda w: (w, 1)) .reduceByKey(lambda x, y: x + y)

# computation actually occurs here counts.saveAsTextFile("file://...")

Page 127: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Example: word countfile = sc.textFile("file://...")

counts = file.flatMap(lambda l: l.split(" ")) .map(lambda w: (w, 1)) .reduceByKey(lambda x, y: x + y)

# computation actually occurs here counts.saveAsTextFile("file://...")

Page 128: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Example: word countfile = sc.textFile("file://...")

counts = file.flatMap(lambda l: l.split(" ")) .map(lambda w: (w, 1)) .reduceByKey(lambda x, y: x + y)

# computation actually occurs here counts.saveAsTextFile("file://...")

Page 129: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Example: word countfile = sc.textFile("file://...")

counts = file.flatMap(lambda l: l.split(" ")) .map(lambda w: (w, 1)) .reduceByKey(lambda x, y: x + y)

# computation actually occurs here counts.saveAsTextFile("file://...")

Page 130: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

BEYOND THE RDD

Page 131: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Spark core

Page 132: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Spark core

Graph SQL ML Streaming

Page 133: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARN

Page 134: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARNk8s

Page 135: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Machine learning with SparkSupport code for feature engineering and learning pipelines.

Many parallel implementations of classic algorithms for machine learning tasks: dimensionality reduction, classification, regression, clustering, recommendation engines, etc.

Parallel optimization primitives (gradient descent, etc.) and linear algebra to implement your own algorithms.

Page 136: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Streaming dataGoal: use the same abstraction for batch and “streaming” (micro-batch) data by dividing a stream into many small RDDs.

input stream

Page 137: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Streaming dataGoal: use the same abstraction for batch and “streaming” (micro-batch) data by dividing a stream into many small RDDs.

Streaming engine

input stream

Page 138: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Streaming dataGoal: use the same abstraction for batch and “streaming” (micro-batch) data by dividing a stream into many small RDDs.

Streaming engine

input streamwindowed data (RDDs)

Page 139: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Streaming dataGoal: use the same abstraction for batch and “streaming” (micro-batch) data by dividing a stream into many small RDDs.

Streaming engine Spark

input streamwindowed data (RDDs)

Page 140: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Streaming dataGoal: use the same abstraction for batch and “streaming” (micro-batch) data by dividing a stream into many small RDDs.

Streaming engine Spark

input streamwindowed data (RDDs)

processed data (RDDs)

Page 141: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Structured queriesThe capacity to run arbitrary code in RDDs is powerful but comes with an important tradeoff: Spark can’t rearrange RDD programs to improve their performance.

Writing Spark programs with a query DSL allows Spark to generate optimized execution plans.

Page 142: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Query planninghugeCollection .join(anotherHugeCollection) .filter(lambda (n, (a, b)): ultraRare(a) and ultraRare(b))

Page 143: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Query planning

hugeCollection.filter(lambda a: ultraRare(a)) .join(anotherHugeCollection.filter(lambda a: ultraRare(a)))

hugeCollection .join(anotherHugeCollection) .filter(lambda (n, (a, b)): ultraRare(a) and ultraRare(b))

Page 144: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Structured query in SparkSQL interface (unchecked syntax or semantics)SELECT word, COUNT(*) FROM words GROUP BY word

Data frame interface (semantics checked at run-time)words.groupBy('word').count()

Dataset interface (mostly checked at compile-time)

Page 145: Before we begin, visit radanalyticsio/workshop … · radanalyticsio/workshop to download images and code! Insightful Apps with Apache Spark and OpenShift William Benton (@willb)

Questions & hands-on exercises