best weather for goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · thinking in spark...

31
Best Weather for Best Weather for Goals? Goals? Willem Hendriks Data Scientist IBM [email protected] https://github.com/willemhendriks

Upload: others

Post on 28-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Best Weather for Best Weather for Goals?Goals?

Willem HendriksData Scientist IBM

[email protected]://github.com/willemhendriks

Page 2: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

“More data usually beats better algorithms”Anand Rajaraman (when teaching at Stanford)

http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

Old Skool StatisticsOld Skool Statistics Big Data!Big Data!

Nice read!

Page 3: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

In The News: No more Silent Movies!

Page 4: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Creative Jobs are at risk, not engineers!

Page 5: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Scary email this week...

Page 6: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Apache Spark™ is a fast and general engine for large-scale data processing.

Google Trends of “Apache Spark”

What is Spark?

Page 7: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Spark 1.6.1 Released (9 March 2016)

Page 8: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Anybody here did any Big-Data projects with Hadoop?

Any computation on clusters?

Page 9: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Why People love Spark

1 – Nice short snappy code (Python / Scala / R)

2 – Universal Connectors (Databases / Hadoop / Files / Messaging Systems)

3 – Interactive Shell

4 – Beautiful Library(Linear Algebra & ML)

Page 10: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Why Lissabon loves Spark?

1 – Nice short snappy code (Python / Scala / R)

2 – Universal Connectors (Databases / Hadoop / Files / Messaging Systems)

3 – Interactive Shell

4 – Beautiful Library(Linear Algebra & ML)

Madrid

Dresden

Amsterdam

Lisbon

Lisbon

Page 11: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

What can we do now,We could not do some years ago?

Why would a person/company invest in Spark Technology?

Page 12: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Spark & IBM

Page 13: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Key Components Spark

Page 14: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Key Components Spark

Spark Context

worker worker workerworker worker worker workerworker worker worker worker

It doesn't matter how you access Spark,Command line,Spark-submitOr notebook

Page 15: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Quiz!

With Apache Spark, Hadoop will be useless (soon)

True -or- False ?

Page 16: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Quiz!

Different Data-Sources (MySQL / Hadoop / Files / Cassandra)Requires Different Spark Programming Approaches

True -or- False ?

Page 17: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Quiz!

Scala is the Best Way to Spark,Python is for hipsters....R is for pirates....Java for old folks....

True -or- False ?

Page 18: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

More than 1 way to Spark

RDDs DataFrames

little slower little faster (language independent)

MLlib sparkML

chain of RDDs pipelines!Nice chain of transformations & fitters

Freedom (map/reduce) structured

Programmers SQL'er / R lovers

Page 19: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Thinking in SparkDon't think as Variables & Functions, but in

Transformations & Actions

A RDD is a pointer to your data, not the data itself

(Hadoop / File / Database)

A transformation on a RDD will create a new RDD, by creating a

new pointer (again no data)

It adds tubes & flasks, but still no data/fluid is flowing.

Not until an action is called upon an RDD, spark will not flow data through your chain of RDD's... (lazy evaluation)

After an action,

Spark lets the

fluid/data flow.

1 2

3

Page 20: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

How can we 'hold' some fluid/data inside a flask, to later

quickly use it?

rdd 1

rdd 2rdd 3

rdd 4rdd 5

Page 21: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

How can we 'hold' some fluid/data inside a flask, to later

quickly use it?

rdd 1

rdd 2rdd 3

rdd 4rdd 5

.persist() or

.cache()

Page 22: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Actions & Transformations

Transformations: Actions:

Page 23: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

RDD's

Contains Elements

Transformations & Actions

.persist()

Mllib to create models (Machine Learning)

Key/Value, a special case, needed for joins

Freedom (as a opposite to Data-Frames)

Page 24: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Double, Float 1.3

List [12, “Blue”, 3.03]

String “23;”AEFED”

Numpy Array Array(1, 4, 3, 3, 5, ...

Spark's Labelled Points

LabelledPoint( “GOOD”, “Chablis”, 12.3 )

Any (basic) python …...

(Key, Value) Special CaseKey = everythingValue = everything

[2,4,6]

[5,4,3]

[7,6,5]

[8,7,8]

[3,4,5]

[2,4,6]

[5,4,3]

[7,1,3]

[1,1,3]

[3,4,3]

[1,4,6]

[5,7,3]

[7,6,5]

[8,8,8]

[3,8,5]

Partition APartition A

Partition B

Partition C

Page 25: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer
Page 26: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer
Page 27: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Let's do some basic Spark'in together

Page 28: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

UEFA Demo

Page 29: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Let's take some Data!

Page 30: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

Let's take some Data!

Page 31: Best Weather for Goals?files.meetup.com/19103884/2_lissabon_meetup.pdf · Thinking in Spark Don't think as Variables & Functions, but in Transformations & Actions A RDD is a pointer

RDD's

What is are the best weather conditions for Portuguese teams?

Does the weather have influence on the number of goals in UEFA Champions League Matches?