python scalability: a convenient truth

47
Python Scalability: A Convenient Truth Travis Oliphant Continuum Analytics

Upload: continuum-analytics

Post on 05-Jan-2017

16.130 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Python Scalability: A Convenient Truth

Python Scalability: A Convenient Truth

Travis Oliphant

Continuum Analytics

Page 2: Python Scalability: A Convenient Truth

2

• A short tour through data science • Understanding Scale • The Zen of PyData

Overview• Abstract overview

Page 3: Python Scalability: A Convenient Truth

DATA SCIENCE: A SHORT TOUR

3

Page 4: Python Scalability: A Convenient Truth

4

wordcount.py8 lines of code

Page 5: Python Scalability: A Convenient Truth

5

Performance & Limitations• For 1 million rows (45MB CSV), takes ~3 sec on Macbook • … but CSVs are not where Python shines • For a billion rows…?

• Also, data has to fit in memory • For this problem, maybe we can be clever with itertools or

toolz and streaming for out-of-core • The lack of “easy” and “obvious” solutions to larger datasets is

why people tend to believe that “Python doesn’t scale”

Page 6: Python Scalability: A Convenient Truth

6

Enter the Hadoops!Hadoop decomposes intoMap and Reduce steps:

From http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Page 7: Python Scalability: A Convenient Truth

Maybe Spark is an improvement?

7

Page 8: Python Scalability: A Convenient Truth

An Inconvenient Truth

8

• “Python” libraries in the Hadoop ecosystem are lipstick.py on an elephant in the room: Python in Hadoop is a second class citizen, outside of the most straightforward use cases • Cognitive model is very different • Access to libraries is extremely limited

Page 9: Python Scalability: A Convenient Truth

Query, Exploration, Modeling

9

• “Core” Hadoop is traditionally concerned with scalable, robust query performance

• Even though Spark and Hadoop enable scale-out, for many data scientists, the usability for ad hoc exploration is dramatically lower than with Python or R because of the reduced interactivity

• In essence, the usability factors around scale-out means that they explore on a single machine, sometimes on a subset of data, and then deploy onto not only a different machine (cluster), but have to recode into different language or APIs

Page 10: Python Scalability: A Convenient Truth

UNDERSTANDING SCALE

10

Page 11: Python Scalability: A Convenient Truth

Scaling Up vs. Out

11

Page 12: Python Scalability: A Convenient Truth

Scaling Up

12

Single Node

C/C++, Python, R

"Native Code"

OperatingSystem

CPU

DiskMem

Scale “Up”

cpu0 .... cpu32

ssd0 .... ssd16

dmm0 .... dmm16

C/C++, Python, R

"Native Code"

OperatingSystem

Page 13: Python Scalability: A Convenient Truth

Scaling Out

13

OperatingSystem

CPU

DiskMem

OperatingSystem

OperatingSystem

CPU

DiskMem

CPU

DiskMem

DistributedWorker

Distributed Computing API

DistributedWorker

DistributedWorker

Virtual Distributed Computer (e.g. Hadoop cluster)

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

CPU

DiskMem

CPU

DiskMem

CPU

DiskMem

Distributed Computing API

Page 14: Python Scalability: A Convenient Truth

Symmetry Ideal

14

cpu0 .... cpu32

ssd0 .... ssd16

dmm0 .... dmm16

C/C++, Python, R

"Native Code"

OperatingSystem

Virtual Distributed Computer (e.g. Hadoop cluster)

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

CPU

DiskMem

CPU

DiskMem

CPU

DiskMem

Distributed Computing API

Scale Up Scale Out

C/C++, Python, R

"Native Code"

Page 15: Python Scalability: A Convenient Truth

Programming Model

15

Virtual Distributed Computer (e.g. Hadoop cluster)

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

CPU

DiskMem

CPU

DiskMem

CPU

DiskMem

C, C++, Python, R Java Scala

Distributed Computing API?

cpu0 .... cpu32

ssd0 .... ssd16

dmm0 .... dmm16

OperatingSystem

Page 16: Python Scalability: A Convenient Truth

HiveImpala

SQL

Hadoop

16

Hadoop Cluster

Java VMJava VMJava VM

YARN

HDFS Disk Disk Disk

CPU

Mem

CPU

Mem

CPU

Mem

Hadoop Cluster

Java VMJava VMJava VM

YARN

HDFS Disk Disk Disk

CPU

Mem

CPU

Mem

CPU

Mem

Java Python R

Hadoop Map-Reduce

Hadoop Cluster

Java VMJava VMJava VM

RDD

YARN

HDFS Disk Disk Disk

CPUCPUCPU

MemMemMem

SparkJava

Spark

Scala Python R SQL

PySpark SparkR SparkSQL

Page 17: Python Scalability: A Convenient Truth

Who’s Afraid of Distributed Computing?

17

• “What’s hard about distributed computing is not space, but time” • The difficulty of the problem scales with your temporal

exposure • The fewer nodes you have, the less complex distributed

computing is • So, optimizing per-node performance can be a lifesaver

• Additionally, the whole reason you use a distributed computer is because of performance. So, performance is an aspect of correctness.

Page 18: Python Scalability: A Convenient Truth

Why Is Python Nice?

18

• Core data structures, with nice functions and spelling • Facilitate a wide variety of algorithms that interop • Open access to a huge variety of existing libraries and

algorithms • Very easy to get high performance when you need it…

• … on a single machine

Page 19: Python Scalability: A Convenient Truth

Scaling Python

19

• Most Python data structures and supporting libraries (including C, C++, FORTRAN) were design for single-node use case

• “Automagically” scaling out single-node (“unified memory”) code to distributed machine is hard. • Making it robust is extremely hard to do w/ software. • “When software gets hard, we solve it with hardware”

Page 20: Python Scalability: A Convenient Truth

Scalable Primitives

20

• What if…. We created a small set of distributed data structures for Python that have this property?

• Not everything from single-node case would work • But many things would • And spelling could at least be made congruent

Page 21: Python Scalability: A Convenient Truth

ANACONDA + HADOOP

21

Page 22: Python Scalability: A Convenient Truth

Hadoop/Spark Revisited

22

Hadoop Cluster

Java VMJava VMJava VM

RDD

YARN

HDFS Disk Disk Disk

CPUCPUCPU

MemMemMem

Single Node

CPU(s)Disk(s) Mem

Java VM

HDFSDataNode

Java VM

Spark

Java VM

YARNNode Manager

Distributed APIs

HDFS Spark RDD YARN Cluster

Page 23: Python Scalability: A Convenient Truth

23

Single Node

CPU(s)Disk(s) Mem

Java VM

HDFSDataNode

Java VM

Spark

Java VM

YARNNode Manager

Distributed APIs

HDFS Spark RDD YARN Cluster

Python

PySpark

Page 24: Python Scalability: A Convenient Truth

24

Single Node

CPU(s)Disk(s) Mem

Java VM

HDFSDataNode

Java VM

Spark

Java VM

YARNNode Manager

C/C++, Python, R

"Native Code"

Page 25: Python Scalability: A Convenient Truth

Hadoop API

Single Node

Disk(s)

Java VM

YARN

Native Code: C, C++, Python, etc.

Mem CPU

HDFS

hdfs.py

YARN

Java VM

HDFS

25

Disk(s)

Native Code: C, C++, Python, etc.

Mem CPU

hdfs.py

Java VM

HDFS

Java VM

YARN

Page 26: Python Scalability: A Convenient Truth

26

Hadoop API

Single Node

Disk(s)

Java VM

YARN

Native Code: C, C++, Python, etc.

Mem CPU

HDFS

hdfs.py

YARN

Java VM

HDFS

Pythonic Distributed

Computing?

Page 27: Python Scalability: A Convenient Truth

27

Dask: Pythonic Parallelism• A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python

Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality

Page 28: Python Scalability: A Convenient Truth

Simple Architecture

28

Page 29: Python Scalability: A Convenient Truth

Core Concepts

29

Page 30: Python Scalability: A Convenient Truth

dask.array: OOC, parallel, ND array

30

Arithmetic: +, *, ...

Reductions: mean, max, ...

Slicing: x[10:, 100:50:-2]Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svdParallel algorithms (approximate quantiles, topk, ...)

Slightly overlapping arrays

Integration with HDF5

Page 31: Python Scalability: A Convenient Truth

dask.dataframe: OOC, parallel, ND array

31

Elementwise operations: df.x + df.yRow-wise selections: df[df.x > 0] Aggregations: df.x.max()groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts()Drop duplicates: df.x.drop_duplicates()Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

Page 32: Python Scalability: A Convenient Truth

DAG of Computation

32

Page 33: Python Scalability: A Convenient Truth

33http://continuum.io/blog/xray-dask

Scientific Big Data

Page 34: Python Scalability: A Convenient Truth

More Complex Graphs

34

cross validation

Page 35: Python Scalability: A Convenient Truth

LU Decomposition

35

dA = da.from_array(A, (3, 3))dP, dL, dU = da.linalg.lu(dA)assert np.allclose(dL.compute(), L) assert np.allclose(dU.compute(), U)dL.visualize()

Page 36: Python Scalability: A Convenient Truth

36

Dask + Hadoop

Hadoop API

Single Node

Disk(s)

Java VM

YARN

Native Code: C, C++, Python, etc.

Mem CPU

HDFS

hdfs.py

YARN

Java VM

HDFS

Pythonic Distributed

Computing?Dask

Hadoop

Single Node

Disk(s)

Java VM

YARN

Native Code: C, C++, Python, etc.

Mem CPU

HDFS

hdfs.py

YARN

Scheduler

Java VM

HDFS

Dask worker

hdfs.py

Page 37: Python Scalability: A Convenient Truth

NYC Taxi Dataset

37http://matthewrocklin.com/blog/work/2016/02/22/dask-distributed-part-2

• Total 50 GB on disk, 120 GB in memory • Creates ~400 sharded Pandas dataframes (128MB each)

Page 38: Python Scalability: A Convenient Truth

Familiar Programming Model

38

Q: For each payment_type, how many rides didn’t tip?

Page 39: Python Scalability: A Convenient Truth

Familiar Programming Model

39

Takes ~6 seconds

Tip % by day of week, and by hour of day

People are very generous at 4 a.m.

Page 40: Python Scalability: A Convenient Truth

Reddit Comments Dataset (240 GB)

40

Page 41: Python Scalability: A Convenient Truth

Easy Text Processing with NLTK

41

Page 42: Python Scalability: A Convenient Truth

42

• Pythonic spelling for parallel arrays and dataframes • Direct access to Hadoop data, without paying cost of

JVM serialization, via hdfs3 library • Excited about Cloudera’s Arrow project, which will

further improve direct data access • Load data into distributed memory via persist(), in

fast and data-local fashion • Direct scheduling with YARN via knit library

Take aways

Page 43: Python Scalability: A Convenient Truth

Not only Hadoop!

43

Dask Scheduler

Single Node

Disk(s)

Native Code: C, C++, Python, etc.

Mem CPU

Dask worker

dask.bag dask.array dask.dataframe • Dask works well with traditional distributed computing (Sun GridEngine, IPython Parallel, etc.)

• Convenient dec2 library for easily provisioning on EC2

• Excellent for “embarrassingly parallel” numerical problems

Page 44: Python Scalability: A Convenient Truth

Scale Up and Out

44

• Same Pythonic programming model for scale-up and scale-out

• Works great with existing C, C++, Python fast code

• Threads or processes • Can stream data from disk,

for out-of-core computationSingle Node

Dask Scheduler

dask.bag dask.array dask.dataframe

cpu0

dmm0

ssd0

cpu1

dmm1

ssd1

cpuN

dmmN

ssdN

daskworker

code code code

daskworker

daskworker

. . . . . . .

Page 45: Python Scalability: A Convenient Truth

ZEN OF PYDATA

45

Page 46: Python Scalability: A Convenient Truth

Productivity at All Scales

46

• “Scaling out” is important for production and data processing • “Scaling down” and “Scaling up” is important for agile, fast

iterations in data exploration and modeling • Don’t compromise agility for the sake of future-proofing

scalability • Real wins in being able to use the same language and API

across scales

Page 47: Python Scalability: A Convenient Truth

About Me

47

• Travis'Oliphant@teoliphantCEO/&/co2founder/Continuum/AnalyticsPh.D./Mayo/Clinic/ in/Biomedical/EngineeringB.S.,/M.S./BYU/Mathematics/&/Electrical/Eng./Open/Source/contributor/and/leader/since/1997

Creator/of/NumPy/and/SciPyStarted/Numba

Author/Guide&to&NumPy