install eclipse for spark-python

11
Philippe ROSSIGNOL : 2015/06/12 How to configure Eclipse in order to develop with Spark and Python http://enahwe.blogspot.fr/p/philippe-rossignol-20150612-how-to.html Introduction Under the cover of PySpark Requirements A brief note about Scala 1°) Install Eclipse 2°) Install Spark 3°) Install PyDev 4°) Configure PyDev with a Python interpreter 5°) Configure PyDev with the Spark Python sources 6°) Configure PyDev with the Spark Environment variables 7°) Create the Spark-Python project “CountWords” 8°) Run the Spark-Python project “CountWords”

Upload: prossblad

Post on 28-Jul-2015

160 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Install Eclipse for Spark-Python

Philippe ROSSIGNOL : 2015/06/12

How to configure Eclipse in order

to develop with Spark and Pythonhttp://enahwe.blogspot.fr/p/philippe-rossignol-20150612-how-to.html

Introduction

Under the cover of PySpark

Requirements

A brief note about Scala

1°) Install Eclipse

2°) Install Spark

3°) Install PyDev

4°) Configure PyDev with a Python interpreter

5°) Configure PyDev with the Spark Python sources

6°) Configure PyDev with the Spark Environment variables

7°) Create the Spark-Python project “CountWords”

8°) Run the Spark-Python project “CountWords”

Page 2: Install Eclipse for Spark-Python

Introduction

This document shows a way to configure Eclipse IDE in order to develop with Spark 1.3.1 and Python via the plugin PyDev.

PyDev is a plugin that enables Eclipse to be used as a Python IDE.

First we will install Eclipse, then Spark 1.3.1 and PyDev, then we will configure PyDev.

Finally we will finish by developing and by testing a well-known example code named “Word Counts” written in Python and running on Spark.

Under the cover of PySpark

The Spark Python API (PySpark) exposes the Spark programming model to Python.

By default, PySpark requires python (2.6 or higher) to be available on the system PATH and use it to run programs.

Let’s note that PySpark applications are executed by using a standard CPython interpreter (in order to support Python modules that use C extensions).

But an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable.

All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported.

In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.

Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java.

For more details please visit the pages below :

Installing and Configuring PySpark : https://spark.apache.org/docs/0.9.2/python-programming-guide.html

PySpark Internals : https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Page 3: Install Eclipse for Spark-Python

Requirements

Let’s note that Spark V1.3.1 runs on Java 6+ and Python 2.6+, so you will need on your computer :

A JVM 6 or higher (JVM 7 may be a good compromising)

A python 2.6 or higher

The following installation has been carried out with a JVM 7 and a Python interpreter 2.7.

A brief note about Scala

Keep in mind that a great idea would consist to use the following Eclipse IDE for Spark in order to develop later both in Python and Scala.

To allow this, it’s important to know that Spark 1.3.1 needs to use a Scala API that is compatible with Scala version 2.10.x.

That’s the reason why the following installation uses Eclipse 4.3 (Kepler) because of its compatibility with Scala 2.10.

Page 4: Install Eclipse for Spark-Python

1°) Install Eclipse

G o t o t h e E c l i p s e w e b s i t e t h e n d o w n l o a d a n d u n c o m p r e s s E c l i p s e 4 . 3 ( K e p l e r ) o n y o u r c o m p u t e r :

http://www.eclipse.org/downloads/packages/release/Kepler/SR2

Finally launch Eclipse and create your workspace as usually.

2°) Install Spark

Go to the Spark website then download and uncompress Spark 1.3.1 (e.g: “Pre-built for Hadoop 2.6 and later”) on your computer :

https://spark.apache.org/downloads.html

Page 5: Install Eclipse for Spark-Python

3°) Install PyDev

From Eclipse IDE :

Go to the menu Help > Install New Software...

From the “ Install “ window :

Click on the button [Add…]

From the “ Add Repository ” dialog box :

Fill the field Name: PyDev

Fill the field Location: http://pydev.sf.net/updates

Validate with the button [OK]

From the “ Install “ window :

Check the name PyDev and click twice on the button [Next >]

Accept the terms of the license agreement and click on the button [Finish]

If a “ Security Warning ” window appears :

If the following warning message appears : “Warning: you are installing software that contains unsigned content…” :

Click on the button [OK]

From the “ Sofware Updates ” window :

Click the button [Yes] to restart Eclipse and for the changes to take effect.

Now PyDev (e.g: 4.1.0) is installed in your Eclipse.

But you can’t develop in Python, because PyDev isn’t configured yet.

Page 6: Install Eclipse for Spark-Python

4°) Configure PyDev with a Python interpreter

As PySpark, PyDev requires a Python interpreter installed on your computer.

Remember that with PySpark, Py4J is not a Python interpreter.

Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.

The following installation has been carried out with a Python interpreter 2.7.

From Eclipse IDE :

Open the PyDev perspective (on top right of the Eclipse IDE)

Go to the menu Eclipse > Preferences… (on Mac), or Window > Preferences... (on Linux and Windows)

From the “ Preferences ” window :

Go to PyDev > Interpreters > Python Interpreter

Click on the button [Advanced Auto-Config]

Eclipse with introspect all the Python installations on your computer.

Choose a Python version 2.7 (e.g: /usr/bin/python2.7 on mac) then validate with the button [OK]

From the “ Selection needed ” window :

Click on the button [OK] to accept the folders to be added to the system PYTHONPATH

From the “ Preferences ” window :

Validate with the button [OK]

Now PyDev is configured in your Eclipse.

You are able to develop in Python but not with Spark yet.

Page 7: Install Eclipse for Spark-Python

5°) Configure PyDev with the Spark Python sources

Now we are going to configure PyDev with the Spark Python sources.

From Eclipse IDE :

Check that you are on the PyDev perspective

Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)

From the “ Preferences ” window :

Go to PyDev > Interpreters > Python Interpreter

Click on the button [New Folder]

Choose the python folder just under your Spark install directory and validate :

e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python

Note : This path must be absolute (don’t use the Spark home environment variable)

Click on the button [New Egg/Zip(s)]

From the File Explorer, select [*.zip] rather [*.egg]

Choose the file py4j-0.8.2.1-src.zip just under your Spark python folder and validate :

e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python/py4j-0.8.2.1-src.zip

Note : This path must be absolute (don’t use the Spark home environment variable)

Validate with the button [OK]

Now PyDev is configured with Spark Python sources.

But we can’t execute Spark while the Environment variables aren’t configured.

Page 8: Install Eclipse for Spark-Python

6°) Configure PyDev with the Spark Environment variables

From Eclipse IDE :

Check that you are on the PyDev perspective

Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)

From the “ Preferences ” window :

Go to PyDev > Interpreters > Python Interpreter

Click on the central button [Environment]

Click on the button [New...] (close to the button [Select...]) to add a new Environment variable.

Add the environment variable SPARK_HOME and validate :

e.g 1 : Name: SPARK_HOME, Value: /home/foo/Spark_1.3.1-Hadoop_2.6

e.g 2 : Name: SPARK_HOME, Value: ${eclipse_home}../Spark_1.3.1-Hadoop_2.6

Note : Don’t use the system environment variables such as Spark home

It’s recommended to handle your own "log4j.properties" file in each of your project.

To allow that, adds the environment variable SPARK_CONF_DIR as previously and validates :

e.g : Name: SPARK_CONF_DIR, Value: ${project_loc}/conf

If you experience some problems with the variable ${project_loc} (e.g: with Linux OS), specify an absolute path instead.

Or if you want to keep ${project_loc}, right-click on every Python source and: Runs As > Run Configurations…,

then create your SPARK_CONF_DIR variable in the Environment tab as described previously

Occasionally, you can add other environment variables such as TERM, SPARK_LOCAL_IP and so on :

e.g 1 : Name: TERM, Value on Mac: xterm-256color, Value on Linux: xterm

e.g 2 : Name: SPARK_LOCAL_IP, Value: 127.0.0.1 (it’s recommended to specify your real local IP address)

Validate with the button [OK]

Now PyDev is full ready to develop with Spark in Python.

Page 9: Install Eclipse for Spark-Python

7°) Create the Spark-Python project “CountWords”

Now we can develop any kind of Spark project written in Python, we will now create the code example named “CountWords”.

This example will count the frequency of each word present in the “README.md” file belonging to the Spark installation.

To allow a such counting, the well-known MapReduce paradigm will be operated in memory by using the two Spark transformations named “ flatMap” and

“reduceByKey”.

Create the new project :

Check that you are on the PyDev perspective

Go to the Eclipse menu File > New > PyDev project

Name your new project “PythonSpark”, then click on the button [Finish]

Create a source folder :

To add a source folder (in order to create soon your Python source), right-click on the project icone and New > Folder

Name the new folder “src”, then click on the button [Finish]

To add the new Python source, right-click on the source folder icon and New > PyDev Module

Name the new Python source “WordCounts”, then click on the button [Finish], then click on the button [OK]

Page 10: Install Eclipse for Spark-Python

Copy-paste the following Python code into your PyDev module WordCounts.py :

# Imports# Take care about unused imports (and also unused variables), please comment them all, otherwise you will get any

errors at the execution. Note that neither the directives "@PydevCodeAnalysisIgnore" nor "@UnusedImport" will be

able to solve that issue.

#from pyspark.mllib.clustering import KMeans

from pyspark import SparkConf, SparkContextimport os

# Configure the Spark environmentsparkConf = SparkConf().setAppName("WordCounts").setMaster("local")sc = SparkContext(conf = sparkConf)

# The WordCounts Spark programtextFile = sc.textFile(os.environ["SPARK_HOME"] + "/README.md")wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)for wc in wordCounts.collect(): print wc

In PyDev take care about unused imports and also unused variables.

Please comment them all, otherwise you will get any errors at the execution.

Note that neither the directives @PydevCodeAnalysisIgnore nor @UnusedImport will be able to solve that issue.

Create a config folder :

To add a config folder (useful for log4j), right-click on the project icon and New > Folder

Name the new folder “conf”, then click on the button [Finish]

To add your new config file (the “log4j.properties” file) right-click on the config folder icon and New > File

Name the new config file “log4j.properties”, then click on the button [Finish], then click on the button [OK]

Copy-paste the content of the file “log4j.properties.template” (under $SPARK_HOME/conf) to your new config file ”log4j.properties”

Edit your own config file ”log4j.properties” to change as much as you want the level of logs (e.g : INFO to WARN, or INFO to ERROR...)

Page 11: Install Eclipse for Spark-Python

8°) Run the Spark-Python project “CountWords"

To execute your code, right-click on the Python module “WordCounts.py”, then choose Run As > 1 Python Run

Have fun :-)