a data science tutorial in python

1/22/2016 Tutorial in Python

http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 1/50

A Complete Tutorial for Data Science in PythonPython is an amazing language. It was created by Guido van Rossum. You can read Guido's history ofPython at the Python History Blog at http://pythonhistory.blogspot.in/2009/01/introductionandoverview.html (http://pythonhistory.blogspot.in/2009/01/introductionandoverview.html)

Here we show a comprehensive tutorial in it for usage in Data Science. Data science lies at theintersection of programming, statistics and business analysis. It is the use of programming tools withstatistical techniques to analyze data in a systematic and scientific way. Accordingly this tutorial will try tofocus atleast on the statistical and programming parts of data science. Data Scientists would also beinterested in the PyData community at http://pydata.org/ (http://pydata.org/)

Note I am writing this article within the Jupyter notebook, a Python interface derived from iPython.

Markdown Tip within JupyterI can also write this text within Jupyter by changing Cell type to Markdown in dropdown.For markdown changing size of font is easy by prefixing by #, or ## , or ### (more the number of #smaller the size of font as it changes the type from header 1, 2 , 3) . In Markdown for a non numberedlist prefix the words by a

Markdown

within Jupyter

is just a # in front of words

and changing the cell type to Markdown

This is a list made byadding a hypen in front ot words

Installation of Python PackagesInstallation of Python is done using pip or easy_install(from setup tools) . Here we show how to installPandas package from the Jupyter Notebook itself. I use the upgrade flag to upgrade it, and I installBokeh using easy_tools. Pandas is the Python library for Data Analysis and Bokeh helps makeinteractive data analysis available. Note the ! sign before the sudo command it helps me use theTerminal without leaving the comfort of my Jupyter Notebook. I can also install Python packages usingconda which is my preffered method for data scienc since I can create custom environments for projects.

The complete Python Package Index is at PyPi https://pypi.python.org/pypi (https://pypi.python.org/pypi)PyPi has 71833 packages as of December 30,2015.

http://python-history.blogspot.in/2009/01/introduction-and-overview.html

http://pydata.org/

https://pypi.python.org/pypi



In [1]:

In [2]:

In [3]:

Loading a Python PackageYou can load a Python Package using the following ways

import PACKAGEimport PACKAGE as PKfrom PACKAGE import FUN

The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.You are using pip version 7.1.0, however version 8.0.2 is available.You should consider upgrading via the 'pip install --upgrade pip' command.The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.Requirement already up-to-date: pandas in /usr/local/lib/python2.7/dist-packagesRequirement already up-to-date: python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas)Requirement already up-to-date: pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas)Requirement already up-to-date: numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas)Requirement already up-to-date: six>=1.5 in /usr/local/lib/python2.7/dist-packages (from python-dateutil->pandas)

Searching for bokehBest match: bokeh 0.10.0Processing bokeh-0.10.0-py2.7.eggbokeh 0.10.0 is already the active version in easy-install.pthInstalling bokeh-server script to /usr/local/binInstalling websocket_worker.py script to /usr/local/bin

Using /usr/local/lib/python2.7/dist-packages/bokeh-0.10.0-py2.7.eggProcessing dependencies for bokehFinished processing dependencies for bokeh

! sudo pip install pandas --upgrade

! sudo easy_install bokeh

#! conda install seaborn



You can then invoke the function using

PACKAGE.FUN , PK.FUN and FUN respectively

In [4]:

In [5]:

The Python Package Index (PyPI) https://pypi.python.org/pypi (https://pypi.python.org/pypi) hoststhousands of thirdparty modules for Python .

You can browse Python packages by topic at https://pypi.python.org/pypi?%3Aaction=browse(https://pypi.python.org/pypi?%3Aaction=browse)

Import DataLet's import some datasets.

In [6]:

In [7]:

In [8]:

Out[4]:

datetime.datetime(2016, 1, 22, 13, 4, 3, 39744)

Out[7]:

'/home/ajay/Dropbox/PYTHON BOOK WILEY/FINAL'

from datetime import datetimeStarttime =datetime.now()Starttime

import pandas as pd

# In case the file is stored locally we can use the os python libraryimport os as os

os.getcwd() #current working directory

os.chdir('/home/ajay/Desktop/test')

https://pypi.python.org/pypi

https://pypi.python.org/pypi?%3Aaction=browse



In [9]:

In [10]:

In [11]:

We will use diamond Dataset bundled with R language fromhttps://vincentarelbundock.github.io/Rdatasets/datasets.html(https://vincentarelbundock.github.io/Rdatasets/datasets.html)

In [12]:

In [13]:

So we got a rough estimate for the time it took for code execution through the datetime.timedelta objectabove. Also read_csv is just one of the many convenient ways we can read data through the pandaslibrary in Python. However Python lacks R's lubridate (for easier date time manipulation) as well asdata.table package in R which makes import and manipulation faster.

In [14]:

Out[9]:

['adult.data.txt']

Out[11]:

32561

Out[13]:

datetime.timedelta(0, 5, 689405)

Out[14]:

pandas.core.frame.DataFrame

a=os.getcwd()os.listdir(a)

adult=pd.read_csv("adult.data.txt",header=None)

len(adult)

diamonds =pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv"

datetime.now()- Starttime

type(diamonds) #this works just like class(object) in R

https://vincentarelbundock.github.io/Rdatasets/datasets.html



In [15]:

to find out more about the objects you can use locals() and globals()

Data InspectionWe get the column names, the column types as well as the information of the data through columns,dtypes, and info commands below. In R we would get this by str command (for structure). In Python strturns the object to string.(Just one of the ways people can get confused moving between data sciencelanguages)

In R we use names function for variable names and length for length of object. While Python usescolumns and len respectively.

In [16]:

Out[15]:

['T', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_SLICEMAP', '__abs__', '__add__', '__and__', '__array__', '__array_wrap__', '__bool__', '__bytes__', '__class__', '__contains__',

Out[16]:

Index(['Unnamed: 0', 'carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z'], dtype='object')

#to find out what all functions we can do we can just use the dir commanddir(diamonds)

diamonds.columns # In Python as well as R , a single Line Comment starts with # # name of variables is given by columns. In R we would use the command names(object)# Note also R uses the FUNCTION(OBJECTNAME) syntax while Python uses OBJECTNAME.FUNCTION



In [17]:

In [18]:

In [19]:

In [20]:

Out[17]:

Unnamed: 0 int64carat float64cut objectcolor objectclarity objectdepth float64table float64price int64x float64y float64z float64dtype: object

Out[18]:

53940

Out[19]:

5.394

Out[20]:

5

diamonds.dtypes

len(diamonds) #gives the number of rows

0.0001*len(diamonds)

round(0.0001*len(diamonds))



In [21]:

<class 'pandas.core.frame.DataFrame'>Int64Index: 53940 entries, 0 to 53939Data columns (total 11 columns):Unnamed: 0 53940 non-null int64carat 53940 non-null float64cut 53940 non-null objectcolor 53940 non-null objectclarity 53940 non-null objectdepth 53940 non-null float64table 53940 non-null float64price 53940 non-null int64x 53940 non-null float64y 53940 non-null float64z 53940 non-null float64dtypes: float64(6), int64(2), object(3)memory usage: 4.3+ MB

'''Lets get some information on the object.

This was a multiple line comment using three single quote marks'''diamonds.info()



In [22]:

Data MungingTo refer to particular row in Python I can use index or .ixIn R I refer to the object in i th row and jth column by OBJECTNAME[i,j]In R I refer to the column name by OBJECTNAME$ColumnName while in Python I would useOBJECTNAME["ColumnName"]Note in Python Index starts with 0 while in R it starts with 1.

Out[22]:

Unnamed:0

carat cut color clarity depth table price x y z

0 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

1 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

2 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

3 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

4 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

5 6 0.24VeryGood

J VVS2 62.8 57 336 3.94 3.96 2.48

6 7 0.24VeryGood

I VVS1 62.3 57 336 3.95 3.98 2.47

7 8 0.26VeryGood

H SI1 61.9 55 337 4.07 4.11 2.53

8 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49

9 10 0.23VeryGood

H VS1 59.4 61 338 4.00 4.05 2.39

diamonds.head(10) #we check the first 10 rows in the dataset



In [23]:

Out[23]:


0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

3 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

diamonds2=diamonds.drop('Unnamed: 0', 1) #Dropping a particular variablediamonds2.head()



In [24]:

In [25]:

Out[24]:

Unnamed:0


20 21 0.30 Good I SI2 63.3 56 351 4.26 4.30 2.71

21 22 0.23VeryGood

E VS2 63.8 55 352 3.85 3.92 2.48

22 23 0.23VeryGood

H VS1 61.0 57 353 3.94 3.96 2.41

23 24 0.31VeryGood

J SI1 59.4 62 353 4.39 4.43 2.62

24 25 0.31VeryGood

J SI1 58.1 62 353 4.44 4.47 2.59

25 26 0.23VeryGood

G VVS2 60.4 58 354 3.97 4.01 2.41

26 27 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47

27 28 0.30VeryGood

J VS2 62.2 57 357 4.28 4.30 2.67

28 29 0.23VeryGood

D VS2 60.5 61 357 3.96 3.97 2.40

29 30 0.23VeryGood

F VS1 60.9 57 357 3.96 3.99 2.42

30 31 0.23VeryGood

F VS1 60.0 57 402 4.00 4.03 2.41

Out[25]:

20 Good21 Very Good22 Very Good23 Very Good24 Very Good25 Very GoodName: cut, dtype: object

diamonds.ix[20:30] #refers to the 21st to 31st row

#To refer to a particular column I use it's name# I can also chain the commandsdiamonds.ix[20:25].cut



In [26]:

In [27]:

Out[26]:

20 I21 E22 H23 J24 J25 GName: color, dtype: object

Out[27]:

color cut price

0 E Ideal 326

1 E Premium 326

2 E Good 327

3 I Premium 334

4 J Good 335

diamonds.ix[20:25]["color"]

diamonds[['color','cut','price']].head() #Note the double square brackets [[]]



In [28]:

Out[28]:

color cut price

10 J Good 339

11 J Ideal 340

12 F Premium 342

13 J Ideal 344

14 E Premium 345

15 E Premium 345

16 I Ideal 348

17 J Good 351

18 J Good 351

19 J Very Good 351

20 I Good 351

diamonds.ix[10:20,['color','cut','price']] #Note how I placed the row index numbers and column names within the double SQUARE brackets. # This is more elaborate than R isnt it.



In [29]:

Out[29]:

Unnamed:0

carat cut color clarity depth table price x y

23644 23645 3.65 Fair H I1 67.1 53 11668 9.53 9.48

24131 24132 3.24 Premium H I1 62.1 58 12300 9.44 9.40

24297 24298 3.22 Ideal I I1 62.6 55 12545 9.49 9.42

24328 24329 3.50 Ideal H I1 62.8 57 12587 9.65 9.59

25998 25999 4.01 Premium I I1 61.0 61 15223 10.14 10.10

25999 26000 4.01 Premium J I1 62.5 62 15223 10.02 9.94

26431 26432 3.40 Fair D I1 66.8 52 15964 9.42 9.34

26444 26445 4.00VeryGood

I I1 63.3 58 15984 10.01 9.94

26534 26535 3.67 Premium I I1 62.4 56 16193 9.86 9.81

27130 27131 4.13 Fair H I1 64.8 61 17329 10.00 9.85

27415 27416 5.01 Fair J I1 65.5 59 18018 10.74 10.54

27630 27631 4.50 Fair J I1 65.8 58 18531 10.23 10.16

27679 27680 3.51 Premium J VS2 62.5 59 18701 9.66 9.63

#Lets try conditional selectiondiamonds[diamonds['carat']>3.2]



In [30]:

In [31]:

Random SampleSince Python does not have any package like dplyr, it needs numpy for more elaborate operations. Herewe take a random sample of a Pandas data frame.

In [32]:

In [33]:

Out[30]:

Unnamed:0

carat cut color clarity depth table price x y

21758 21759 3.11 Fair J I1 65.9 57 9823 9.15 9.02

25999 26000 4.01 Premium J I1 62.5 62 15223 10.02 9.94

26467 26468 3.01 Ideal J SI2 61.7 58 16037 9.25 9.20

26744 26745 3.01 Ideal J I1 65.4 60 16538 8.99 8.93

27415 27416 5.01 Fair J I1 65.5 59 18018 10.74 10.54

27630 27631 4.50 Fair J I1 65.8 58 18531 10.23 10.16

27679 27680 3.51 Premium J VS2 62.5 59 18701 9.66 9.63

27684 27685 3.01 Premium J SI2 60.7 59 18710 9.35 9.22

27685 27686 3.01 Premium J SI2 59.7 58 18710 9.41 9.32

Out[31]:

(13791, 11)

[34159 23971 31335 1895 28279]

##Lets try multiple conditions. We use the query command.diamonds.query('carat >3 and color =="J"')

diamonds3=diamonds.query('price >28000 or cut =="Premium"')diamonds3.shape

import numpy as np

rows = np.random.choice(diamonds.index.values, round(0.0001*len(diamonds)))print(rows)



In [34]:

In [35]:

SummariesWe now do summaries for numerical and categorical data.

In [36]:

Out[34]:

Unnamed:0


34159 34160 0.33 Ideal G VS1 62.1 55.0 854 4.46 4.43 2.76

23971 23972 1.51VeryGood

H VS2 62.4 55.6 12108 7.28 7.33 4.56

31335 31336 0.41 Ideal G SI1 61.9 54.0 759 4.77 4.82 2.97

1895 1896 0.73 Ideal E VS2 62.7 56.0 3077 5.75 5.80 3.62

28279 28280 0.31 Premium J SI1 60.9 60.0 363 4.36 4.38 2.66

Out[36]:

Unnamed: 0 carat depth table price x

count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000

mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 5.731157

std 15571.281097 0.474011 1.432621 2.234491 3989.439738 1.121761

min 1.000000 0.200000 43.000000 43.000000 326.000000 0.000000

25% 13485.750000 0.400000 61.000000 56.000000 950.000000 4.710000

50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 5.700000

75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 6.540000

max 53940.000000 5.010000 79.000000 95.000000 18823.000000 10.740000

diamonds.ix[rows]

##Missing Values

diamonds= diamonds.dropna(how='any')

diamonds.describe()



In [37]:

In [38]:

Out[37]:

count 53940.000000mean 3932.799722std 3989.439738min 326.00000025% 950.00000050% 2401.00000075% 5324.250000max 18823.000000Name: price, dtype: float64

Out[38]:

Unnamed:0

carat depth table price x y

Unnamed:0

1.000000 0.377983 0.034800 0.100830 0.306873 0.405440 0.395843

carat 0.377983 1.000000 0.028224 0.181618 0.921591 0.975094 0.951722

depth 0.034800 0.028224 1.000000 0.295779 0.010647 0.025289 0.029341

table 0.100830 0.181618 0.295779 1.000000 0.127134 0.195344 0.183760

price 0.306873 0.921591 0.010647 0.127134 1.000000 0.884435 0.865421

x 0.405440 0.975094 0.025289 0.195344 0.884435 1.000000 0.974701

y 0.395843 0.951722 0.029341 0.183760 0.865421 0.974701 1.000000

z 0.399208 0.953387 0.094924 0.150929 0.861249 0.970772 0.952006

diamonds.price.describe()

diamonds.corr() #Numerical Corelations



In [39]:

In [40]:

In [41]:

In [42]:

Out[39]:

Unnamed: 0 carat depth table price x y z

Unnamed: 0 True False False False False False False False

carat False True False False True True True True

depth False False True False False False False False

table False False False True False False False False

price False True False False True True True True

x False True False False True True True True

y False True False False True True True True

z False True False False True True True True

Out[40]:

array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'], dtype=object)

Out[41]:

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

Out[42]:

Ideal 21551Premium 13791Very Good 12082Good 4906Fair 1610Name: cut, dtype: int64

diamonds.corr()>0.5

# I use unique to get unique values. That is useful for categorical and character data.diamonds['clarity'].unique()

diamonds['cut'].unique()

#to get the distribution across values of cateforical values I can use the value_counts commandpd.value_counts(diamonds.cut)



In [43]:

In [44]:

In [45]:

Out[43]:

G 11292E 9797F 9542H 8304D 6775I 5422J 2808Name: color, dtype: int64

Out[44]:

color D E F G H I J

cut

Fair 163 224 312 314 303 175 119

Good 662 933 909 871 702 522 307

Ideal 2834 3903 3826 4884 3115 2093 896

Premium 1603 2337 2331 2924 2360 1428 808

Very Good 1513 2400 2164 2299 1824 1204 678

Out[45]:

color D E F G H I J All

cut

Fair 163 224 312 314 303 175 119 1610

Good 662 933 909 871 702 522 307 4906

Ideal 2834 3903 3826 4884 3115 2093 896 21551

Premium 1603 2337 2331 2924 2360 1428 808 13791

Very Good 1513 2400 2164 2299 1824 1204 678 12082

All 6775 9797 9542 11292 8304 5422 2808 53940

pd.value_counts(diamonds.color)

#the crosstab helps to make a crosstabulation.pd.crosstab(diamonds.cut,diamonds.color)

#Adding margins =TRUE helps with the row and column totals in a cross tabulationpd.crosstab(diamonds.cut,diamonds.color,margins='TRUE')



In [46]:

In [47]:

In [48]:

In [49]:

Out[46]:

color D E F G H I J All

cut

Fair 163 224 312 314 303 175 119 1610

Good 662 933 909 871 702 522 307 4906

Ideal 2834 3903 3826 4884 3115 2093 896 21551

Premium 1603 2337 2331 2924 2360 1428 808 13791

Very Good 1513 2400 2164 2299 1824 1204 678 12082

All 6775 9797 9542 11292 8304 5422 2808 53940

Out[48]:

pandas.core.groupby.DataFrameGroupBy

Out[49]:

cutFair 3282.0Good 3050.5Ideal 1810.0Premium 3185.0Very Good 2648.0Name: price, dtype: float64

pd.crosstab(diamonds.cut,diamonds.color,margins='TRUE')

#To do a groupby analysis we can use groupby command. This two step method is more similar to dplyr in Rcutgroup=pd.groupby(diamonds,diamonds.cut)

type(cutgroup)

cutgroup.price.median()



In [50]:

In [51]:

In [52]:

Out[50]:

cut price

0 Fair 3282.0

1 Good 3050.5

2 Ideal 1810.0

3 Premium 3185.0

4 Very Good 2648.0

Out[51]:

0 1 2 3 4

cut Fair Good Ideal Premium Very Good

price 3282 3050.5 1810 3185 2648

Out[52]:

<pandas.core.groupby.DataFrameGroupBy object at 0xaad3a36c>

cutgroup.price.median().reset_index()

d=cutgroup.price.median().reset_index()#transpose turns row values to columns d.transpose()

# We can group by multiple columnsdiamonds.groupby(['cut', "color"])



In [53]:

Out[53]:

cut color price

0 Fair D 3730.0

1 Fair E 2956.0

2 Fair F 3035.0

3 Fair G 3057.0

4 Fair H 3816.0

5 Fair I 3246.0

6 Fair J 3302.0

7 Good D 2728.5

8 Good E 2420.0

9 Good F 2647.0

10 Good G 3340.0

11 Good H 3468.5

12 Good I 3639.5

13 Good J 3733.0

14 Ideal D 1576.0

15 Ideal E 1437.0

16 Ideal F 1775.0

17 Ideal G 1857.5

18 Ideal H 2278.0

19 Ideal I 2659.0

20 Ideal J 4096.0

21 Premium D 2009.0

22 Premium E 1928.0

23 Premium F 2841.0

24 Premium G 2745.0

25 Premium H 4511.0

26 Premium I 4640.0

27 Premium J 5063.0

28 Very Good D 2310.0

diamonds.groupby(['cut', "color"]).price.median().reset_index()



In [54]:

In [55]:

In [56]:

Using SQL

29 Very Good E 1989.5

30 Very Good F 2471.0

31 Very Good G 2437.0

32 Very Good H 3734.0

33 Very Good I 3888.0

34 Very Good J 4113.0

Out[54]:

color D E F G H I J

cut

Fair 3730.0 2956.0 3035 3057.0 3816.0 3246.0 3302

Good 2728.5 2420.0 2647 3340.0 3468.5 3639.5 3733

Ideal 1576.0 1437.0 1775 1857.5 2278.0 2659.0 4096

Premium 2009.0 1928.0 2841 2745.0 4511.0 4640.0 5063

Very Good 2310.0 1989.5 2471 2437.0 3734.0 3888.0 4113

Out[56]:

color D E F G H I J

cut

Fair False False False False False False False

Good False False False False False False False

Ideal False False False False False False True

Premium False False False False True True True

Very Good False False False False False False True

e=diamonds.groupby(['cut', "color"]).price.median().reset_index()e.pivot(index='cut', columns='color', values='price')#The pivot command further helps to look at the data into a pivot table format.

f=e.pivot(index='cut', columns='color', values='price')

f>4000



Python does have the pandasql package thanks to the lovely team at YHat ( who also made the RodeoIDE) . It is simsilar to the sqldf package in R that is alloows the user to write sql queries to the data frameobject

In [57]:

In [58]:

In [59]:

Out[58]:


0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

3 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47

7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53

8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49

9 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39

Out[59]:


0 4.01 Premium I I1 61.0 61 15223 10.14 10.10 6.17

1 4.01 Premium J I1 62.5 62 15223 10.02 9.94 6.24

2 4.13 Fair H I1 64.8 61 17329 10.00 9.85 6.43

3 5.01 Fair J I1 65.5 59 18018 10.74 10.54 6.98

4 4.50 Fair J I1 65.8 58 18531 10.23 10.16 6.72

from pandasql import sqldfpysqldf = lambda q: sqldf(q, globals())

pysqldf("SELECT * FROM diamonds2 LIMIT 10 ; ")#you can get an error if you have a column name within your Panda Data frame that is not within SQLIte tables.#Therefore we used the diamonds dataset but after dropping the first column #(i.e diamonds2=diamonds.drop('Unnamed: 0', 1) #Dropping a particular variable)

pysqldf("SELECT * FROM diamonds2 WHERE carat >4 ;")



In [60]:

In [61]:

Out[60]:


0 4.01 Premium J I1 62.5 62 15223 10.02 9.94 6.24

1 5.01 Fair J I1 65.5 59 18018 10.74 10.54 6.98

2 4.50 Fair J I1 65.8 58 18531 10.23 10.16 6.72

Out[61]:

mean_price color

0 3169.954096 D

1 3076.752475 E

2 3724.886397 F

3 3999.135671 G

4 4486.669196 H

5 5091.874954 I

6 5323.818020 J

pysqldf("SELECT * FROM diamonds2 WHERE color =='J' and carat>4 ;")

pysqldf("SELECT AVG(price) AS mean_price,color FROM diamonds2 GROUP by color;"



In [62]:

Out[62]:

AVG(price) AVG(carat) cut clarity

0 3703.533333 1.361000 Fair I1

1 1912.333333 0.474444 Fair IF

2 4208.279412 0.964632 Fair SI1

3 5173.916309 1.203841 Fair SI2

4 4165.141176 0.879824 Fair VS1

5 4174.724138 0.885249 Fair VS2

6 3871.352941 0.664706 Fair VVS1

7 3349.768116 0.691594 Fair VVS2

8 3596.635417 1.203021 Good I1

9 4098.323944 0.616338 Good IF

10 3689.533333 0.830397 Good SI1

11 4580.260870 1.035227 Good SI2

12 3801.445988 0.757685 Good VS1

13 4262.236196 0.850787 Good VS2

14 2254.774194 0.502312 Good VVS1

15 3079.108392 0.614930 Good VVS2

16 4335.726027 1.222671 Ideal I1

17 2272.913366 0.455041 Ideal IF

18 3752.118169 0.801808 Ideal SI1

19 4755.952656 1.007925 Ideal SI2

20 3489.744497 0.674714 Ideal VS1

21 3284.550385 0.670566 Ideal VS2

22 2468.129458 0.495960 Ideal VVS1

23 3250.290100 0.586213 Ideal VVS2

24 3947.331707 1.287024 Premium I1

25 3856.143478 0.603478 Premium IF

26 4455.269371 0.908601 Premium SI1

27 5545.936928 1.144161 Premium SI2

pysqldf("SELECT AVG(price),AVG(carat),cut,clarity FROM diamonds2 GROUP by cut,clarity;"



Data VisualizationWe are going to follow three main packages for Data Visualization in Python. They are

matplotlib (standard basic data visualization package)seaborn ( advanced package for statistical graphs)ggplot ( a port by Yhat of the ggplot2 package in R created by Hadley Wickham)

In [63]:

In [64]:

28 4485.462041 0.793308 Premium VS1

29 4550.331248 0.833774 Premium VS2

30 2831.206169 0.534821 Premium VVS1

31 3795.122989 0.654724 Premium VVS2

32 4078.226190 1.281905 Very Good I1

33 4396.216418 0.618769 Very Good IF

34 3932.391049 0.845978 Very Good SI1

35 4988.688095 1.064338 Very Good SI2

36 3805.353239 0.733307 Very Good VS1

37 4215.759552 0.811181 Very Good VS2

38 2459.441065 0.494588 Very Good VVS1

39 3037.765182 0.566389 Very Good VVS2

/home/ajay/anaconda3/lib/python3.4/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))

import matplotlib.pyplot as plt%matplotlib inlinepd.options.display.mpl_style = 'default'plt.style.use('ggplot')

import seaborn as sns



In [65]:


Out[65]:

<seaborn.axisgrid.JointGrid at 0xa68163ac>

sns.jointplot('price','carat',kind='hex',data=diamonds2)



In [66]:

Out[66]:

(array([ 25335., 9328., 7393., 3878., 2364., 1745., 1306., 1002., 863., 726.]), array([ 326. , 2175.7, 4025.4, 5875.1, 7724.8, 9574.5, 11424.2, 13273.9, 15123.6, 16973.3, 18823. ]), <a list of 10 Patch objects>)

plt.hist(diamonds.price)



In [74]:

In [67]:

Out[67]:

<matplotlib.axes._subplots.AxesSubplot at 0xa3d3ecac>

sns.distplot(diamonds.price, bins=20, kde=True, rug=False);

plt.figure();diamonds['price'].plot(kind='hist', stacked=True, bins=20)



In [68]:

Out[68]:

{'boxes': [<matplotlib.lines.Line2D at 0xa38c344c>], 'caps': [<matplotlib.lines.Line2D at 0xa38c08ac>, <matplotlib.lines.Line2D at 0xa38be38c>], 'fliers': [<matplotlib.lines.Line2D at 0xa38bb9ac>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0xa38bee8c>], 'whiskers': [<matplotlib.lines.Line2D at 0xa38c22cc>, <matplotlib.lines.Line2D at 0xa38c2d8c>]}

plt.boxplot(diamonds.price)



In [69]:

In [70]:

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0xa3b2502c>

Out[70]:

<matplotlib.axes._subplots.AxesSubplot at 0xa38e8e2c>

diamonds['price'].plot()

plt.figure();diamonds['price'].plot(kind='box')



In [72]:

In [ ]:


ax = sns.boxplot(x="color", y="price", data=diamonds)

diamonds.plot(kind='hexbin', x='price', y='carat', gridsize=8)



In [76]:

Out[76]:

<matplotlib.axes._subplots.AxesSubplot at 0x96d078cc>

sns.kdeplot(diamonds['price'],shade= True)



In [75]:

In [77]:


Out[75]:

<seaborn.axisgrid.JointGrid at 0x9717fd8c>

sns.jointplot('price','carat',data=diamonds2)

from ggplot import *



In [78]:


Out[78]:

<ggplot: (-917530690)>

p = ggplot(aes(x='price', y='carat',color="clarity"), data=diamonds)p + geom_point()



In [79]:

ModelingLets do some basic Regression Modeling

In [80]:

In [81]:

In [82]:


Out[79]:

<ggplot: (-917530742)>

p = ggplot(aes(x='price', y='carat',color="cut"), data=diamonds)p + geom_point()

import statsmodels.formula.api as sm

boston=pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Boston.csv"



In [83]:

In [84]:

Out[83]:

crim zn indus chas nox rm age dis rad tax ptratio black lstat

0 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98

1 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14

2 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03

3 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94

4 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33

Out[84]:

crim zn indus chas nox rm age

crim 1.000000 0.200469 0.406583 0.055892 0.420972 0.219247 0.352734

zn 0.200469 1.000000 0.533828 0.042697 0.516604 0.311991 0.569537

indus 0.406583 0.533828 1.000000 0.062938 0.763651 0.391676 0.644779

chas 0.055892 0.042697 0.062938 1.000000 0.091203 0.091251 0.086518

nox 0.420972 0.516604 0.763651 0.091203 1.000000 0.302188 0.731470

rm 0.219247 0.311991 0.391676 0.091251 0.302188 1.000000 0.240265

age 0.352734 0.569537 0.644779 0.086518 0.731470 0.240265 1.000000

dis 0.379670 0.664408 0.708027 0.099176 0.769230 0.205246 0.747881

rad 0.625505 0.311948 0.595129 0.007368 0.611441 0.209847 0.456022

tax 0.582764 0.314563 0.720760 0.035587 0.668023 0.292048 0.506456

ptratio 0.289946 0.391679 0.383248 0.121515 0.188933 0.355501 0.261515

black 0.385064 0.175520 0.356977 0.048788 0.380051 0.128069 0.273534

lstat 0.455621 0.412995 0.603800 0.053929 0.590879 0.613808 0.602339

medv 0.388305 0.360445 0.483725 0.175260 0.427321 0.695360 0.376955

boston =boston.drop('Unnamed: 0', 1)

boston.head()

boston.corr()



In [85]:

In [86]:

Out[85]:

crim zn indus chas nox rm age dis rad tax ptratio

crim True False False False False False False False False False False

zn False True False False False False False False False False False

indus False False True False True False False False False False False

chas False False False True False False False False False False False

nox False False True False True False False False False False False

rm False False False False False True False False False False False

age False False False False False False True False False False False

dis False False False False False False False True False False False

rad False False False False False False False False True True False

tax False False False False False False False False True True False

ptratio False False False False False False False False False False True

black False False False False False False False False False False False

lstat False False False False False False False False False False False

medv False False False False False False False False False False False

Out[86]:

crim -0.388305zn 0.360445indus -0.483725chas 0.175260nox -0.427321rm 0.695360age -0.376955dis 0.249929rad -0.381626tax -0.468536ptratio -0.507787black 0.333461lstat -0.737663medv 1.000000Name: medv, dtype: float64

boston.corr()>0.75

boston.corr().medv



In [87]:

Out[87]:

OLS Regression Results

Dep. Variable: medv Rsquared: 0.631

Model: OLS Adj. Rsquared: 0.626

Method: Least Squares Fstatistic: 142.0

Date: Fri, 22 Jan 2016 Prob (Fstatistic): 1.49e104

Time: 13:22:42 LogLikelihood: 1588.2

No. Observations: 506 AIC: 3190.

Df Residuals: 499 BIC: 3220.

Df Model: 6

Covariance Type: nonrobust

coef std err t P>|t| [95.0% Conf. Int.]

Intercept 0.3594 4.863 0.074 0.941 9.915 9.196

crim 0.0991 0.034 2.890 0.004 0.167 0.032

zn 0.0064 0.014 0.470 0.638 0.033 0.020

nox 10.8653 2.865 3.793 0.000 16.494 5.237

ptratio 1.0519 0.135 7.796 0.000 1.317 0.787

black 0.0137 0.003 4.453 0.000 0.008 0.020

rm 6.9796 0.396 17.612 0.000 6.201 7.758

Omnibus: 298.859 DurbinWatson: 0.808

Prob(Omnibus): 0.000 JarqueBera (JB): 3305.426

Skew: 2.385 Prob(JB): 0.00

Kurtosis: 14.577 Cond. No. 7.66e+03

import statsmodels.formula.api as smresult = sm.ols(formula="medv ~ crim + zn + nox + ptratio + black + rm ", dataresult.summary()



In [88]:

Out[88]:

Intercept -0.359432crim -0.099122zn -0.006364nox -10.865295ptratio -1.051937black 0.013737rm 6.979587dtype: float64

result.params



In [89]:

Out[89]:

['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cache', '_data_attr', '_get_robustcov_results', '_is_nested', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params',

dir(result)



In [90]:

In [91]:

'cov_type', 'df_model', 'df_resid', 'diagn', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_robustcov_results', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wresid']

Out[90]:

<bound method OLSResults.outlier_test of <statsmodels.regression.linear_model.OLSResults object at 0x961745cc>>

result.outlier_test



In [92]:

In [93]:

In [94]:

Decision Trees

Out[92]:

['__call__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__func__', '__ge__', '__get__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__self__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

student_resid unadj_p bonf(p)365 5.130997 4.137329e-07 2.093488e-04367 4.458162 1.022270e-05 5.172687e-03368 7.350666 8.147884e-13 4.122829e-10369 4.972797 9.097632e-07 4.603402e-04370 4.510890 8.060499e-06 4.078612e-03371 5.691137 2.156804e-08 1.091343e-05372 6.272833 7.704855e-10 3.898656e-07

a=result.outlier_test

dir(a)

def outlierTest(x): outl=x.outlier_test() print (outl.loc[outl['bonf(p)'] != 1])

outlierTest(result)



pydot is Graphviz’s dot language Python interface.This module provides with a full interface to createhandle modify and process graphs in Graphviz’s dot language.

In [95]:

In [96]:

In [97]:

In [98]:

The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.You are using pip version 7.1.0, however version 8.0.2 is available.You should consider upgrading via the 'pip install --upgrade pip' command.The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.Requirement already satisfied (use --upgrade to upgrade): pydot in /usr/local/lib/python2.7/dist-packagesRequirement already satisfied (use --upgrade to upgrade): pyparsing in /usr/lib/python2.7/dist-packages (from pydot)Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/local/lib/python2.7/dist-packages/setuptools-18.6.1-py2.7.egg (from pydot)

from sklearn import treefrom sklearn.externals.six import StringIO

! sudo pip install pydot#pydot import pydot

weather=pd.read_csv('https://raw.githubusercontent.com/decisionstats/pythonfordatascience/master/data/weather.csv'

weather=weather.drop('Unnamed: 0', 1)



In [110]:

For DecisionTrees to work we need to convert the categorical variables to integer variables. To do thiswe'll create an encoding function as below.

<class 'pandas.core.frame.DataFrame'>Int64Index: 366 entries, 0 to 365Data columns (total 24 columns):Date 366 non-null objectLocation 366 non-null objectMinTemp 366 non-null float64MaxTemp 366 non-null float64Rainfall 366 non-null float64Evaporation 366 non-null float64Sunshine 363 non-null float64WindGustDir 363 non-null objectWindGustSpeed 364 non-null float64WindDir9am 335 non-null objectWindDir3pm 365 non-null objectWindSpeed9am 359 non-null float64WindSpeed3pm 366 non-null int64Humidity9am 366 non-null int64Humidity3pm 366 non-null int64Pressure9am 366 non-null float64Pressure3pm 366 non-null float64Cloud9am 366 non-null int64Cloud3pm 366 non-null int64Temp9am 366 non-null float64Temp3pm 366 non-null float64RainToday 366 non-null objectRISK_MM 366 non-null float64RainTomorrow 366 non-null objectdtypes: float64(12), int64(5), object(7)memory usage: 61.5+ KB

weather.info()



In [100]:

In [101]:

In [102]:

In [103]:

In [104]:

['MaxTemp', 'Rainfall', 'Evaporation', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday']

def encode_target(df, target_columns): """Add column to df with integers for the target.

Args ---- df -- pandas DataFrame. target_column -- column to map to int, producing new Target column.

Returns ------- df_mod -- modified DataFrame. targets -- list of target names. """ df_mod = df.copy() for target_column in target_columns: targets = df_mod[target_column].unique() map_to_int = {name: n for n, name in enumerate(targets)} df_mod[target_column] = df_mod[target_column].replace(map_to_int)

return df_mod

weather_new=encode_target(weather,["RainToday","Location","WindGustDir","WindDir9am"features= list(weather_new.columns[3:])features.remove("RISK_MM")target=features.pop()

y = weather_new[target]X = weather_new[features]

good_columns = X._get_numeric_data().dropna(axis=1)features= list(good_columns.columns)

print (features)



In [105]:

In [106]:

In [111]:

Out[111]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=20, min_weight_fraction_leaf=0.0, random_state=99, splitter='best')

dt = tree.DecisionTreeClassifier(min_samples_split=20, random_state=99)dt=dt.fit(good_columns, y)

tree.export_graphviz(dt,out_file="tree.dot")

dt



In [112]:

Out[112]:

['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_get_param_names', 'class_weight', 'classes_', 'criterion', 'feature_importances_', 'fit', 'fit_transform', 'get_params', 'max_depth', 'max_features', 'max_features_', 'max_leaf_nodes', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_classes_', 'n_features_', 'n_outputs_', 'predict', 'predict_log_proba', 'predict_proba',

dir(dt)



In [116]:

In [107]:

In [121]:

In [108]:

In [109]:

In [117]:

In [120]:

In [ ]:

'random_state', 'score', 'set_params', 'splitter', 'transform', 'tree_']

Out[116]:

<bound method DecisionTreeClassifier.score of DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=20, min_weight_fraction_leaf=0.0, random_state=99, splitter='best')>

Out[108]:

'/home/ajay/Desktop/test'

Out[109]:

['tree.dot', 'adult.data.txt']

dt.score

import os as os

#import pydot

os.getcwd()

os.listdir(os.getcwd())

#from IPython.display import Image #dot_data = StringIO()

#graph = pydot.graph_from_dot_data(tree.dot.getvalue())

#You can use Pydot from Python 2, or use Graphviz for reading the dot file

a data science tutorial in python

Data & Analytics