a data science tutorial in python
TRANSCRIPT
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 1/50
A Complete Tutorial for Data Science in PythonPython is an amazing language. It was created by Guido van Rossum. You can read Guido's history ofPython at the Python History Blog at http://pythonhistory.blogspot.in/2009/01/introductionandoverview.html (http://pythonhistory.blogspot.in/2009/01/introductionandoverview.html)
Here we show a comprehensive tutorial in it for usage in Data Science. Data science lies at theintersection of programming, statistics and business analysis. It is the use of programming tools withstatistical techniques to analyze data in a systematic and scientific way. Accordingly this tutorial will try tofocus atleast on the statistical and programming parts of data science. Data Scientists would also beinterested in the PyData community at http://pydata.org/ (http://pydata.org/)
Note I am writing this article within the Jupyter notebook, a Python interface derived from iPython.
Markdown Tip within JupyterI can also write this text within Jupyter by changing Cell type to Markdown in dropdown.For markdown changing size of font is easy by prefixing by #, or ## , or ### (more the number of #smaller the size of font as it changes the type from header 1, 2 , 3) . In Markdown for a non numberedlist prefix the words by a
Markdown
within Jupyter
is just a # in front of words
and changing the cell type to Markdown
This is a list made byadding a hypen in front ot words
Installation of Python PackagesInstallation of Python is done using pip or easy_install(from setup tools) . Here we show how to installPandas package from the Jupyter Notebook itself. I use the upgrade flag to upgrade it, and I installBokeh using easy_tools. Pandas is the Python library for Data Analysis and Bokeh helps makeinteractive data analysis available. Note the ! sign before the sudo command it helps me use theTerminal without leaving the comfort of my Jupyter Notebook. I can also install Python packages usingconda which is my preffered method for data scienc since I can create custom environments for projects.
The complete Python Package Index is at PyPi https://pypi.python.org/pypi (https://pypi.python.org/pypi)PyPi has 71833 packages as of December 30,2015.
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 2/50
In [1]:
In [2]:
In [3]:
Loading a Python PackageYou can load a Python Package using the following ways
import PACKAGEimport PACKAGE as PKfrom PACKAGE import FUN
The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.You are using pip version 7.1.0, however version 8.0.2 is available.You should consider upgrading via the 'pip install --upgrade pip' command.The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.Requirement already up-to-date: pandas in /usr/local/lib/python2.7/dist-packagesRequirement already up-to-date: python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas)Requirement already up-to-date: pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas)Requirement already up-to-date: numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas)Requirement already up-to-date: six>=1.5 in /usr/local/lib/python2.7/dist-packages (from python-dateutil->pandas)
Searching for bokehBest match: bokeh 0.10.0Processing bokeh-0.10.0-py2.7.eggbokeh 0.10.0 is already the active version in easy-install.pthInstalling bokeh-server script to /usr/local/binInstalling websocket_worker.py script to /usr/local/bin
Using /usr/local/lib/python2.7/dist-packages/bokeh-0.10.0-py2.7.eggProcessing dependencies for bokehFinished processing dependencies for bokeh
! sudo pip install pandas --upgrade
! sudo easy_install bokeh
#! conda install seaborn
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 3/50
You can then invoke the function using
PACKAGE.FUN , PK.FUN and FUN respectively
In [4]:
In [5]:
The Python Package Index (PyPI) https://pypi.python.org/pypi (https://pypi.python.org/pypi) hoststhousands of thirdparty modules for Python .
You can browse Python packages by topic at https://pypi.python.org/pypi?%3Aaction=browse(https://pypi.python.org/pypi?%3Aaction=browse)
Import DataLet's import some datasets.
In [6]:
In [7]:
In [8]:
Out[4]:
datetime.datetime(2016, 1, 22, 13, 4, 3, 39744)
Out[7]:
'/home/ajay/Dropbox/PYTHON BOOK WILEY/FINAL'
from datetime import datetimeStarttime =datetime.now()Starttime
import pandas as pd
# In case the file is stored locally we can use the os python libraryimport os as os
os.getcwd() #current working directory
os.chdir('/home/ajay/Desktop/test')
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 4/50
In [9]:
In [10]:
In [11]:
We will use diamond Dataset bundled with R language fromhttps://vincentarelbundock.github.io/Rdatasets/datasets.html(https://vincentarelbundock.github.io/Rdatasets/datasets.html)
In [12]:
In [13]:
So we got a rough estimate for the time it took for code execution through the datetime.timedelta objectabove. Also read_csv is just one of the many convenient ways we can read data through the pandaslibrary in Python. However Python lacks R's lubridate (for easier date time manipulation) as well asdata.table package in R which makes import and manipulation faster.
In [14]:
Out[9]:
['adult.data.txt']
Out[11]:
32561
Out[13]:
datetime.timedelta(0, 5, 689405)
Out[14]:
pandas.core.frame.DataFrame
a=os.getcwd()os.listdir(a)
adult=pd.read_csv("adult.data.txt",header=None)
len(adult)
diamonds =pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv"
datetime.now()- Starttime
type(diamonds) #this works just like class(object) in R
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 5/50
In [15]:
to find out more about the objects you can use locals() and globals()
Data InspectionWe get the column names, the column types as well as the information of the data through columns,dtypes, and info commands below. In R we would get this by str command (for structure). In Python strturns the object to string.(Just one of the ways people can get confused moving between data sciencelanguages)
In R we use names function for variable names and length for length of object. While Python usescolumns and len respectively.
In [16]:
Out[15]:
['T', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_SLICEMAP', '__abs__', '__add__', '__and__', '__array__', '__array_wrap__', '__bool__', '__bytes__', '__class__', '__contains__',
Out[16]:
Index(['Unnamed: 0', 'carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z'], dtype='object')
#to find out what all functions we can do we can just use the dir commanddir(diamonds)
diamonds.columns # In Python as well as R , a single Line Comment starts with # # name of variables is given by columns. In R we would use the command names(object)# Note also R uses the FUNCTION(OBJECTNAME) syntax while Python uses OBJECTNAME.FUNCTION
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 6/50
In [17]:
In [18]:
In [19]:
In [20]:
Out[17]:
Unnamed: 0 int64carat float64cut objectcolor objectclarity objectdepth float64table float64price int64x float64y float64z float64dtype: object
Out[18]:
53940
Out[19]:
5.394
Out[20]:
5
diamonds.dtypes
len(diamonds) #gives the number of rows
0.0001*len(diamonds)
round(0.0001*len(diamonds))
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 7/50
In [21]:
<class 'pandas.core.frame.DataFrame'>Int64Index: 53940 entries, 0 to 53939Data columns (total 11 columns):Unnamed: 0 53940 non-null int64carat 53940 non-null float64cut 53940 non-null objectcolor 53940 non-null objectclarity 53940 non-null objectdepth 53940 non-null float64table 53940 non-null float64price 53940 non-null int64x 53940 non-null float64y 53940 non-null float64z 53940 non-null float64dtypes: float64(6), int64(2), object(3)memory usage: 4.3+ MB
'''Lets get some information on the object.
This was a multiple line comment using three single quote marks'''diamonds.info()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 8/50
In [22]:
Data MungingTo refer to particular row in Python I can use index or .ixIn R I refer to the object in i th row and jth column by OBJECTNAME[i,j]In R I refer to the column name by OBJECTNAME$ColumnName while in Python I would useOBJECTNAME["ColumnName"]Note in Python Index starts with 0 while in R it starts with 1.
Out[22]:
Unnamed:0
carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
5 6 0.24VeryGood
J VVS2 62.8 57 336 3.94 3.96 2.48
6 7 0.24VeryGood
I VVS1 62.3 57 336 3.95 3.98 2.47
7 8 0.26VeryGood
H SI1 61.9 55 337 4.07 4.11 2.53
8 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
9 10 0.23VeryGood
H VS1 59.4 61 338 4.00 4.05 2.39
diamonds.head(10) #we check the first 10 rows in the dataset
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 9/50
In [23]:
Out[23]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
diamonds2=diamonds.drop('Unnamed: 0', 1) #Dropping a particular variablediamonds2.head()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 10/50
In [24]:
In [25]:
Out[24]:
Unnamed:0
carat cut color clarity depth table price x y z
20 21 0.30 Good I SI2 63.3 56 351 4.26 4.30 2.71
21 22 0.23VeryGood
E VS2 63.8 55 352 3.85 3.92 2.48
22 23 0.23VeryGood
H VS1 61.0 57 353 3.94 3.96 2.41
23 24 0.31VeryGood
J SI1 59.4 62 353 4.39 4.43 2.62
24 25 0.31VeryGood
J SI1 58.1 62 353 4.44 4.47 2.59
25 26 0.23VeryGood
G VVS2 60.4 58 354 3.97 4.01 2.41
26 27 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
27 28 0.30VeryGood
J VS2 62.2 57 357 4.28 4.30 2.67
28 29 0.23VeryGood
D VS2 60.5 61 357 3.96 3.97 2.40
29 30 0.23VeryGood
F VS1 60.9 57 357 3.96 3.99 2.42
30 31 0.23VeryGood
F VS1 60.0 57 402 4.00 4.03 2.41
Out[25]:
20 Good21 Very Good22 Very Good23 Very Good24 Very Good25 Very GoodName: cut, dtype: object
diamonds.ix[20:30] #refers to the 21st to 31st row
#To refer to a particular column I use it's name# I can also chain the commandsdiamonds.ix[20:25].cut
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 11/50
In [26]:
In [27]:
Out[26]:
20 I21 E22 H23 J24 J25 GName: color, dtype: object
Out[27]:
color cut price
0 E Ideal 326
1 E Premium 326
2 E Good 327
3 I Premium 334
4 J Good 335
diamonds.ix[20:25]["color"]
diamonds[['color','cut','price']].head() #Note the double square brackets [[]]
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 12/50
In [28]:
Out[28]:
color cut price
10 J Good 339
11 J Ideal 340
12 F Premium 342
13 J Ideal 344
14 E Premium 345
15 E Premium 345
16 I Ideal 348
17 J Good 351
18 J Good 351
19 J Very Good 351
20 I Good 351
diamonds.ix[10:20,['color','cut','price']] #Note how I placed the row index numbers and column names within the double SQUARE brackets. # This is more elaborate than R isnt it.
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 13/50
In [29]:
Out[29]:
Unnamed:0
carat cut color clarity depth table price x y
23644 23645 3.65 Fair H I1 67.1 53 11668 9.53 9.48
24131 24132 3.24 Premium H I1 62.1 58 12300 9.44 9.40
24297 24298 3.22 Ideal I I1 62.6 55 12545 9.49 9.42
24328 24329 3.50 Ideal H I1 62.8 57 12587 9.65 9.59
25998 25999 4.01 Premium I I1 61.0 61 15223 10.14 10.10
25999 26000 4.01 Premium J I1 62.5 62 15223 10.02 9.94
26431 26432 3.40 Fair D I1 66.8 52 15964 9.42 9.34
26444 26445 4.00VeryGood
I I1 63.3 58 15984 10.01 9.94
26534 26535 3.67 Premium I I1 62.4 56 16193 9.86 9.81
27130 27131 4.13 Fair H I1 64.8 61 17329 10.00 9.85
27415 27416 5.01 Fair J I1 65.5 59 18018 10.74 10.54
27630 27631 4.50 Fair J I1 65.8 58 18531 10.23 10.16
27679 27680 3.51 Premium J VS2 62.5 59 18701 9.66 9.63
#Lets try conditional selectiondiamonds[diamonds['carat']>3.2]
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 14/50
In [30]:
In [31]:
Random SampleSince Python does not have any package like dplyr, it needs numpy for more elaborate operations. Herewe take a random sample of a Pandas data frame.
In [32]:
In [33]:
Out[30]:
Unnamed:0
carat cut color clarity depth table price x y
21758 21759 3.11 Fair J I1 65.9 57 9823 9.15 9.02
25999 26000 4.01 Premium J I1 62.5 62 15223 10.02 9.94
26467 26468 3.01 Ideal J SI2 61.7 58 16037 9.25 9.20
26744 26745 3.01 Ideal J I1 65.4 60 16538 8.99 8.93
27415 27416 5.01 Fair J I1 65.5 59 18018 10.74 10.54
27630 27631 4.50 Fair J I1 65.8 58 18531 10.23 10.16
27679 27680 3.51 Premium J VS2 62.5 59 18701 9.66 9.63
27684 27685 3.01 Premium J SI2 60.7 59 18710 9.35 9.22
27685 27686 3.01 Premium J SI2 59.7 58 18710 9.41 9.32
Out[31]:
(13791, 11)
[34159 23971 31335 1895 28279]
##Lets try multiple conditions. We use the query command.diamonds.query('carat >3 and color =="J"')
diamonds3=diamonds.query('price >28000 or cut =="Premium"')diamonds3.shape
import numpy as np
rows = np.random.choice(diamonds.index.values, round(0.0001*len(diamonds)))print(rows)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 15/50
In [34]:
In [35]:
SummariesWe now do summaries for numerical and categorical data.
In [36]:
Out[34]:
Unnamed:0
carat cut color clarity depth table price x y z
34159 34160 0.33 Ideal G VS1 62.1 55.0 854 4.46 4.43 2.76
23971 23972 1.51VeryGood
H VS2 62.4 55.6 12108 7.28 7.33 4.56
31335 31336 0.41 Ideal G SI1 61.9 54.0 759 4.77 4.82 2.97
1895 1896 0.73 Ideal E VS2 62.7 56.0 3077 5.75 5.80 3.62
28279 28280 0.31 Premium J SI1 60.9 60.0 363 4.36 4.38 2.66
Out[36]:
Unnamed: 0 carat depth table price x
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 5.731157
std 15571.281097 0.474011 1.432621 2.234491 3989.439738 1.121761
min 1.000000 0.200000 43.000000 43.000000 326.000000 0.000000
25% 13485.750000 0.400000 61.000000 56.000000 950.000000 4.710000
50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 5.700000
75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 6.540000
max 53940.000000 5.010000 79.000000 95.000000 18823.000000 10.740000
diamonds.ix[rows]
##Missing Values
diamonds= diamonds.dropna(how='any')
diamonds.describe()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 16/50
In [37]:
In [38]:
Out[37]:
count 53940.000000mean 3932.799722std 3989.439738min 326.00000025% 950.00000050% 2401.00000075% 5324.250000max 18823.000000Name: price, dtype: float64
Out[38]:
Unnamed:0
carat depth table price x y
Unnamed:0
1.000000 0.377983 0.034800 0.100830 0.306873 0.405440 0.395843
carat 0.377983 1.000000 0.028224 0.181618 0.921591 0.975094 0.951722
depth 0.034800 0.028224 1.000000 0.295779 0.010647 0.025289 0.029341
table 0.100830 0.181618 0.295779 1.000000 0.127134 0.195344 0.183760
price 0.306873 0.921591 0.010647 0.127134 1.000000 0.884435 0.865421
x 0.405440 0.975094 0.025289 0.195344 0.884435 1.000000 0.974701
y 0.395843 0.951722 0.029341 0.183760 0.865421 0.974701 1.000000
z 0.399208 0.953387 0.094924 0.150929 0.861249 0.970772 0.952006
diamonds.price.describe()
diamonds.corr() #Numerical Corelations
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 17/50
In [39]:
In [40]:
In [41]:
In [42]:
Out[39]:
Unnamed: 0 carat depth table price x y z
Unnamed: 0 True False False False False False False False
carat False True False False True True True True
depth False False True False False False False False
table False False False True False False False False
price False True False False True True True True
x False True False False True True True True
y False True False False True True True True
z False True False False True True True True
Out[40]:
array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'], dtype=object)
Out[41]:
array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
Out[42]:
Ideal 21551Premium 13791Very Good 12082Good 4906Fair 1610Name: cut, dtype: int64
diamonds.corr()>0.5
# I use unique to get unique values. That is useful for categorical and character data.diamonds['clarity'].unique()
diamonds['cut'].unique()
#to get the distribution across values of cateforical values I can use the value_counts commandpd.value_counts(diamonds.cut)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 18/50
In [43]:
In [44]:
In [45]:
Out[43]:
G 11292E 9797F 9542H 8304D 6775I 5422J 2808Name: color, dtype: int64
Out[44]:
color D E F G H I J
cut
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Ideal 2834 3903 3826 4884 3115 2093 896
Premium 1603 2337 2331 2924 2360 1428 808
Very Good 1513 2400 2164 2299 1824 1204 678
Out[45]:
color D E F G H I J All
cut
Fair 163 224 312 314 303 175 119 1610
Good 662 933 909 871 702 522 307 4906
Ideal 2834 3903 3826 4884 3115 2093 896 21551
Premium 1603 2337 2331 2924 2360 1428 808 13791
Very Good 1513 2400 2164 2299 1824 1204 678 12082
All 6775 9797 9542 11292 8304 5422 2808 53940
pd.value_counts(diamonds.color)
#the crosstab helps to make a crosstabulation.pd.crosstab(diamonds.cut,diamonds.color)
#Adding margins =TRUE helps with the row and column totals in a cross tabulationpd.crosstab(diamonds.cut,diamonds.color,margins='TRUE')
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 19/50
In [46]:
In [47]:
In [48]:
In [49]:
Out[46]:
color D E F G H I J All
cut
Fair 163 224 312 314 303 175 119 1610
Good 662 933 909 871 702 522 307 4906
Ideal 2834 3903 3826 4884 3115 2093 896 21551
Premium 1603 2337 2331 2924 2360 1428 808 13791
Very Good 1513 2400 2164 2299 1824 1204 678 12082
All 6775 9797 9542 11292 8304 5422 2808 53940
Out[48]:
pandas.core.groupby.DataFrameGroupBy
Out[49]:
cutFair 3282.0Good 3050.5Ideal 1810.0Premium 3185.0Very Good 2648.0Name: price, dtype: float64
pd.crosstab(diamonds.cut,diamonds.color,margins='TRUE')
#To do a groupby analysis we can use groupby command. This two step method is more similar to dplyr in Rcutgroup=pd.groupby(diamonds,diamonds.cut)
type(cutgroup)
cutgroup.price.median()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 20/50
In [50]:
In [51]:
In [52]:
Out[50]:
cut price
0 Fair 3282.0
1 Good 3050.5
2 Ideal 1810.0
3 Premium 3185.0
4 Very Good 2648.0
Out[51]:
0 1 2 3 4
cut Fair Good Ideal Premium Very Good
price 3282 3050.5 1810 3185 2648
Out[52]:
<pandas.core.groupby.DataFrameGroupBy object at 0xaad3a36c>
cutgroup.price.median().reset_index()
d=cutgroup.price.median().reset_index()#transpose turns row values to columns d.transpose()
# We can group by multiple columnsdiamonds.groupby(['cut', "color"])
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 21/50
In [53]:
Out[53]:
cut color price
0 Fair D 3730.0
1 Fair E 2956.0
2 Fair F 3035.0
3 Fair G 3057.0
4 Fair H 3816.0
5 Fair I 3246.0
6 Fair J 3302.0
7 Good D 2728.5
8 Good E 2420.0
9 Good F 2647.0
10 Good G 3340.0
11 Good H 3468.5
12 Good I 3639.5
13 Good J 3733.0
14 Ideal D 1576.0
15 Ideal E 1437.0
16 Ideal F 1775.0
17 Ideal G 1857.5
18 Ideal H 2278.0
19 Ideal I 2659.0
20 Ideal J 4096.0
21 Premium D 2009.0
22 Premium E 1928.0
23 Premium F 2841.0
24 Premium G 2745.0
25 Premium H 4511.0
26 Premium I 4640.0
27 Premium J 5063.0
28 Very Good D 2310.0
diamonds.groupby(['cut', "color"]).price.median().reset_index()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 22/50
In [54]:
In [55]:
In [56]:
Using SQL
29 Very Good E 1989.5
30 Very Good F 2471.0
31 Very Good G 2437.0
32 Very Good H 3734.0
33 Very Good I 3888.0
34 Very Good J 4113.0
Out[54]:
color D E F G H I J
cut
Fair 3730.0 2956.0 3035 3057.0 3816.0 3246.0 3302
Good 2728.5 2420.0 2647 3340.0 3468.5 3639.5 3733
Ideal 1576.0 1437.0 1775 1857.5 2278.0 2659.0 4096
Premium 2009.0 1928.0 2841 2745.0 4511.0 4640.0 5063
Very Good 2310.0 1989.5 2471 2437.0 3734.0 3888.0 4113
Out[56]:
color D E F G H I J
cut
Fair False False False False False False False
Good False False False False False False False
Ideal False False False False False False True
Premium False False False False True True True
Very Good False False False False False False True
e=diamonds.groupby(['cut', "color"]).price.median().reset_index()e.pivot(index='cut', columns='color', values='price')#The pivot command further helps to look at the data into a pivot table format.
f=e.pivot(index='cut', columns='color', values='price')
f>4000
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 23/50
Python does have the pandasql package thanks to the lovely team at YHat ( who also made the RodeoIDE) . It is simsilar to the sqldf package in R that is alloows the user to write sql queries to the data frameobject
In [57]:
In [58]:
In [59]:
Out[58]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
9 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
Out[59]:
carat cut color clarity depth table price x y z
0 4.01 Premium I I1 61.0 61 15223 10.14 10.10 6.17
1 4.01 Premium J I1 62.5 62 15223 10.02 9.94 6.24
2 4.13 Fair H I1 64.8 61 17329 10.00 9.85 6.43
3 5.01 Fair J I1 65.5 59 18018 10.74 10.54 6.98
4 4.50 Fair J I1 65.8 58 18531 10.23 10.16 6.72
from pandasql import sqldfpysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM diamonds2 LIMIT 10 ; ")#you can get an error if you have a column name within your Panda Data frame that is not within SQLIte tables.#Therefore we used the diamonds dataset but after dropping the first column #(i.e diamonds2=diamonds.drop('Unnamed: 0', 1) #Dropping a particular variable)
pysqldf("SELECT * FROM diamonds2 WHERE carat >4 ;")
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 24/50
In [60]:
In [61]:
Out[60]:
carat cut color clarity depth table price x y z
0 4.01 Premium J I1 62.5 62 15223 10.02 9.94 6.24
1 5.01 Fair J I1 65.5 59 18018 10.74 10.54 6.98
2 4.50 Fair J I1 65.8 58 18531 10.23 10.16 6.72
Out[61]:
mean_price color
0 3169.954096 D
1 3076.752475 E
2 3724.886397 F
3 3999.135671 G
4 4486.669196 H
5 5091.874954 I
6 5323.818020 J
pysqldf("SELECT * FROM diamonds2 WHERE color =='J' and carat>4 ;")
pysqldf("SELECT AVG(price) AS mean_price,color FROM diamonds2 GROUP by color;"
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 25/50
In [62]:
Out[62]:
AVG(price) AVG(carat) cut clarity
0 3703.533333 1.361000 Fair I1
1 1912.333333 0.474444 Fair IF
2 4208.279412 0.964632 Fair SI1
3 5173.916309 1.203841 Fair SI2
4 4165.141176 0.879824 Fair VS1
5 4174.724138 0.885249 Fair VS2
6 3871.352941 0.664706 Fair VVS1
7 3349.768116 0.691594 Fair VVS2
8 3596.635417 1.203021 Good I1
9 4098.323944 0.616338 Good IF
10 3689.533333 0.830397 Good SI1
11 4580.260870 1.035227 Good SI2
12 3801.445988 0.757685 Good VS1
13 4262.236196 0.850787 Good VS2
14 2254.774194 0.502312 Good VVS1
15 3079.108392 0.614930 Good VVS2
16 4335.726027 1.222671 Ideal I1
17 2272.913366 0.455041 Ideal IF
18 3752.118169 0.801808 Ideal SI1
19 4755.952656 1.007925 Ideal SI2
20 3489.744497 0.674714 Ideal VS1
21 3284.550385 0.670566 Ideal VS2
22 2468.129458 0.495960 Ideal VVS1
23 3250.290100 0.586213 Ideal VVS2
24 3947.331707 1.287024 Premium I1
25 3856.143478 0.603478 Premium IF
26 4455.269371 0.908601 Premium SI1
27 5545.936928 1.144161 Premium SI2
pysqldf("SELECT AVG(price),AVG(carat),cut,clarity FROM diamonds2 GROUP by cut,clarity;"
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 26/50
Data VisualizationWe are going to follow three main packages for Data Visualization in Python. They are
matplotlib (standard basic data visualization package)seaborn ( advanced package for statistical graphs)ggplot ( a port by Yhat of the ggplot2 package in R created by Hadley Wickham)
In [63]:
In [64]:
28 4485.462041 0.793308 Premium VS1
29 4550.331248 0.833774 Premium VS2
30 2831.206169 0.534821 Premium VVS1
31 3795.122989 0.654724 Premium VVS2
32 4078.226190 1.281905 Very Good I1
33 4396.216418 0.618769 Very Good IF
34 3932.391049 0.845978 Very Good SI1
35 4988.688095 1.064338 Very Good SI2
36 3805.353239 0.733307 Very Good VS1
37 4215.759552 0.811181 Very Good VS2
38 2459.441065 0.494588 Very Good VVS1
39 3037.765182 0.566389 Very Good VVS2
/home/ajay/anaconda3/lib/python3.4/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))
import matplotlib.pyplot as plt%matplotlib inlinepd.options.display.mpl_style = 'default'plt.style.use('ggplot')
import seaborn as sns
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 27/50
In [65]:
/home/ajay/anaconda3/lib/python3.4/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))
Out[65]:
<seaborn.axisgrid.JointGrid at 0xa68163ac>
sns.jointplot('price','carat',kind='hex',data=diamonds2)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 28/50
In [66]:
Out[66]:
(array([ 25335., 9328., 7393., 3878., 2364., 1745., 1306., 1002., 863., 726.]), array([ 326. , 2175.7, 4025.4, 5875.1, 7724.8, 9574.5, 11424.2, 13273.9, 15123.6, 16973.3, 18823. ]), <a list of 10 Patch objects>)
plt.hist(diamonds.price)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 29/50
In [74]:
In [67]:
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0xa3d3ecac>
sns.distplot(diamonds.price, bins=20, kde=True, rug=False);
plt.figure();diamonds['price'].plot(kind='hist', stacked=True, bins=20)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 30/50
In [68]:
Out[68]:
{'boxes': [<matplotlib.lines.Line2D at 0xa38c344c>], 'caps': [<matplotlib.lines.Line2D at 0xa38c08ac>, <matplotlib.lines.Line2D at 0xa38be38c>], 'fliers': [<matplotlib.lines.Line2D at 0xa38bb9ac>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0xa38bee8c>], 'whiskers': [<matplotlib.lines.Line2D at 0xa38c22cc>, <matplotlib.lines.Line2D at 0xa38c2d8c>]}
plt.boxplot(diamonds.price)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 31/50
In [69]:
In [70]:
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0xa3b2502c>
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0xa38e8e2c>
diamonds['price'].plot()
plt.figure();diamonds['price'].plot(kind='box')
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 32/50
In [72]:
In [ ]:
/home/ajay/anaconda3/lib/python3.4/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))
ax = sns.boxplot(x="color", y="price", data=diamonds)
diamonds.plot(kind='hexbin', x='price', y='carat', gridsize=8)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 33/50
In [76]:
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x96d078cc>
sns.kdeplot(diamonds['price'],shade= True)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 34/50
In [75]:
In [77]:
/home/ajay/anaconda3/lib/python3.4/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))
Out[75]:
<seaborn.axisgrid.JointGrid at 0x9717fd8c>
sns.jointplot('price','carat',data=diamonds2)
from ggplot import *
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 35/50
In [78]:
/home/ajay/anaconda3/lib/python3.4/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))
Out[78]:
<ggplot: (-917530690)>
p = ggplot(aes(x='price', y='carat',color="clarity"), data=diamonds)p + geom_point()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 36/50
In [79]:
ModelingLets do some basic Regression Modeling
In [80]:
In [81]:
In [82]:
/home/ajay/anaconda3/lib/python3.4/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))
Out[79]:
<ggplot: (-917530742)>
p = ggplot(aes(x='price', y='carat',color="cut"), data=diamonds)p + geom_point()
import statsmodels.formula.api as sm
boston=pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Boston.csv"
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 37/50
In [83]:
In [84]:
Out[83]:
crim zn indus chas nox rm age dis rad tax ptratio black lstat
0 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
1 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
2 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
3 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
4 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
Out[84]:
crim zn indus chas nox rm age
crim 1.000000 0.200469 0.406583 0.055892 0.420972 0.219247 0.352734
zn 0.200469 1.000000 0.533828 0.042697 0.516604 0.311991 0.569537
indus 0.406583 0.533828 1.000000 0.062938 0.763651 0.391676 0.644779
chas 0.055892 0.042697 0.062938 1.000000 0.091203 0.091251 0.086518
nox 0.420972 0.516604 0.763651 0.091203 1.000000 0.302188 0.731470
rm 0.219247 0.311991 0.391676 0.091251 0.302188 1.000000 0.240265
age 0.352734 0.569537 0.644779 0.086518 0.731470 0.240265 1.000000
dis 0.379670 0.664408 0.708027 0.099176 0.769230 0.205246 0.747881
rad 0.625505 0.311948 0.595129 0.007368 0.611441 0.209847 0.456022
tax 0.582764 0.314563 0.720760 0.035587 0.668023 0.292048 0.506456
ptratio 0.289946 0.391679 0.383248 0.121515 0.188933 0.355501 0.261515
black 0.385064 0.175520 0.356977 0.048788 0.380051 0.128069 0.273534
lstat 0.455621 0.412995 0.603800 0.053929 0.590879 0.613808 0.602339
medv 0.388305 0.360445 0.483725 0.175260 0.427321 0.695360 0.376955
boston =boston.drop('Unnamed: 0', 1)
boston.head()
boston.corr()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 38/50
In [85]:
In [86]:
Out[85]:
crim zn indus chas nox rm age dis rad tax ptratio
crim True False False False False False False False False False False
zn False True False False False False False False False False False
indus False False True False True False False False False False False
chas False False False True False False False False False False False
nox False False True False True False False False False False False
rm False False False False False True False False False False False
age False False False False False False True False False False False
dis False False False False False False False True False False False
rad False False False False False False False False True True False
tax False False False False False False False False True True False
ptratio False False False False False False False False False False True
black False False False False False False False False False False False
lstat False False False False False False False False False False False
medv False False False False False False False False False False False
Out[86]:
crim -0.388305zn 0.360445indus -0.483725chas 0.175260nox -0.427321rm 0.695360age -0.376955dis 0.249929rad -0.381626tax -0.468536ptratio -0.507787black 0.333461lstat -0.737663medv 1.000000Name: medv, dtype: float64
boston.corr()>0.75
boston.corr().medv
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 39/50
In [87]:
Out[87]:
OLS Regression Results
Dep. Variable: medv Rsquared: 0.631
Model: OLS Adj. Rsquared: 0.626
Method: Least Squares Fstatistic: 142.0
Date: Fri, 22 Jan 2016 Prob (Fstatistic): 1.49e104
Time: 13:22:42 LogLikelihood: 1588.2
No. Observations: 506 AIC: 3190.
Df Residuals: 499 BIC: 3220.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.3594 4.863 0.074 0.941 9.915 9.196
crim 0.0991 0.034 2.890 0.004 0.167 0.032
zn 0.0064 0.014 0.470 0.638 0.033 0.020
nox 10.8653 2.865 3.793 0.000 16.494 5.237
ptratio 1.0519 0.135 7.796 0.000 1.317 0.787
black 0.0137 0.003 4.453 0.000 0.008 0.020
rm 6.9796 0.396 17.612 0.000 6.201 7.758
Omnibus: 298.859 DurbinWatson: 0.808
Prob(Omnibus): 0.000 JarqueBera (JB): 3305.426
Skew: 2.385 Prob(JB): 0.00
Kurtosis: 14.577 Cond. No. 7.66e+03
import statsmodels.formula.api as smresult = sm.ols(formula="medv ~ crim + zn + nox + ptratio + black + rm ", dataresult.summary()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 40/50
In [88]:
Out[88]:
Intercept -0.359432crim -0.099122zn -0.006364nox -10.865295ptratio -1.051937black 0.013737rm 6.979587dtype: float64
result.params
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 41/50
In [89]:
Out[89]:
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cache', '_data_attr', '_get_robustcov_results', '_is_nested', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params',
dir(result)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 42/50
In [90]:
In [91]:
'cov_type', 'df_model', 'df_resid', 'diagn', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_robustcov_results', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wresid']
Out[90]:
<bound method OLSResults.outlier_test of <statsmodels.regression.linear_model.OLSResults object at 0x961745cc>>
result.outlier_test
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 43/50
In [92]:
In [93]:
In [94]:
Decision Trees
Out[92]:
['__call__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__func__', '__ge__', '__get__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__self__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']
student_resid unadj_p bonf(p)365 5.130997 4.137329e-07 2.093488e-04367 4.458162 1.022270e-05 5.172687e-03368 7.350666 8.147884e-13 4.122829e-10369 4.972797 9.097632e-07 4.603402e-04370 4.510890 8.060499e-06 4.078612e-03371 5.691137 2.156804e-08 1.091343e-05372 6.272833 7.704855e-10 3.898656e-07
a=result.outlier_test
dir(a)
def outlierTest(x): outl=x.outlier_test() print (outl.loc[outl['bonf(p)'] != 1])
outlierTest(result)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 44/50
pydot is Graphviz’s dot language Python interface.This module provides with a full interface to createhandle modify and process graphs in Graphviz’s dot language.
In [95]:
In [96]:
In [97]:
In [98]:
The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.You are using pip version 7.1.0, however version 8.0.2 is available.You should consider upgrading via the 'pip install --upgrade pip' command.The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.Requirement already satisfied (use --upgrade to upgrade): pydot in /usr/local/lib/python2.7/dist-packagesRequirement already satisfied (use --upgrade to upgrade): pyparsing in /usr/lib/python2.7/dist-packages (from pydot)Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/local/lib/python2.7/dist-packages/setuptools-18.6.1-py2.7.egg (from pydot)
from sklearn import treefrom sklearn.externals.six import StringIO
! sudo pip install pydot#pydot import pydot
weather=pd.read_csv('https://raw.githubusercontent.com/decisionstats/pythonfordatascience/master/data/weather.csv'
weather=weather.drop('Unnamed: 0', 1)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 45/50
In [110]:
For DecisionTrees to work we need to convert the categorical variables to integer variables. To do thiswe'll create an encoding function as below.
<class 'pandas.core.frame.DataFrame'>Int64Index: 366 entries, 0 to 365Data columns (total 24 columns):Date 366 non-null objectLocation 366 non-null objectMinTemp 366 non-null float64MaxTemp 366 non-null float64Rainfall 366 non-null float64Evaporation 366 non-null float64Sunshine 363 non-null float64WindGustDir 363 non-null objectWindGustSpeed 364 non-null float64WindDir9am 335 non-null objectWindDir3pm 365 non-null objectWindSpeed9am 359 non-null float64WindSpeed3pm 366 non-null int64Humidity9am 366 non-null int64Humidity3pm 366 non-null int64Pressure9am 366 non-null float64Pressure3pm 366 non-null float64Cloud9am 366 non-null int64Cloud3pm 366 non-null int64Temp9am 366 non-null float64Temp3pm 366 non-null float64RainToday 366 non-null objectRISK_MM 366 non-null float64RainTomorrow 366 non-null objectdtypes: float64(12), int64(5), object(7)memory usage: 61.5+ KB
weather.info()
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 46/50
In [100]:
In [101]:
In [102]:
In [103]:
In [104]:
['MaxTemp', 'Rainfall', 'Evaporation', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday']
def encode_target(df, target_columns): """Add column to df with integers for the target.
Args ---- df -- pandas DataFrame. target_column -- column to map to int, producing new Target column.
Returns ------- df_mod -- modified DataFrame. targets -- list of target names. """ df_mod = df.copy() for target_column in target_columns: targets = df_mod[target_column].unique() map_to_int = {name: n for n, name in enumerate(targets)} df_mod[target_column] = df_mod[target_column].replace(map_to_int)
return df_mod
weather_new=encode_target(weather,["RainToday","Location","WindGustDir","WindDir9am"features= list(weather_new.columns[3:])features.remove("RISK_MM")target=features.pop()
y = weather_new[target]X = weather_new[features]
good_columns = X._get_numeric_data().dropna(axis=1)features= list(good_columns.columns)
print (features)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 47/50
In [105]:
In [106]:
In [111]:
Out[111]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=20, min_weight_fraction_leaf=0.0, random_state=99, splitter='best')
dt = tree.DecisionTreeClassifier(min_samples_split=20, random_state=99)dt=dt.fit(good_columns, y)
tree.export_graphviz(dt,out_file="tree.dot")
dt
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 48/50
In [112]:
Out[112]:
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_get_param_names', 'class_weight', 'classes_', 'criterion', 'feature_importances_', 'fit', 'fit_transform', 'get_params', 'max_depth', 'max_features', 'max_features_', 'max_leaf_nodes', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_classes_', 'n_features_', 'n_outputs_', 'predict', 'predict_log_proba', 'predict_proba',
dir(dt)
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 49/50
In [116]:
In [107]:
In [121]:
In [108]:
In [109]:
In [117]:
In [120]:
In [ ]:
'random_state', 'score', 'set_params', 'splitter', 'transform', 'tree_']
Out[116]:
<bound method DecisionTreeClassifier.score of DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=20, min_weight_fraction_leaf=0.0, random_state=99, splitter='best')>
Out[108]:
'/home/ajay/Desktop/test'
Out[109]:
['tree.dot', 'adult.data.txt']
dt.score
import os as os
#import pydot
os.getcwd()
os.listdir(os.getcwd())
#from IPython.display import Image #dot_data = StringIO()
#graph = pydot.graph_from_dot_data(tree.dot.getvalue())
#You can use Pydot from Python 2, or use Graphviz for reading the dot file
1/22/2016 Tutorial in Python
http://localhost:8888/notebooks/Dropbox/PYTHON%20BOOK%20WILEY/FINAL/Tutorial%20in%20Python.ipynb# 50/50