data mining using python | code · pdf filedata mining using python | code comments code...

131
Data Mining using Python — code comments Finn ˚ Arup Nielsen DTU Compute Technical University of Denmark November 29, 2017

Upload: vanhanh

Post on 01-Feb-2018

250 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Finn Arup Nielsen

DTU Compute

Technical University of Denmark

November 29, 2017

Page 2: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Code comments

Random comments on code provided by students.

With thanks to Vladimir Keleshev and others for tips.

Finn Arup Nielsen 1 November 29, 2017

Page 3: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

argparse?

import argparse

Finn Arup Nielsen 2 November 29, 2017

Page 4: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

argparse?

import argparse

Use docopt. :-)

Finn Arup Nielsen 3 November 29, 2017

Page 5: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

argparse?

import argparse

Use docopt. :-)

import docopt

http://docopt.org/

Vladimir Keleshev’s video: PyCon UK 2012: Create *beautiful* command-

line interfaces with Python

You will get the functionality and the documentation in one go.

Finn Arup Nielsen 4 November 29, 2017

Page 6: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Comments and a function declaration?

# Get comments of a subreddit, returns a list of strings

def get_subreddit_comments(self, subreddit, limit=None):

Finn Arup Nielsen 5 November 29, 2017

Page 7: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Comments and a function declaration?

# Get comments of a subreddit, returns a list of strings

def get_subreddit_comments(self, subreddit, limit=None):

Any particular reason for not using docstrings?

def get_subreddit_comments(self, subreddit, limit=None):

"""Get comments of a subreddit and return a list of strings."""

. . . and use Vladimir Keleskev’s Python program pep257 to check docstring

format convention (PEP 257 “Docstring Conventions”’).

Please do:

$ sudo pip install pep257

$ pep257 yourpythonmodule.py

Finn Arup Nielsen 6 November 29, 2017

Page 8: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Namesreq = requests.get(self.video_url.format(video_id), params=params)

Finn Arup Nielsen 7 November 29, 2017

Page 9: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Namesreq = requests.get(self.video_url.format(video_id), params=params)

The returned object from requests.get is a Response object (actually a

requests.model.Response object).

A more appropriate name would be response:

response = requests.get(self.video_url.format(video_id), params=params)

Finn Arup Nielsen 8 November 29, 2017

Page 10: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Namesreq = requests.get(self.video_url.format(video_id), params=params)

The returned object from requests.get is a Response object (actually a

requests.model.Response object).

A more appropriate name would be response:

response = requests.get(self.video_url.format(video_id), params=params)

And what about this:

next_url = [n["href"] for n in r["feed"]["link"] if n["rel"] == "next"]

Finn Arup Nielsen 9 November 29, 2017

Page 11: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Namesreq = requests.get(self.video_url.format(video_id), params=params)

The returned object from requests.get is a Response object (actually a

requests.model.Response object).

A more appropriate name would be response:

response = requests.get(self.video_url.format(video_id), params=params)

And what about this:

next_url = [n["href"] for n in r["feed"]["link"] if n["rel"] == "next"]

Single character names are difficult for the reader to understand.

Single characters should perhaps only be used for indices and for abstract

mathematical objects, e.g., matrix where the matrix can contain ‘general’

data.

Finn Arup Nielsen 10 November 29, 2017

Page 12: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

More names

with open(WORDS_PATH ,’a+’) as w:

...

Finn Arup Nielsen 11 November 29, 2017

Page 13: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

More names

with open(WORDS_PATH ,’a+’) as w:

...

WORDS_PATH is a file name not a path (or a path name).

Finn Arup Nielsen 12 November 29, 2017

Page 14: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Enumerable constants

Finn Arup Nielsen 13 November 29, 2017

Page 15: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Enumerable constants

Use Enum class in enum module from the enum34 package.

Finn Arup Nielsen 14 November 29, 2017

Page 16: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

pi

import math

pi = math.pi # Define pi

Finn Arup Nielsen 15 November 29, 2017

Page 17: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

pi

import math

pi = math.pi # Define pi

What about

from math import pi

Finn Arup Nielsen 16 November 29, 2017

Page 18: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Assignment

words =[]

words = single_comment.split()

Finn Arup Nielsen 17 November 29, 2017

Page 19: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Assignment

words =[]

words = single_comment.split()

words is set to an empty list and then immediately overwritten!

Finn Arup Nielsen 18 November 29, 2017

Page 20: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

URL and CSV

def get_csv_from_url(self, url):

request = urllib2.Request(url)

try:

response = urllib2.urlopen(request)

self.company_list = pandas.DataFrame({"Companies" : \

[line for line in response.read().split("\r\n") \

if (line != ’’ and line != "Companies") ]})

print "Fetching data from " + url

except urllib2.HTTPError, e:

print ’HTTPError = ’ + str(e.code)

...

Finn Arup Nielsen 19 November 29, 2017

Page 21: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

URL and CSV

def get_csv_from_url(self, url):

request = urllib2.Request(url)

try:

response = urllib2.urlopen(request)

self.company_list = pandas.DataFrame({"Companies" : \

[line for line in response.read().split("\r\n") \

if (line != ’’ and line != "Companies") ]})

print "Fetching data from " + url

except urllib2.HTTPError, e:

print ’HTTPError = ’ + str(e.code)

...

Pandas read csv will also do URLs:

def get_company_list_from_url(self , url):

self.company_list = pandas.read_csv(url)

Also note: issues of exception handling, logging and documentation.

Finn Arup Nielsen 20 November 29, 2017

Page 22: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Sorting

def SortList(l):

...

def FindClosestValue(v,l):

...

...

SortList(a)

VaIn = FindClosestValue(int(Value), a)

Finn Arup Nielsen 21 November 29, 2017

Page 23: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Sorting

def SortList(l):

...

def FindClosestValue(v,l):

...

...

SortList(a)

VaIn = FindClosestValue(int(Value), a)

Reinventing the wheel? Google: “find closest value in list python” yieldsseveral suggestions, if unsorted:

min(my_list, key=lambda x: abs(x - my_number))

and if sorted:

from bisect import bisect_left

Finn Arup Nielsen 22 November 29, 2017

Page 24: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Sorting 2

Returning the key associated with the maximum value in a dict:

return sorted(likelihoods , key=likelihoods.get ,

reverse=True )[0]

Finn Arup Nielsen 23 November 29, 2017

Page 25: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Sorting 2

Returning the key associated with the maximum value in a dict:

return sorted(likelihoods , key=likelihoods.get ,

reverse=True )[0]

Sorting is O(N logN) while finding the maximum is O(N), so this should

(hopefully) be faster:

return max(likelihoods , key=likelihoods.get)

Finn Arup Nielsen 24 November 29, 2017

Page 26: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Word tokenization

Splitting a string into a list of words and processing each word:

for word in re.split("\W+", sentence)

Finn Arup Nielsen 25 November 29, 2017

Page 27: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Word tokenization

Splitting a string into a list of words and processing each word:

for word in re.split("\W+", sentence)

Maybe \W+ is not necessarily particularly good?

Comparison with NLTK’s word tokenizer:

>>> import nltk, re

>>> sentence = "In a well-behaved manner"

>>> [word for word in re.split("\W+", sentence)]

[’In’, ’a’, ’well’, ’behaved’, ’manner’]

>>> nltk.word_tokenize(sentence)

[’In’, ’a’, ’well-behaved’, ’manner’]

Finn Arup Nielsen 26 November 29, 2017

Page 28: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

POS-tagging

import re

sentences = """Sometimes it may be good to take a close look at the

documentation. Sometimes you will get surprised."""

words = [word for sentence in nltk.sent_tokenize(sentences)

for word in re.split(’\W+’, sentence)]

nltk.pos_tag(words)

Finn Arup Nielsen 27 November 29, 2017

Page 29: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

POS-tagging

import re

sentences = """Sometimes it may be good to take a close look at the

documentation. Sometimes you will get surprised."""

words = [word for sentence in nltk.sent_tokenize(sentences)

for word in re.split(’\W+’, sentence)]

nltk.pos_tag(words)

>>> nltk.pos_tag(words)[12:15][(’documentation’, ’NN’), (’’, ’NN’), (’Sometimes’, ’NNP’)]

>>> map(lambda s: nltk.pos_tag(nltk.word_tokenize(s)), nltk.sent_tokenize(sentences))[[(’Sometimes’, ’RB’), (’it’, ’PRP’), (’may’, ’MD’), (’be’, ’VB’),(’good’, ’JJ’), (’to’, ’TO’), (’take’, ’VB’), (’a’, ’DT’), (’close’,’JJ’), (’look’, ’NN’), (’at’, ’IN’), (’the’, ’DT’), (’documentation’,’NN’), (’.’, ’.’)], [(’Sometimes’, ’RB’), (’you’, ’PRP’), (’will’,’MD’), (’get’, ’VB’), (’surprised’, ’VBN’), (’.’, ’.’)]]

Note the period which is tokenized. “Sometimes” looks like a propernoun because of the initial capital letter.

Finn Arup Nielsen 28 November 29, 2017

Page 30: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Exception

class LongMessageException(Exception ):

pass

Finn Arup Nielsen 29 November 29, 2017

Page 31: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Exception

class LongMessageException(Exception ):

pass

Yes, we can! It is possible to define you own exceptions!

Finn Arup Nielsen 30 November 29, 2017

Page 32: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Exception 2

if self.db is None:

raise Exception(’No database engine attached to this instance.’)

Finn Arup Nielsen 31 November 29, 2017

Page 33: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Exception 2

if self.db is None:

raise Exception(’No database engine attached to this instance.’)

Derive your own class so the user of your module can distinguish between

errors.

Finn Arup Nielsen 32 November 29, 2017

Page 34: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Exception 3

try:

if data["feed"]["entry"]:

for item in data["feed"]["entry"]:

return_comments.append(item["content"])

except KeyError:

sys.exc_info ()[0]

Finn Arup Nielsen 33 November 29, 2017

Page 35: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Exception 3

try:

if data["feed"]["entry"]:

for item in data["feed"]["entry"]:

return_comments.append(item["content"])

except KeyError:

sys.exc_info ()[0]

sys.exc_info()[0] just ignores the exception. Either you should pass it,log it or actually handle it, here using the logging module:

import logging

try:

if data["feed"]["entry"]:

for item in data["feed"]["entry"]:

return_comments.append(item["content"])

except KeyError:

logging.exception("Unhandled feed item")

Finn Arup Nielsen 34 November 29, 2017

Page 36: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Trying to import a url library?

try:

import urllib3 as urllib

except ImportError:

try:

import urllib2 as urllib

except ImportError:

import urllib as urllib

Finn Arup Nielsen 35 November 29, 2017

Page 37: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Trying to import a url library?

try:

import urllib3 as urllib

except ImportError:

try:

import urllib2 as urllib

except ImportError:

import urllib as urllib

This is a silly example. This code was from the lecture slides and wasmeant to be a demonstration (I thought I was paedagodic). Just importone of them. And urllib and urllib2 is in PSL so it is not likely that youcannot import them.

Just write:

import urllib

Finn Arup Nielsen 36 November 29, 2017

Page 38: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Import failing?

try:

import BeautifulSoup as bs

except ImportError, message:

print "There was an error loading BeautifulSoup: %s" % message

Finn Arup Nielsen 37 November 29, 2017

Page 39: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Import failing?

try:

import BeautifulSoup as bs

except ImportError, message:

print "There was an error loading BeautifulSoup: %s" % message

But ehhh. . . you are using BeautifulSoup further down in the code so it

will fail then and then raise an exception that is difficult to understand.

Finn Arup Nielsen 38 November 29, 2017

Page 40: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Globbing import

from youtube import *

Finn Arup Nielsen 39 November 29, 2017

Page 41: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Globbing import

from youtube import *

It is usually considered good style to only import the names you need to

avoid “polluting” your name space.

Better:

import youtube

alternatively:

from youtube import YouTubeScraper

Finn Arup Nielsen 40 November 29, 2017

Page 42: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Importing files in a directory tree

import sys

sys.path.append(’theproject/lib’)

sys.path.append(’theproject/view_objects ’)

sys.path.append(’theproject/models ’)

from user_builder import UserBuilder

(user_builder is somewhere in the directory tree)

Finn Arup Nielsen 41 November 29, 2017

Page 43: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Importing files in a directory tree

import sys

sys.path.append(’theproject/lib’)

sys.path.append(’theproject/view_objects ’)

sys.path.append(’theproject/models ’)

from user_builder import UserBuilder

(user_builder is somewhere in the directory tree)

It is better to define __init__.py files in each directory containing the

imports with the names that needs to be exported.

Finn Arup Nielsen 42 November 29, 2017

Page 44: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Function declaration

def __fetch_page_local(self, course_id, course_dir):

Finn Arup Nielsen 43 November 29, 2017

Page 45: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Function declaration

def __fetch_page_local(self, course_id, course_dir):

Standard naming (for “public” method) (Beazley and Jones, 2013):

def fetch_page_local(self, course_id, course_dir):

Standard naming (for “internal” method):

def _fetch_page_local(self, course_id, course_dir):

Standard naming (for “internal” method with name mangling. Is that

what you want? Or have you been coding too much in Java?):

def __fetch_page_local(self, course_id, course_dir):

Finn Arup Nielsen 44 November 29, 2017

Page 46: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Making sense of repr

class CommentSentiment(base):

...

def __repr__(self):

return self.positive

Finn Arup Nielsen 45 November 29, 2017

Page 47: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Making sense of repr

class CommentSentiment(base):

...

def __repr__(self):

return self.positive

The __repr__ should present something usefull for the developer that uses

the class, e.g., its name! Here only the value of an attribute is printed.

Vladimir Keleshev’s suggestion:

def __repr__(self):

return ’%s(id=%r, video_id =%r, positive =%r)’ % (

self.__class__.__name__ , self.id, self.video_id , self.positive)

This will print out

>>> comment_sentiment = CommentSentiment(video_id =12, positive=True)

>>> comment_sentiment

CommentSentiment(id=23, video_id =12, positive=True)

Finn Arup Nielsen 46 November 29, 2017

Page 48: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Strings?

userName = str(user[u’name’].encode(’ascii’, ’ignore’))

Finn Arup Nielsen 47 November 29, 2017

Page 49: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Strings?

userName = str(user[u’name’].encode(’ascii’, ’ignore’))

Ehhh. . . Why not just user[’name’]? What does str do? Redundant!?

Finn Arup Nielsen 48 November 29, 2017

Page 50: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Strings?

userName = str(user[u’name’].encode(’ascii’, ’ignore’))

Ehhh. . . Why not just user[’name’]? What does str do? Redundant!?

You really need to make sure you understand the distinction betweenASCII byte strings, UTF-8 byte strings and Unicode strings.

You should consider for each variable in your program what is the mostappropriate type and when it makes sense to convert it.

Usually: On the web data comes as UTF-8 byte strings that you wouldneed in Python 2 to convert to Unicode strings. After you have done theprocessing in Unicode you may what to write out the results. This willmostly be in UTF-8.

See slides on encoding.

Finn Arup Nielsen 49 November 29, 2017

Page 51: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

A URL?

baselink = ("http ://www.kurser.dtu.dk/search.aspx"

"?YearGroup =2013 -2014"

"&btnSearch=Search"

"&menulanguage=en -GB"

"&txtSearchKeyword =%s")

Finn Arup Nielsen 50 November 29, 2017

Page 52: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

A URL?

baselink = ("http ://www.kurser.dtu.dk/search.aspx"

"?YearGroup =2013 -2014"

"&btnSearch=Search"

"&menulanguage=en -GB"

"&txtSearchKeyword =%s")

“2013-2014” looks like something likely to change in the future. Maybe

it would be better to make it a parameter.

Also note that the get method in the requests module has the param input

argument, which might be better for URL parameters.

Finn Arup Nielsen 51 November 29, 2017

Page 53: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

“Constants”

print(c.fetch_comments("RiQYcw -u18I"))

Finn Arup Nielsen 52 November 29, 2017

Page 54: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

“Constants”

print(c.fetch_comments("RiQYcw -u18I"))

Don’t put such “pseudoconstants” in a reuseable module, — unless they

are examples.

Put them in data files, configuration files or as script input arguments.

Finn Arup Nielsen 53 November 29, 2017

Page 55: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Configuration

Use configuration files for ’changing constants’, e.g., API keys.

There are two modules config and ConfigParser/configparser. configparser

(Python 3) can parse portable windows-like configuration file like:

[requests]

user_agent = fnielsenbot

from = [email protected]

[twitter]

consumer_key = HFDFDF45454HJHJH

consumer_secret = kjhkjsdhfksjdfhf3434jhjhjh34h3

access_token = kjh234kj2h34

access_secret = kj23h4k2h34k23h4

Finn Arup Nielsen 54 November 29, 2017

Page 56: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Constructing a path

FILE_PATH = "%s" + os.sep + "%s.txt"

current_file_path = FILE_PATH % (directory, filename)

Finn Arup Nielsen 55 November 29, 2017

Page 57: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Constructing a path

FILE_PATH = "%s" + os.sep + "%s.txt"

current_file_path = FILE_PATH % (directory, filename)

Yes! os.sep is file independent.

Finn Arup Nielsen 56 November 29, 2017

Page 58: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Constructing a path

FILE_PATH = "%s" + os.sep + "%s.txt"

current_file_path = FILE_PATH % (directory, filename)

Yes! os.sep is file independent.

But so is os.path.join:

from os.path import join

join(directory, filename + ’.txt’)

Finn Arup Nielsen 57 November 29, 2017

Page 59: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Building a URL

request_url = self.BASE_URL + \

’?’ + self.PARAM_DEV_KEY + \

’=’ + self.developer_key + \

’&’ + self.PARAM_PER_PAGE + \

’=’ + str(amount) + \

’&’ + self.PARAM_KEYWORDS + \

’=’ + ’,’.join(keywords)

Finn Arup Nielsen 58 November 29, 2017

Page 60: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Building a URL

request_url = self.BASE_URL + \

’?’ + self.PARAM_DEV_KEY + \

’=’ + self.developer_key + \

’&’ + self.PARAM_PER_PAGE + \

’=’ + str(amount) + \

’&’ + self.PARAM_KEYWORDS + \

’=’ + ’,’.join(keywords)

Use instead the params keyword in the requests.get function, as special

characters need to be escaped in URL, e.g.,

>>> response = requests.get(’http://www.dtu.dk’, params={’q’: u’æ ø a’})

>>> response.url

u’http://www.dtu.dk/?q=%C3%A6+%C3%B8+%C3%A5’

Finn Arup Nielsen 59 November 29, 2017

Page 61: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

The double break outimport Image

image = Image.open("/usr/lib/libreoffice/program/about.png")message = "Hello, world"

mat = image.load()x,y = image.sizecount = 0done = Falsefor i in range(x):

for j in range(y):mat[i,j] = (mat[i,j][2], mat[i,j][0], mat[i,j][1])count = count + 1if count == len(message):

done = Truebreak

if done:break

(modified from the original)

Finn Arup Nielsen 60 November 29, 2017

Page 62: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

The double break outimport Imagefrom itertools import product

image = Image.open("/usr/lib/libreoffice/program/about.png")message = "Hello, world"

mat = image.load()for count, (i, j) in enumerate(product(*map(range, image.size))):

mat[i,j] = (mat[i,j][2], mat[i,j][0], mat[i,j][1])if count == len(message):

break

Fewer lines avoiding the double break, but less readable perhaps? So not

necessarily better? In Python itertools module there are lots of interesting

functions for iterators. Take a look.

Note that count = count + 1 can also be written count += 1

Finn Arup Nielsen 61 November 29, 2017

Page 63: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Not everything you read on the Internet . . .def remove_html_markup(html_string):

tag = Falsequote = Falseout = ""for char in html_string:

if char == "<" and not quote:tag = True

elif char == ’>’ and not quote:tag = False

elif (char == ’"’ or char == "’") and tag:quote = not quote

elif not tag:out = out + char

return out

“Borrowed from: http://stackoverflow.com/a/14464381/368379”

Finn Arup Nielsen 62 November 29, 2017

Page 64: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Not everything you read on the Internet . . .def remove_html_markup(html_string):

tag = Falsequote = Falseout = ""for char in html_string:

if char == "<" and not quote:tag = True

elif char == ’>’ and not quote:tag = False

elif (char == ’"’ or char == "’") and tag:quote = not quote

elif not tag:out = out + char

return out

“Borrowed from: http://stackoverflow.com/a/14464381/368379”

The Stackoverflow website is a good resource, but please be critical about

the suggested solutions.

Finn Arup Nielsen 63 November 29, 2017

Page 65: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Not everything you read on the Internet . . .def remove_html_markup(html_string):

tag = Falsequote = Falseout = ""for char in html_string:

if char == "<" and not quote:tag = True

elif char == ’>’ and not quote:tag = False

elif (char == ’"’ or char == "’") and tag:quote = not quote

elif not tag:out = out + char

return out

>>> s = """<a title="DTU’s homepage" href="http://dtu.dk">DTU</a>"""

>>> remove_html_markup(s)

’’

Finn Arup Nielsen 64 November 29, 2017

Page 66: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Not everything you read on the Internet . . .def remove_html_markup(html_string):

tag = Falsequote = Falseout = ""for char in html_string:

if char == "<" and not quote:tag = True

elif char == ’>’ and not quote:tag = False

elif (char == ’"’ or char == "’") and tag:quote = not quote

elif not tag:out = out + char

return out

You got BeautifulSoup, NLTK, etc., e.g.,

>>> import nltk

>>> nltk.clean_html(s)

’DTU’

Finn Arup Nielsen 65 November 29, 2017

Page 67: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Documentationdef update(self):

"""Execute an update where the currently selected course from the history is displayed and the buttons for back and forward are up"""

Finn Arup Nielsen 66 November 29, 2017

Page 68: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Documentationdef update(self):

"""Execute an update where the currently selected course from the history is displayed and the buttons for back and forward are up"""

Docstrings should be considered to be read on an 80-column terminal:

def update(self):"""Update the currently selected courses.

Update the currently selected course from the history is displayedand the buttons for back and forward are up.

"""

Style Guide for Python Code: “For flowing long blocks of text with fewer

structural restrictions (docstrings or comments), the line length should

be limited to 72 characters.” (van Rossum et al., 2013).

Run Vladimir Keleshev’s pep257 program on your code!

Finn Arup Nielsen 67 November 29, 2017

Page 69: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

For loop with counter

tcount = 0

for t in stream:

if tcount >= 1000:

break

dump(t)

tcount += 1

Finn Arup Nielsen 68 November 29, 2017

Page 70: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

For loop with counter

tcount = 0

for t in stream:

if tcount >= 1000:

break

dump(t)

tcount += 1

Please at least enumerate:

for tcount, t in enumerate(stream):

if tcount >= 1000:

break

dump(t)

Finn Arup Nielsen 69 November 29, 2017

Page 71: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

For loop with counter

tcount = 0

for t in stream:

if tcount >= 1000:

break

dump(t)

tcount += 1

Please at least enumerate:

for tcount, t in enumerate(stream):

if tcount >= 1000:

break

dump(t)

and don’t use that short variable names, — like “t”!

Finn Arup Nielsen 70 November 29, 2017

Page 72: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

For loop with counter 2

len together with for is often suspicious. Made-up example:

from nltk.corpus import shakespeare

tokens = shakespeare.words(’hamlet.xml’)

words = []

for n in range(len(tokens )):

if tokens[n]. isalpha ():

words.append(tokens[n].lower ())

Finn Arup Nielsen 71 November 29, 2017

Page 73: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

For loop with counter 2

len together with for is often suspicious. Made-up example:

from nltk.corpus import shakespeare

tokens = shakespeare.words(’hamlet.xml’)

words = []

for n in range(len(tokens )):

if tokens[n]. isalpha ():

words.append(tokens[n].lower ())

Better and cleaner:

tokens = shakespeare.words(’hamlet.xml’)

words = []

for token in tokens:

if token.isalpha ():

words.append(token.lower ())

Finn Arup Nielsen 72 November 29, 2017

Page 74: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

For loop with counter 2

len together with for is usually suspicious. Made-up example:

from nltk.corpus import shakespeare

tokens = shakespeare.words(’hamlet.xml’)

words = []

for n in range(len(tokens )):

if tokens[n]. isalpha ():

words.append(tokens[n].lower ())

Better and cleaner:

tokens = shakespeare.words(’hamlet.xml’)

for token in tokens:

if tokens.isalpha ():

words.append(tokens.lower ())

Or with a generator comprehension (alternatively list comprehension):

tokens = shakespeare.words(’hamlet.xml’)

words = (token.lower() for token in tokens if token.isalpha ())

Finn Arup Nielsen 73 November 29, 2017

Page 75: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

For loop with counter 3

for c_no , value in enumerate(a_list ):

# more code here.

c_no += 1

Finn Arup Nielsen 74 November 29, 2017

Page 76: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

For loop with counter 3

for c_no , value in enumerate(a_list ):

# more code here.

c_no += 1

The first variable c_no is increase automatically. Just write:

for c_no , value in enumerate(a_list ):

# more code here.

Finn Arup Nielsen 75 November 29, 2017

Page 77: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Caching results

You want to cache results that takes long time to fetch or compute:

def get_all_comments(self):

self.comments = computation_that_takes_long_time ()

return self.comments

def get_all_comments_from_last_call(self):

return self.comments

Finn Arup Nielsen 76 November 29, 2017

Page 78: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Caching results

You want to cache results that takes long time to fetch or compute:

def get_all_comments(self):

self.comments = computation_that_takes_long_time ()

return self.comments

def get_all_comments_from_last_call(self):

return self.comments

This can be done more elegantly with a lazy property:

import lazy

@lazy

def all_comments(self):

comments = computation_that_takes_long_time ()

return comments

Finn Arup Nielsen 77 November 29, 2017

Page 79: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

All those Python versions . . . !

import sysconfig

if float(sysconfig.get_python_version ()) < 3.1:

exit(’your version of python is below 3.1’)

Finn Arup Nielsen 78 November 29, 2017

Page 80: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

All those Python versions . . . !

import sysconfig

if float(sysconfig.get_python_version ()) < 3.1:

exit(’your version of python is below 3.1’)

Are there any particular reason why it shouldn’t work with previous ver-

sions of Python?

Try install tox that will allow you to test your code with multiple versions

of Python.

. . . and please be careful with handling version numbering: conversion to

float will not work with, e.g., “3.2.3”. See pkg resources.parse version.

Finn Arup Nielsen 79 November 29, 2017

Page 81: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Iterable

for kursus in iter(kursusInfo.keys()):

# Here is some extra code

Finn Arup Nielsen 80 November 29, 2017

Page 82: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Iterable

for kursus in iter(kursusInfo.keys()):

# Here is some extra code

Dictionary keys are already iterable

for kursus in kursusInfo.keys():

# Here is some extra code

Finn Arup Nielsen 81 November 29, 2017

Page 83: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Iterable

for kursus in iter(kursusInfo.keys()):

# Here is some extra code

Dictionary keys are already iterable

for kursus in kursusInfo.keys():

# Here is some extra code

. . . and you can actually make it yet shorter.

Finn Arup Nielsen 82 November 29, 2017

Page 84: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Iterable

for kursus in iter(kursusInfo.keys()):

# Here is some extra code

Dictionary keys are already iterable

for kursus in kursusInfo.keys():

# Here is some extra code

. . . and you can actually make it yet shorter.

for kursus in kursusInfo:

# Here is some extra code

Finn Arup Nielsen 83 November 29, 2017

Page 85: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Getting those items

endTime = firstData["candles"][-1].__getitem__("time")

Finn Arup Nielsen 84 November 29, 2017

Page 86: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Getting those items

endTime = firstData["candles"][-1].__getitem__("time")

There is no need to use magic methods directly .__getitem__("time") is

the same as ["time"]

endTime = firstData["candles"][-1]["time"]

Finn Arup Nielsen 85 November 29, 2017

Page 87: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Checkin of .pyc files

$ git add *.pyc

Finn Arup Nielsen 86 November 29, 2017

Page 88: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Checkin of .pyc files

$ git add *.pyc

*.pyc files are byte code files generated from *.py files. Do not check

these files into the revision control system.

Put them in .gitignore:

*.pyc

Finn Arup Nielsen 87 November 29, 2017

Page 89: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Checkin of .pyc files

$ git add *.pyc

*.pyc files are byte code files generated from *.py files. Do not check

these files into the revision control system.

Put them in .gitignore together with others

*.pyc

.tox

__pycache__

And many more, see an example of .gitignore.

Finn Arup Nielsen 88 November 29, 2017

Page 90: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

I18N

Module with a docstring

"""

@author: Finn Arup Nielsen

"""

# Code below.

Finn Arup Nielsen 89 November 29, 2017

Page 91: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

I18N

Module with a docstring

"""

@author: Finn Arup Nielsen

"""

# Code below.

Python 2 is by default ASCII, — not UTF-8.

# -*- coding: utf-8 -*-

u"""

@author: Finn Arup Nielsen

"""

# Code below.

Finn Arup Nielsen 90 November 29, 2017

Page 92: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

I18N

Module with a docstring

"""

@author: Finn Arup Nielsen

# Code below.

"""

Note that you might run into Python 2/3 output encoding problem with

>>> import mymodule

>>> help(mymodule)

...

UnicodeEncodeError: ’ascii’ codec can’t encode character

u’\xc5’ in position 68: ordinal not in range (128)

If the user session is in an ascii session an encoding exception is raised.

Finn Arup Nielsen 91 November 29, 2017

Page 93: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Opening a file with with

comments_filename = ’somekindoffilename.txt’

words_filename = ’anotherfilename.txt’

text = ’Some kind of text goes here.’

with open(comments_filename , ’a+’) as c:

#Convert text to str and remove newlines

single_comment = str(text). replace("\n", " ")

single_comment +=’\n’

c.write(single_comment)

with open(words_filename ,’a+’) as w:

words =[]

words = single_comment.split()

for word in words:

single_word = str(word)

single_word +=’\n’

w.write(single_word)

w.close()

c.close()

Finn Arup Nielsen 92 November 29, 2017

Page 94: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Opening a file with with

comments_filename = ’somekindoffilename.txt’

words_filename = ’anotherfilename.txt’

text = ’Some kind of text goes here.’

with open(comments_filename , ’a+’) as c:

#Convert text to str and remove newlines

single_comment = str(text). replace("\n", " ")

single_comment +=’\n’

c.write(single_comment)

with open(words_filename ,’a+’) as w:

words =[]

words = single_comment.split()

for word in words:

single_word = str(word)

single_word +=’\n’

w.write(single_word)

File identifiers are already closed when the with block has ended.

Finn Arup Nielsen 93 November 29, 2017

Page 95: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Opening a file with with

comments_filename = ’somekindoffilename.txt’

words_filename = ’anotherfilename.txt’

text = ’Some kind of text goes here.’

with open(comments_filename , ’a+’) as c:

#Convert text to str and remove newlines

single_comment = str(text). replace("\n", " ")

single_comment +=’\n’

c.write(single_comment)

with open(words_filename ,’a+’) as w:

words = single_comment.split()

for word in words:

single_word = str(word)

single_word +=’\n’

w.write(single_word)

File identifiers are already closed when the with block has ended. Eraseredundant assignment.

Finn Arup Nielsen 94 November 29, 2017

Page 96: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Opening a file with with

comments_filename = ’somekindoffilename.txt’

words_filename = ’anotherfilename.txt’

text = ’Some kind of text goes here.’

with open(comments_filename , ’a+’) as c:

single_comment = text.replace("\n", " ")

single_comment +=’\n’

c.write(single_comment)

with open(words_filename ,’a+’) as w:

words = single_comment.split()

for word in words:

single_word = word

single_word +=’\n’

w.write(single_word)

File identifiers are already closed when the with block has ended. Erase

redundant assignment. No need to convert to str.

Finn Arup Nielsen 95 November 29, 2017

Page 97: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Opening a file with with

comments_filename = ’somekindoffilename.txt’

words_filename = ’anotherfilename.txt’

text = ’Some kind of text goes here.’

with open(comments_filename , ’a+’) as c:

single_comment = text.replace("\n", " ")

c.write(single_comment + ’\n’)

with open(words_filename ,’a+’) as w:

words = single_comment.split()

for word in words:

w.write(word + ’\n’)

File identifiers are already closed when the with block has ended. Erase

redundant assignment. No need to convert to str. Simplifying.

Finn Arup Nielsen 96 November 29, 2017

Page 98: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Opening a file with with

comments_filename = ’somekindoffilename.txt’

words_filename = ’anotherfilename.txt’

text = ’Some kind of text goes here.’

with open(comments_filename , ’a+’) as comments_file , \

open(words_filename ,’a+’) as words_file:

single_comment = text.replace("\n", " ")

comments_file.write(single_comment + ’\n’)

words = single_comment.split()

for word in words:

words_file.write(word + ’\n’)

File identifiers are already closed when the with block has ended. Erase

redundant assignment. No need to convert to str. Simplifying. Multiple

file openings with with.

Finn Arup Nielsen 97 November 29, 2017

Page 99: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Opening a file with with

comments_filename = ’somekindoffilename.txt’

words_filename = ’anotherfilename.txt’

text = ’Some kind of text goes here.’

with open(comments_filename , ’a+’) as f:

single_comment = text.replace("\n", " ")

f.write(single_comment + ’\n’)

with open(words_filename ,’a+’) as f:

words = text.split ()

for word in words:

f.write(word + ’\n’)

File identifiers are already closed when the with block has ended. Erase

redundant assignment. No need to convert to str. Simplifying. Multiple

file openings with with. Alternatively: Split the with blocks.

Finn Arup Nielsen 98 November 29, 2017

Page 100: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordstext = ’Some kind of text goes here.’

words = text.split ()

Finn Arup Nielsen 99 November 29, 2017

Page 101: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordstext = ’Some kind of text goes here.’

words = text.split ()

You get a word with a dot: here. because of split on whitespaces.

Finn Arup Nielsen 100 November 29, 2017

Page 102: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordstext = ’Some kind of text goes here.’

words = text.split ()

You get a word with a dot: here. because of split on whitespaces.

Write a test!

Finn Arup Nielsen 101 November 29, 2017

Page 103: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordsdef split(text):

words = text.split ()

return words

def test_split ():

text = ’Some kind of text goes here.’

assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

Finn Arup Nielsen 102 November 29, 2017

Page 104: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordsdef split(text):

words = text.split ()

return words

def test_split ():

text = ’Some kind of text goes here.’

assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

Test the module with py.test:

$ py.test yourmodulewithtestfunctions.py

Finn Arup Nielsen 103 November 29, 2017

Page 105: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordsdef split(text):

words = text.split ()

return words

def test_split ():

text = ’Some kind of text goes here.’

assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

Test the module with py.test

$ py.test yourmodulewithtestfunctions.py

def test_split ():

> assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

E assert [’Some’, ’kin... goes’, ’here’] == [’Some’, ’kind ... oes’, ’here.’]

E At index 5 diff: ’here’ != ’here.’

E Use -v to get the full diff

Finn Arup Nielsen 104 November 29, 2017

Page 106: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordsfrom nltk import sent_tokenize , word_tokenize

def split(text):

words = [word for sentence in sent_tokenize(text)

for word in word_tokenize(sentence )]

return words

def test_split ():

text = ’Some kind of text goes here.’

assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

Test the module with py.test

$ py.test yourmodulewithtestfunctions.py

Finn Arup Nielsen 105 November 29, 2017

Page 107: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordsfrom nltk import sent_tokenize , word_tokenize

def split(text):

words = [word for sentence in sent_tokenize(text)

for word in word_tokenize(sentence )]

return words

def test_split ():

text = ’Some kind of text goes here.’

assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

Test the module with py.test

$ py.test yourmodulewithtestfunctions.py

def test_split ():

text = ’Some kind of text goes here.’

> assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

E assert [’Some’, ’kin... goes’, ’here’] == [’Some’, ’kind..., ’here’, ...]

E Right contains more items , first extra item: ’.’

E Use -v to get the full diff

Finn Arup Nielsen 106 November 29, 2017

Page 108: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordsfrom nltk import sent_tokenize , word_tokenize

def split(text):

words = [word for sentence in sent_tokenize(text)

for word in word_tokenize(sentence)

if word.isalpha ()]

return words

def test_split ():

text = ’Some kind of text goes here.’

assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

Test the module with py.test

$ py.test yourmodulewithtestfunctions.py

Finn Arup Nielsen 107 November 29, 2017

Page 109: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting wordsfrom nltk import sent_tokenize , word_tokenize

def split(text):

words = [word for sentence in sent_tokenize(text)

for word in word_tokenize(sentence)

if word.isalpha ()]

return words

def test_split ():

text = ’Some kind of text goes here.’

assert [’Some’, ’kind’, ’of’, ’text’, ’goes’, ’here’] == split(text)

Test the module with py.test

$ py.test yourmodulewithtestfunctions.py

Success!

Finn Arup Nielsen 108 November 29, 2017

Page 110: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

The example continues . . .

What about lower/uppercase case?

What about issues of Unicode/UTF-8?

Should the files really be opened for each comment?

Should individual words really be written one at a time to a file?

Finn Arup Nielsen 109 November 29, 2017

Page 111: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

A sentiment analysis function# pylint: disable = fixme , line -too -long

#-------------------------------------------------------------------------------

def afinn(text):

’’’

AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive ). The words have been manually labeled by Finn Arup Nielsen in 2009 -2011.

This method uses this AFINn list to find the sentiment score of a tweet:

Parameters

----------

text : Text

A tweet text

Returns

-------

sum :

A sentiment score based on the input text

’’’

afinn = dict(map(lambda(k, v): (k, int(v)), [line.split(’\t’) for line in open("AFINN -111. txt")]))

return sum(map(lambda word: afinn.get(word , 0), text.lower (). split ()))

Finn Arup Nielsen 110 November 29, 2017

Page 112: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

A sentiment analysis functiondef afinn(text):

’’’

AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive ). The words have been manually labeled by Finn Arup Nielsen in 2009 -2011.

This method uses this AFINn list to find the sentiment score of a tweet:

Parameters

----------

text : Text

A tweet text

Returns

-------

sum :

A sentiment score based on the input text

’’’

afinn = dict(map(lambda(k, v): (k, int(v)), [line.split(’\t’) for line in open("AFINN -111. txt")]))

return sum(map(lambda word: afinn.get(word , 0), text.lower (). split ()))

If the lines are too long they are too long.

Finn Arup Nielsen 111 November 29, 2017

Page 113: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

A sentiment analysis functiondef afinn(text):

""" Return sentiment analysis score for text.

The AFINN word list is used to score the text. The sum of word

valences are returned.

Parameters

----------

text : str

A tweet text

Returns

-------

sum : int

A sentiment score based on the input text

"""

afinn = dict(map(lambda(k, v): (k, int(v)), [line.split(’\t’) for line in open("AFINN -111. txt")]))

return sum(map(lambda word: afinn.get(word , 0), text.lower (). split ()))

PEP 257 fixes: Proper types for input and output, short headline, correctindentation. Note extra space need. And perhaps ‘References’.

Finn Arup Nielsen 112 November 29, 2017

Page 114: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

A sentiment analysis function

A closer look on the actual itself:

def afinn(text):

afinn = dict(map(lambda(k, v): (k, int(v)), [line.split(’\t’) for line in open("AFINN -111. txt")]))

return sum(map(lambda word: afinn.get(word , 0), text.lower (). split ()))

What is wrong?

Finn Arup Nielsen 113 November 29, 2017

Page 115: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

A sentiment analysis function

A closer look on the actual itself:

def afinn(text):

afinn = dict(map(lambda(k, v): (k, int(v)), [line.split(’\t’) for line in open("AFINN -111. txt")]))

return sum(map(lambda word: afinn.get(word , 0), text.lower (). split ()))

What is wrong?

The word list is read each time the function is called. That’s slow.

The word tokenization is bad. It splits on whitespace.

UTF-8 encoded file is not handled.

Style: The lines are too long and variable names too short.

The location of the file is hardcoded (the filename cannot be seen becausethe line is too long).

Finn Arup Nielsen 114 November 29, 2017

Page 116: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

A sentiment analysis function

Avoid reading the word list file multiple times:

class Afinn(object ):

def __init__(self):

self.load_wordlist ()

def load_wordlist(self , filename=’AFINN -111. txt’):

self.wordlist = ...

def score(self , text):

return ...

afinn = Afinn ()

afinn.score(’This bad text should be sentiment analyzed ’)

afinn.score(’This good text should be sentiment analyzed ’)

Finn Arup Nielsen 115 November 29, 2017

Page 117: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Tools to help

Install flake8 with plugins flake8-docstrings, flake8-import-order and pep8-

naming to perform style checking:

$ flake8 yourmoduledirectory

It catches, e.g., naming issue, unused imports, missing documentation or

documentation in the wrong format.

pylint will catch, e.g., unused arguments.

Install py.test and write test functions and then:

$ py.test yourmoduledirectory

Use Numpy document convention, A Guide to NumPy/SciPy Documen-

tation. See example.py for use. “Examples” section should be doctested.

Finn Arup Nielsen 116 November 29, 2017

Page 118: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Testing

Ran both py.test and nosetests with the result:

“All scripts returned ‘Ran 0 tests in 0.000s’ ”

Finn Arup Nielsen 117 November 29, 2017

Page 119: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Machine learning

Finn Arup Nielsen 118 November 29, 2017

Page 120: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting the data set

When you evaluate the performance of a machine learning algorithm that

has been trained/estimated/optimized, then you need to evaluate the

performance (e.g., mean square error, accuracy, precision, . . . ) on an

independent data set to get unbiased results, — or you too optimistic

results.

Finn Arup Nielsen 119 November 29, 2017

Page 121: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting the data set

When you evaluate the performance of a machine learning algorithm that

has been trained/estimated/optimized, then you need to evaluate the

performance (e.g., mean square error, accuracy, precision, . . . ) on an

independent data set to get unbiased results, — or you too optimistic

results.

So split the data into a training and a test set, — and perhaps a devel-

opment set (part of the training set used test, e.g., hyperparameters).

The split can, e.g., be split-half or cross-validation.

Finn Arup Nielsen 120 November 29, 2017

Page 122: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Splitting the data set

When you evaluate the performance of a machine learning algorithm that

has been trained/estimated/optimized, then you need to evaluate the

performance (e.g., mean square error, accuracy, precision, . . . ) on an

independent data set to get unbiased results, — or you too optimistic

results.

So split the data into a training and a test set, — and perhaps a devel-

opment set (part of the training set used test, e.g., hyperparameters).

The split can, e.g., be split-half or cross-validation.

It is ok to evaluate the performance on the training test: This will usually

give an upper bound on the performance of your model, but do not regard

this value as relevant for the final performance of the system.

Finn Arup Nielsen 121 November 29, 2017

Page 123: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Continuous, ordered and categorical data

If you have a continuous variable as input or output of the model, then

it is usually not a good idea to model this as a categorical value.

Default NLTK naıve Bayes classifier handles categorical input and out-

put values and not continuous, so if you have a problem with continous

variables do not use that default classifier.

There are amble of different methods, e.g., in sklearn and statsmodels

for modeling both continuous and categorical values.

Finn Arup Nielsen 122 November 29, 2017

Page 124: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Continuous, ordered and categorical data

This is very wrong:

>>> import nltk

>>> classifier = nltk.NaiveBayesClassifier.train ([({’a’: 1.0}, ’pos’),

({’a’: -2.0}, ’neg’)])

>>> classifier.prob_classify ({’a’: 1.5}). prob(’pos’)

0.5

This is an example of how it can be done:

import pandas as pd

import statsmodels.formula.api as smf

data = pd.DataFrame ([{’Value’: 1.0, ’Class’: ’pos’},

{’Value’: -2.0, ’Class’: ’neg’}])

data[’Classint ’] = (data[’Class’] == ’pos’). astype(float)

results = smf.ols(’Classint ~ Value ’, data=data).fit()

Example use:

>>> results.predict ({’Value’: 1.5})

array([ 1.16666667])

Finn Arup Nielsen 123 November 29, 2017

Page 125: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Labels as features!?

Comments from social media (Twitter, YouTube, . . . ) can be down-

loaded and annotated for sentiment automatically using the happy and

sad emoticons as labels.

Finn Arup Nielsen 124 November 29, 2017

Page 126: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Labels as features!?

Comments from social media (Twitter, YouTube, . . . ) can be down-

loaded and annotated for sentiment automatically using the happy and

sad emoticons as labels.

This is a good idea!

Finn Arup Nielsen 125 November 29, 2017

Page 127: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

Labels as features!?

Comments from social media (Twitter, YouTube, . . . ) can be down-

loaded and annotated for sentiment automatically using the happy and

sad emoticons as labels.

This is a good idea!

But then do not use the specific emoticons as features. You will very

likely get an all too optimistic performance meaure!

When training a machine learning classifier do not use labels as features!

Finn Arup Nielsen 126 November 29, 2017

Page 128: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

sklearn

def sklearn_classifier_Kfold_score(features ,targets ):

# gnb = GaussianNB ()

svm = SVC()

Finn Arup Nielsen 127 November 29, 2017

Page 129: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

sklearn

def sklearn_classifier_Kfold_score(features ,targets ):

# gnb = GaussianNB ()

svm = SVC()

If you want to try out several classifier it is possible to parametrize the

classifier.

There is a nice scikit-learn example, plot classifier comparison.py, where

multiple classifiers are used on multiple data sets.

Finn Arup Nielsen 128 November 29, 2017

Page 130: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

Data Mining using Python — code comments

To be continued

Finn Arup Nielsen 129 November 29, 2017

Page 131: Data Mining using Python | code · PDF fileData Mining using Python | code comments Code comments Random comments on code provided by students. With thanks to Vladimir Keleshev and

References

References

Beazley, D. and Jones, B. K. (2013). Python Cookbook. O’Reilly, Sebastopol, third edition.

van Rossum, G., Warsaw, B., and Coglan, N. (2013). Style guide for python code. Python EnhancementProposals 8, Python Software Foundation. http://www.python.org/dev/peps/pep-0008/.

Finn Arup Nielsen 130 November 29, 2017