ga final report

24
GA_data_science/ final_project dan knox co-founder, science exchange

Upload: dan-knox

Post on 18-Nov-2014

254 views

Category:

Education


5 download

DESCRIPTION

Ga final report

TRANSCRIPT

Page 1: Ga final report

GA_data_science/ final_project

dan knox co-founder, science exchange

Page 2: Ga final report

how has the language of machine learning changed since 2003?

Page 3: Ga final report

motivation

Page 4: Ga final report
Page 5: Ga final report

data

Page 6: Ga final report

if "pdf" in a[“href"]: ! href = a.get("href") ! name = "cs229-%s-%s.pdf" % (year, counter) fullname = "%s/%s" % (directory, name) ! pdf_url = "http://cs229.stanford.edu/" + str(href) r = requests.get(pdf_url) with open(fullname, "wb") as pdf: pdf.write(r.content) ! print "...downloading %s as %s" % (href, name) counter += 1

Page 7: Ga final report

3,030 papers !

913 papers from Stanford CS229 class (2005-12) 2,117 papers from N.I.P.S. Annual Conference (2005-12)

Page 8: Ga final report
Page 9: Ga final report

if input_file.endswith(".pdf"): ! print "extracting %s as %s" % (input_file, output_file) try: os_call = str("pdf2txt.py -c utf-8 -o %s %s" % (output_file, input_file)) os.system(os_call) except: pass

Page 10: Ga final report

def stem_word(word): ! """Takes a single word input and returns the stem.""" ! stemmer = PorterStemmer() stem = stemmer.stem stemmed_word = stem(word) return stemmed_word

Page 11: Ga final report

['use', 'atom', 'action', 'control', 'snake', 'robot', 'locomotion', 'joseph', 'kooemail', ‘jckoo’, 'stanfordedu', 'introduction', 'locomot', 'problem', 'gener', 'difficult', 'because', 'larg', 'amount', 'coordin', 'requir', 'make', 'sure', 'desir', 'movement', 'occur', 'undesir', 'action', 'avoid', 'snake', 'locomot', 'exception', 'rule', 'project', 'show', 'simplifi', 'control', 'snake', 'robot', 'comput', 'tractable', 'still', 'produc', 'interest', 'locomot', 'strategi', 'particular', 'applic', 'snake', 'robot', 'navigate', 'complex', 'terrain', 'order', 'reach', 'goal', 'traversing', 'greatest', 'distanc', 'possibl', 'scenario', 'could', 'aris', 'exampl', 'applic', 'robot', 'need', 'look', 'through', 'rubbl', 'build', 'collaps', 'order', 'search', 'survivor', 'situat', 'wellknown', ‘strategi', 'work', 'strategi', 'highlyspeci', 'particular', 'terrain', 'case', 'exampl', 'accept', 'movement', 'produc', 'terrain', 'enough', 'contact', ‘snake', 'surfac', 'occur', 'regular', 'interv', 'along', 'snake', 'length', 'unfortun', 'hard', 'come', 'rubbl', 'snake', 'robot', 'use', 'project', 'shown', 'figur', 'consist', 'bodi', 'segment', 'connect', 'chain', 'consecut', 'bodi', 'segment', 'connect', 'pair', 'joint', 'give', 'degre', 'freedom', ‘link', 'total', 'degre', 'freedom', ‘entir', 'precis', 'control', ‘snake’, 'build']

Page 12: Ga final report

full_corpus.txt

5,650,772 tokens

Page 13: Ga final report

analysis

Page 14: Ga final report

def calc_unigrams(text, number=100): ! """Returns frequency of unigrams from a text input.""" ! words = [w for w in text] count = Counter(words).most_common(number) return count

Page 15: Ga final report

def calc_bigrams(text, min_freq=100): ! """Returns frequency of bigrams from a text input.""" ! words = [w.lower() for w in text] bcf = BigramCollocationFinder.from_words(words) bcf.apply_freq_filter(min_freq) bigrams = bcf.ngram_fd.items() bigram_list.append(bigrams) return bigram_list

Page 16: Ga final report

def calc_trigrams(text, min_freq=50): ! """Returns frequency of trigrams from a text input.""" ! words = [w.lower() for w in text] tcf = TrigramCollocationFinder.from_words(words) tcf.apply_freq_filter(min_freq) trigrams = tcf.ngram_fd.items() trigram_list.append(trigrams) return trigram_list

Page 17: Ga final report

results

Page 18: Ga final report

0

175

350

525

700

2005 2006 2007 2008 2009 2010 2011 2012cs229 nips

papers per year

Page 19: Ga final report
Page 20: Ga final report
Page 21: Ga final report
Page 22: Ga final report

HOT :) PCA: 3.4x gibbs sampling: 2.9x logistic regression: 2.7x naïve bayes: 2.7x cross validation: 2.3x random forest: 2.1x !

NOT :( nonlinear dimensionality reduction: 0.2x

Page 23: Ga final report

extensions

Page 24: Ga final report

!• more data !• more analysis

!• better visualizations

!• classification model i.e. see if I can predict which

year a paper is from based on the language used