demographics andweblogtargeting
TRANSCRIPT
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Demographics and Weblog Hackathon – Case Study
5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are
important for strategies to increase the subscription rate
Learn by Doing
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
http://www.meetup.com/HandsOnProgrammingEvents/
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Data Mining Hackathon
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Funded by Rapleaf
• With Motley Fool’s data• App note for Rapleaf/Motley Fool • Template for other hackathons• Did not use AWS. R on individual PCs• Logisics: Rapleaf funded prizes and food for 2
weekends for ~20-50. Venue was free
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Getting more subscribers
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Headline Data, Weblog
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Demographics
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Cleaning Data
• training.csv(201,000), headlines.tsv(811MB), entry.tsv(100k), demographics.tsv
• Feature Engineering• Github:
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Ensemble Methods
• Bagging, Boosting, randomForests• Overfitting• Stability (small changes make large prediction
changes)• Previously none of these work at scale• Small scale results using R, large scale exist in
proprietary implementations(google, amazon, etc..)
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
ROC Curves
Binary Classifier Only!
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Paid Subscriber ROC curve, ~61%
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Boosted Regression Trees Performance
• training data ROC score = 0.745 • cv ROC score = 0.737 ; se = 0.002• 5.5% less performance than the winning score
without doing any data processing• Random is 50% or .50. We are .737-.50 better
than random by 23.7%
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Contribution of predictor variables
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Predictive Importance• Friedman, number of times a variable is selected for splitting weighted by
squared error or improvement to model. Measure of sparsity in data• Fit plots remove averages of model variables• 1 pageV 74.0567852• 2 loc 11.0801383• 3 income 4.1565597• 4 age 3.1426519• 5 residlen 3.0813927• 6 home 2.3308287• 7 marital 0.6560258• 8 sex 0.6476549• 9 prop 0.3817017• 10 child 0.2632598• 11 own 0.2030012
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Behavioral vs. Demographics
• Demographics are sparse• Behavioral weblogs are the best source. Most
sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm
• Linear vs. Nonlinear
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Fitted Values (Crappy)
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Fitted Values Better
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Predictor Variable Interaction
• Adjusting variable interactions
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Variable Interactions
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Plot Interactions age, loc
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Trees vs. other methods
• Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model
• No Math. Analyst
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Number of Trees
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Data Set Number of Trees
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Hackathon Results
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Weblogs only 68.15%, 18% better than random
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Demographics add 1%
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
AWS Advantages
• Running multiple instances with different algorithms and parameters using R
• Add tutorial, install Screen, R GUI bugs• http://amazonlabs.pbworks.com/w/page/280
36646/FrontPage
copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
Conclusion
• Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing.
• Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3.
• This isn’t reproducable in Hadoop/Mahout or any open source code I know of
• Other use cases, i.e. predicting which item will sell(eBay), search engine ranking.
• Careful with MR paradigms, Hadoop MR != Couchbase MR