mining the social web for fun and profit: a getting started guide

35
Mining the Social Web for Fun and Profit: A Getting Started Guide Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com Front Range PyData Meetup - 21 May 2014 1

Upload: matthew-russell

Post on 27-Jan-2015

106 views

Category:

Social Media


0 download

DESCRIPTION

A presentation to the FrontRange PyData Meetup on how to get started with Mining the Social Web.

TRANSCRIPT

Page 1: Mining the Social Web for Fun and Profit: A Getting Started Guide

Mining the Social Web for Fun and Profit:

A Getting Started Guide

Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com

Front Range PyData Meetup - 21 May 2014

1

Page 2: Mining the Social Web for Fun and Profit: A Getting Started Guide

Overview

Intro (5 mins)

Virtual Machine Experience (10 mins)

Virtual Machine and IPython Notebook Demonstration (10 mins)

Mining Twitter: A Primer (20 mins)

Wrap Up/Final Q&A (10 mins)

2

Page 3: Mining the Social Web for Fun and Profit: A Getting Started Guide

Intro

3

Page 4: Mining the Social Web for Fun and Profit: A Getting Started Guide

Hello, My Name Is ... Matthew

4

Background in Computer Science

Data mining & machine learning

CTO @ Digital Reasoning Systems

Data mining; machine learning

Author @ O'Reilly Media

5 published books on technology

Principal @ Zaffra

Selective boutique consulting

Page 5: Mining the Social Web for Fun and Profit: A Getting Started Guide

Transforming Curiosity Into Insight

5

An open source software (OSS) project

http://bit.ly/MiningTheSocialWeb2E

A (rewritten) book

http://bit.ly/135dHfs

Accessible to (virtually) everyone

Virtual machine with turn-key coding templates for data science experiments

Think of the book as "premium" support for the OSS project

Page 6: Mining the Social Web for Fun and Profit: A Getting Started Guide

The Social Web Is All the Rage

World population: ~7B people

Facebook: 1.15B users

Twitter: 500M users

Google+ 343M users

LinkedIn: 238M users

~200M+ blogs (conservative estimate)

6

Page 7: Mining the Social Web for Fun and Profit: A Getting Started Guide

Table of Contents (1/2)

Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More

Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More

Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More

Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and More

Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and More

7

Page 8: Mining the Social Web for Fun and Profit: A Getting Started Guide

Table of Contents (2/2)

Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More

Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More

Chapter 9 - Twitter Cookbook

Appendix A - Information About This Machine's Virtual Machine Experience

Appendix B - OAuth Primer

Appendix C - Python and IPython Notebook Tips & Tricks

8

Page 9: Mining the Social Web for Fun and Profit: A Getting Started Guide

Anatomy of Each ChapterBrief Intro

Objectives

API Primer

Analysis Technique(s)

Data Visualization

Recap

Suggested Exercises

Recommended Resources

9

Page 10: Mining the Social Web for Fun and Profit: A Getting Started Guide

The Virtual Machine Experience

10

Page 11: Mining the Social Web for Fun and Profit: A Getting Started Guide

Why do you need a VM?

11

To save time

Because installation and configuration management is harder than it first appears

So that you can focus on the task at hand instead

So that I can support you regardless of your hardware and operating system

Arguably, it's even a best practice for a dev environment

Page 12: Mining the Social Web for Fun and Profit: A Getting Started Guide

But I can do all of that myself...True...

If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand

At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages

Including scientific computing tools that require underlying C/C++ code to be compiled

Which requires specific versions of developer libraries to be installed

You get the idea...

12

Page 13: Mining the Social Web for Fun and Profit: A Getting Started Guide

The Virtual Machine ExperienceVagrant

A nice abstraction around virtual machine providers

One ring to rule them all

Virtualbox, VMWare, AWS, ...

IPython Notebook

The easiest way to program with Python

A better REPL (interpreter)

Great for hacking

13

Page 14: Mining the Social Web for Fun and Profit: A Getting Started Guide

What happens when you vagrant up?

Vagrant follows the instructions in your Vagrantfile

Starts up a Virtualbox instance

Uses Chef to provision it

Installs OS patches/updates

Installs MTSW software dependencies

Starts IPython Notebook server on port 8888

14

Page 15: Mining the Social Web for Fun and Profit: A Getting Started Guide

Why Should I Use IPython Notebook?

Because it's great for hacking

And hacking is usually the first step

Because it's great for collaboration

Sharing/publishing results is trivial

Because the UX is as easy as working in a notepad

Think of it as "executable paper"

15

Page 16: Mining the Social Web for Fun and Profit: A Getting Started Guide

16

Page 17: Mining the Social Web for Fun and Profit: A Getting Started Guide

17

Page 18: Mining the Social Web for Fun and Profit: A Getting Started Guide

VM Quick Start Instructions

Go to http://MiningTheSocialWeb.com/quick-start/

Follow the instructions

And watch the screencasts!

Basically:

Install Virtualbox & Vagrant

Run "vagrant up" in a terminal to start a guest VM

Then, go to http://localhost:8888 on your host machine's web browser

18

Page 19: Mining the Social Web for Fun and Profit: A Getting Started Guide

An (AWS) Hosted Virtual Machine

Is it free?

Perhaps...

...Sign-up for the AWS free tier at http://aws.amazon.com/free/

But not right now. Do it later

See this blog post for some inspiration on how to easily build your own AMI from Vagrant boxes

http://wp.me/p3QiJd-3T

19

Page 20: Mining the Social Web for Fun and Profit: A Getting Started Guide

Virtual Machine and IPython Notebook Demonstration

20

Page 21: Mining the Social Web for Fun and Profit: A Getting Started Guide

Demonstration of Virtual Machine

http://nbviewer.ipython.org

http://MiningTheSocialWeb.com/quick-start/

Your first "vagrant up"

21

Page 22: Mining the Social Web for Fun and Profit: A Getting Started Guide

Mining Twitter: A Primer

22

Page 23: Mining the Social Web for Fun and Profit: A Getting Started Guide

Objectives

23

Be able to identify Twitter primitives

Understand tweet metadata and how to use it

Learn how to extract entities such as user mentions, hashtags, and URLs from tweets

Apply techniques for performing frequency analysis with Python

Be able to plot histograms of Twitter data with IPython Notebook

Page 24: Mining the Social Web for Fun and Profit: A Getting Started Guide

Twitter Primitives

24

Accounts Types: "Anything"

"Following" Relationships

Favorites

Retweets

Replies

(Almost) No Privacy Controls

Page 25: Mining the Social Web for Fun and Profit: A Getting Started Guide

API RequestsRESTful requests

Everything is a "resource"

You GET, PUT, POST, and DELETE resources

Standard HTTP "verbs"

Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=SocialWebMining

Streaming API filters

JSON responses

Cursors (not quite pagination)

25

Page 26: Mining the Social Web for Fun and Profit: A Getting Started Guide

Twitter is an Interest Graph

26

Roberto Mercedes

Jorge

Ana

Nina

Johnny Araya

Rodolfo Hernández

Page 27: Mining the Social Web for Fun and Profit: A Getting Started Guide

What's in a Tweet?

27

140 Characters ...

... Plus ~5KB of metadata!

Authorship

Time & location

Tweet "entities"

Replying, retweeting, favoriting, etc.

Page 28: Mining the Social Web for Fun and Profit: A Getting Started Guide

What are Tweet Entities?

Essentially, the "easy to get at" data in the 140 characters

@usermentions

#hashtags

URLs

multiple variations

(financial) symbols

stock tickers

media

28

Page 29: Mining the Social Web for Fun and Profit: A Getting Started Guide

Data Mining Is...

Counting

Comparing

Filtering

Ranking

29

Page 30: Mining the Social Web for Fun and Profit: A Getting Started Guide

Histograms

A chart that is handy for frequency analysis

They look like bar charts...except they're not bar charts

Each value on the x-axis is a range (or "bin") of values

Not categorical data

Each value on the y-axis is the combined frequency of values in each range

30

Page 31: Mining the Social Web for Fun and Profit: A Getting Started Guide

31

Example: Histogram of Retweets

Page 32: Mining the Social Web for Fun and Profit: A Getting Started Guide

Social Media Analysis FrameworkA memorable four step process to guide data science experiments:

Aspire

To test a hypothesis (answer a question)

Acquire

Get the data

Analyze

Count things

Summarize

Plot the results

32

Page 33: Mining the Social Web for Fun and Profit: A Getting Started Guide

Recommended ExercisesReview Python idioms in the "Appendix C (Python Tips & Tricks)" notebook

Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook

Fill in Example 1-1 with credentials and begin work

Execute each example sequentially

Customize queries

Explore tweet metadata; count tweet entities; plot histograms of results

Explore the "Chapter 9 (Twitter Cookbook)" notebook

Think of it as a collection of building blocks

33

Page 34: Mining the Social Web for Fun and Profit: A Getting Started Guide

Final Q&A; Wrap Up

34

Page 35: Mining the Social Web for Fun and Profit: A Getting Started Guide

Recommended Resourceshttp://MiningTheSocialWeb.com

Mining the Social Web 2E Chapter 1 (Chimera)

http://bit.ly/13XgNWR

Source Code (GitHub)

http://bit.ly/MiningTheSocialWeb2E

http://bit.ly/1fVf5ej (numbered examples)

Screencasts (Vimeo)

http://bit.ly/mtsw2e-screencasts

35