welcome to ist 380 !

73
Welcome to IST 380 ! When the course was over, I knew it was a good thing. We don't have strong enough words to describe this class. Data Science Programming an advocate of concrete computing – and HMC's mascot - New York Times Review of Courses - US News and Course Report We give this course two thumbs! - Ebert and Roeper

Upload: havard

Post on 09-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Welcome to IST 380 !. Data Science Programming. We don't have strong enough words to describe this class. - US News and Course Report. When the course was over, I knew it was a good thing. an advocate of concrete computing – and HMC's mascot. - New York Times Review of Courses. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Welcome to IST 380 !

Welcome to IST 380 !

When the course was over, I knew it was a good thing.

We don't have strong enough words to describe this class.

Data Science Programming

an advocate of concrete computing – and HMC's mascot - New York Times Review of Courses

- US News and Course Report

We give this course two thumbs!- Ebert and Roeper

Page 2: Welcome to IST 380 !

Welcome to IST 380 !

Data Science Programming

an advocate of concrete computing – and HMC's mascot

Page 3: Welcome to IST 380 !

About myself

Who Zach Dodds

Harvey Mudd CollegeWhere

What Research includes robotics and computer vision

Contact Information

[email protected]

909-607-0867

Office Hours:Friday mornings, 9-11 am

or set up a time...

When Mondays 7-10pm here in ACB 119

HMC Beckman B111

Page 4: Welcome to IST 380 !

TMI?

fan of low-tech gamesfan of low-level AI

Page 5: Welcome to IST 380 !

IST 380 ~ the big picture

What is it? Why me?

Page 6: Welcome to IST 380 !

IST 380 ~ the big picture

Data Science Venn

Diagram

Hmmm… where am I on this diagram?

What is it?

Page 7: Welcome to IST 380 !

Data?!• Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science" background?

(statistics, machine learning, CS)

Where?

Page 8: Welcome to IST 380 !

state reminders…

Page 9: Welcome to IST 380 !

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science" background?

(statistics, machine learning, CS)

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

Page 10: Welcome to IST 380 !

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science" background?

(statistics, machine learning, CS)

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

be sure to set up your login + profile for the submission site…

This class is truly seminar-

style: I'm here, as you are,

in order to gain insights

into this very new field… .

Page 11: Welcome to IST 380 !

Data Science concerns

Is "Data Science" important or just

trendy?

Page 12: Welcome to IST 380 !

Hmmm…

Data Science concerns

Page 13: Welcome to IST 380 !

the companies are expanding as fast as the data!

Page 14: Welcome to IST 380 !

There's certainly a lot of it!

2015

1 Zettabyte

1 Exabyte

1 Petabyte

(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store

(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

1 Petabyte == 1000 TB 2002 2009

(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf

(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf

2006 2011

(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!

5 EB

161 EB

800 EB

1.8 ZB 8.0 ZB

14 PB

60 PB

Data produced each year

100-years of HD video + audioHuman brain's capacity

Data, data everywhere…

References

1 TB = 1000 GB

120 PB

logarithmic scale

Page 15: Welcome to IST 380 !

data

information

knowledge

wisdom

I'd call it data, not information

Page 16: Welcome to IST 380 !

Big Data?

I agree with this…

Page 17: Welcome to IST 380 !

Make data easier to use ~ by using it!

It may be true that Data Science isn't

a science – but that doesn't mean

it's not useful!

Page 18: Welcome to IST 380 !

IST 380 ~ the big picture

What? Why?Data Science Programming Data Rules

All of our insights – large and small, permanent and ephemeral, natural and artificial – come about

through the integration of lots of data.

Data Science simply recognizes that the rules and skills behind those insights are widely applicable…

Page 19: Welcome to IST 380 !

A few examples…

Make3d

How is this being done?

Andrew Ng ~ Computers and Thought award,

2009

… Data Science is at the heart of computer science

and how do we succeed?

Page 20: Welcome to IST 380 !

A few examples…

… Data Science is at the heart of computer science

Stanford's Autonomous

Vehicles project (Thrun et al.)

Learning to Powerslide

Page 21: Welcome to IST 380 !

A few examples…

… Data Science is at the heart of computer science

"my summer was finding that red line"

Learning ground from obstacles

Page 22: Welcome to IST 380 !

A few examples…

Learning ground from obstacles

classification segmentation

Page 23: Welcome to IST 380 !

Insights beyond science

Page 24: Welcome to IST 380 !

Marketing

Page 25: Welcome to IST 380 !

Visualization

Motivation

Page 26: Welcome to IST 380 !
Page 27: Welcome to IST 380 !

Recommender Systems

predicting movie ratings

Page 28: Welcome to IST 380 !

Bob Bell, winner of the "Netflix prize"

Napoleon Dynamite =Batman Begins =

Netflix Prize

Finding Nemo =Lord of the Rings =

(I don't know this guy)

1.22.75

????

Some films are difficult to predict…

Page 29: Welcome to IST 380 !

Bob Bell, winner of the "Netflix prize"(I don't know this guy)

Napoleon Dynamite =Batman Begins =

Finding Nemo =Lord of the Rings =

1.22.75

.67

.42Some films are difficult to predict… and others are easier!

Netflix Prize

Page 30: Welcome to IST 380 !

Why IST 380 ?Specific skills:

R statistical environment (and the S programming language)

Experience with several statistical analyses (descriptive statistics)

Experience with predictive statistics (modeling) and machine learning algorithms

Page 31: Welcome to IST 380 !

Why IST 380 ?Specific skills:

Broad background:

You'll be confident and capable with whatever datasets you encounter in the future – on your own or as part of a team.

R statistical environment (and the S programming language)

Experience with several statistical analyses (descriptive statistics)

Experience with predictive statistics (modeling) and machine learning algorithms

Final project ~ open-ended with datasets of your choice

Page 32: Welcome to IST 380 !

About IST 380 …

Page 33: Welcome to IST 380 !

DetailsWeb Page:

http://www.cs.hmc.edu/~dodds/IST380

Assignments, online text, necessary files, lecture slides are linked

First week's assignment: Getting started with R

Programming: R

Textbook An introduction to Data Sciencejsresearch.net/groups/teachdatascience/

www.r-project.org/

Grab both of these now…

freely available online

and many online resources…

Page 34: Welcome to IST 380 !

Homepage

http://www.cs.hmc.edu/~dodds/IST380/Go to the course page

Grab R and the text from these two links…

Page 35: Welcome to IST 380 !

Homework

Assignments~ 2-5 problems/week ~ 100 points extra credit, often

Due Tuesday of the following week by 11:59 pm.

Assignment 1 due Tuesday, February 5.

1 week + 1 day…

Page 36: Welcome to IST 380 !

Homework

Working on programs: On your own or in groups of 2.

Divide the work at the keyboard evenly!

Submitting programs: at the submission website

Today's Lab: install software ensure accounts are workingtry out R - the first HW is officially due on 2/5

Assignments~ 2-5 problems/week ~ 100 points extra credit, often

Due Tuesday of the following week by 11:59 pm.

Assignment 1 due Tuesday, February 5.

Page 37: Welcome to IST 380 !

Outline

Weeks 1-5

using R

descriptive statistics

predictive statistics

probability distributions

Weeks 6-10

"Data Science"

"Machine Learning"

statistical modelingsupport vector machines (SVMs)

random forestsk-means algorithm

nearest neighbors (NN)

Weeks 11-15

approximate!

Final Project

No breaks?!

Page 38: Welcome to IST 380 !

Grading

Grades

Final project

if score >= 0.95: grade = "A"if score >= 0.90: grade = "A-"if score >= 0.86: grade = "B+"

• the last ~4 weeks will work towards a larger, final project

• there will be a short design phase and a short final presentation

• I'd encourage you to connect R and our Data Science techniques to other datasets or projects that you use/need/like, etc.

Based on points percentage ~ 800 points for assignments

see the course syllabus for the full list...~ 400 points for the final project

• choose your own problem to study (I'll have some suggestions, too.)

Page 39: Welcome to IST 380 !

Academic Honesty

This course operates under CGU's (and all of Claremont Schools') Academic Honesty policies…

•Your work must be your own. This must be true for the whole team, if you're working in a pair.

•Consulting with others (except team members or myself) is encouraged, but has to be limited to discussion and debugging of problems. Sharing of written, electronic, or verbal solutions/files/code is a violation of CGU’s academic honesty policy.

•A reasonable guideline: Work is your own if you could delete all of it and recreate it yourself.

Page 40: Welcome to IST 380 !

Thoughts?

Page 41: Welcome to IST 380 !

Getting to know… R

Page 42: Welcome to IST 380 !

Getting to know… R

http://lang-index.sourceforge.net/#categ

R is the programmer's toolkit for statistics; SAS, Stata, SPSS are preferred by those in business intelligence

Page 43: Welcome to IST 380 !

Getting to know… R

Free… and very well supported online…

Page 44: Welcome to IST 380 !

Getting to know… R

R is responsive, up-to-date, and flexible: Data Science vs. Statistics

Page 45: Welcome to IST 380 !

Getting to know… R

1) Find the IST 380 course webpage

Try it!

www.cs.hmc.edu/~dodds/IST380/

2) Download and install R

3) Run R and try some basic commands at the prompt:

6 * 7

rnorm(10)

x <- 380

Page 46: Welcome to IST 380 !

Getting started!

1) Open Matloff's Why R? notes

2) Skip ahead to page 7, the "5 minute example session"

3) Try out the commands in section 2.2 to get started…

4) When you finish, save your session and submit it!

This is problem 1 this week

Page 47: Welcome to IST 380 !

Saving your session

2) Use the Save to file… (Windows) or Save as… (Mac) in order to save your current console session into hw1

This is problem 1 this week

1) Create a folder named hw1, perhaps on your desktop

3) Name that file pr1.txt

4) From your operating system, open up that file in order to confirm it contains your whole session!

Page 48: Welcome to IST 380 !

Submitting your work

2) From the course webpage, click on the submission site link.

You've completed Problem 1!

1) Zip up hw1 into hw1.zip

3) Choose a submission site login name & let me know!

4) Once your account is made, login, change your password to something you know, and submit hw1.zip

This webserver can be spacey -- I should

know!

troubles? email me!5) You can submit again – all copies are saved…

Page 49: Welcome to IST 380 !

Reflection

Average and standard deviation?

Assignment?

Comments?

Printing?

Comments?

Creating a vector?

Page 50: Welcome to IST 380 !

R types

You can use mode() to view the type of a variable.

Page 51: Welcome to IST 380 !

Where's the big data?

Vectors are R lists of a single type of element

c ~ concatenate

Page 52: Welcome to IST 380 !

Where's the big data?

Vectors are R lists of a single type of element

c ~ concatenate

the colon : also creates vectors

Page 53: Welcome to IST 380 !

Analyzing vectors – try these…

Square brackets [] can "subset" (or "slice") vectors

Page 54: Welcome to IST 380 !

Analyzing vectors

Square brackets [] can "subset" (or "slice") vectors

you can use a boolean vector

to subset another vector

Page 55: Welcome to IST 380 !

NA

R uses NA to represent data that is "not available"

What is going on here?

The function is.na( ) tests for NA

Page 56: Welcome to IST 380 !

NA

R uses NA to represent data that is "not available"

What is going on here?

The function is.na( ) tests for NA

This uses subsetting to remove NA values!

Page 57: Welcome to IST 380 !

Data frames

R's fundamental data structures are data frames

The next tutorial will introduce them…

Page 58: Welcome to IST 380 !

Irises…

setosavirginica

data() yields many built-in data files. This is iris

Page 59: Welcome to IST 380 !

Subsetting iris data

As with vectors, you can "subset" data frames.

df[rows,cols]

Page 60: Welcome to IST 380 !

Lab…

The 2nd part of each class meeting dedicated to lab work.

I welcome you to stay for the lab, but it is not required.

Today's lab:

Work through Santorico and Shin's Tutorial for the R Statistical Package and submit the console sessions as pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt.

This is a nice reinforcement of vectors, introduction to data frames, and a look at the graphics that R supports.

Page 61: Welcome to IST 380 !

Homework

Problem 3: Challenge exercises in R

These will reinforce the "subsetting" and data-analysis introduction from pr2's tutorial.

Problem 4: Introduction to Data Science, early chapters

This is a fuller background on R and the field of data science

(submit your console session for both of these…)

Page 62: Welcome to IST 380 !

Lab !

Page 63: Welcome to IST 380 !

CS vs. IS and IT ?

www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf

greater integration system-wide issues

smaller details machine specifics

Page 64: Welcome to IST 380 !

CS vs. IS and IT ?

Where will IS go?

Page 65: Welcome to IST 380 !

CS vs. IS and IT ?

Page 66: Welcome to IST 380 !

IT ?

Where will IT go?

Page 67: Welcome to IST 380 !

IT ?

Page 68: Welcome to IST 380 !
Page 69: Welcome to IST 380 !

The bigger picture

Weeks 10-12

Objects

Week 10

Week 11

Week 12

Weeks 13-15

Final Projects

classes vs. objects

methods and data

inheritance

Week 13

Week 14

Week 15

final projects

final projects

final exam

Page 70: Welcome to IST 380 !

Data?!• Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Where?

Page 71: Welcome to IST 380 !

state reminders…

Page 72: Welcome to IST 380 !

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

Page 73: Welcome to IST 380 !

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

be sure to set up your login + profile for the submission site…

This class is truly seminar-style:

we're devloping expertise in this field together.