statistics 202: data mining - introduction · statistics 202: data mining c jonathan taylor based...

14
Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text- book, slides of Susan Holmes Statistics 202: Data Mining Introduction c Jonathan Taylor Based in part on slides from textbook, slides of Susan Holmes October 7, 2011 1/1

Upload: others

Post on 25-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Statistics 202: Data MiningIntroduction

c©Jonathan TaylorBased in part on slides from textbook, slides of Susan Holmes

October 7, 2011

1 / 1

Page 2: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

What is data mining?

Non-trivial extraction of implicit, previously unknown andpotentially useful information from data

Data mining involves the use of sophisticated data analysistools to discover previously unknown, valid patterns andrelationships in large data sets.

A key feature of data mining is that the data sets arelarger than those encountered in “classical” statistics. Solarge that it must be (semi-)automated.

2 / 1

Page 3: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Who uses data mining?

Industry:1 Netflix2 Amazon3 Google (i.e. google trends)

Science:1 Genomics2 Climate Science3 Astrophysics4 Neuroimaging

3 / 1

Page 4: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Netflix

4 / 1

Page 5: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Amazon

See larger image

Share your own customer images

Publisher: learn how customers can search inside thisbook.

+

Hello, Jonathan Taylor. We have recommendations for you. (Not Jonathan?) FREE Two-Day Shipping: See details

Jonathan's Amazon.com | Today's Deals | Gifts & Wish Lists | Gift Cards Your Digital Items | Your Account | Help

Search Books

Books AdvancedSearch

BrowseSubjects

NewReleases

BestSellers

The New YorkTimes® Bestsellers

Libros enespañol

BargainBooks Textbooks

Introduction to Data Mining [Hardcover]Pang-Ning Tan (Author), Michael Steinbach (Author), VipinKumar (Author)

(18 customer reviews) | (3)

List Price: $120.00

Price: $94.50 & this item ships for FREE withSuper Saver Shipping. Details

You Save: $25.50 (21%)

In Stock.Ships from and sold by Amazon.com. Gift-wrap available.

Want it delivered Tuesday, September 27? Order it in thenext 20 hours and 22 minutes, and choose One-Day Shipping atcheckout. Details

32 new from $94.50 20 used from $55.00

FREE Two-Day Shipping for Students. Learn more

Frequently Bought TogetherCustomers buy this book with Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The MorganKaufmann Series in Data Management Systems) by Ian H. Witten Paperback $39.50

Price For Both: $134.00

Show availability and shipping details

Shop All Departments Cart Wish List

Yes, I want FREE Two-DayShipping with Amazon Prime

Quantity: 1

or

Sign in to turn on 1-Click ordering.

More Buying Choices

52 used & new from $55.00

Have one to sell? or

Get a $62.20 Amazon Gift Card

Share

Tell the Publisher!I'd like to read this book on Kindle

Don't have a Kindle? Get your Kindlehere, or download a FREE KindleReading App.

Formats AmazonPrice

Newfrom

Usedfrom

Hardcover $94.50 $94.50 $55.00

Paperback -- -- $84.93

Sell Back Your Copy for $62.20Whether you buy it used on Amazon for $55.00 or somewhere else, you can sell it backthrough our Book Trade-In Program at the current price of $62.20 through December 20,2011. Restrictions Apply

Customers Who Bought This Item Also Bought Page 1 of 11

Data Mining: PracticalMachine Learning Toolsan... by Ian H. Witten

(13)

$39.50

The Elements ofStatistical Learning:Data Minin... by TrevorHastie

(45)

$61.32

Programming CollectiveIntelligence: BuildingSma... by Toby Segaran

(69)

$26.39

Data Mining: Conceptsand Techniques, ThirdEdition... by Jiawei Han

(4)

$60.12

Amazon.com: Introduction to Data Mining (9780321321367): ... http://www.amazon.com/Introduction-Data-Mining-Pang-Ning...

1 of 7 9/25/11 8:07 PM

5 / 1

Page 6: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Google Trends

[email protected] | Sign out

andrew luck Search Trends Tip: Use commas to compare multiple search terms.

Searches Websites All regions All years

- Scale is based on the average worldwide traffic of andrew luck in all years. Learn more- An improvement to our geographical assignment was applied retroactively from 1/1/2011. Learn more

andrew luck 1.00

Rank by andrew luck

Interception caps tough day for Stanford's Andrew LuckSan Jose Mercury News - Nov 22 2009

Andrew Luck outplays Jake Locker as No. 13 Stanford dominates Washington 41-0Los Angeles Times - Oct 31 2010

Cam Newton wins Heisman Trophy over Andrew Luck, LaMichael James, Kellen Moore in New YorkNew York Daily News - Dec 12 2010

Andrew Luck leads Stanford past Va Tech 40-12Fox News - Jan 4 2011

Andrew Luck, No. 7 Stanford roll past San Jose State 57-3 in season openerWashington Post - Sep 4 2011

Andrew Luck throws for 325 yards as Stanford rolls ArizonaESPN - Sep 18 2011

More news results »

Regions

1. United States

2. Canada

3. Australia

4. United Kingdom

Cities

1. Stanford, CA, USA

2. Charlotte, NC, USA

3. San Francisco, CA, USA

4. Houston, TX, USA

5. San Jose, CA, USA

6. Herndon, VA, USA

7. Austin, TX, USA

8. Raleigh, NC, USA

9. Pleasanton, CA, USA

10. Seattle, WA, USA

Languages

1. English

2. Spanish

Export this page as a CSV file

Google Trends provides insights into broad search patterns. Please keep in mind that several approximations are used when computing these results.

©2008 Google - Discuss - Terms of Use - Privacy Policy - Help

Google Trends: andrew luck http://www.google.com/trends?q=andrew+luck&ctab=0&geo=a...

1 of 1 9/25/11 8:09 PM

6 / 1

Page 7: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Genomics

7 / 1

Page 8: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Neuroimaging

8 / 1

Page 9: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Climate science

9 / 1

Page 10: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Some things that are not data mining

Looking up a record in a database by identifier such as lastname . (No pattern is revealed by this lookup . . . )

Searching for “Amazon” on google. (Google has donesome data mining, but you have not . . . )

Testing a two-sample hypothesis in a clinical trial. (Dataset is often not large and unstructured.)

10 / 1

Page 11: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Some things that are more like data mining

Noting that some last names occur in certain geographicalareas.

Taking all query results from google on Amazon anddiscovering that there are at least two groups: “Amazonriver” and “Amazon.com”

When doing multiple tests across many different genes,identifying very strongly significant genes . . .

11 / 1

Page 12: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Prediction / Supervised Problems

In such problems there is an outcome or label we want topredict based on many features.

Classification

Regression

Outlier detection

12 / 1

Page 13: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

Data Mining

Descriptive / Unsupervised Problems

In such problems, we are seeking to discover hidden “structure”in the data, without an outcome or label.

Clustering

Dimension Reduction

Association Rules

Semisupervised problems

A mix of labelled and unlabelled data is used.

13 / 1

Page 14: Statistics 202: Data Mining - Introduction · Statistics 202: Data Mining c Jonathan Taylor Based in part on slides from text-book, slides of Susan Holmes Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Based inpart onslidesfromtext-book,

slides ofSusan

Holmes

14 / 1