cpts 475/575: data science what is data science?€¦ · doing data science, straight talk from the...

14
Fall 2018 CptS 475/575: Data Science What is Data Science?

Upload: others

Post on 21-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Fall 2018

CptS 475/575: Data Science

What is Data Science?

Page 2: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

First a good news…

• Starting from Friday August 24 and for the remainder of the semester, the meeting location for the class has changed to CUE 319 • CUE 319 is a bigger (and nicer) room • Every one in waiting list will be enrolled!

Page 3: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Next a couple of left over slides from last time…

Page 4: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Learning Outcomes

•  Describe what Data Science is and the skill sets needed •  Describe the Data Science Process •  Use R to carry out statistical modeling and analysis •  Carry out exploratory data analysis (to gain insight) •  Apply machine learning algorithms for predictive modeling •  Correctly apply cross-validation to assess model performance •  Apply unsupervised learning methods to discover patterns, trends and anomalies in data •  Use effective data wrangling approaches to manipulate data •  Create effective visualization of data (to communicate or persuade) •  Reason around ethical and privacy issues in data science, and apply ethical practices •  Work effectively in teams on data science projects •  Apply knowledge gained in the course to carry out a project and write technical report

Page 5: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Weekly Schedule

Page 6: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Books

•  No required textbook •  Lecture notes (slides) and reading material will be made available on the OSBLE+ page •  References

•  Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2013. (Freely available online)

•  Cathy O'Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. •  Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1. Cambridge University Press.

2014. (Freely available online) •  Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. Draft of a book, latest version,

2018. (Freely available online) •  Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques. Third Edition. Morgan

Kaufmann Publishers. 2012. •  Ethem Alpaydin. Introduction to Machine Learning. Third Edition. MIT Press, 2014. •  Nathan Yau. Visualize This: The FlowingData Guide to Design, Visualization, and Statistrics. Wiley Publications,

2011. •  Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. (Freely available online)

Page 7: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Policies

•  Conduct in class •  Silence personal electronics •  Arrive on time and remain throughout the class

•  Correspondence •  Happens via OSBLE+

• Attendance •  Required. Make sure absences are cleared with me

• Missing or late work •  Max 48 hrs with 10% penalty per 24 hrs

• Academic Integrity •  Strongly enforced

•  Consult syllabus for more details

Page 8: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Now to today’s topic…

Page 9: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

What is Data Science?

Outline: • Big Data and Data Science hype •  and getting past the hype

• Why now? • Landscape of perspectives • Skill set needed

Page 10: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Big Data and Data Science Hype

What might be eyebrow-raising about Big Data and Data Science?

•  Lack of definition around basic terminology •  Lack of recognition for researchers in academia and industry

who have been working on this kind of stuff for years •  The hype can be crazy

Source: Doing Data Science (O’Neil & Schutt, 2013).

Page 11: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Getting past the hype

Around all the hype, there is a ring of truth Data Science is something new – it has access to a larger body of knowledge and methodology as well as a process that has foundations in both statistics and computer science. [DDS, O’Neil and Schutt]

We are here in this course to understand this better and contribute to the pursuit of a sharper definition.

Page 12: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Quote from Introduction of “Foundations of Data Science” book by Avrim Blum, John Hopcroft and Ravindran Kannan (2018)

(https://www.cs.cornell.edu/jeh/book.pdf)

Computer science as an academic discipline began in the 60’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory.

John Hopcroft

Page 13: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Quote from Introduction of “Foundations of Data Science” book by Avrim Blum, John Hopcroft and Ravindran Kannan (2018)

While traditional areas of computer science remain highly important, increasingly researchers of the future will be involved with using computers to understand and extract usable information from massive data arising in applications, not just how to make computers useful on specific well-defined problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as an understanding of automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is an increase in emphasis on probability, statistics, and numerical methods.

John Hopcroft

Page 14: CptS 475/575: Data Science What is Data Science?€¦ · Doing Data Science, Straight Talk From The Frontline. O'Reilly. 2014. • Jure Leskovek, Anand Rajaraman and Jeffrey Ullman

Assefaw Gebremedhin, CptS 475/575: Data Science, http://scads.eecs.wsu.edu

Why Now? Enablers of today’s “big data revolution”

•  Proliferation of sensors •  Creation of almost all information in digital form

•  Datafication •  Dramatic cost reduction in storage

•  You can afford to keep all the data •  Dramatic increases in network bandwidth

•  You can move the data to where it is needed •  Dramatic cost reduction and scalability improvements in

computation •  Dramatic algorithmic breakthroughs

•  Machine Learning, Data Mining, Fundamental advances in CS and Statistics

•  Ever more powerful models producing ever increasing volumes of data that must be analyzed