introduction to big data - harvard university€¦ · introduction to big data chapter 3 & 4...
TRANSCRIPT
![Page 1: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/1.jpg)
Introductionto Big Data
Chapter 3 & 4 (Week 2)Applications of Big Data
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok [email protected]
![Page 2: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/2.jpg)
Contents
General Workflow
Workflow of Data Science2.
Diverse applications
Applications of Big Data1.
1st step (Problem definition)
Practice to define problem
Practice to imagine required data
![Page 3: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/3.jpg)
3 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsNatural language processing and voice recognization
![Page 4: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/4.jpg)
4 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsNetflix & Youtube recommendation system
![Page 5: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/5.jpg)
5 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsChatBot
fXck yoX!!! <- Please don’t try that..... for chatbot’s future
![Page 6: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/6.jpg)
6 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsColor recovery for B&W picture
![Page 7: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/7.jpg)
7 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsResolution recovery for poor quality picture
![Page 8: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/8.jpg)
8 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsMotion detection
https://www.youtube.com/watch?v=pW6nZXeWlGM&feature=youtu.be
![Page 9: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/9.jpg)
9 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsImage captioning
![Page 10: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/10.jpg)
10 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsNew image generation
![Page 11: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/11.jpg)
11 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsAutonomous car
![Page 12: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/12.jpg)
12 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsRobotics
![Page 13: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/13.jpg)
13 / 20copyrightⓒ 2018 All rights reserved by Korea University
But if I ask you to make these things right now !?
![Page 14: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/14.jpg)
14 / 20copyrightⓒ 2018 All rights reserved by Korea University
Main goal of this courseMindset of this course
A journey of a thousand miles begins with a single step !!
After taking this class, you should be able to:
• think that XXX types of data will be required for these application.
• imagine data structure for these applications.
• know what technique will be required even though you don't know the exact mathematical / statistical formular of that for these applications.
![Page 15: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/15.jpg)
15 / 20copyrightⓒ 2018 All rights reserved by Korea University
General workflow for Data ScienceDiagram of workflow
![Page 16: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/16.jpg)
16 / 20copyrightⓒ 2018 All rights reserved by Korea University
Another workflow for Data ScienceDiagram of workflow
![Page 17: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/17.jpg)
17 / 20copyrightⓒ 2018 All rights reserved by Korea University
1st step for Big Data ScienceProblem definition
This is the first step in everywork
• We can set a problem by talking with someone.
• You can also set issues while fighting with yourself.
• Someone else may tell you what is uncomfortable.
• You can also view news articles and come up with new ideas.
• Ideas can come to mind during irrelevant activities.
...
![Page 18: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/18.jpg)
18 / 20copyrightⓒ 2018 All rights reserved by Korea University
What was unconfortable?Think about why this technology came about
![Page 19: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/19.jpg)
19 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsNetflix & Youtube recommendation system
![Page 20: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/20.jpg)
20 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsChatBot
![Page 21: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/21.jpg)
21 / 20copyrightⓒ 2018 All rights reserved by Korea University
ApplicationsColor recovery for B&W picture
![Page 22: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/22.jpg)
22 / 20copyrightⓒ 2018 All rights reserved by Korea University
![Page 23: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/23.jpg)
23 / 20copyrightⓒ 2018 All rights reserved by Korea University
Workflow for Data ScienceDiagram of workflow
![Page 24: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/24.jpg)
24 / 20copyrightⓒ 2018 All rights reserved by Korea University
Workflow for Data ScienceDiagram of workflow
![Page 25: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/25.jpg)
25 / 20copyrightⓒ 2018 All rights reserved by Korea University
2nd step for Big Data ScienceExperimental design for getting data
Before attempting to collect data, think about what data you need to collect for your purpose.
• What is important feature?
• # of features
• # of samples
• Types of features
• Target individual
• ...
This is basically covered in the "Experimental Design" course in Department of Statistics.
![Page 26: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/26.jpg)
26 / 20copyrightⓒ 2018 All rights reserved by Korea University
![Page 27: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/27.jpg)
27 / 20copyrightⓒ 2018 All rights reserved by Korea University
Tabular DataStructured data
What is a table?
• A table is a collection of rows and columns
• Each row has an index
• Each column has a name
• A cell is specified by an (index, name) pair
• A cell may or may not have a value
![Page 28: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/28.jpg)
28 / 20copyrightⓒ 2018 All rights reserved by Korea University
Tabular DataStructured data
![Page 29: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/29.jpg)
29 / 20copyrightⓒ 2018 All rights reserved by Korea University
Tabular Datacsv format (comma-separated values)
![Page 30: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/30.jpg)
30 / 20copyrightⓒ 2018 All rights reserved by Korea University
The structure spectrumStructured or not
Structured(schema-first)
Relational DatabaseFormatted Messages
Semi-Structured(schema-later)
DocumentsXML
Tagged Text/Media
Unstructured(schema-never)
Plain Text
Media
![Page 31: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/31.jpg)
31 / 20copyrightⓒ 2018 All rights reserved by Korea University
When people use the word database, fundamentally what they are saying is
that the data should be self-describing and it should have a
schema. That’s really all the word database means.
-- Jim Gray, “The Fourth Paradigm”
![Page 32: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/32.jpg)
32 / 20copyrightⓒ 2018 All rights reserved by Korea University
Key concept: Structured DataStructured data
A data model is a collection of concepts for describing data.
A schema is a description of a particular collection of data, using a given data model.
![Page 33: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/33.jpg)
33 / 20copyrightⓒ 2018 All rights reserved by Korea University
The Relational ModelStructured data
The Relational Model is UbiquitousMySQL, PostgreSQL, Oracle, DB2, SQLServer, …
Foundational work done atIBM Santa Teresa Labs (now IBM Almaden )“System R”UC Berkeley CS – the “Ingres” System
Object-oriented concepts have been merged in
Early work: POSTGRES research project at Berkeley
As has support for XML (semi-structured data)
E. F., “Ted” CoddTuring Award
1981
![Page 34: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/34.jpg)
34 / 20copyrightⓒ 2018 All rights reserved by Korea University
ExampleInstance of student relation
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@eecs 18 3.253650 Smith smith @math 19 3.8
CREATE TABLE Students(sid CHAR(20), name CHAR(20), login CHAR(10),age INTEGER,gpa FLOAT)
![Page 35: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/35.jpg)
35 / 20copyrightⓒ 2018 All rights reserved by Korea University
Data model (Tabular)Python
DataFrame: a dict of Series objectsEach Series object represents a column
Series: a named, ordered dictionaryThe keys of the dictionary are the indexesBuilt on NumPy’s ndarrayValues can be any Numpy data type object
Data stored in memory
Operations performed from Python shell
![Page 36: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/36.jpg)
36 / 20copyrightⓒ 2018 All rights reserved by Korea University
Operations (Tabular)Python
• integrate (join), transform, clean, impute
• aggregate: sum, count, average, max, min
• sort
• pivot
• Relational• union, intersection, difference, cartesian product (CROSS JOIN)• select/filter, project• join: natural join (INNER JOIN), theta join, semi-join, etc.• rename
![Page 37: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/37.jpg)
37 / 20copyrightⓒ 2018 All rights reserved by Korea University
Data model (Tabular)R
data.frame: a list of vector objectsEach vector object represents a column
Possible vector typeslogical, integer, double, complex, character, raw
Data stored in memory
Operations performed from the R shell
![Page 38: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/38.jpg)
38 / 20copyrightⓒ 2018 All rights reserved by Korea University
What’s wrong with Tables?
Too limited in structure?Too rigid?Too old fashioned?
![Page 39: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/39.jpg)
39 / 20copyrightⓒ 2018 All rights reserved by Korea University
Beyond tables
![Page 40: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/40.jpg)
40 / 20copyrightⓒ 2018 All rights reserved by Korea University
But Structure Matters!
Func
tiona
lity
Time (and cost)
Structured(schema-first)
Unstructured (schema-less)
Dataspaces(pay-as-you-go)
Structure enables computers to help usersmanipulate and maintain the data.
![Page 41: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 3 & 4 (Week 2) Applications of Big Data. DCCS208(02) Korea University 2019 Fall. Asst. Prof](https://reader035.vdocuments.site/reader035/viewer/2022070710/5ec6ec696bb37358b07546dd/html5/thumbnails/41.jpg)
End of Slide