data management for big data analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/course... ·...

19
Data Management for Big Data Analytics 23 - 31 July PKU

Upload: others

Post on 23-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Data Management for

Big Data Analytics

23 - 31 July PKU

Page 2: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

What do data analysts do?

Page 3: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

PROGRAMMING LANGUAGES

SQL70%

R57%

PYTHON54%

BASH24%JAVA18%

JAVASCRIPT17%

VISUAL BASIC / VBA13%

C++9%

SCALA8%

C#8%

C8%

SAS5%

PERL5%

RUBY3%

GO1%

OCTAVE2%

MATLAB9%

SALARY MEDIAN AND IQR (US DOLLARS)

Range/Median

Lang

uage

s

SHARE OF RESPONDENTS

0 50K 100K 150K 200K

GoOctave

RubySASPerl

C#C

ScalaMatlab

C++Visual Basic/VBA

JavaScriptJavaBash

PythonR

SQL

Source: O’Reilly Data Science Salary Survey 2016

Page 4: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Source: O’Reilly Data Science Salary Survey 2015

0%

10%

20%

30%

40%

50%

60%

70%

Couc

hbas

eG

oG

raph

Lab

Ora

cle

Exas

cale

Aste

r Dat

a (T

erad

ata)

Ora

cle

BIAm

azon

Dyn

amoD

BIB

MVo

wpa

l Wab

bit

EMC/

Gre

enpl

umKN

IME

Spotfir

eM

ahou

tN

eo4J

Mat

hem

atic

aSt

ata

Stor

mVe

rtic

aM

icro

stra

tegy

Pent

aho

Qlik

View

Cogn

osLI

BSVM

Map

RRu

byN

etez

za (I

BM)

Splu

nkG

oogl

e Bi

gQue

ry/F

usio

n Ta

bles

Rapi

dMin

erSA

P H

ANA

Busi

ness

Obj

ects

IBM

DB2

Goo

gle

Imag

e Ch

arts

API

Mat

lab

Redi

sG

oogl

e Ch

art T

ools

/ Im

age

API

C#H

orto

nwor

ksCa

ssan

dra

Tera

data

SPSSPe

rlAm

azon

Ela

stic

Map

Redu

ce (E

MR)

Hba

seW

eka

Amaz

on R

edSh

iftPigC

SQLi

teSc

ala

Pow

erPi

vot

C++

SAS

Apac

he H

adoo

pM

ongo

DB

Visu

al B

asic

/VBA

Clou

dera

Spar

kH

ive

Hom

egro

wn

anal

ysis

tool

sD

3O

racl

ePo

stgr

eSQ

LJa

vaM

atpl

otlib

(Pyt

hon)

Java

Scrip

tTa

blea

uM

icro

soft

SQ

L Se

rver

ggpl

otPy

thon

: num

py, s

cipy

, sci

kit-

lear

nM

ySQ

LRPy

thon

Exce

lSQ

L

TOOLSLANGUAGES, DATA PLATFORMS, ANALYTICS

Shar

e of

Res

pond

ents

Tool: language, data platform, analytics Tool: language, data platform, analytics

Page 5: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Source: O’Reilly Data Science Salary Survey 2014

4 Two exceptions were “Natural Language/Text Processing” and “Networks/Social GraphProcessing,"” which are less tools than they are types of data analysis.

bases, Hadoop distributions, visualization applications, businessintelligence (BI) programs, operating systems, or statistical pack‐ages.4 One hundred and fourteen tools were present on the list, butover 200 more were manually entered in the “other” fields.

Figure 1-10. Most commonly used tools

Just as in the previous year’s salary survey, SQL was the most com‐monly used tool (aside from operating systems); even with the rapidinflux of new data technology, there is no sign that SQL is going

12 | 2014 Data Science Salary Survey

Page 6: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Source: O’Reilly Data Science Salary Survey 2013

3. Correlations were tested using a Pearson’s chi square test with p=.05.

That SQL/RDB is the top bar is no surprise: accessing data is the meatand potatoes of data analysis, and has not been displaced by othertools. The preponderance of R and Python usage is more surprising—operating systems aside, these were the two most commonly usedindividual tools, even above Excel, which for years has been the go-tooption for spreadsheets and surface-level analysis. R and Python arelikely popular because they are easily accessible and effective opensource tools for analysis. More traditional statistical programs such asSAS and SPSS were far less common than R and Python.

By counting tool usage, we are only scratching the surface: who exactlyuses these tools? In comparing usage of R/Python and Excel, we hadhypothesized that it would be possible to categorize respondents asusers of one or the other: those who use a wider variety of tools, largelyopen source, including R, Python, and some Hadoop, and those whouse Excel but few tools beside it.

Python and R correlate with each other—a respondent who uses oneis more likely to use the other—but neither correlates with Excel (neg‐atively or positively): their usage (joint or separate) does not predictwhether a respondent would also use Excel. However, if we look at allcorrelations between all pairs of tools, we can see a pattern that, to anextent, divides respondents. The significant positive correlations canbe drawn as edges between tools as nodes, producing a graph with twomain clusters.3

Tool Usage | 7

Page 7: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Future data scientists’ favorite tools

Page 8: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

wrangling analytics

Page 9: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

wrangling analytics(up to 80%)

Page 10: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Challenges• 4 Vs of big data

• Volume - data is large

• Variety - data comes in different formats

• Veracity - data is not there/uncertain/dirty

• Velocity - speed of change

Page 11: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Volume challenges

• Even scanning data can take hours/days/weeks

• How to get to the right data?

• Precise answers to queries are impossible

• Hence need to approximate

Page 12: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Variety challenges• Different data formats

• Still much of the data is stored in relational DBMSs

• But other models catch up

• Most active these days is graph data

• We will look at it a lot

Page 13: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Veracity challenges• How to deal with uncertainty or incompleteness?

Relational databases (SQL) are really bad at it

• What to do if data is structured under a different schema, or different bits of data reside in different databases? (Data exchange and integration)

• What to do if data is supplemented with additional knowledge, e.g., an ontology, to compensate for missing data?

Page 14: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Course structure• A quick of reminder of the basics of relational

databases

• SQL: how well do we understand it?

• Volume: Conjunctive queries (many-way joins)

• optimisation

• approximation

Page 15: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

• Volume: scale independence

• Variety: graph databases

• theoretical languages

• property graphs in Neo4j

• Variety: RDF data

• Variety: tree-structured data (XML) - depending on time

Page 16: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

• Veracity: incomplete information and correct answers

• Veracity: data integration and exchange

• Veracity: answering queries with the help of ontologies

Page 17: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

Evaluation• Project

• There is a long list of papers on the web

• Choose one of them

• The goal is to write an essay, 6-8 pages

• It must present a summary of the paper that would be understood by someone who has not read the paper

• It should also provide some of your own ideas or further investigation about the paper

Page 18: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

• Examples:

• analysis of the followup literature to see how these ideas were used

• ideas on improving algorithms in the paper, perhaps in some special cases

• an implementation of a theoretical algorithm to see how it performs

Page 19: Data Management for Big Data Analyticshomepages.inf.ed.ac.uk/libkin/teach/beijing2018/Course... · 2018-07-23 · Source: O’Reilly Data Science Salary Survey 2014 4 Two exceptions

2-stage process• Stage 1: this week (and weekend), choose the

paper, read it quickly, and decide what you want to do in addition to summarising its ideas

• Have a quick presentation Tuesday 31 July (during the last 2 hours of the course)

• The do the proper writeup and email it to me, [email protected], before To Be Announced