data science patterns: preparing data for agile data science (brighttalk webinar)

42
Data Science Patterns PREPARING DATA FOR AGILE DATA SCIENCE Copyright Enda Ridge 2015 #GuerrillaAnalytics http://guerrilla- analytics.net 1

Upload: enda-ridge

Post on 17-Jan-2017

1.422 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

Data Science PatternsPREPARING DATA FOR AGILE DATA SCIENCE

Copyright Enda Ridge 2015

#GuerrillaAnalytics http://guerrilla-analytics.net

1

Page 2: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

2What You Will Learn

Why you must identify and mitigate disruptions in projects What Data Science patterns are and how to use them effectively

How this will help you Data Scientists: you need to ‘think in patterns’ Developers: you will productionise these patterns Managers and Directors: you need this capability in a high

performing team

Copyright Enda Ridge 2015

Page 3: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

3What I’ve Learned

PhD‘Design of Experime

nts for Tuning

Algorithms’

Boutique Consultanc

y

Forensic Data

Analytics

Senior Manager

Professional

Services

Head of Algorith

ms

Copyright Enda Ridge 2015

No matter the industry, teams were always plagued by the same problem …

Time was wasted preparing data and revisiting data instead of delivering real Data Science value

2004 2008 2010 2012 2015

Page 4: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

4Teams Need ‘Guerrilla Analytics’

Copyright Enda Ridge 2015

Data• Extraction• Receipt• Loading

Analytics• Transform• Algorithms• Consolidate

Insight• Reporting• Work Products

Disruptions

Page 5: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

5Solution: Maintain Data Provenance

Data

CodeBusiness Domain

Copyright Enda Ridge 2015

Page 6: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

6Agile Data Preparation Capability

Agility

3. Recognize & Implement

Patterns

2.Supporting

Tools

1. Simple

Conventions

Copyright Enda Ridge 2015

Page 7: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

7

DataWHAT IT LOOKS LIKEWHAT IT SHOULD LOOK LIKE

Copyright Enda Ridge 2015

Page 8: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

8What Raw Data Looks Like

Relational DataCustomer

Address

JSON{ "firstName": "John", "lastName": "Smith", "age": 25, "address":

{ "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" },

"phoneNumber": [ { "type": "home", "number":

"212 555-1234" }, { "type": "fax", "number": "646

555-4567" } ] }

Copyright Enda Ridge 2015

firstName lastName age addressID

John Smith 25 340

Jane Doe 36 158

addressID

StreetAddress

City State

postCode

340 21 2nd Street

New York

NY 10021

341 Main Street

Boston MA 34041

Page 9: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

9What Raw Data Looks Like

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

Copyright Enda Ridge 2015

Machine Data

Page 10: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

10Data Scientists Need Data To Look Like ThisArtist Track Wee

kDate Ran

k2 Pac Baby Don’t

Cry1 2000-02-

2687

2 Pac Baby Don’t Cry

2 2000-03-02

82

2 Pac Baby Don’t Cry

3 2000-03-11

72

2 Pac Baby Don’t Cry

4 2000-03-18

77

2 Pac Baby Don’t Cry

5 2000-03-25

87

2 Pac Baby Don’t Cry

6 2000-04-01

94

2 Pac Baby Don’t Cry

7 2000-04-08

99

3 Doors Down

Kryptonite 1 2000-04-08

68

3 Doors Down

Kryptonite 2 2000-04-15

67

3 Doors Down

Kryptonite 3 2000-04-22

66

One row per observation One variable per column

Copyright Enda Ridge 2015

‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014

Page 11: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

11Data Scientists Need Data To Look Like ThisArtist Track Wee

kDate Ran

k2 Pac Baby Don’t

Cry1 2000-02-

2687

2 Pac Baby Don’t Cry

2 2000-03-02

82

2 Pac Baby Don’t Cry

3 2000-03-11

72

2 Pac Baby Don’t Cry

4 2000-03-18

77

2 Pac Baby Don’t Cry

5 2000-03-25

87

2 Pac Baby Don’t Cry

6 2000-04-01

94

2 Pac Baby Don’t Cry

7 2000-04-08

99

3 Doors Down

Kryptonite 1 2000-04-08

68

3 Doors Down

Kryptonite 2 2000-04-15

67

3 Doors Down

Kryptonite 3 2000-04-22

66

Easier to describe relationships between variables (columns) than between rows

Copyright Enda Ridge 2015

‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014

2014-04-01 is the 6th week

Page 12: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

12Data Scientists Need Data To Look Like This

Artist Track Week

Date Rank

2 Pac Baby Don’t Cry

1 2000-02-26

87

2 Pac Baby Don’t Cry

2 2000-03-02

82

2 Pac Baby Don’t Cry

3 2000-03-11

72

2 Pac Baby Don’t Cry

4 2000-03-18

77

2 Pac Baby Don’t Cry

5 2000-03-25

87

2 Pac Baby Don’t Cry

6 2000-04-01

94

2 Pac Baby Don’t Cry

7 2000-04-08

99

3 Doors Down

Kryptonite 1 2000-04-08

68

3 Doors Down

Kryptonite 2 2000-04-15

67

3 Doors Down

Kryptonite 3 2000-04-22

66

Easier to describe relationships between variables (columns) than between rows

Easier to do comparisons between groups of observations than between groups of columns

Copyright Enda Ridge 2015

‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014

Min, max, first, Nth, average,

median

Page 13: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

13What Data Scientists Need Data To Look LikeArtist Track Wee

kDate Ran

k2 Pac Baby Don’t

Cry1 2000-02-

2687

2 Pac Baby Don’t Cry

2 2000-03-02

82

2 Pac Baby Don’t Cry

3 2000-03-11

72

2 Pac Baby Don’t Cry

4 2000-03-18

77

2 Pac Baby Don’t Cry

5 2000-03-25

87

2 Pac Baby Don’t Cry

6 2000-04-01

94

2 Pac Baby Don’t Cry

7 2000-04-08

99

3 Doors Down

Kryptonite 1 2000-04-08

68

3 Doors Down

Kryptonite 2 2000-04-15

67

3 Doors Down

Kryptonite 3 2000-04-22

66

Variables organized by role Experiment design (fixed)

on left Measurements on right De-normalised

inefficiencies are OK!

Copyright Enda Ridge 2015

‘Tidy Data’, H. Wickham, Journal of Statistical Software 2014

Page 14: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

Patterns

Architecture Software Data Science

?Copyright Enda Ridge 2015

14Patterns are “Recurring solutions to common problems”

Page 15: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

Patterns: ‘Recurring solutions to common problems’

Joining DataCollectingUnique IDMap renameFuzzy joinStacking

TransformationDuplicatesOutliersSampling

Tidying DataSortFilterDerived variables AggregationsPivot and unpivotRoll and unrollPrevious/Next NSplit-Apply-Combine

Copyright Enda Ridge 2015

15

Pattern MatchingRegular Expressions

Page 16: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

16

Joining Patterns

Copyright Enda Ridge 2015

CollectingUnique IDMap renameFuzzy joinStacking

Page 17: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

17Joining Pattern: Collecting Datasets

Copyright Enda Ridge 2015

Pull datasets by name (if you have a convention)

Pull datasets by content Index and search

CapabilitySc

hem

a

2015-10-01.log2015-10-02.log2015-10-03.log2015-10-04.log2015-10-05.log2015-10-06.log2015-10-07.log…

Situation: log files

Page 18: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

18Joining Pattern: Collecting Datasets

Copyright Enda Ridge 2015

Sampling: test and train Experimenting: factors Exploring: what’s in there?

Benefit

Pull datasets by name (if you have a convention)

Pull datasets by content Index and search

CapabilitySc

hem

a

2015-10-01.log2015-10-02.log2015-10-03.log2015-10-04.log2015-10-05.log2015-10-06.log2015-10-07.log…

Situation: log files

Page 19: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

19Joining Pattern: Unique IDs

Situation: data refreshes

Day_ID date Amt Act

3477 2014-03-16

150,000 SETTLE

4598 2014-03-17

45,000 AMEND

… … … …

CapabilityNeed to uniquely identify records, even when IDs exist in the data Hash functions turn large

amount of data into ‘unique’ string

MD5(Guerrilla Analytics) 3b04a8085df05752e24c09

5f036c44f3 MD5(guerrilla analytics)

8f1438b18748981180e10b8c1365e4d9

Copyright Enda Ridge 2015

Day_ID date Amt Act

3477 2014-03-16

150,000 SETTLE

4598 2014-03-17

45,001 AMEND

… … … …

WEEK 1

WEEK 3.5

Page 20: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

20Joining Pattern: Unique IDs

Day_ID

date Amt Act Hash_id

3477 2014-03-16

150,000

SETTLE 244072c9f78f59f7ca0ca93426db98da

4598 2014-03-17

45,000 AMEND 613b4ddfc4db2436e8b8deda26bc3c25

… … … …

Copyright Enda Ridge 2015

Day_ID date Amt Act Hash_id3477 2014-03-16 150,000 SETTLE 244072c9f78f59f7ca0ca93426db

98da4598 2014-03-17 45,001 AMEND 03a0d5e5646bfe60fce679a87ef4

cd34… … … …

WEEK 1

WEEK 3.5

Page 21: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

21Joining Pattern: Map renameSituation: lots of renamingDay_ID

Cust Amt Act

3477 2014-03-16

150,000

SETTLE

4598 2014-03-17

45,000 AMEND

… … … …

Copyright Enda Ridge 2015

id customer

amount

event

3477 2014-03-16

150,000

SETTLE

4598 2014-03-17

45,000 AMEND

… … … …

Page 22: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

22Joining Pattern: Map renameSituation: lots of renamingDay_ID

Cust Amt Act

3477 2014-03-16

150,000 SETTLE

4598 2014-03-17

45,000 AMEND

… … … …

PatternDay_ID

Cust Amt Act

3477 2014-03-16

150,000 SETTLE

4598 2014-03-17

45,000 AMEND

… … … …

Copyright Enda Ridge 2015

id customer

amount

event

3477 2014-03-16

150,000 SETTLE

4598 2014-03-17

45,000 AMEND

… … … …

id customer

amount

event

3477 2014-03-16

150,000 SETTLE

4598 2014-03-17

45,000 AMEND

… … … …

dataset

from to

trades Day_ID idtrades Amt amount… … …

Page 23: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

23

Transformation Patterns

Copyright Enda Ridge 2015

DuplicatesOutliersSampling

Page 24: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

24Transformation Pattern: Duplicates

Situation: repeated data. Can’t decide what to removeArtist Track W

eek

Date Rank

2 Pac Baby Don’t Cry

1 2000-02-26

87

2 Pac Baby Don’t Cry

2 2000-03-02

82

2 Pac Baby Don’t Cry

2 2000-03-02

82

2 Pac Baby Don’t Cry

3 2000-03-11

72

2 Pac Baby Don’t Cry

4 2000-03-18

77

2 Pac Baby Don’t Cry

5 2000-03-25

87

2 Pac Baby Don’t Cry

6 2000-04-01

94

2 Pac Baby Don’t Cry

7 2000-04-08

99

3 Doors Down

Kryptonite 1 2000-03-02

82

Capability Tag repeating records Hold out and review in

critical applications Tag records that repeat

across arbitrary columns

Copyright Enda Ridge 2015

Page 25: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

25Transformation Pattern: Duplicates

Artist Track Week

Date Rank Dupe_Full_id

2 Pac Baby Don’t Cry

1 2000-02-26

87 1

2 Pac Baby Don’t Cry

2 2000-03-02

82 2

2 Pac Baby Don’t Cry

2 2000-03-02

82 2

2 Pac Baby Don’t Cry

3 2000-03-11

72 3

2 Pac Baby Don’t Cry

4 2000-03-18

77 4

2 Pac Baby Don’t Cry

5 2000-03-25

87 5

2 Pac Baby Don’t Cry

6 2000-04-01

94 6

2 Pac Baby Don’t Cry

7 2000-04-08

99 7

3 Doors Down

Kryptonite 1 2000-03-02

82 8

Copyright Enda Ridge 2015

Give duplicate groups an ID. Don’t delete!

Page 26: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

26Transformation Pattern: Duplicates

Artist Track Week

Date Rank Dupe_Full_id

Dupe_rank_date

2 Pac Baby Don’t Cry

1 2000-02-26

87 1 1

2 Pac Baby Don’t Cry

2 2000-03-02

82 2 2

2 Pac Baby Don’t Cry

2 2000-03-02

82 2 2

2 Pac Baby Don’t Cry

3 2000-03-11

72 3 3

2 Pac Baby Don’t Cry

4 2000-03-18

77 4 4

2 Pac Baby Don’t Cry

5 2000-03-25

87 5 5

2 Pac Baby Don’t Cry

6 2000-04-01

94 6 6

2 Pac Baby Don’t Cry

7 2000-04-08

99 7 7

3 Doors Down

Kryptonite 1 2000-03-02

82 8 2

Copyright Enda Ridge 2015

Give multiple duplicate groups their own IDs.

Page 27: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

27

Pattern Matching

Copyright Enda Ridge 2015

Regular Expressions

Page 28: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

28Pattern MatchingSituation: getting content from large amounts of text

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

Capability: Find and extract arbitrary groups of text

ip datetime verb target return

Etc etc

123.123.123.123

26/Apr/2000:00:23:48 -0400

GET /pics/wpaper.gif HTTP/1.0

200 http://www.jafsoft.com/asctortf/

Copyright Enda Ridge 2015

Page 29: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

29Pattern Matching: Regular Expressions

123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

ip datetime verb

target return

Etc etc

123.123.123.123

26/Apr/2000:00:23:48 -0400

GET /pics/wpaper.gif HTTP/1.0

200 http://www.jafsoft.com/asctortf/

Copyright Enda Ridge 2015

From beginning of line, give me: 1 to 3 integers, immediately followed by a dot immediately followed by 1 to 3 integers….etc up until I encounter the first “ - -”

/^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$/m

Regular Expression:

Situation: getting content from large amounts of text

Page 30: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

30

Tidying Data Patterns

Copyright Enda Ridge 2015

SortFilterDerived variables AggregationsPivot and unpivotRoll and unrollPrevious/Next NSplit-Apply-Combine

Page 31: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

31Tidying Data Pattern: Split-apply-combine

Artist Track Week

Date Rank

2 Pac Baby Don’t Cry

1 2000-02-26

87

2 Pac Baby Don’t Cry

2 2000-03-02

82

2 Pac Baby Don’t Cry

3 2000-03-11

72

2 Pac Baby Don’t Cry

4 2000-03-18

77

2 Pac Baby Don’t Cry

5 2000-03-25

87

2 Pac Baby Don’t Cry

6 2000-04-01

94

2 Pac Baby Don’t Cry

7 2000-04-08

99

3 Doors Down

Kryptonite 1 2000-04-08

68

3 Doors Down

Kryptonite 2 2000-04-15

67

3 Doors Down

Kryptonite 3 2000-04-22

66

CapabilityApply arbitrary functions to arbitrary groups

ExampleWhat was each artist’s lowest rank per month (i.e their best track)?

Copyright Enda Ridge 2015

Page 32: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

32Split-apply-combine: SPLITArtist Track We

ekDate Ran

k2 Pac Baby Don’t

Cry1 2000-02-26 87

2 Pac Baby Don’t Cry

2 2000-03-02 82

2 Pac Baby Don’t Cry

3 2000-03-11 72

2 Pac Baby Don’t Cry

4 2000-03-18 77

2 Pac Baby Don’t Cry

5 2000-03-25 87

2 Pac Baby Don’t Cry

6 2000-04-01 94

2 Pac Baby Don’t Cry

7 2000-04-08 99

3 Doors Down

Kryptonite 1 2000-04-08 68

3 Doors Down

Kryptonite 2 2000-04-15 67

3 Doors Down

Kryptonite 3 2000-04-22 66

Copyright Enda Ridge 2015

Artist Track Week

Date Rank

2 Pac Baby Don’t Cry

1 2000-02-26 87

2 Pac Baby Don’t Cry

2 2000-03-02 82

2 Pac Baby Don’t Cry

3 2000-03-11 72

2 Pac Baby Don’t Cry

4 2000-03-18 77

2 Pac Baby Don’t Cry

5 2000-03-25 87

2 Pac Baby Don’t Cry

6 2000-04-01 94

2 Pac Baby Don’t Cry

7 2000-04-08 99

3 Doors Down

Kryptonite 1 2000-04-08 68

3 Doors Down

Kryptonite 2 2000-04-15 67

3 Doors Down

Kryptonite 3 2000-04-22 66

Page 33: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

33Split-apply-combine: APPLYArtist Track We

ekDate Ran

k2 Pac Baby Don’t

Cry1 2000-02-26 87

2 Pac Baby Don’t Cry

2 2000-03-02 82

2 Pac Baby Don’t Cry

3 2000-03-11 72

2 Pac Baby Don’t Cry

4 2000-03-18 77

2 Pac Baby Don’t Cry

5 2000-03-25 87

2 Pac Baby Don’t Cry

6 2000-04-01 94

2 Pac Baby Don’t Cry

7 2000-04-08 99

3 Doors Down

Kryptonite 1 2000-04-08 68

3 Doors Down

Kryptonite 2 2000-04-15 67

3 Doors Down

Kryptonite 3 2000-04-22 66

Copyright Enda Ridge 2015

Artist Track Week

Date Rank

2 Pac Baby Don’t Cry

1 2000-02-26 87

2 Pac Baby Don’t Cry

2 2000-03-02 82

2 Pac Baby Don’t Cry

3 2000-03-11 72

2 Pac Baby Don’t Cry

4 2000-03-18 77

2 Pac Baby Don’t Cry

5 2000-03-25 87

2 Pac Baby Don’t Cry

6 2000-04-01 94

2 Pac Baby Don’t Cry

7 2000-04-08 99

3 Doors Down

Kryptonite 1 2000-04-08 68

3 Doors Down

Kryptonite 2 2000-04-15 67

3 Doors Down

Kryptonite 3 2000-04-22 66

Page 34: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

34Split-apply-combine: COMBINEArtist Track W

eek

Date Rank

2 Pac Baby Don’t Cry

1 2000-02-26

87

2 Pac Baby Don’t Cry

2 2000-03-02

82

2 Pac Baby Don’t Cry

3 2000-03-11

72

2 Pac Baby Don’t Cry

4 2000-03-18

77

2 Pac Baby Don’t Cry

5 2000-03-25

87

2 Pac Baby Don’t Cry

6 2000-04-01

94

2 Pac Baby Don’t Cry

7 2000-04-08

99

3 Doors Down

Kryptonite 1 2000-04-08

68

3 Doors Down

Kryptonite 2 2000-04-15

67

3 Doors Down

Kryptonite 3 2000-04-22

66

Copyright Enda Ridge 2015

Artist Track Week

Date Rank

2 Pac Baby Don’t Cry

1 2000-02-26

87

2 Pac Baby Don’t Cry

3 2000-03-11

72

2 Pac Baby Don’t Cry

6 2000-04-01

94

3 Doors Down

Kryptonite 3 2000-04-22

66

Page 35: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

35Tidying Data Pattern: Unroll (and roll up)

Situation: data on one line

customer_id

session_id

basket

34567 12 45;67;235;9920fD

1232134 2 1345t;456234t

Capability

Copyright Enda Ridge 2015

Get data into a Tidy format

Page 36: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

36Tidying Data Pattern: Unroll (and roll up)

Situation: data on one line

customer_id

session_id

basket

34567 12 45;67;235;9920fD

1232134 2 1345t;456234t

Capability

customer_id

session_id

basket basket_item

item_order

34567 12 45;67;235;9920fD

45 1

34567 12 45;67;235;9920fD

67 2

34567 12 45;67;235;9920fD

235 3

34567 12 45;67;235;9920fD

99 4

Copyright Enda Ridge 2015

SELECT customer_id,session_id unnest( string_to_array (basket, ‘;') ) AS basket_item FROM TheTable

Page 37: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

37Tidying Data Pattern: Nth item

Situation: items have an order

Copyright Enda Ridge 2015

customer_id session_id access_point access

34567 12 45;67;235;99;235;99 45

34567 12 45;67;235;99;235;99 67

34567 12 45;67;235;99;235;99 235

34567 12 45;67;235;99;235;99 99

34567 12 45;67;235;99;235;99 235

34567 12 45;67;235;99;235;99 99

Page 38: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

38Tidying Data Pattern: Nth item

Situation: items have an order

Copyright Enda Ridge 2015

customer_id session_id access_point access order

34567 12 45;67;235;99;235;99 45 1

34567 12 45;67;235;99;235;99 67 2

34567 12 45;67;235;99;235;99 235 3

34567 12 45;67;235;99;235;99 99 4

34567 12 45;67;235;99;235;99 235 5

34567 12 45;67;235;99;235;99 99 6

SELECT row_number() over (partition by customer_id, session_id)AS orderFROM TheTable

Page 39: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

39Tidying Data Pattern: Nth item

Situation: items have an order

Copyright Enda Ridge 2015

Can I see when users are flipping between access points?

customer_id session_id access_point access order

34567 12 45;67;235;99;235;99 45 1

34567 12 45;67;235;99;235;99 67 2

34567 12 45;67;235;99;235;99 235 3

34567 12 45;67;235;99;235;99 99 4

34567 12 45;67;235;99;235;99 235 5

34567 12 45;67;235;99;235;99 99 6

Gap=1

Page 40: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

40

Sort1 Unroll2 Nth item3 Split-Apply-Combine4

Chaining of Patterns

Copyright Enda Ridge 2015

With high pattern maturity, focus is no longer on details of the ‘standard’ pattern.Complex evolving code is easier to maintain

Page 41: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

41Summing up

Guerrilla Analytics requires agile teams

Data Science Patterns are recurring solutions to data preparation problems

Capability to recognize and implement patterns is key for high performance

Pattern groups: Join Transform Pattern matching Tidying Chaining

Copyright Enda Ridge 2015

Agility

3. Recognize & Implement

Patterns

2. Supporting

Tools

1. Simple

Conventions

Page 42: Data Science Patterns: Preparing Data for Agile Data Science (BrightTalk Webinar)

#GuerrillaAnalytics http://guerrilla-analytics.net

42Find out more

Copyright Enda Ridge 2015

@Enda_Ridge

http://guerrilla-analytics.net

Available on: