web ui, algorithms, and feature engineering

Post on 07-Feb-2017

132 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Algorithms

Poul Petersen @pejpgrep CIO, BigML, Inc @bigmlcom

UI Algorithms & Feature Engineering with Flatline

BigML, Inc 2ML Crash Course - UI/Algorithms/Feature Engineering

BigML Algorithm History

2011

Prototyping and Beta

API-first Approach

2013

Evaluations, Batch Predictions,

Ensembles, Sunburst

2015

Association Discovery,

Correlations, Samples, Statistical

Tests

2014

Anomaly Detection, Clusters, Flatline

2016

Scripts, Libraries, Executions,

WhizzML, Logistic Regression

2012

Core ML workflow: source, dataset,

model, prediction

BigML, Inc 3ML Crash Course - UI/Algorithms/Feature Engineering

The need for Machine Learning• Can you find any pattern in this tiny data set?

Talk Text Purchases Data Age Churn?

148 72 0 33.6 50 TRUE

85 66 0 26.6 31 FALSE

183 64 0 23.3 32 TRUE

89 66 94 28.1 21 FALSE

115 0 0 35.3 29 FALSE

166 72 175 25.8 51 TRUE

100 0 0 30 32 TRUE

118 84 230 45.8 31 TRUE

171 110 240 45.4 54 TRUE

159 64 0 27.4 40 FALSE

…. but this is a simple example

BigML, Inc 4ML Crash Course - UI/Algorithms/Feature Engineering

Data Types

numeric

1 2 3

1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical

A B C

DATE-TIME2013-09-25 10:02

DATE-TIME

YEAR

MONTH

DAY-OF-MONTH

YYYY-MM-DD

DAY-OF-WEEK

HOUR

MINUTE

YYYY-MM-DD

YYYY-MM-DD

M-T-W-T-F-S-D

HH:MM:SS

HH:MM:SS

2013

September

25

Wednesday

10

02

text / itemsBe not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em.

text

“great”“afraid”“born”“some”

appears 2 timesappears 1 timeappears 1 timeappears 2 times

BigML, Inc 5ML Crash Course - UI/Algorithms/Feature Engineering

Text Analysis

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon 'em.

great: appears 4 times

Bag of Words

BigML, Inc 6ML Crash Course - UI/Algorithms/Feature Engineering

Text Analysis

… great afraid born achieve … …

… 4 1 1 1 … …

… … … … … … …

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon ‘em.

Model

The token “great” occurs more than 3 times

The token “afraid” occurs no more than once

BigML, Inc 7ML Crash Course - UI/Algorithms/Feature Engineering

DATASET

Evaluation

TRAIN SET

TEST SET

PREDICTIONS

METRICS

BigML, Inc 8ML Crash Course - UI/Algorithms/Feature Engineering

EnsemblesDiameter Color Shape Fruit

4 red round plum

5 red round apple

5 red round apple

6 red round plum

7 red round appleBagging!

Random Decision Forest!

All Data: “plum”

Sample 2: “apple”

Sample 3: “apple”

Sample 1: “plum”}“apple”

What is a round, red 6cm fruit?

BigML, Inc 9ML Crash Course - UI/Algorithms/Feature Engineering

Logistic Regression

BigML, Inc 10ML Crash Course - UI/Algorithms/Feature Engineering

Logistic Regression

????

BigML, Inc 11ML Crash Course - UI/Algorithms/Feature Engineering

Logistic Regression

P≈0 P≈10<P<1• x→-∞ : P(x)→0

• x→∞ : P(x)→1

BigML, Inc 12ML Crash Course - UI/Algorithms/Feature Engineering

Supervised Learning

animal state … proximity actiontiger hungry … close run

elephant happy … far take picture

Classification

animal state … proximity min_kmhtiger hungry … close 70

hippo angry … far 10

Regression

label

animal state … proximity action1 action2tiger hungry … close run look untasty

elephant happy … far take picture call friends

Multi-Label Classification

BigML, Inc 13ML Crash Course - UI/Algorithms/Feature Engineering

Unsupervised Learning

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

Clustering

Anomaly Detection

similar

unusual

BigML, Inc 14ML Crash Course - UI/Algorithms/Feature Engineering

K-Means

K=3

BigML, Inc 15ML Crash Course - UI/Algorithms/Feature Engineering

K-Means

K=3

BigML, Inc 16ML Crash Course - UI/Algorithms/Feature Engineering

G-Means

BigML, Inc 17ML Crash Course - UI/Algorithms/Feature Engineering

G-Means

BigML, Inc 18ML Crash Course - UI/Algorithms/Feature Engineering

G-MeansLet K=2Keep 1, Split 1 New K=3

BigML, Inc 19ML Crash Course - UI/Algorithms/Feature Engineering

G-MeansLet K=3Keep 1, Split 2New K=5

BigML, Inc 20ML Crash Course - UI/Algorithms/Feature Engineering

G-MeansLet K=5K=5

BigML, Inc 21ML Crash Course - UI/Algorithms/Feature Engineering

Isolation Forest

Grow a random decision tree until each instance is in its own leaf

“easy” to isolate

“hard” to isolate

Depth

Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)

BigML, Inc 22ML Crash Course - UI/Algorithms/Feature Engineering

Model Competence

MODEL

ANOMALY DETECTOR

Prediction T T

Confidence 86% 84%

AnomalyScore 0.5367 0.7124

Competent? Y N

At Training Time At Prediction Time

DATASET

BigML, Inc 23ML Crash Course - UI/Algorithms/Feature Engineering

Association Rules

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

{class = gas} amount < 100{customer = Bob, account = 3421} zip = 46140

Rules:

Antecedent Consequent

BigML, Inc 24ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Instances

AC

Coverage

Percentage of instances which match antecedent “A”

BigML, Inc 25ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Instances

AC

Support

Percentage of instances which match antecedent “A” and Consequent “C”

BigML, Inc 26ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Coverage

Support

Instances

AC

Confidence

Percentage of instances in the antecedent which also contain the consequent.

BigML, Inc 27ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

CInstances

A C

A

Instances

C

Instances

A

Instances

AC

0% 100%

Instances

AC

Confidence

A never implies C

A sometimes implies C

A always implies C

BigML, Inc 28ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Independent

AC

C

Observed

A

Lift

Ratio of observed support to support if A and C were statistically independent.

Support == Confidence p(A) * p(C) p(C)

BigML, Inc 29ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

C

Observed

A

Observed

AC

< 1 > 1

Independent

A C

Lift = 1

Negative Correlation No Association Positive

Correlation

Independent

A C

Independent

A C

Observed

A C

BigML, Inc 30ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Independent

AC

C

Observed

A

Leverage

Difference of observed support and support if A and C were statistically independent.

Support - [ p(A) * p(C) ]

BigML, Inc 31ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

C

Observed

A

Observed

AC

< 0 > 0

Independent

A C

Leverage = 0

NegativeCorrelation No Association Positive

Correlation

Independent

A C

Independent

A C

Observed

A C

-1…

BigML, Inc 32ML Crash Course - UI/Algorithms/Feature Engineering

Machine Learning Secret

“…the largest improvements in accuracy often came from quick experiments, feature engineering, and model tuning rather than applying fundamentally different algorithms.”

Facebook FBLearner 2016

Feature Engineering: applying domain knowledge of the data to create features that make machine

learning algorithms work better or at all.

BigML, Inc 33ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

2013-09-25 10:02

DATE-TIME

Automatic Date Transformation

… year month day hour minute …

… 2013 Sep 25 10 2 …

… … … … … … …

NUM NUMCAT NUM NUM

BigML, Inc 34ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringAutomatic Categorical Transformation

… alchemy_category …… business …… recreation …… health …… … …

CAT

business health recreation …… 1 0 0 …… 0 0 1 …… 0 1 0 …… … … … …

NUM NUM NUM

BigML, Inc 35ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon ‘em.

TEXT

Automatic Text Transformation

… great afraid born achieve …

… 4 1 1 1 …

… … … … … …

NUM NUM NUM NUM

BigML, Inc 36ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

{ “url":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News “, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.”}

TEXT

Better representation

title body

Breaking News… news covering…

… …

TEXT TEXT

BigML, Inc 37ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringDiscretization

Total Spend

7,342.99

304.12

4.56

345.87

8,546.32

NUM

“Predict will spend $3,521 with error

$1,232”

Spend Category

Top 33%

Middle 33%

Bottom 33%

Middle 33%

Top 33%

CAT

“Predict customer will be Top 33% in

spending”

BigML, Inc 38ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringCombinations of Multiple Features

Kg M2

101.4 3.24

85.2 2.8

56.2 2.9

136.1 3.6

95.9 4.1

NUM NUM

BMI

31.17

30.4

19.38

37.8

23.39

NUM

Kg M2

BigML, Inc 39ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringFlatline

• BigML’s Domain-Specific Language (DSL) for Transforming Datasets

• Limited programming language structures

• let, cond, if, maps, list operators, */+-

• Dataset Fields are first-class citizens

• (field “diabetes pedigree”)

• Built-in transformations

• statistics, strings, timestamps, windows

BigML, Inc 40ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))

date volume price1 34353 3142 44455 3153 22333 3154 52322 3215 28000 3206 31254 3197 56544 3238 44331 3249 81111 287

10 65422 29411 59999 30012 45556 30213 19899 30114 21453 302

day-4 day-3 day-2 day-1 4davg 0

314 314314 315 314.5

314 315 315 314.6314 315 315 321 316.25315 315 321 320 317.75315 321 320 319 318.75

Current - (4-day avg) std dev

Shock: Deviations from a Trend

BigML, Inc 41ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

(/ (- (f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))

Current - (4-day avg) std dev

Shock: Deviations from a Trend

Current : (field “price”) 4-day avg: (avg-window “price” -4 -1) std dev: (standard-deviation “price”)

BigML, Inc 42ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringFix Missing Values in a “Meaningful” Way

Filter Zeros

Model insulin

Predict insulin

Select insulin

FixedDataset

AmendedDataset

OriginalDataset

CleanDataset

top related