feature engineering for diverse data types

33
FEATURE ENGINEERING FOR DIVERSE DATA TYPES Alice Zheng October 10, 2016 Seattle PyLadies Meetup 1

Upload: alice-zheng

Post on 18-Jan-2017

123 views

Category:

Science


4 download

TRANSCRIPT

Page 1: Feature engineering for diverse data types

FEATURE ENGINEERING FOR DIVERSE DATA TYPESAlice ZhengOctober 10, 2016Seattle PyLadies Meetup

1

Page 2: Feature engineering for diverse data types

2

MY JOURNEY SO FAR

Shortage of expertise andgood tools in the market.

Applied machine learning/data science

Build ML tools

Write a book

Page 3: Feature engineering for diverse data types

3

MACHINE LEARNING IS USEFUL!

Model data.Make predictions.Build intelligent

applications.Play chess and go!

Page 4: Feature engineering for diverse data types

4

THE MACHINE LEARNING PIPELINE

It is a puppy and it is extremely cute.

Raw data

FeaturesModels

Predictions

Deploy inproduction

Page 5: Feature engineering for diverse data types

Models

Page 6: Feature engineering for diverse data types

6

A SIMPLE MODELX

Y

X and Y

1

1

1

0

0

0

0 1

1

0 0 0

f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) = 1 if f(x, y) > 0

0 if f(x, y) <= 0

Page 7: Feature engineering for diverse data types

7

VISUALIZING A MODEL

1

1

X

Y

g(x,y)0

Page 8: Feature engineering for diverse data types

8

FROM SIMPLE TO COMPLEX

Xn

X3

X2

X1

r1(X1, X2)

r2(X2∪X3)

rm(X1, Xn)

s1(r1, r2)

s2(r1, r3)

sm(rm-1, rm)

Use more complicated functions

or

Stack layers of simple functions(e.g., deep neural nets)

Page 9: Feature engineering for diverse data types

9

BETWEEN RAW DATA AND MODELS• Mathematical models take numeric input• Raw data are not numeric (or not the right kind of numeric)• Featurization: the step in-between• Feature space: multi-dimensional numeric space where modeling

happens

Page 10: Feature engineering for diverse data types

Feature Generation

Feature: An individual measurable property of a phenomenon being observed.

⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”

Page 11: Feature engineering for diverse data types

TEXT

Page 12: Feature engineering for diverse data types

12

TURNING TEXT INTO FEATURES

It is a puppy and it is extremely cute.

What are the important measures?

Keywords? Verb tense? Subject,

object?

it 2is 2

puppy 1and 1cat 0

aardvark 0cute 1

extremely 1… …

Bag of words feature vector

Raw text

Page 13: Feature engineering for diverse data types

13

VISUALIZING BAG-OF-WORDSpuppy

cute

1

1

It is a puppy andit is extremely cute

Page 14: Feature engineering for diverse data types

14

CLASSIFYING BAG-OF-WORDS

puppy

cat

2

11

have

I have a puppy

I have a catI have a kitten

I have a dogand I have a pen

1Decision surface

Page 15: Feature engineering for diverse data types

Feature Cleaning and Transformation

Page 16: Feature engineering for diverse data types

16

AUTO-GENERATED FEATURES ARE NOISY

Rank Word Doc Count

Rank Word Doc Count

1 the 1,416,058 11 was 929,7032 and 1,381,324 12 this 844,8243 a 1,263,126 13 but 822,3134 i 1,230,214 14 my 786,5955 to 1,196,238 15 that 777,0456 it 1,027,835 16 with 775,0447 of 1,025,638 17 on 735,4198 for 993,430 18 they 720,9949 is 988,547 19 you 701,01510 in 961,518 20 have 692,749

Most popular words in Yelp reviews dataset (~ 6M reviews).

Page 17: Feature engineering for diverse data types

17

AUTO-GENERATED FEATURES ARE NOISY

Rank Word Doc Count

Rank Word Doc Count

357,480 cmtk8xyqg 1 357,470 attractif 1357,479 tangified 1 357,469 chappagetti 1357,478 laaaaaaasts 1 357,468 herdy 1357,477 bailouts 1 357,467 csmpus 1357,476 feautred 1 357,466 costoso 1357,475 résine 1 357,465 freebased 1357,474 chilyl 1 357,464 tikme 1357,473 cariottis 1 357,463 traditionresort 1357,472 enfeebled 1 357,462 jallisco 1357,471 sparklely 1 357,461 zoawan 1

Least popular words in Yelp reviews dataset (~ 6M reviews).

Page 18: Feature engineering for diverse data types

18

FEATURE CLEANING• Popular words and rare words are not helpful• Manually defined blacklist – stopwords

a b c d e f g h iable be came definitely each far get had ieabout became can described edu few gets happens ifabove because cannot despite eg fifth getting hardly ignoredaccording become cant did eight first given has immediatelyaccordingly becomes cause different either five gives have inacross becoming causes do else followed go having inasmuch… … … … … … … … …

Page 19: Feature engineering for diverse data types

19

FEATURE CLEANING• Frequency-based pruning

Page 20: Feature engineering for diverse data types

20

STOPWORDS VS. FREQUENCY FILTERS

No training required

Stopwords Frequency filters

Can be exhaustive

Inflexible

Adapts to data

Also deals with rare words

Needs tuning, hard to control

Both require manual attention

Page 21: Feature engineering for diverse data types

21

FEATURE SCALING WITH TD-IDF• Scaling ”evens out” the features

• A soft filter• Tf-idf = term frequency x inverse document frequency• Tf = Number of times a terms appears in a document• Idf = log(# total docs / # docs containing word w)

• Large for uncommon words, small for popular words• Discounts popular words, highlights rare words

Page 22: Feature engineering for diverse data types

22

VISUALIZING TF-IDF

puppy

cat

2

11

have

I have a puppy

I have a catI have a kitten

idf(puppy) = log 4idf(cat) = log 4idf(have) = log 1 = 0

I have a dogand I have a pen

1

Page 23: Feature engineering for diverse data types

23

VISUALIZING TF-IDF

puppy

cat1

have

tfidf(puppy) = log 4tfidf(cat) = log 4tfidf(have) = 0

I have a dogand I have a pen,I have a kitten

1

log 4

log 4

I have a cat

I have a puppy

Page 24: Feature engineering for diverse data types

IMAGES

Page 25: Feature engineering for diverse data types

25

REPRESENTING IMAGES

What are the “semantic atoms” of images?• Semantic atom = a unit of meaning

Page 26: Feature engineering for diverse data types

26

COLOR HISTOGRAM

40%

60%

White Blue

40%

60%

White Blue

Page 27: Feature engineering for diverse data types

27

INFORMATION ABOUT STRUCTURE

Collection of local patches encapsulates global structure

Page 28: Feature engineering for diverse data types

28

IMAGE GRADIENTS AND ORIENTATION HISTOGRAM

• Color changes indicate edges, patterns, or texture

• Image gradient: direction of largest change in color, starting from a pixel

-45º

45º

-90º

90º135º

180º

-135º

• Gradient orientation histogram: indicates the prominent directions of color change in a patch of pixels

Page 29: Feature engineering for diverse data types

29

SIFT IMAGE FEATURE PIPELINE

Lowe, ICCV 1999

Page 30: Feature engineering for diverse data types

30

DEEP LEARNING APPROACH• Stack multiple layers – combine local features to form global features• Similar in spirit to SIFT/HOG

“AlexNet” – Krizhevsky et al., NIPS 2012

Page 31: Feature engineering for diverse data types

31

VISUALIZING ALEXNET

Weights of a trained AlexNet. Left– first layer, right – second layer.

Page 32: Feature engineering for diverse data types

32

FEATURIZATION CHALLENGES

It is a puppy and it is extremely cute.

“Human native”Conceptually abstract

Low Semantic content in dataHigh

Higher Difficulty of feature generationLower

TextImageAudio

Page 33: Feature engineering for diverse data types

33

KEY TO FEATURE ENGINEERING• Features sit in-between data and models• Need to encapsulate necessary semantic information from raw data• Distribution of data in feature space should be easily manageable by

intended model• Natural text and logs contain higher level semantic information

• Easier to featurize than images and audio• Requires ingenuity and intuition!

@RainyData [email protected]

Amazon Ad Platform is hiring!