feature engineering for diverse data types

FEATURE ENGINEERING FOR DIVERSE DATA TYPESAlice ZhengOctober 10, 2016Seattle PyLadies Meetup

1

2

MY JOURNEY SO FAR

Shortage of expertise andgood tools in the market.

Applied machine learning/data science

Build ML tools

Write a book

3

MACHINE LEARNING IS USEFUL!

Model data.Make predictions.Build intelligent

applications.Play chess and go!

4

THE MACHINE LEARNING PIPELINE

It is a puppy and it is extremely cute.

Raw data

FeaturesModels

Predictions

Deploy inproduction

Models

6

A SIMPLE MODELX

Y

X and Y

1

1

1

0

0

0

0 1

1

0 0 0

f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) = 1 if f(x, y) > 0

0 if f(x, y) <= 0

7

VISUALIZING A MODEL

1

1

X

Y

g(x,y)0

8

FROM SIMPLE TO COMPLEX

Xn

X3

X2

X1

…

r1(X1, X2)

r2(X2∪X3)

rm(X1, Xn)

…

s1(r1, r2)

s2(r1, r3)

sm(rm-1, rm)

…

Use more complicated functions

or

Stack layers of simple functions(e.g., deep neural nets)

9

BETWEEN RAW DATA AND MODELS• Mathematical models take numeric input• Raw data are not numeric (or not the right kind of numeric)• Featurization: the step in-between• Feature space: multi-dimensional numeric space where modeling

happens

Feature Generation

Feature: An individual measurable property of a phenomenon being observed.

⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”

12

TURNING TEXT INTO FEATURES


What are the important measures?

Keywords? Verb tense? Subject,

object?

it 2is 2

puppy 1and 1cat 0

aardvark 0cute 1

extremely 1… …

Bag of words feature vector

Raw text

13

VISUALIZING BAG-OF-WORDSpuppy

cute

1

1

It is a puppy andit is extremely cute

14

CLASSIFYING BAG-OF-WORDS

puppy

cat

2

11

have

I have a puppy

I have a catI have a kitten

I have a dogand I have a pen

1Decision surface

Feature Cleaning and Transformation

16

AUTO-GENERATED FEATURES ARE NOISY

Rank Word Doc Count

Rank Word Doc Count

1 the 1,416,058 11 was 929,7032 and 1,381,324 12 this 844,8243 a 1,263,126 13 but 822,3134 i 1,230,214 14 my 786,5955 to 1,196,238 15 that 777,0456 it 1,027,835 16 with 775,0447 of 1,025,638 17 on 735,4198 for 993,430 18 they 720,9949 is 988,547 19 you 701,01510 in 961,518 20 have 692,749

Most popular words in Yelp reviews dataset (~ 6M reviews).

17

AUTO-GENERATED FEATURES ARE NOISY

Rank Word Doc Count

Rank Word Doc Count

357,480 cmtk8xyqg 1 357,470 attractif 1357,479 tangified 1 357,469 chappagetti 1357,478 laaaaaaasts 1 357,468 herdy 1357,477 bailouts 1 357,467 csmpus 1357,476 feautred 1 357,466 costoso 1357,475 résine 1 357,465 freebased 1357,474 chilyl 1 357,464 tikme 1357,473 cariottis 1 357,463 traditionresort 1357,472 enfeebled 1 357,462 jallisco 1357,471 sparklely 1 357,461 zoawan 1

Least popular words in Yelp reviews dataset (~ 6M reviews).

18

FEATURE CLEANING• Popular words and rare words are not helpful• Manually defined blacklist – stopwords

a b c d e f g h iable be came definitely each far get had ieabout became can described edu few gets happens ifabove because cannot despite eg fifth getting hardly ignoredaccording become cant did eight first given has immediatelyaccordingly becomes cause different either five gives have inacross becoming causes do else followed go having inasmuch… … … … … … … … …

19

FEATURE CLEANING• Frequency-based pruning

20

STOPWORDS VS. FREQUENCY FILTERS

No training required

Stopwords Frequency filters

Can be exhaustive

Inflexible

Adapts to data

Also deals with rare words

Needs tuning, hard to control

Both require manual attention

21

FEATURE SCALING WITH TD-IDF• Scaling ”evens out” the features

• A soft filter• Tf-idf = term frequency x inverse document frequency• Tf = Number of times a terms appears in a document• Idf = log(# total docs / # docs containing word w)

• Large for uncommon words, small for popular words• Discounts popular words, highlights rare words

22

VISUALIZING TF-IDF

puppy

cat

2

11

have

I have a puppy

I have a catI have a kitten

idf(puppy) = log 4idf(cat) = log 4idf(have) = log 1 = 0

I have a dogand I have a pen

1

23

VISUALIZING TF-IDF

puppy

cat1

have

tfidf(puppy) = log 4tfidf(cat) = log 4tfidf(have) = 0

I have a dogand I have a pen,I have a kitten

1

log 4

log 4

I have a cat

I have a puppy

IMAGES

25

REPRESENTING IMAGES

What are the “semantic atoms” of images?• Semantic atom = a unit of meaning

26

COLOR HISTOGRAM

40%

60%

White Blue

40%

60%

White Blue

27

INFORMATION ABOUT STRUCTURE

Collection of local patches encapsulates global structure

28

IMAGE GRADIENTS AND ORIENTATION HISTOGRAM

• Color changes indicate edges, patterns, or texture

• Image gradient: direction of largest change in color, starting from a pixel

-45º

0º

45º

-90º

90º135º

180º

-135º

• Gradient orientation histogram: indicates the prominent directions of color change in a patch of pixels

29

SIFT IMAGE FEATURE PIPELINE

Lowe, ICCV 1999

30

DEEP LEARNING APPROACH• Stack multiple layers – combine local features to form global features• Similar in spirit to SIFT/HOG

“AlexNet” – Krizhevsky et al., NIPS 2012

31

VISUALIZING ALEXNET

Weights of a trained AlexNet. Left– first layer, right – second layer.

32

FEATURIZATION CHALLENGES


“Human native”Conceptually abstract

Low Semantic content in dataHigh

Higher Difficulty of feature generationLower

TextImageAudio

33

KEY TO FEATURE ENGINEERING• Features sit in-between data and models• Need to encapsulate necessary semantic information from raw data• Distribution of data in feature space should be easily manageable by

intended model• Natural text and logs contain higher level semantic information

• Easier to featurize than images and audio• Requires ingenuity and intuition!

@RainyData [email protected]

Amazon Ad Platform is hiring!

mailto:[email protected]

feature engineering for diverse data types

Science