feature engineering for diverse data types
TRANSCRIPT
FEATURE ENGINEERING FOR DIVERSE DATA TYPESAlice ZhengOctober 10, 2016Seattle PyLadies Meetup
1
2
MY JOURNEY SO FAR
Shortage of expertise andgood tools in the market.
Applied machine learning/data science
Build ML tools
Write a book
3
MACHINE LEARNING IS USEFUL!
Model data.Make predictions.Build intelligent
applications.Play chess and go!
4
THE MACHINE LEARNING PIPELINE
It is a puppy and it is extremely cute.
Raw data
FeaturesModels
Predictions
Deploy inproduction
Models
6
A SIMPLE MODELX
Y
X and Y
1
1
1
0
0
0
0 1
1
0 0 0
f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) = 1 if f(x, y) > 0
0 if f(x, y) <= 0
7
VISUALIZING A MODEL
1
1
X
Y
g(x,y)0
8
FROM SIMPLE TO COMPLEX
Xn
X3
X2
X1
…
r1(X1, X2)
r2(X2∪X3)
rm(X1, Xn)
…
s1(r1, r2)
s2(r1, r3)
sm(rm-1, rm)
…
Use more complicated functions
or
Stack layers of simple functions(e.g., deep neural nets)
9
BETWEEN RAW DATA AND MODELS• Mathematical models take numeric input• Raw data are not numeric (or not the right kind of numeric)• Featurization: the step in-between• Feature space: multi-dimensional numeric space where modeling
happens
Feature Generation
Feature: An individual measurable property of a phenomenon being observed.
⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
TEXT
12
TURNING TEXT INTO FEATURES
It is a puppy and it is extremely cute.
What are the important measures?
Keywords? Verb tense? Subject,
object?
it 2is 2
puppy 1and 1cat 0
aardvark 0cute 1
extremely 1… …
Bag of words feature vector
Raw text
13
VISUALIZING BAG-OF-WORDSpuppy
cute
1
1
It is a puppy andit is extremely cute
14
CLASSIFYING BAG-OF-WORDS
puppy
cat
2
11
have
I have a puppy
I have a catI have a kitten
I have a dogand I have a pen
1Decision surface
Feature Cleaning and Transformation
16
AUTO-GENERATED FEATURES ARE NOISY
Rank Word Doc Count
Rank Word Doc Count
1 the 1,416,058 11 was 929,7032 and 1,381,324 12 this 844,8243 a 1,263,126 13 but 822,3134 i 1,230,214 14 my 786,5955 to 1,196,238 15 that 777,0456 it 1,027,835 16 with 775,0447 of 1,025,638 17 on 735,4198 for 993,430 18 they 720,9949 is 988,547 19 you 701,01510 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
17
AUTO-GENERATED FEATURES ARE NOISY
Rank Word Doc Count
Rank Word Doc Count
357,480 cmtk8xyqg 1 357,470 attractif 1357,479 tangified 1 357,469 chappagetti 1357,478 laaaaaaasts 1 357,468 herdy 1357,477 bailouts 1 357,467 csmpus 1357,476 feautred 1 357,466 costoso 1357,475 résine 1 357,465 freebased 1357,474 chilyl 1 357,464 tikme 1357,473 cariottis 1 357,463 traditionresort 1357,472 enfeebled 1 357,462 jallisco 1357,471 sparklely 1 357,461 zoawan 1
Least popular words in Yelp reviews dataset (~ 6M reviews).
18
FEATURE CLEANING• Popular words and rare words are not helpful• Manually defined blacklist – stopwords
a b c d e f g h iable be came definitely each far get had ieabout became can described edu few gets happens ifabove because cannot despite eg fifth getting hardly ignoredaccording become cant did eight first given has immediatelyaccordingly becomes cause different either five gives have inacross becoming causes do else followed go having inasmuch… … … … … … … … …
19
FEATURE CLEANING• Frequency-based pruning
20
STOPWORDS VS. FREQUENCY FILTERS
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
21
FEATURE SCALING WITH TD-IDF• Scaling ”evens out” the features
• A soft filter• Tf-idf = term frequency x inverse document frequency• Tf = Number of times a terms appears in a document• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words• Discounts popular words, highlights rare words
22
VISUALIZING TF-IDF
puppy
cat
2
11
have
I have a puppy
I have a catI have a kitten
idf(puppy) = log 4idf(cat) = log 4idf(have) = log 1 = 0
I have a dogand I have a pen
1
23
VISUALIZING TF-IDF
puppy
cat1
have
tfidf(puppy) = log 4tfidf(cat) = log 4tfidf(have) = 0
I have a dogand I have a pen,I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
IMAGES
25
REPRESENTING IMAGES
What are the “semantic atoms” of images?• Semantic atom = a unit of meaning
26
COLOR HISTOGRAM
40%
60%
White Blue
40%
60%
White Blue
27
INFORMATION ABOUT STRUCTURE
Collection of local patches encapsulates global structure
28
IMAGE GRADIENTS AND ORIENTATION HISTOGRAM
• Color changes indicate edges, patterns, or texture
• Image gradient: direction of largest change in color, starting from a pixel
-45º
0º
45º
-90º
90º135º
180º
-135º
• Gradient orientation histogram: indicates the prominent directions of color change in a patch of pixels
29
SIFT IMAGE FEATURE PIPELINE
Lowe, ICCV 1999
30
DEEP LEARNING APPROACH• Stack multiple layers – combine local features to form global features• Similar in spirit to SIFT/HOG
“AlexNet” – Krizhevsky et al., NIPS 2012
31
VISUALIZING ALEXNET
Weights of a trained AlexNet. Left– first layer, right – second layer.
32
FEATURIZATION CHALLENGES
It is a puppy and it is extremely cute.
“Human native”Conceptually abstract
Low Semantic content in dataHigh
Higher Difficulty of feature generationLower
TextImageAudio
33
KEY TO FEATURE ENGINEERING• Features sit in-between data and models• Need to encapsulate necessary semantic information from raw data• Distribution of data in feature space should be easily manageable by
intended model• Natural text and logs contain higher level semantic information
• Easier to featurize than images and audio• Requires ingenuity and intuition!
@RainyData [email protected]
Amazon Ad Platform is hiring!