cm nccu class2
TRANSCRIPT
-
NCCUChih-Ming
-
Kaggle
https://www.facebook.com/groups/kaggletw/
https://www.facebook.com/groups/kaggletw/
-
3
https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/
-
Why Compete?
For Fun: Competing with others like running or racing
For Learning: Improving your abilities
-
Why Compete?
For Fun: Competing with others like running or racing
For Learning: Improving your abilities
What's Your Motivation?
-
Why Compete?
For Fun: Competing with others like running or racing
For Learning: Improving your abilities
What's Your Motivation?
-
Why Compete?
-
Related Websites
http://dc.dsp.im/index.php
http://dc.dsp.im/index.php
-
Related Websites
https://tianchi.aliyun.com/
https://tianchi.aliyun.com/
-
10
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
-
An Overview of Making a Prediction
Raw Data
-
An Overview of Making a Prediction
Data CleaningRaw Data
-
An Overview of Making a Prediction
Data CleaningRaw Data
Preprocessing Tasks
- Data Cleaning
- Data Transformation
- Data Reduction
-
Recap
Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value
Outlier Detection - https://en.wikipedia.org/wiki/Outlier
Redundant Features - we usually remove them
mean /median /mode /clustering /modeling methods
https://en.wikipedia.org/wiki/Outlier
-
Data Cleaning / Preprocessing
AgeUser A 19User B 27User C 200
-
Data Cleaning / Preprocessing
AgeUser A 19User B 27User C 200 drop
-
Data Cleaning / Preprocessing
AgeUser A 19User B 27User C 200
Std.1.21.16.3
AgeUser A 19User B 27User C 200
Out.110
drop
add label
-
Data Cleaning / Preprocessing
AgeUser A 19User B 27User C 200
Std.1.21.16.3
AgeUser A 19User B 27User C 200
Out.110
drop
AgeUser A 19User B 27User C 36
replaceadd label
mean /median /mode /clustering /modeling methods
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Random Splitting
Split by Time
Split by Id
Cross-Validation
-
Hold A Proper Validation
Random Splitting
Split by Time
Split by Id
TrainValidation
Test
7 DAYS7 DAYS
5/20 5/275/13
or
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
-
An Overview of Making a Prediction
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
-
RoadMap (1)
Feature Engineering
Feature Encoding
- Binary Features
- Numeric Features
- Categorical Features
-
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
https://en.wikipedia.org/wiki/Feature_engineering
-
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
like/dislike?
https://en.wikipedia.org/wiki/Feature_engineering
-
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
like/dislike?
answer: like
https://en.wikipedia.org/wiki/Feature_engineering
-
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
like/dislike?
answer: like gender: boy, age: 22
https://en.wikipedia.org/wiki/Feature_engineering
-
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering
like/dislike?
answer: like gender: boy, age: 22 artist: Cheer, genre: pop
https://en.wikipedia.org/wiki/Feature_engineering
-
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
-
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
-
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
0 = girl*w1 + age*w2 + may_day*w3 + indie*w4
-
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4??????
-
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
0 2 13 26
-
Feature Encoding
Convert the extracted features to be readable by applied machine learning models.
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
0 2 13 26
0 AND 2 AND 13 AND 26
= 1
-
Feature Encoding
Convert the extracted features to be readable by applied machine learning models. - Binary Features
- Numeric Features
- Categorical Features
-
Binarization
GenderUser A boyUser B boyUser C girl
-
Binarization
GenderUser A boyUser B boyUser C girl 0 for boy
1 for girl
-
Binarization
GenderUser A boyUser B boyUser C girl
0User A 0User B 0User C 1
0 for boy
1 for girl
-
Binarization
AgeUser A 17User B 27User C 32
-
Binarization
AgeUser A 17User B 27User C 32 0 for 18
-
Binarization
AgeUser A 17User B 27User C 32
1User A 0User B 1User C 1
0 for 18
-
Binarization
Gender AgeUser A boy 17User B boy 27User C girl 32
0 1User A 0 0User B 0 1User C 1 1
-
Binarization
Gender AgeUser A boy 17User B boy 27User C girl 32
0 1User A 0 0User B 0 1User C 1 1
What about using BINNING?
-
Categorical Features
ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles
-
Categorical Features
ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles
One-hot Encoding
2 3 4
User A 1 0 0User B 0 1 0User C 0 0 1
mayday sekai_no_oware the_beatles
-
Categorical Features
ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles
Grouping by language5 6 7
User A 1 0 0User B 0 1 0User C 0 0 1
CHN JPN ENG
-
Categorical Features
ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles
Grouping by language2 3 4 5 6 7
User A 1 0 0 1 0 0User B 0 1 0 0 1 0User C 0 0 1 0 0 1
mayday sekai_no_oware the_beatles CHN JPN ENG
-
Numerical Features
T1
T2U
T3
23
1
6
-
Numerical Features
T1
T2U
T3
23
1
6
T1 T2 T3
23 1 6count
-
Numerical Features
T1
T2U
T3
23
1
6
T1 T2 T3
23 1 6
1 0 1
count
binary
-
Numerical Features
T1
T2U
T3
23
1
6
T1 T2 T3
23 1 6
1 0 1
count
binary
probability 23/30 1/30 6/30
-
Numerical Features
Standardization / Normalization
Rescaling
Transform the Distribution - logarithmic transformation - tf-idf like transformation
Binning / Sampling
https://en.wikipedia.org/wiki/Feature_scaling
required bymany ML algorithms
https://en.wikipedia.org/wiki/Data_transformation_(statistics)
https://en.wikipedia.org/wiki/Feature_scalinghttps://en.wikipedia.org/wiki/Data_transformation_(statistics)
-
Categorical vs. Numerical
Ordinal Categories
HATE DONT MIND LIKE LOVE
0 1 2 3
0
2
4
6
8
HATE DON'T MIND LIKE LOVE
exp(value)
-
RoadMap (2)
Advanced Feature Engineering
Feature Extraction
- Feature Interactions
- Data Minings
- Dimensional Reduction
- Domain-specific Process
-
Example (1)
Text-based - Vector Space Model - Word Embeddings
https://en.wikipedia.org/wiki/Vector_space_model
MAN
WOMAN
KING
QUEEN
need stemming? lemmatization?
https://en.wikipedia.org/wiki/Vector_space_model
-
Example (2)
Text
......
-
Example (2)
Text
......
segmentation
[] [] [] [][] [] [] [][] [] []
-
Example (2)
Text
......
segmentation
[] [] [] [][] [] [] [][] [] []
:1 :1 :1 :2:1 :1 :2
:1filtering
dummyvariables
-
Example (2)
Text
......
segmentation
[] [] [] [][] [] [] [][] [] []
:1 :1 :1 :2:1 :1 :2
:1filtering
WordEmbeddings?
dummyvariables
:2 :1 :1 :4:2 :1 :1:0.8
AdvancedWeighting?
-
Example (3)
Image-based - SIFT - Convolutional NN
https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network
https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network
-
Example (4)
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
-
Example (4)
1 = boy*w1 + age*w2 + cheer*w3 + pop*w4
+ (boy AND pop)*w5
-
Realize the Meaning Behind the Observed Features
2017/05/20 08:00
Taipei
Holiday? Weekday?
Day? Night?
Asia
Mandarin
-
RoadMap (3)
Popular ML Models
Linear Models
Tree-based Models
Support Vector Machines
K-means
-
Understand the Pros and Cons
Linear Model - simple, fast and easy to tune - occupy low memory - non-complex
-
Understand the Pros and Cons
Linear Model - simple, fast and easy to tune - occupy low memory - non-complex
Linear Non-Linear
-
Understand the Pros and Cons
Linear Model - simple, fast and easy to tune - occupy low memory - non-complex
Linear Non-Linear
-
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry
-
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
-
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
CHN NON-CHN
-
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Age=16 Age=26
CHN NON-CHN
-
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Genre=Pop?
CHN
Age=26
NON-CHN
-
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Genre=Pop?
Age=12 Age=19
CHN
YES NOAge=26
NON-CHN
-
Understand the Pros and Cons
Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language
Genre=Pop? Western?
Age=12 Age=19 Age=29 Age=23
CHN NON-CHN
YES YESNO NO
-
Understand the Pros and Cons
Neural Networks - easy end2end learning - flexible - hard to tune/train
-
Understand the Pros and Cons
Neural Networks - easy end2end learning - flexible - hard to tune/train
https://en.wikipedia.org/wiki/Artificial_neural_network
https://en.wikipedia.org/wiki/Artificial_neural_network
-
Understand the Pros and Cons
Neural Networks - easy end2end learning - flexible - hard to tune/train
https://en.wikipedia.org/wiki/Artificial_neural_network
A BOY?
CHN?
POP?
https://en.wikipedia.org/wiki/Artificial_neural_network
-
Understand the Pros and Cons
Neural Networks - easy end2end learning - flexible - hard to tune/train
https://en.wikipedia.org/wiki/Artificial_neural_network
A BOY?
CHN?
POP?
LIKE
DISLIKE
https://en.wikipedia.org/wiki/Artificial_neural_network
-
Understand the Pros and Cons
SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters
https://en.wikipedia.org/wiki/Support_vector_machine
https://en.wikipedia.org/wiki/Support_vector_machine
-
Understand the Pros and Cons
Gradient Boosting Machine (GBM) - usually unbeatable in using dense feature sets
Factorization Machine (FM) - the master in dealing with sparse data
-
Understand the Pros and Cons
There are too many details
Find some online courses or ML books The Elements of Statistical Learning
Machine Learning, A Probabilistic Perspective
Programming Collective Intelligence
Information Science and Statistics
Pattern Recognition and Machine Learning
-
Understand the Pros and Cons
Ill tell you everything.
-
HOMEWORK PROJECT
Find a Dataset or Join a Competition
Apply the Techniques Presented in this Course
Data CleaningRaw Data
Train/Test Splitting
Exploratory Data Analysis (EDA)
Feature Engineering
Applying Estimation Models
Evaluation Prediction
-
What to HAND IN
A Paper Report
Any toolkit is welcome.
Select and use one or multiple topics you learned from the course.
Showing the Performance Difference of Using Different Methods.
-
ANY QUESTION?changecandy at gmail