cm nccu class2

93
NCCU Chih-Ming

Upload: -

Post on 22-Jan-2018

165 views

Category:

Data & Analytics


2 download

TRANSCRIPT

  • NCCUChih-Ming

  • Kaggle

    https://www.facebook.com/groups/kaggletw/

    https://www.facebook.com/groups/kaggletw/

  • 3

    https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/https://www.kaggle.com/

  • Why Compete?

    For Fun: Competing with others like running or racing

    For Learning: Improving your abilities

  • Why Compete?

    For Fun: Competing with others like running or racing

    For Learning: Improving your abilities

    What's Your Motivation?

  • Why Compete?

    For Fun: Competing with others like running or racing

    For Learning: Improving your abilities

    What's Your Motivation?

  • Why Compete?

  • Related Websites

    http://dc.dsp.im/index.php

    http://dc.dsp.im/index.php

  • Related Websites

    https://tianchi.aliyun.com/

    https://tianchi.aliyun.com/

  • 10

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

    Applying Estimation Models

    Evaluation Prediction

  • An Overview of Making a Prediction

    Raw Data

  • An Overview of Making a Prediction

    Data CleaningRaw Data

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Preprocessing Tasks

    - Data Cleaning

    - Data Transformation

    - Data Reduction

  • Recap

    Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value

    Outlier Detection - https://en.wikipedia.org/wiki/Outlier

    Redundant Features - we usually remove them

    mean /median /mode /clustering /modeling methods

    https://en.wikipedia.org/wiki/Outlier

  • Data Cleaning / Preprocessing

    AgeUser A 19User B 27User C 200

  • Data Cleaning / Preprocessing

    AgeUser A 19User B 27User C 200 drop

  • Data Cleaning / Preprocessing

    AgeUser A 19User B 27User C 200

    Std.1.21.16.3

    AgeUser A 19User B 27User C 200

    Out.110

    drop

    add label

  • Data Cleaning / Preprocessing

    AgeUser A 19User B 27User C 200

    Std.1.21.16.3

    AgeUser A 19User B 27User C 200

    Out.110

    drop

    AgeUser A 19User B 27User C 36

    replaceadd label

    mean /median /mode /clustering /modeling methods

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Random Splitting

    Split by Time

    Split by Id

    Cross-Validation

  • Hold A Proper Validation

    Random Splitting

    Split by Time

    Split by Id

    TrainValidation

    Test

    7 DAYS7 DAYS

    5/20 5/275/13

    or

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

    Applying Estimation Models

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

    Applying Estimation Models

    Evaluation

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

    Applying Estimation Models

    Evaluation Prediction

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

    Applying Estimation Models

    Evaluation Prediction

  • An Overview of Making a Prediction

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

    Applying Estimation Models

    Evaluation Prediction

  • RoadMap (1)

    Feature Engineering

    Feature Encoding

    - Binary Features

    - Numeric Features

    - Categorical Features

  • Feature Engineering

    Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering

    https://en.wikipedia.org/wiki/Feature_engineering

  • Feature Engineering

    Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering

    like/dislike?

    https://en.wikipedia.org/wiki/Feature_engineering

  • Feature Engineering

    Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering

    like/dislike?

    answer: like

    https://en.wikipedia.org/wiki/Feature_engineering

  • Feature Engineering

    Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering

    like/dislike?

    answer: like gender: boy, age: 22

    https://en.wikipedia.org/wiki/Feature_engineering

  • Feature Engineering

    Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.https://en.wikipedia.org/wiki/Feature_engineering

    like/dislike?

    answer: like gender: boy, age: 22 artist: Cheer, genre: pop

    https://en.wikipedia.org/wiki/Feature_engineering

  • Feature Encoding

    Convert the extracted features to be readable by applied machine learning models.

  • Feature Encoding

    Convert the extracted features to be readable by applied machine learning models.

    1 = boy*w1 + age*w2 + cheer*w3 + pop*w4

  • Feature Encoding

    Convert the extracted features to be readable by applied machine learning models.

    1 = boy*w1 + age*w2 + cheer*w3 + pop*w4

    0 = girl*w1 + age*w2 + may_day*w3 + indie*w4

  • Feature Encoding

    Convert the extracted features to be readable by applied machine learning models.

    1 = boy*w1 + age*w2 + cheer*w3 + pop*w4??????

  • Feature Encoding

    Convert the extracted features to be readable by applied machine learning models.

    1 = boy*w1 + age*w2 + cheer*w3 + pop*w4

    0 2 13 26

  • Feature Encoding

    Convert the extracted features to be readable by applied machine learning models.

    1 = boy*w1 + age*w2 + cheer*w3 + pop*w4

    0 2 13 26

    0 AND 2 AND 13 AND 26

    = 1

  • Feature Encoding

    Convert the extracted features to be readable by applied machine learning models. - Binary Features

    - Numeric Features

    - Categorical Features

  • Binarization

    GenderUser A boyUser B boyUser C girl

  • Binarization

    GenderUser A boyUser B boyUser C girl 0 for boy

    1 for girl

  • Binarization

    GenderUser A boyUser B boyUser C girl

    0User A 0User B 0User C 1

    0 for boy

    1 for girl

  • Binarization

    AgeUser A 17User B 27User C 32

  • Binarization

    AgeUser A 17User B 27User C 32 0 for 18

  • Binarization

    AgeUser A 17User B 27User C 32

    1User A 0User B 1User C 1

    0 for 18

  • Binarization

    Gender AgeUser A boy 17User B boy 27User C girl 32

    0 1User A 0 0User B 0 1User C 1 1

  • Binarization

    Gender AgeUser A boy 17User B boy 27User C girl 32

    0 1User A 0 0User B 0 1User C 1 1

    What about using BINNING?

  • Categorical Features

    ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles

  • Categorical Features

    ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles

    One-hot Encoding

    2 3 4

    User A 1 0 0User B 0 1 0User C 0 0 1

    mayday sekai_no_oware the_beatles

  • Categorical Features

    ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles

    Grouping by language5 6 7

    User A 1 0 0User B 0 1 0User C 0 0 1

    CHN JPN ENG

  • Categorical Features

    ArtistUser A MaydayUser B SEKAI_NO_OWAREUser C The_Beatles

    Grouping by language2 3 4 5 6 7

    User A 1 0 0 1 0 0User B 0 1 0 0 1 0User C 0 0 1 0 0 1

    mayday sekai_no_oware the_beatles CHN JPN ENG

  • Numerical Features

    T1

    T2U

    T3

    23

    1

    6

  • Numerical Features

    T1

    T2U

    T3

    23

    1

    6

    T1 T2 T3

    23 1 6count

  • Numerical Features

    T1

    T2U

    T3

    23

    1

    6

    T1 T2 T3

    23 1 6

    1 0 1

    count

    binary

  • Numerical Features

    T1

    T2U

    T3

    23

    1

    6

    T1 T2 T3

    23 1 6

    1 0 1

    count

    binary

    probability 23/30 1/30 6/30

  • Numerical Features

    Standardization / Normalization

    Rescaling

    Transform the Distribution - logarithmic transformation - tf-idf like transformation

    Binning / Sampling

    https://en.wikipedia.org/wiki/Feature_scaling

    required bymany ML algorithms

    https://en.wikipedia.org/wiki/Data_transformation_(statistics)

    https://en.wikipedia.org/wiki/Feature_scalinghttps://en.wikipedia.org/wiki/Data_transformation_(statistics)

  • Categorical vs. Numerical

    Ordinal Categories

    HATE DONT MIND LIKE LOVE

    0 1 2 3

    0

    2

    4

    6

    8

    HATE DON'T MIND LIKE LOVE

    exp(value)

  • RoadMap (2)

    Advanced Feature Engineering

    Feature Extraction

    - Feature Interactions

    - Data Minings

    - Dimensional Reduction

    - Domain-specific Process

  • Example (1)

    Text-based - Vector Space Model - Word Embeddings

    https://en.wikipedia.org/wiki/Vector_space_model

    MAN

    WOMAN

    KING

    QUEEN

    need stemming? lemmatization?

    https://en.wikipedia.org/wiki/Vector_space_model

  • Example (2)

    Text

    ......

  • Example (2)

    Text

    ......

    segmentation

    [] [] [] [][] [] [] [][] [] []

  • Example (2)

    Text

    ......

    segmentation

    [] [] [] [][] [] [] [][] [] []

    :1 :1 :1 :2:1 :1 :2

    :1filtering

    dummyvariables

  • Example (2)

    Text

    ......

    segmentation

    [] [] [] [][] [] [] [][] [] []

    :1 :1 :1 :2:1 :1 :2

    :1filtering

    WordEmbeddings?

    dummyvariables

    :2 :1 :1 :4:2 :1 :1:0.8

    AdvancedWeighting?

  • Example (3)

    Image-based - SIFT - Convolutional NN

    https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network

    https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network

  • Example (4)

    1 = boy*w1 + age*w2 + cheer*w3 + pop*w4

  • Example (4)

    1 = boy*w1 + age*w2 + cheer*w3 + pop*w4

    + (boy AND pop)*w5

  • Realize the Meaning Behind the Observed Features

    2017/05/20 08:00

    Taipei

    Holiday? Weekday?

    Day? Night?

    Asia

    Mandarin

  • RoadMap (3)

    Popular ML Models

    Linear Models

    Tree-based Models

    Support Vector Machines

    K-means

  • Understand the Pros and Cons

    Linear Model - simple, fast and easy to tune - occupy low memory - non-complex

  • Understand the Pros and Cons

    Linear Model - simple, fast and easy to tune - occupy low memory - non-complex

    Linear Non-Linear

  • Understand the Pros and Cons

    Linear Model - simple, fast and easy to tune - occupy low memory - non-complex

    Linear Non-Linear

  • Understand the Pros and Cons

    Random Forest - work very well in many competitions - fast and easy to tune - memory hungry

  • Understand the Pros and Cons

    Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language

  • Understand the Pros and Cons

    Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language

    CHN NON-CHN

  • Understand the Pros and Cons

    Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language

    Age=16 Age=26

    CHN NON-CHN

  • Understand the Pros and Cons

    Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language

    Genre=Pop?

    CHN

    Age=26

    NON-CHN

  • Understand the Pros and Cons

    Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language

    Genre=Pop?

    Age=12 Age=19

    CHN

    YES NOAge=26

    NON-CHN

  • Understand the Pros and Cons

    Random Forest - work very well in many competitions - fast and easy to tune - memory hungry Language

    Genre=Pop? Western?

    Age=12 Age=19 Age=29 Age=23

    CHN NON-CHN

    YES YESNO NO

  • Understand the Pros and Cons

    Neural Networks - easy end2end learning - flexible - hard to tune/train

  • Understand the Pros and Cons

    Neural Networks - easy end2end learning - flexible - hard to tune/train

    https://en.wikipedia.org/wiki/Artificial_neural_network

    https://en.wikipedia.org/wiki/Artificial_neural_network

  • Understand the Pros and Cons

    Neural Networks - easy end2end learning - flexible - hard to tune/train

    https://en.wikipedia.org/wiki/Artificial_neural_network

    A BOY?

    CHN?

    POP?

    https://en.wikipedia.org/wiki/Artificial_neural_network

  • Understand the Pros and Cons

    Neural Networks - easy end2end learning - flexible - hard to tune/train

    https://en.wikipedia.org/wiki/Artificial_neural_network

    A BOY?

    CHN?

    POP?

    LIKE

    DISLIKE

    https://en.wikipedia.org/wiki/Artificial_neural_network

  • Understand the Pros and Cons

    SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters

    https://en.wikipedia.org/wiki/Support_vector_machine

    https://en.wikipedia.org/wiki/Support_vector_machine

  • Understand the Pros and Cons

    Gradient Boosting Machine (GBM) - usually unbeatable in using dense feature sets

    Factorization Machine (FM) - the master in dealing with sparse data

  • Understand the Pros and Cons

    There are too many details

    Find some online courses or ML books The Elements of Statistical Learning

    Machine Learning, A Probabilistic Perspective

    Programming Collective Intelligence

    Information Science and Statistics

    Pattern Recognition and Machine Learning

  • Understand the Pros and Cons

    Ill tell you everything.

  • HOMEWORK PROJECT

    Find a Dataset or Join a Competition

    Apply the Techniques Presented in this Course

    Data CleaningRaw Data

    Train/Test Splitting

    Exploratory Data Analysis (EDA)

    Feature Engineering

    Applying Estimation Models

    Evaluation Prediction

  • What to HAND IN

    A Paper Report

    Any toolkit is welcome.

    Select and use one or multiple topics you learned from the course.

    Showing the Performance Difference of Using Different Methods.

  • ANY QUESTION?changecandy at gmail