advanced network database lab kaggle competition prudential life insurance assessment can you make...
TRANSCRIPT
Advanced Network Database Lab
Kaggle Competition
Prudential Life Insurance Assessment
Can you make buying life insurance easier?
2
Registration
• Site: https://www.kaggle.com/competitions
• Account: IKDD1(Group Number)
3
Prudential
• Prudential Financial, Inc.• An American Fortune Global 500 and Fortune 500 company• https://www.prulife.com.tw/page/index.htm• $ 30,000
4
Prudential
• Competition url: https://www.kaggle.com/c/prudential-life-insurance-assessment
• Data url: https://www.kaggle.com/c/prudential-life-insurance-assessment /data
• Leaderboard: https://www.kaggle.com/c/prudential-life-insurance-assessment /leaderboard
5
Data Attribute
6
Data Attribute
• Nominal type• Numbers may be used to represent the variables but the numbers do
not have numerical value or relationship.
7
Classification
8
Prediction
9
Decision Tree
10
Sklearn – Python tool
• Simple and efficient tools for data mining and data analysis!
• Decision tree url : http://scikit-learn.org/stable/modules/tree.html
11
Homework 1
• Registration
• Apply a simple algorithm to build the classifier
• To predict the "Response" variable for each Id in the test set
• Submit the result to Kaggle
• Deadline: next Thursday (12/31)
12
Homework 2
• Improve your prediction results
• Oral report
• Deadline: next Thursday (1/7)
13
Homework 3 (Final project)
• Try different algorithms to build the best classifier
• Submit the result to Kaggle
14
Final project
• Deadline: 1/14 23:59
• Submission: • Submit the results to kaggle• Email your project to [email protected]• Project file content:
• code • prediction result • report
15
Report
• The details of the your best method
• The description of the methods that you tried
• The important attributes or surprised features you found
16
Grading
• Homework 1: 20%
• Homework 2: 10%
• Final Project : 70%• The ranking: 20%• Algorithm and coding : 25%• Report: 25%
XGBoost
• General purpose gradient boosting library, including generalized linear model and gradient boosted decision tree
• SITE: http://dmlc.ml/
tslm
• A linear model with time series components
• SITE: http://www.inside-r.org/packages/cran/forecast/docs/tslm
H2o.randomForest
• Random Forest (RF) is a powerful classification tool. When given a set of data, RF generates a forest of classification trees, rather than a single classification tree. Each of these trees generates a classification for a given set of attributes. The classification from each H2O tree can be thought of as a vote; the most votes determines the classification.
• SITE: http://docs.h2o.ai/h2oclassic/datascience/rf.html