student’s alcohol consumption data analysis

31
Student’s Alcohol Consumption Analysis Group 9 Demin; Derrick; Gaurav; Jingya; Ramya; Si

Upload: demin-damian-wang

Post on 11-Apr-2017

66 views

Category:

Technology


0 download

TRANSCRIPT

Student’s Alcohol Consumption Analysis

Group 9Demin; Derrick; Gaurav; Jingya; Ramya; Si

IntroductionSome of the most important new data to emerge on young adult drinking were collected through a recent nationwide survey, the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). According to these data, about 70 percent of young adults or about 19 million people, consumed alcohol in the year preceding the survey.

Short exploratory data analysis focusing on the alcohol variables from the Portuguese school dataset. Our main goal is using Data Mining To Predict School Student Alcohol Consumption and finding the significant factors.

Objective/problem statement •Build models to predict school students’ drinking behavior during weekdays and weekends.

•Compare various models and choose the best.

•Find out which factors are influential to school students’ alcohol consumption – sensible recommendations were made.

DatasetData collected through a survey from two classes in two schools in

Portugal

33 VariablesPersonal e.g. school, sex, age, address, health status, romantic experience, going out

with friends, free time after school

Educational e.g. study time, class failures, intention for higher education, extra-curricular activities, educational support, number of school absences, grades

Family e.g. mother/father’s education, mother/father’s job, family size, quality of family relationship, parent’s cohabitation status

Alcohol Consumption e.g. workday alcohol consumption, weekend alcohol consumption

Data TypesBinary

Ordinal

Nominal

Numeric

Data preparation No missing data

Overlapping Students taking both math and portuguese class

649 students in Portuguese class, 395 students in Math class

Merging dataCriterion

"school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"

382 students identified

Deciding attributes

Keep max values

Keep “yes” for paid class

Resulting 674 students in total

ApproachesThe data is distributed to analyse 2 different models(alcohol consumption for weekday and for the weekend)

Target variables: Weekday alcohol consumption and weekends alcohol consumption

For weekday (more serious issue than weekend),

Level 1 - acceptable alcohol consumption

Levels 2- 5 - unacceptable

For the weekend,

Levels 1 and 2 - acceptable alcohol consumption

Level 3, 4, 5 - unacceptable

Techniques UsedDecision Tree

Poor performance ☹• Overall error rate 38%• Tried improving the model by cost matrix

(0,25,80,0) → 32% error in predicting unacceptable behavior

• But increased the error rate of acceptable to 44%

REJECTED DECISION TREE

Neural Network

Poor performance ☹• Neural network worked best for 15 nodes• But the error rate is quite high → 53% for

unacceptable class• Also the error rate for the acceptable class was

22%

REJECTED NEURAL NETWORKBoosting

Poor performance ☹• Overall error rate is 25% which is quite

less ☺• However, 59% of the data is wrongly

classified into unacceptable• Area under ROC curve is 0.6782

REJECTED BOOSTING

Naïve Bayes

Poor performance ☹• Overall error rate was 38.46% • Couldn’t properly classify unacceptable class• Accuracy was also very low

REJECTED NAÏVE BAYES

Random ForestWinner ☺

• Unacceptable class error rate was 29%

• And the unacceptable class is very important for the

prediction of the model

ACCEPTED RANDOM FOREST

Weekday Alcohol ConsumptionInput Variables: All the variables were chosen as input for Weekday Alcohol consumption model building except

G1, G2 and Weekend Alcohol consumption.

Weekend Alcohol consumption is ignored to avoid the target leakage condition

G1, G2 - Grades for the first and second year. We include G3 (derived from G1 and G2) and ignore G1 and G2 to make the input variables independent.

Target:

Weekday Alcohol consumption

We classified the Ordinal Variable Weekday Alcohol consumption (Ratings 1 - 5)

Acceptable (Rating 1) and

Unacceptable (Ratings 2 - 5)

Weekday Alcohol ConsumptionRandom Forest Model:

Partitioning:

Training: Validation: Test - 70:15:15

Sample size chosen as 85,100 to downsample the acceptable class

No.of Trees : 5200

Weekday Alcohol ConsumptionRandom Forest Model:

Overall error 35%

For Unacc classPrecision: 52%Recall : 70.5%

Weekday Alcohol ConsumptionRandom Forest Model:

Weekday Alcohol Consumption

Important Factors:● Sex being male● Grades● Mother’s education● Going out● Mother’s job● Failures

Weekend Alcohol Consumption - Input & BalanceThe best model is Balanced Random Forest :

Ignore the variable Dalc, G1 & G2

The target value walc: 1-2 “Low” & 3-5 “High”

High : Low = 262 : 412 = 38 : 62

Train : Validation : Test =70 : 15 : 15

Weekend Alcohol Consumption - Number of Trees

The number of trees is 5200

Weekend Alcohol Consumption - Validation

AUC=0.748Overall error 32%

Precision: 58.5%Recall : 73.8%

Actual

Unac Accp Error

Unac 0.31 0.11 0.26

Accp 0.22 0.36 0.37

Weekend Alcohol Consumption - Importance

Important Factors:● Going Out with friends● Sexual● Grades● Family Size● Absences● Freetime● Father’s Job

Compare two modelsRandom forest can best predict the data in both models.

For daily alcohol consumption, the overall error rate is 35%, with the error rate in unacceptable group of 29%. However, according to AUC, it explains only 69% of the data.

For weekends alcohol consumption, the overall error rate is 32%, with the error rate in high consumption group of 26%. According to AUC, it explains 74.8% of the data.

The weekend model is the better one.

Insights of the models1.Drinking is a daily behavior

most of the drinkers drink both on weekends and weekdays.Students tend to drink more on weekends.

2. Mom and dad plays important roles in different time

According to the daily alcohol consumption model, mother’s education, mother’s job have relationship with the daily drinking behavior of the child.

While, during weekends, father’s job matters to the weekends drinking behavior.

Insights of the models3. Common factors shows up in both models

● Sexual --boys tend to drink more than girls● Grades --kids with lowers grades drinks more than those with higher grades● Absences --kids absences more tend to drink more ● Freetime --kids with more free time tend to drink more

4. Exclusive factors related to alcohol consumption● Going out with friends --on weekends peer behavior have relationship with alcohol

consumption● Family Size --kids with larger family size tend to drink less on weekends.● Going out for more time --during weekdays, more freetime have relationship with

alcohol consumption

Recommendation

Family and school are both important.

After running both models on only school-related data, family-related data we discover the prediction error rate get even higher, which indicates that alcohol consumption behaviour related to both aspects. Solving the alcohol consumption problem among high-school students need the efforts from both school and family.

● Educate the students. Reduce negative peer impacts. Build their awareness of harmful effects of alcohol use.

● Educate the parents. And get parents to keep track of their kids’ after school behavior.

● Keep track of the data to build students’ behavior profile in future prediction.

Recommendation

How to predict better.

As both models can hardly predict the drinkers group well. We could collect more data on larger sample to build the model better.There might be more relevant variables like the group the kids hang out with or how much money they have or other factors we are not included in the study.

THANK YOU

More support pages

Variables

Variables

Correlation

Weekday Alcohol ConsumptionDecision Tree Model:

● Sex being male● Lesser Grade during finals (G3 <14)● Going out more● More absences from class● Mother’s education lower than 1.5 yrs● Mother’s job other than At home,health or teacher

are the factors that seemed to cause Unacceptable drinking behavior (Ratings 2 - 5)

Weekday Alcohol ConsumptionDecision Tree with Loss Matrix:

Loss matrix used: 0,25,80,0

Training: Validation: Test - 70:15:15

Weekday Alcohol Consumption

Decision Tree Model with loss matrix:

Why chose G3 as the indicator of grade