student’s alcohol consumption data analysis
Post on 11-Apr-2017
Embed Size (px)
Students Alcohol Consumption AnalysisGroup 9Demin; Derrick; Gaurav; Jingya; Ramya; Si
IntroductionSome of the most important new data to emerge on young adult drinking were collected through a recent nationwide survey, the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). According to these data, about 70 percent of young adults or about 19 million people, consumed alcohol in the year preceding the survey.
Short exploratory data analysis focusing on the alcohol variables from the Portuguese school dataset. Our main goal is using Data Mining To Predict School Student Alcohol Consumption and finding the significant factors.
Objective/problem statement Build models to predict school students drinking behavior during weekdays and weekends.
Compare various models and choose the best.
Find out which factors are influential to school students alcohol consumption sensible recommendations were made.
DatasetData collected through a survey from two classes in two schools in Portugal33 VariablesPersonal e.g. school, sex, age, address, health status, romantic experience, going out with friends, free time after schoolEducational e.g. study time, class failures, intention for higher education, extra-curricular activities, educational support, number of school absences, grades Family e.g. mother/fathers education, mother/fathers job, family size, quality of family relationship, parents cohabitation status Alcohol Consumption e.g. workday alcohol consumption, weekend alcohol consumption Data TypesBinary OrdinalNominal Numeric
Data preparation No missing dataOverlapping Students taking both math and portuguese class649 students in Portuguese class, 395 students in Math classMerging dataCriterion "school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"382 students identifiedDeciding attributes Keep max values Keep yes for paid class Resulting 674 students in total
ApproachesThe data is distributed to analyse 2 different models(alcohol consumption for weekday and for the weekend)Target variables: Weekday alcohol consumption and weekends alcohol consumptionFor weekday (more serious issue than weekend), Level 1 - acceptable alcohol consumption Levels 2- 5 - unacceptableFor the weekend, Levels 1 and 2 - acceptable alcohol consumption Level 3, 4, 5 - unacceptable
Techniques UsedDecision Tree
Poor performance Overall error rate 38%Tried improving the model by cost matrix (0,25,80,0) 32% error in predicting unacceptable behaviorBut increased the error rate of acceptable to 44%
REJECTED DECISION TREE
Neural NetworkPoor performance Neural network worked best for 15 nodesBut the error rate is quite high 53% for unacceptable classAlso the error rate for the acceptable class was 22%
REJECTED NEURAL NETWORK
Poor performance Overall error rate is 25% which is quite less However, 59% of the data is wrongly classified into unacceptableArea under ROC curve is 0.6782
Poor performance Overall error rate was 38.46% Couldnt properly classify unacceptable classAccuracy was also very low
REJECTED NAVE BAYES
Random ForestWinner Unacceptable class error rate was 29% And the unacceptable class is very important for the prediction of the model
ACCEPTED RANDOM FOREST
Weekday Alcohol ConsumptionInput Variables: All the variables were chosen as input for Weekday Alcohol consumption model building except G1, G2 and Weekend Alcohol consumption. Weekend Alcohol consumption is ignored to avoid the target leakage conditionG1, G2 - Grades for the first and second year. We include G3 (derived from G1 and G2) and ignore G1 and G2 to make the input variables independent.Target: Weekday Alcohol consumption We classified the Ordinal Variable Weekday Alcohol consumption (Ratings 1 - 5) Acceptable (Rating 1) and Unacceptable (Ratings 2 - 5)
Weekday Alcohol ConsumptionRandom Forest Model: Partitioning: Training: Validation: Test - 70:15:15 Sample size chosen as 85,100 to downsample the acceptable class No.of Trees : 5200
Weekday Alcohol ConsumptionRandom Forest Model:
Overall error 35%
For Unacc classPrecision: 52%Recall : 70.5%
Weekday Alcohol ConsumptionRandom Forest Model:
Weekday Alcohol ConsumptionImportant Factors:Sex being maleGradesMothers educationGoing outMothers jobFailures
Weekend Alcohol Consumption - Input & BalanceThe best model is Balanced Random Forest :Ignore the variable Dalc, G1 & G2The target value walc: 1-2 Low & 3-5 HighHigh : Low = 262 : 412 = 38 : 62 Train : Validation : Test =70 : 15 : 15
Weekend Alcohol Consumption - Number of Trees
The number of trees is 5200
Weekend Alcohol Consumption - Validation
Overall error 32%
Precision: 58.5%Recall : 73.8%ActualUnacAccpErrorUnac0.310.110.26Accp0.220.360.37
Weekend Alcohol Consumption - Importance
Important Factors:Going Out with friendsSexualGradesFamily SizeAbsencesFreetimeFathers Job
Compare two modelsRandom forest can best predict the data in both models.
For daily alcohol consumption, the overall error rate is 35%, with the error rate in unacceptable group of 29%. However, according to AUC, it explains only 69% of the data.
For weekends alcohol consumption, the overall error rate is 32%, with the error rate in high consumption group of 26%. According to AUC, it explains 74.8% of the data.
The weekend model is the better one.
Insights of the models1.Drinking is a daily behaviormost of the drinkers drink both on weekends and weekdays.Students tend to drink more on weekends.
2. Mom and dad plays important roles in different timeAccording to the daily alcohol consumption model, mothers education, mothers job have relationship with the daily drinking behavior of the child. While, during weekends, fathers job matters to the weekends drinking behavior.
Insights of the models3. Common factors shows up in both modelsSexual --boys tend to drink more than girlsGrades --kids with lowers grades drinks more than those with higher gradesAbsences --kids absences more tend to drink more Freetime --kids with more free time tend to drink more
4. Exclusive factors related to alcohol consumptionGoing out with friends --on weekends peer behavior have relationship with alcohol consumptionFamily Size --kids with larger family size tend to drink less on weekends.Going out for more time --during weekdays, more freetime have relationship with alcohol consumption
Recommendation Family and school are both important.After running both models on only school-related data, family-related data we discover the prediction error rate get even higher, which indicates that alcohol consumption behaviour related to both aspects. Solving the alcohol consumption problem among high-school students need the efforts from both school and family.Educate the students. Reduce negative peer impacts. Build their awareness of harmful effects of alcohol use.Educate the parents. And get parents to keep track of their kids after school behavior.Keep track of the data to build students behavior profile in future prediction.
How to predict better.As both models can hardly predict the drinkers group well. We could collect more data on larger sample to build the model better.There might be more relevant variables like the group the kids hang out with or how much money they have or other factors we are not included in the study.
More support pages
Weekday Alcohol ConsumptionDecision Tree Model:Sex being maleLesser Grade during finals (G3