hint analysis

8/3/2019 Hint Analysis

http://slidepdf.com/reader/full/hint-analysis 1/47



mountain. The main interaction of a player with Prime Climb consists of making a movement

from a location on a mountain of numbers to another location on the mountain until she reachesthe top of the mountain. Each location on a mountain either represents a number or is blocked.

The other possible forms of interactions with the game is attending to the given the hints and

using a tool called Magnifying glass which shows the factor tree of a number once the student

uses the magnifying glass and clicks on a number on the mountain.Prime Climb utilizes a probabilistic student’s model, a Dynamic Bayesian Network (DBN) to

track and assess the student’s number factorization knowledge during the interaction. Thepedagogical agent embedded in the game, will provide the student with adaptive hints when,

according to the student’s model assessment, the student needs such interventions. As an

adaptive educational game, successfulness of Prime Climb in assisting the student learn numberfactorization knowledge depends on how accurate the student’s model is when evaluating the

level of relevant knowledge and skills and providing supports to the student.

The objective of this report is three folds. Firstly, we summarize the simulations carried out to

improve the student’s model accuracy. To this end, a data-driven approach was applied to refinethe parameters of the student’s model which are essential in defining the conditional probability

tables of the nodes in the DBN which is designed for Prime Climb. Then, we report on theanalysis of the performance of the pedagogical agent in providing adaptive interventions to thestudents during the game-play. To this end, two measures of intervention performance called hint

precision and hint recall were defined and calculated. Finally, the accuracy of the student’s

model in assessing the current level of number factorization knowledge is examined andanalyzed.

The rest of this manuscript is organized as following. In Section 2, we briefly summarize the

results of the data-driven student’s model parameters refinement. Section 3, discusses the

analysis of the intervention mechanism used in Prime Climb. Section 4 describes the effects of using different prior probabilities settings on the hinting mechanism in Prime Climb. Section 5

focuses on analysis the performance of the student’s model in evaluating the level of

factorization knowledge during the interaction. Section 6 summarizes the statistical analysis of the effect of different prior probabilities settings on the student’s model. Section 7 presents some

preliminary results on analysis of the pre-test and post-test. Finally, in Section 8, some future

works are mentioned.

2 Data-Driven Model Refinement

Prime Climb utilizes a parametric probabilistic student’s model, a Dynamic Bayesian

Network, DBN, to track the evolution of the student’s number factorization and common factor

skills while the student interacts with the game. Essentially, there exists three steps in creating aDBN as following:

1. Determining the random variables and their domains.

2. Specifying the connections among different random variables.

3. Parameterizing the model by specifying the conditional probability tables, CPT, of the

random variables and specifying the prior probabilities if available.

In Prime Climb, an expert-driven approach has been used for defining the random variables, their

domains and connections among them. In such expert-driven approach, a domain expert



determines the variables and their connections based on her own intuition and experience. On the

Contrary, a data-driven mechanism has been used to find the optimal parameters setting to beused in the conditional probability tables, CPTs, of the nodes in the network. In such data-driven

method, the values of the relevant parameters are calculated using sample training datasets.

There are four parameters used in student’s model in Prime Climb. These parameters specify

how the evidence (making a correct or wrong movement) propagates in the student’s model andrepresent the probability of a student making a correct or wrong movement under a specific

situation as well as knowing/not knowing numbers factorization knowledge. These parametersare as following:

• Guess: The probability that the student makes a correct movement while the student doesnot have the required skill for making such a movement.

• Edu-Guess: Standing for Educational-Guess, Edu_Guess determines the probability thatthe student makes a correct movement while she partially has gained the requiredknowledge for making such a movement.

• Slip: The probability that the student makes a wrong move, while based on the student’smodel assessment, the corresponding skill is known to the student.

• Max: Show how the evidence on a skill will propagate to other relevant skills.In addition, one other step of constructing a DBN, as mentioned earlier, is assigning the random

variables with initial probabilities known as prior probabilities. In Prime Climb, three types of

prior probabilities settings are considered namely, 1) Population, 2) User-specific 2) Generic.We elaborate on these prior probabilities settings and the model’s parameter in the subsequent

subsection.

2.1 Sensitivity of the Model to Parameters

Given the structure (nodes and connections) of the DBN in Prime Climb, a more appropriate set

of model’s parameters allows the model to more precisely track and assess the evolution of the

desired skills (factorization and common factor) during the game-play and eventually at the endof the interaction, result in posterior probabilities for the skill’s corresponding nodes in the DBN,

which accurately predicts the relevant knowledge in the students after the game-play.

In order to find the best set of parameters, a comprehensive range of values between 0-1 wasselected for each parameter. We then utilized a Receiver-Operator Curve (ROC) and found the

best pair of sensitivity and specificity which results in the highest accuracy and balance between

sensitivity and specificity. A ROC Curve plots the true positive rate (sensitivity) versus false

positive rate (1-specificity) when a discrimination threshold varies. As our measure of accuracy,we chose accuracy=(sensitivity+specificity)/2. sensitivity is the true positive rate, the percentage

of known skills that the model classifies as such. specificity is the true negative rate, the

percentage of unknown skills classified as such. A simulator was developed to simulate theinteractions of 45 students with Prime Climb. Table 1, represents the optimal values found for

the model’s parameters. In the next subsection we report on the values of specificity, sensitivity

and accuracy for the different prior probabilities settings.

Table 1: Optimal values of the model’s parameters

Parameter Guess Edu-Guess Slip Max

Value 0.6 0.7 0.1 0.2



Table 2: Summary of the simulation results for different prior probabilities settings

Prior Setting Accuracy Specificity Sensitivity

Population 0.755 0.737 0.779

User-specific 0.713 0.648 0.77

Generic 0.684 0.773 0.612

A probable drawback of relying on this result and using the population prior probabilities for the

future studies with different students could be that future subjects might have lower level of number factorization knowledge compared to the sample group of subjects used to refine the

student’s model parameters and this would result in a model which initially might overestimate

the student’s knowledge and might not perform as well as expected. On the hand, the user-

specific prior probability setting is the one which is expected to more specifically represent the

student’s prior number factorization knowledge than the other two settings. Yet, according to the

results in Table 2, using user-specific prior probabilities resulted in a model the lowest specificity

compared to the other two settings.

3 Hint Precision and Recall

A true intervention strategy in an adaptive educational game insures pedagogical effectiveness

by providing decent tailored supports when required while does not intervene amply which

might negatively affect the user’s engagement in the game. The intervention mechanism in Prime

Climb has been developed in forms of providing different types of hints during the interaction of the student with the game. The hinting strategy in Prime Climb utilizes the student’s model’s

assessment of the student number factorization and common factor knowledge during the game-

play to provide adaptive supports in terms of hints on unknown skills. To decide on when tointervene, the hinting strategy uses four thresholds namely: 1) Fact-CorrectMove, 2) Fact-

WrongMove, 3) CF-CorrectMove and 4) CF-WrongMove. The first two thresholds, 1 and 2,determine the values, used to evaluate a number factorization (Fact) skill as known or unknownafter a correct and wrong movement respectively. Similarly, the last two thresholds, 3 and 4, are

used to assess the common factor (CF) skill as known or unknown immediately after a correct

ore wrong movement. A human-adjusted approach has been applied to find an original setting

for the four aforementioned thresholds in the intervention strategy in Prime Climb. To this end,subsequent to choosing some initial values for each of the thresholds, some graduate students

played the game and their reports on timing the hints were used to adjust the initial values for the

thresholds. The Table 2 shows the final values selected for each of the thresholds.

Table 3: The thresholds used in the hinting algorithm in Prime Climb

Threshold Final value

Fact-CorrectMove 0.5

Fact-WrongMove 0.8

CF-CorrectMove 0.1

CF-WrongMove 0.5



The Algorithm 1 shows how these thresholds are used in the intervention mechanism in Prime

Climb to decide when and on what skill to provide hints.

Algorithm 1: Hinting strategy in Prime Climb

//Initializing variables

if (Player made a correct move)

{

fact_unknown = (playerBelief < fact_correctMoveHintThreshold ||

partnerBelief < fact_correctMoveHintThreshold);

cf_unknown = (cfBelief < cf_correctMoveHintThreshold);

}

else //Player made a wrong move

{

fact_unknown = (playerBelief < fact_wrongMoveHintThreshold ||

partnerBelief < fact_wrongMoveHintThreshold);

cf_unknown = (cfBelief < cf_wrongMoveHintThreshold);}

//When and what skill to hint on

if (cf_unknown && (!fact_unknown))

{

Hint on Common Factor Skill

}

else if (fact_unknown && (!cf_unknown))

{

Hint on Factorization Skill

}

else if (cf_unknown && fact_unknown)

{Hint on Common Factor and Factorization alternatively

}

Algorithm 1: The hinting strategy in Prime Climb

From a pedagogical perspective, it is essential to provide the student with “correct” supportswhen she needs it. A “correct” support is given on the correct skill when required and presented

with helpful context in a way that encourages the student to attend to the support. As the

intervention mechanism in Prime Climb uses real-time assessment of the student’s knowledge to

determine when and on what to provide help, effectiveness of the mechanism is influenced byhow accurately the student’s model tracks and assesses the evolution of desired skills. To

investigate how well the hinting strategy and student’s model provides tailored supports to thestudent during the interaction, two measures of performance are defined: 1) Hint Precision and2) Hint Recall.

Generally, precision is defined as the fraction of retrieved instances which are relevant while

recall is the fraction of relevant instances that are retrieved. Similarly hint precision is defined asthe fraction of given hints which are justified and the hint recall is defined as the fraction of

justified hints which are retrieved and given to the student.



An intervention provided to the user is called justified if it is given at the correct time and on the

right skill. On the contrary, an unjustified intervention is presented to the student when it is notrequired and expected by the student. Similarly, if the intervention strategy fails to provide a

justified intervention, it is said that a justified intervention has been missed. Finally, when no

intervention is given when it is not required, the intervention mechanism has “correctly not

given” the hint. Given these terminology, the hint precision and hint recall are defined using thefollowing equations.

Equation 1: Hint precision

)intint(

intPrint

shd unjustifieof Number sh justified of Number

sh justified of Number ecision H

+

=

Equation 2: Hint recall

)intint(

intReint

shmissed of Number sh justified of Number

sh justified of Number call H

+

=

3.1 Simulation of the intervention mechanism using the original threshold setting

In order to calculate the hint precision and hint recall in Prime Climb, the data from interactionsof 45 students in grade 5,6 with Prime Climb was used to simulate the hint strategy using the

original parameter settings (see Table 3). To this end, we initialized the student’s model with

each of the settings of prior probabilities and used the optimal model’s parameters setting (See

Table 1). Since there is no ground truth on how the student’s number factorization and commonfactor knowledge evolve during the interaction of the student with Prime Climb, in the process of

calculating the hint precision and hint recall, we only considered the movements in which either

the player’s number or the partner’s number or both keep the same score from pre-test to post-test. In each movement made by the student, there are two numbers involved: 1) Player’s

number and 2) Partner’s number. The player’s number is the number to which the player has

just moved while the partner had moved to the partner’s number on the mountain. All thenumbers the students ever moved to during the game-play were assigned a label based on the

performance of the student on that specific number in the pre and post tests. We used 5 labels to

represent the status of the numbers from the pre-test to post-test as following:

1. KK : Stands for Known-Known and shows that the number has been known to the studentboth in the pre-test and post-test (student has answered correctly to the number’s

corresponding question in both tests).

2. UU : Stands for Unknown-Unknown and shows that the number has been unknown to thestudent both in the pre-test and post-test.

3. KU : Stands for known-Unknown and shows that the student has correctly answer the

number’s corresponding question in the pre-test and wrongly in the post-test.4. UK : Stands for Unknown-Known and shows that the student has given a wrong answer to

the number’s corresponding question in the pre-test and a correct answer in the post-test.

5. NAP: If the number does not appear on the tests.



Given the above terminologies, the types of the hints are defined based on the status of the

numbers on which the hints are given as following:

• Justified hint : A hint which is given on a number with status of UU .

• Unjustified hint : A hint which is given on a number with status of KK.

• Missed hint: When the hinting mechanism fails to provide a hint on a number with statusof UU.

• CorrectlyNotGiven hint : When the hinting mechanism correctly detects not to provide

hint on a number with status of KK .

In calculation of hint precision and hint recall it has been assumed that a student should receive a

hint following a movement which contains at least a number with status of UU and should never

receive a hint on a number with a status of KK . For each set of prior probabilities, total numbers

of different types of hints were calculated and the confusion matrix was constructed. Table 4shows the structure of the confusion matrix calculated for the intervention mechanism in Prime

Climb. For instance, in this confusion matrix, an unjustified hint is a hint given on a number

which is known to the student according to pre-test and post-test scores of the student and isunknown on the basis of the student’s model assessment.

Table 4: Structure of the confusion matrix for the intervention mechanism

Model assessment of student knowledge

Unknown Known

Pre-Post

Test

Known Unjustified hint (UJ) Correctly Not Given (CN)Unknown Justified hint (J) Missed hint (M)

3.1.1 Simulation of the Intervention Mechanism Using Population Prior Settings

As previously discussed, three types of prior probabilities settings are used in Prime Climb to

initialize the student’s model. Table 5 represents the confusion matrix for the hinting mechanism

in Prime Climb when the population prior setting was used. The result was based on using theoriginal thresholds for the hinting strategy (see Table 3) and optimal model’s parameters (see

Table 1).

Table 5: Confusion Matrix (# of raw data points and [percentages]) when the population priors is used

Model assessment of student knowledge(Population-based Prior)

Unknown Known Total

Pre-PostTest Known 108 [12.3%] (UJ) 306 [34.8%] (CN) 414 [50.9%]Unknown 122 [13.9%] (J) 343[39.0%] (M) 465 [49.1%]

Total 230 [26.2%] 649 [73.8%] 879 [100%]

Given the equations 1 and 2, the hint precision and hint recall are 0.53 and 0.26 respectively.

As calculated, the hint precision and hint recall are of low values which means that initializing

the student’s model with the population prior probabilities and using the model as the basis forproviding tailored supports to the student could result in many unjustified interventions (almost



47% of all interventions) and this has the potential of ceasing the student to benefit from the

provided supports. It also could result in many missed hints (about 74% of the time the modelfails to provide a justified hint) which could negatively affect the learning gain in the students.

To find out which situations during the game-play make the most contribution toward

lowering the hint precision and hint recall we made some further investigations. To this end, all

the movements made by the student were extracted from the log files and each movement wasassigned a label which comprised the status of the player’s number in the pre-test and post-test

followed by the status of the partner’s number in the pre-test and post-test. A number’s status isof format of XY which X represents if, based on the pre-test result, the student knows (K ) / does

not know (U ) the factorization of the number. Whether the student knows the factorization of the

same number based on the post-test result is shown by Y . If a number does not appear in the pre-test and post-test, it is assigned a NAP status. For instance, in the status (UK-NAP), UK

represents the status of the player’s number in the movement and shows that factorization

knowledge of the number is Unknown to the student in the pre-test and Known in the post-test.

On the other hand, NAP represents the status of the partner’s number which means that thenumber does not appear in the pre-test and post-test.

Then all the movements which have the potential of receiving unjustified and justified hintswere extracted. Figure 2, illustrates the frequencies of the relevant movements to the hints. Asdepicted, in 3.95% of the time the model underestimates the known number factorization

knowledge in the students. On the contrary, in 64% of the time, when a justified hint was

required, no hint was given to the students, an indication of a high rate of overestimation of unknown number factorization skills. In addition, in “at least” (since we could not judge on

given hints on numbers with status NAP) 22.8% of the time the model succeeded to provide a

justified hint to the student.

Figure 2: Frequency (raw# and percentages) proportion of each hint types to its relevant possible

movements for the population prior

Then all the movements which have the potential of receiving unjustified hints were extracted.

There are 9 types of movements on which unjustified hints might be given. Figure 3 shows the

labels of the 9 movement types.



Figure 3: Frequency (raw# and percentages) of the unjustified hints for each movement type

Figure 4: Frequency (raw# and percentages) of the missed hints for each movement type

Figures 3, 4 and 5 illustrate more detailed analysis of all relevant movements to the hints.

Figure 3, represents all the statuses of the movements which are relevant to unjustified hints.

Next to each movement’s status, the raw number and percentage of given unjustified hintsrelevant to the movement is given. For instance, 40 unjustified hints are given on movements

with status of KK-KK which includes 11.5% of all movements with status KK-KK . Figures 4 and

5 represent similar information for the missed and justified hints. As shown in Figure 4, at least

in 50% of all the relevant movements the model has failed to provide a hint and a justified hint



has been missed . In addition, Figure 5 represents the low rate of given justified hints for each

relevant status of the movements.

Figure 5: Frequency (raw# and percentages) of the justified hints for each movement type

We can conclude from Figures 2-5 that the hinting strategy is successful in not giving many

unjustified hints on the numbers on which the student’s model has population prior knowledge,although, as mentioned before, almost 47% of given hints are unjustified . Also the hinting

strategy is in trouble in giving justified hints and there are too many missed hints meaning that

the student’s model overestimates the student’s factorization knowledge on numbers with statusof UU . This deficiency could hinder learning gains through receiving tailored helps during the

interaction with Prime Climb. Similarly, in the next subsection the effect of initializing the model

with the generic prior probabilities on hint precision and hint recall is discussed.

3.1.2 Simulation of the Intervention Mechanism Using Generic Prior Setting

To further investigate the effect of the prior probability settings on the hint precision and hint

recall, a similar process was carried out on the model which was initialized by the generic prior

probabilities. In the generic prior setting, the prior probabilities of all numbers on the mountainsare set to 0.5 regardless of how the student has scored on that specific number on the pre-test.

The confusion matrix of the intervention mechanism based on the generic prior is shown in

Table 6. As calculated by using the Equations 1 and 2, the hint precision and hint recall are0.378 and 0.363 when the generic prior setting. Figure 4 represents the frequencies of all the

relevant movements as well as the frequencies and the percentages of the corresponding hints.The results show an increase in frequency of given unjustified hints and decrease in frequency of

missed hints. A detailed statistical analysis and comparison will be discussed in Section 4. The

results on using the generic prior probabilities provided the intuition that lowering the prior

probabilities could result in higher rate of underestimation of known skills and lower rate of

underestimation of unknown skills.



Table 6: Confusion Matrix when generic priors is used

Model assessment of student knowledge(Generic-based Prior)

Unknown Known Total

Pre-Post

Test

Known 379 (UJ) [34.4%] 257 (CN)[23.3%] 636[57.7%]

Unknown 169 (J) [15.3%] 297 (M)[27.0%] 466[42.3%]Total 548[49.7] 554[50.3%] 1102[100%]

Figure 6: Frequency (raw# and percentage) proportion of each hint types to its relevant possible

movements for the generic prior probabilities




Figures 7, 8 and 9 respectively, illustrate all the relevant movement statuses to the hints. Figure 7

shows that a low rate of unjustified hints given on the movements although almost 70% of all thegiven hints are unjustified (see the confusion matrix, Table 6). Furthermore, as shown in Figure

8, the student’s model has failed to provide a justified hint in at least 30% of each relevant

movement statuses. Figure 9 also shows all relevant statuses, the raw frequency of each

movement as well as the raw frequency and percentage of the given justified hints on eachcorresponding status. We also conducted the similar study using the user-specific prior setting as

discussed in the next subsection.





3.1.3 Simulation of the Intervention Mechanism Using User-specific Prior Settings

In the user-specific prior setting, the prior probabilities of the numbers appearing in the pre-test

and post-test are calculated based on the student’s performance on the number’s corresponding

question in the pre-test. In other words, if the student has answered correctly to a number’s

corresponding question in the pre-test, the prior probability of the number is set to 0.9 and 0.1otherwise. Clearly, the prior probability of a known number in the user-specific prior setting is

higher than the same number’s prior probability in the generic and population prior probabilitiessettings. To investigate the effect of initializing the student’s model with the user-specific prior

probabilities, we have conducted a similar simulation to the simulations described in the 2

previous subsections. Table 7 represents the confusion matrix of the intervention mechanismwhen the user-specific prior setting is used.

Table 7: Confusion Matrix when the user-specific priors is used


(User-Specific-based Prior)

Unknown Known Total

Pre-Post

Test

Known 79(UJ)[8.7%] 315(CN)[34.8%] 394[43.5%]

Unknown 468(J)[51.7%] 44(M)[4.8%] 512[56.5%]

Total 547[60.4%] 359[39.6%] 906[100%]

When the user-specific prior is used, the hint precision and hint recall are 0.856 and 0.91respectively. The results show a considerable improvement in the hint precision and hint recall

compared to the results obtained when the population and generic priors were used. Figure 10

represents the raw frequencies of all relevant movements to the hints as well as the rawfrequencies and the percentages of the hints. As shown in Figure 10, the student’s model

initialized by the user-specific prior probabilities has succeeded to provide a justified hint on

87.3% of the relevant movements. Also, there are low rates of the unjustified and missed hints.

Figure 10: Frequency (raw# and percentage) proportion of each hint types to its relevant possible

movements when the user-specific prior probabilities are used



Figures 11, 12 and 13 respectively represent all the relevant statues to the hints, the frequencies

of each status’ corresponding movements and the frequencies and percentages of the hints.Figure 11 shows that the highest rate of the unjustified hints is related to the movements with

status of KK-KK . This could be an indication that the students might have made enough wrong

movements which involved numbers with status of KK . On the other hand, it could also be an

indication for a not well adjusted slip parameter (see Section 2). Also the highest rate of themissed hints pertains to the movements with status of UU-KK . Moreover, Figure 13 represents a

high rate of justified hints on each relevant movement. In the next section, a statisticalcomparison of the results will be presented.






4 Comparison and Results of Hint Precision and Hint Recall

Table 8, summarizes the results of simulating the intervention mechanism using the three

different prior settings. In the simulation, the total number of movements made by all the players

was 8666 movements. The intervention mechanism used in Prime Climb, provides supports ontwo skills: 1) number factorization skills: the knowledge of factorizing a number to its factors

and 2) common factor skill: the concept of two numbers having at least a factor in common. The

results show that, on average, more than one hint is given on each three movements made by theplayer during the game-play.

Table 8: General statistics on the total number and [percentage] of hints using different prior probability

settings

Prior Setting

Population User-Specific Generic

Number of hints 3344 [38.6%] 3807 [43.9%] 3561[41.1%]

Factorization hints 3256 3721 3510

Common Factor hints 88 86 51

In the following subsections, the effects of initializing the student’s model with three priorprobability settings are compared on the total number of given hints, total number of justified

hints, total number of justified hints, total number of missed hints and total number of correctly

not-given hints. In all the comparisons, we first conducted the test of homogeneity of variancesand whenever there was a violation of the assumption of homogeneity of variance, the Welch test

followed by the Games-Howell post-hoc test has been applied instead of the traditional single

factor ANOVA.



4.1 Total Number of Given Hints

We found no statistically significant difference on the total number of hints given between

different groups of prior settings. (F(2,132)=1.32, p=0.270203>0.05). On average, each studenthas made 193 movements (std.: 53) during interaction with Prime Climb. Table 9 represents the

mean and standard deviation of the total number of given hints with respect to the different priorsettings. Figure 14 illustrates the average number of given hints to each player during theinteraction with Prime Climb when the different prior settings were used. Also Figure 15

compares the total hints given to each student in different prior settings.

Figure 14: Average number of total hints given to each player

Figure 15: Total number of hints given to each player (student)



Table 9: Mean and Standard Deviation of the total number (# raw data point) of given hints

Population Prior User Specific Prior Generic Prior

Mean 74.31 84.6 79.13

Standard Deviation 27.5 32.71 29.66

4.2 Number of Given Justified Hints

Using the Welch test we found a statistically significant difference on the total number of given

justified hints, among the different groups of prior settings (p<0.05). Table 10 represents the

means, standard deviation and total number of justified hints for each prior probabilities setting.Also, Table 11 represents the results of the Games-Howell post-hoc test. (“*” indicates the

significant difference)

Table 10: Descriptive statistics on total number of justified hints

Population Prior User Specific Prior Generic PriorMean 1.83 13.8 3.67


Total number of given justified hints

55 414 110

Table 11: Games-Howell Post-hoc test result (Dependent variable: total number of justified hints)

Games-

Howell Test

Prior

Probabilities

Prior

Probabilities

p-value

(Sig.)

Significant

(*: Yes)

Comparison

Population User-specific .000 *

Generic .649User-specific Population .000 *

Generic .002 *Generic Population .649

User-specific .002 *

The results showed that there is no significant effect of using the population prior probabilities

and the generic prior probabilities on the total number of justified hints. On the contrary there is

a statistically significant difference between the user-specific and population as well as betweenthe user-specific and the generic prior probabilities settings with respect to the total number of

justified hints. Figures 16, 17, respectively illustrate the average number of given justified hints

and total number of given justified hints to each student.

4.3 Number of Given Unjustified Hints

The Welch test showed that there was a statistically significant difference on the total number of given unjustified hints, among different groups of prior settings (p<0.05). Table 12 shows the



descriptive statistics on the total number of justified hints. Table 13 represents the results of the

Games-Howell post-hoc test.

Figure 16: Average number of given justified hints

Figure 17: Total number of given justified hints to each student

Table 12: Descriptive statistics on total number of unjustified hints


Mean 2.64 1.73 7.93


Total number of givenunjustified hints

116 76 349



Table 13: Games-Howell Test (Dependent variable: total number of unjustified hints)

Games-

Howell Test

Prior

Probabilities

Prior

Probabilities

p-value

(Sig.)

Significant

(*: Yes)

Comparison

Population User-specific .546

Generic .002 *

User-specific Population .546 Generic .000 *

Generic Population .002 *User-specific .000 *

The results showed that there is a significant difference between the generic prior probabilitiessetting and the population and user-specific prior probabilities settings on the total number of

unjustified hints. Also there is no statistically significant difference between the user-specific and

population prior probabilities settings on the total number of unjustified hints. Figure 18 and 19respectively illustrate the average number of given justified hints, total number of given

unjustified hints to each student.

Figure 18: Average number of given unjustified hints

4.4 Number of Missed Hints

The Welch test showed that there was a statistically significant difference on the total number

of missed hints, among different groups of prior settings (p<0.05). Table 14 shows thedescriptive statistics on the total number of missed hints. Table 15 represents the results of the

Games-Howell post-hoc test. The results showed no significant difference on the total number of

missed hints between the generic and population prior probabilities settings while there existed asignificant difference between the user-specific prior probabilities setting and the population and

generic prior probabilities settings on the total number of missed hints.



Figure 19: Total number of unjustified hints

Table 14: Descriptive statistics on total number of missed hints


Mean 10.47 1.37 9.2


Total number of given

unjustified hints314 41 276

Table 15: Paired T-test results. (Dependent variable: total number of missed hints)

Games-

Howell Test

Prior

Probabilities

Prior

Probabilities

p-value

(Sig.)

Significant

(*: Yes)

Comparison


Generic .770User-specific Population .000 *

Generic .000 *

Generic Population .770User-specific .000 *

Figures 20 and 21 respectively illustrate the average number of missed hints, total number of missed hints of each student in the different prior probabilities.



Figure 20: Average number of missed hints

Figure 21: Total number of missed hints for each student

4.5 Number of Correctly Not-Given Hints

No significant difference between the total number of correctly not-given hints was found usinga single factor ANOVA test (F(2,129)= 0.034 , p>0.05). Table 16 shows the descriptive statisticson the total number of correctly not-given hints. Figures 22 and 23 respectively illustrate the

average number of correctly not-given hints and total number of correctly not-given hints for

each student in different prior probabilities settings.



Table 16: Descriptive statistics on total number of correctly not-given hints


Mean 6.95 7.16 5.84


Total number of givenunjustified hints

306 315 257

Figure 22: Average number of correctly not given hints

4.6 Hint Precision

The Welch test showed that there was a statistically significant difference on the hint

precision, among different groups of prior settings (p<0.05). Table 17 represents the results of

the Games-Howell test and Table 18 shows the descriptive statistics on the hint precision. The

results showed no significant difference between the population and the generic probabilitiessettings on hint precision. On the contrary there was a statistically significant difference between

the user-specific prior setting and the population and the generic prior probabilities settings.

Table 17: Game-Howell post-hoc test result (Dependent variable: hint precision)

Games-Howell Test

PriorProbabilities

PriorProbabilities

p-value(Sig.)

Significant(*: Yes)

Comparison

Population User-specific .001 *Generic .682

User-specific Population .001 *

Generic .000 *




Table 18: Descriptive statistics on the hint precision


Mean 50.79% 85.2% 41.7%


Figures 23 and 24 respectively illustrate the average hint precision and the hint precision of eachstudent.

Figure 23: Average hint precision for the different prior settings

Figure 24: Total hint precision of each student for the different prior settings



4.7 Hint Recall

The Welch Single Factor ANOVA test showed that there was a statistically significant

difference on the hint recall, among different groups of prior settings (p<0.05). Table 19 showsthe descriptive statistics on the hint recall and Table 20 gives the results of the Games-Howell

post-hoc test. The results showed no statistically significant difference between the population and the generic while there was a statistically significant between the user-specific priorprobabilities setting and the population and the generic prior probabilities settings.

Table 19: Descriptive statistics on the hint recall


Mean 21.27% 93.96% 26.44%


Table 20: Games-Howell test results (Dependent variable: hint recall)

Games-

Howell Test

Prior

Probabilities

Prior

Probabilities

p-value

(Sig.)

Significant

(*: Yes)

Comparison



Generic .000 *Generic Population .746


Figures 25 and 26 respectively illustrate the average hint recall and the hint recall of each

student.

Figure 25: Average the hint recall for the different prior settings



Figure 26: Total hint recall of each student for the different prior settings

4.8 Thresholds Refinement in the Hinting Mechanism

As discussed in the previous section, an expert-based approach was used to find the optimal

thresholds used in the intervention mechanism. Alternatively, a data-driven approach also can

also be utilized to determine the values for the threshold possibly resulting in higher hint

precision and hint recall. Similar to the student’s model parameter refinement discussed in the

Section 2, a set of values for the Fact-correctMove and Fact-wrongMove thresholds were

examined and the hint precision and hint recall were calculated. We defined another measure of

performance, called accuracy=(hint precision + hint sensitivity)/2. Figures 27, 29, 31 illustrate

how the hint precision and hint recall change while the value for Fact-WrongMove thresholdvaries and Fact-correctMove threshold holds its original values (ie. 0.5) for all three types of

prior probabilities settings, population, user-specific, generic. Subsequently, Figures 28, 30, 32plot changes in hint precision, hint recall and accuracy with respect to different values for Fact-

CorrectMove threshold while Fact-WrongMove threshold holds its optimal value, the value

resulting in highest hint precision and hint recall. The thresholds which resulted in the highest

hint precision and hint recall are represented in the Figures. Table 21 and 22 summarize the

optimal thresholds and total number of given hints, average and standard deviation of the total

number of given hints for all prior probabilities settings.

Table 21: Summary of hinting strategy’s thresholds refinement

Prior settingCF-Correct

Move

CF-Wrong

Move

Hint

PrecisionHint Recall

Population 0.72 0.8 55.2% 56.2%

Generic 0.88 0.76 40.6% 94.2%

User-specific 0.68 0.44 92.8% 95.1%



Table 22: Descriptive statistics of the hinting strategy’s thresholds refinement

Prior settingTotal number of

given hints

Average number of

given hints

Std. number of

given hints

Population 6703 148 34

Generic 8024 178 45

User-specific 6556 145 34

Figure 27: FACT-Wrong threshold refinement for the population prior probabilities

Figure 28: FACT-CorrectMove threshold refinement for the population prior probabilities



Figure 29: FACT-WrongMove threshold refinement for the generic prior probabilities

Figure 30: FACT-CorrectMove threshold refinement for the generic probabilities



Figure 31: FACT-WrongMovement threshold refinement for the user-specific prior probabilities

Figure 32: FACT-CorrectMovement threshold refinement for the user-specific prior probabilities

5 Model Precision and Sensitivity

In Sections 3 and 4, two measures of effectiveness of the intervention (hinting) mechanism in

Prime Climb, hint precision and hint recall were calculated. On the contrary, the main objectiveof the current section is quantifying the ability of the student’s model to detect the level of

number factorization skills in the player during the interaction with Prime Climb. Similar to the

strategy followed in calculating the hint precision and hint recall, since there is no ground-truth



on how the number factorization knowledge evolves during the interaction of the student with

the game from the pre-test to the post-test, only numbers with the same score in the pre-test andpost-test were considered. To this end, four measures were defined namely, 1)model positive

precision, 2)model negative precision 3)model sensitivity 4)model specificity. Before formulating

the above measures, some terminologies need to be defined. In the following definitions, “a

known/unknown factorization skill” to the player refers to a factorization skill on which thestudent keeps the same score from the pre-test to the post-test and the student has

correctly/wrongly answered the skill’s corresponding question in the pre-test and post-test.

• True-Positive: The student’s model correctly assesses a known factorization skill asknown to the student during the game-play.

• False-Positive: The student’s model fails to assess an unknown factorization skill asunknown to the student during the game-play.

• True-Negative: The student’s model correctly assesses an unknown factorization skillas unknown to the student during the game-play.

• False-Negative: The student’s model fails to assess a known factorization skill as

known to the student during the game-play.

Given the above definitions, model positive precision, model negative precision, model

sensitivity and model specificity are formulated as following:

Equation 3: Model Positive Precision

)#(#

#mod

PositiveFalseof PositiveTrueof

PositiveTrueof precision positiveel

+

=

Equation 4: Model Negative Precision

)#(#

#mod

NegativeFalseof NegativeTrueof

NegativeTrueof precisionnegativeel

+

=

Equation 5: Model Sensitivity

)#(#

#mod

NegativeFalseof PositiveTrueof

PositiveTrueof ysensitivit el

+

=

Equation 6: Model Specificity

)#(##mod

PositiveFalseof NegativeTrueof NegativeTrueof yspecificit el+

=

5.1 Simulation of Interactions of the Users with Prime Climb

Log files of the interactions of 45 students in grade 5,6 with Prime Climb were parsed to

simulate the movements the students made during the game-play. In sum, there are 8666

movements extracted from the log files. Then, a post-processing filtering was applied to exclude



the movements in which neither the player’s number nor the partner’s number keep the same

score from the pre-test to the post-test resulting in 3203 left movements with at least one numberwith the same status in the pre-test and post-test and the movements were classified in 16

possible groups on the basis of the status of the player’s number and the partner’s number in the

pre-test and post-test. Figure 33, represents the percentage frequency distribution of the statuses.

Generally, there are 3083 (84.6%) and 559 (15.4%) data points which represent numbers withstatus of KK and UU respectively.

Figure 33: Percentage frequency of each movement types in total movements made by the players

The objective of the simulation was to evaluate how accurately the model could evaluate thelevel of factorization knowledge for the numbers with the status of KK and UU after each

movement. To this end, we used two thresholds namely FACT-CorrectMove and FACT-WrongMove . The former threshold represents the cut-off to evaluate a number factorization skillas known (above the threshold) or unknown (below the threshold) after a correct movement, and

the latter threshold is the cut-off for evaluation a number factorization skill as known (above the

threshold) or unknown (below the threshold) after a wrong movement. The initial values used forthese two thresholds were 0.5 (for FACT-CorrectMove) and 0.8 (for FACT-WrongMove). These

values are identical to the original values used for the thresholds in the hinting strategy (see

Table 3). We counted the number of True-Positive, True-Negative, False Positive and False

Negative cases of all the students and formed the confusion matrix and used the Equations 3-6 tocalculate the mode positive precision, model negative precision, model sensitivity and model

specificity. The structure of the confusion matrix is represented in Table 23.

Table 23: Structure of the confusion matrix


Unknown Known

Pre-PostTest

Known False Negative (FN) True Positive (TP)Unknown True Negative (TN) False Positive (FP)



5.1.2 Generic Prior Setting

Table 26 shows the confusion matrix when the generic prior probabilities were used to

initialize the student’s model. Similar to the population prior probabilities setting, a low

percentage of (4.7/ 15.3)% for the TrueNegative indicates that the model with the generic prior

probabilities has problem with detecting the “unknown” factorization skills to the student duringthe game-play and consequently a low model negative precision and model specificity are

expected. The student’s model has best performed on evaluating the “known” skills as “known”

(68.4/84.7)%. Table 27 represents the values for the four measures of model positive precision,

model negative precision, model sensitivity and model specificity. Figure 35, illustrates the

percentages of the elements of the confusion matrix (TruePositive, FalsePositive, TrueNegative,

FalseNegative) for each relevant status.

Table 26: Confusion Matrix (# of raw data points and [percentages]) for the generic priors


(Generic-based Prior)

Unknown Known Total

Pre-Post

Test

Known 592[16.3%] (FN) 2491[68.4%] (TP) 3083[84.7%]Unknown 171[4.7%] (TN) 386[10.6%] (FP) 557[15.3%]

Total 763[21.0%] 2877[79.0%] 3640[100%]

Table 27: Summary of the results on the model analysis for the generic priors setting

Prior Setting Generic-based

MeasuresModel Positive

Precision

Model Negative

PrecisionModel Sensitivity Model Specificity

Values 0.866 0.225 0.808 0.307

Figure 35: Frequency (%) of the elements of the confusion matrix for each relevant status



5.1.3 User-specific Prior Setting

Table 28 gives the confusion matrix when the user-specific prior probabilities are used to

initialize the student’s model in Prime Climb. The low percentages of the FalseNegative (4.3/

84.7)% and FalsePositive (2.1% / 82.5%) and high percentages of TrueNegative (13.2% / 17.5%)

and TruePositive (80.4/84.7)% has provided evidence on that the student’s model initialized bythe user-specific prior probabilities performs well in assessing the “known” skill as “known” and

“unknown skills” as “unknown”. Table 29 represents the values for four measures of model

positive precision, model negative precision, model sensitivity and model specificity. Figure 36also illustrates the percentages of the elements of the confusion matrix in their relevant statuses.

Table 28: Confusion Matrix (# of raw data points and [percentages]) for the user-specific prior setting


(User-specific-based Prior)

Unknown Known Total

Pre-Post

Test

Known 156[4.3%] (FN) 2927[80.4%] (TP) 3083[84.7%]

Unknown 480[13.2%] (TN) 77[2.1%] (FP) 557[15.3%]

Total [17.5%] [82.5%] 3640[100%]

Table 29: Summary of the results on the model analysis for the user-specific priors setting

Prior Setting User-specific-based

MeasuresModel Positive

Precision

Model Negative

Precision

Model Positive

Sensitivity

Model Negative

Sensitivity

Values 0.975 0.755 0.95 0.862

Figure 36: Frequency (%) of the elements of the confusion matrix for each relevant status



6 Comparison of the Model’s Performance for Different Prior Probabilities

In the previous section we showed that how the student’s model performs when the different

prior probabilities settings were used to initialize the student’s model. In this section, the effects

of using different prior probabilities settings on the number of TruePositive, FalsePositive,

TrueNegative, FalseNegative, model positive precision, model negative precision, modelsensitivity and model specificity are statistically discussed.

6.1 Total Number of True-Negative

The Welch test showed that there was a statistically significant difference on the

TrueNegative, among the different groups of prior settings (p<0.05). Table 30 shows thedescriptive statistics on the TrueNegative. Table 31 represents the results of subsequent Games-

Howell test. The results showed that there was no statistically significant difference on the total

number of TrueNegative between the population and generic prior probabilities while there wasa statistically significant difference between the user-specific prior probabilities setting and the

other two settings.

Table 30: Descriptive statistics on True Negative


Mean 4.1 16 5.7


Table 31: Games-Howell test result (Dependent variable: True Negative)

Games-Howell Test

PriorProbabilities

PriorProbabilities

p-value(Sig.)

Significant(*: Yes)

Comparison



Generic .002 *


6.2 Total Number of False-Negative

The welch test showed that there was a statistically significant difference on the

FalseNegative, among different groups of prior settings (p<0.05). Table 32 shows the descriptive

statistics on the total number of FalseNegative. Table 33 represents the results of the Games-Howell tests. The results showed that there is a statistically significant difference between all

three groups of prior probabilities settings on the total number of FalseNegative.

Table 32: Descriptive statistics on False Negative


Mean 6.61 3.54 13.45




Table 33: Games-Howell test result (Dependent variable: False-Negative)

Games-

Howell Test

Prior

Probabilities

Prior

Probabilities

p-value

(Sig.)

Significant

(*: Yes)

Comparison


Generic .001 *

User-specific Population .030 *Generic .000 *


Figure 37: Average of TrueNegative for the different prior probabilities settings

Figure 38: Total TrueNegative for each student



Figures 39 and 40 respectively illustrate the average number of FalseNegative and the total

FalseNegative for each student for each prior probabilities setting.

Figure 39: Average of FalseNegative for the different prior probabilities settings

Figure 40: Total number of FalseNegative for each student when different prior probabilities were used

6.3 Comparison of Total Number of True Positive

Following a non-significant difference between the variances of the three prior probabilities

settings using the test of homogeneity of variance (Levene statistics), a traditional single factorANOVA showed that there is no statistically significant difference on the total number of

TruePositive among different groups of prior settings (F(2,129)= 0.63 ,p= 0.531758>0.05). Table

34 represents the mean and standard deviation of total TruePositive for the different settings.



Figures 41 and 42, respectively illustrate the average of the TruePositive and the total number of

TruePositive of each student for different prior probabilities settings.

Table 34: Descriptive statistics on True Positive


Mean 63.45 66.52 56.61


Figure 41: Average of TruePositive

Figure 42: Total number of TruePositive of each student for each prior probabilities setting



Figure 44: The total number of FalsePositive of each student for each prior probabilities setting

6.5 Comparison of Model Positive Precision

The Welch test showed that there was a statistically significant difference on the model

positive precision, among different groups of prior probabilities settings (p=<0.05). Table 37

shows the descriptive statistics on the model positive precision. Table 38, represents the result of

the Games-Howell post-hoc test. The results showed that there is no statistically significant

difference on model positive precision between the population and generic prior probabilitiessettings. Furthermore, there was a statistically significant difference on the model positive

precision between the user-specific prior probabilities settings and the other two settings. Figures

45 and 46 respectively illustrate the average (in percentage) of the model positive precision andthe model positive precision for each student for different prior probabilities settings.

Table 37: Descriptive statistics on Model Positive Precision


Mean 84.68 96.87 84.57


Table 38: Paired T-test results. (Dependent variable: Model Positive Precision)

Games-

Howell Test

Prior

Probabilities

Prior

Probabilities

p-value

(Sig.)

Significant

(*: Yes)

Comparison


Generic 1.00


Generic Population 1.000




Figure 45: Average of the model positive precision

Figure 46: model positive precision for each student for different prior probabilities settings

6.6 Comparison of the Model Negative Precision

The Welch test showed that there was a statistically significant difference on the model

negative precision, among different groups of prior settings (p<0.05). Table 40 shows thedescriptive statistics on the model negative precision. Table 41 represents the results of Games-Howell test. The results showed that there is no statistically significant difference on the model

negative precision between the population and generic settings. Also, there was a statistically



significant difference between the user-specific prior probabilities setting and the other two

settings. Figures 47 and 48 illustrate the average (in percentage) of model negative precision andthe total model negative precision for each student for each prior probability settings

respectively.

Table 39: Descriptive statistics on Model Negative Precision


Mean38.73 75.91 32.94

Standard Deviation38.52 26.45 36.25

Table 40: Games-Howell test results (Dependent variable: Model Negative Precision)

Games-

Howell Test

Prior

Probabilities

Prior

Probabilities

p-value

(Sig.)

Significant

(*: Yes)

Comparison


Generic .821


Generic Population .821


Figure 47: Average of the Model Negative Precision



Figure 48: Model Negative Precision for each student and each prior probabilities setting

6.7 Comparison of the Model Sensitivity

The Welch test showed a statistically significant difference on the model sensitivity, among

different groups of prior settings (p<0.05). Table 41 shows the descriptive statistics on the model

sensitivity. Table 42 gives the result of the Games-Howell test. The results showed that thereexisted a statistically significant difference among all settings of the prior probabilities. Figures

49, 50 illustrate the average (in percentage) of the model sensitivity and the total model sensitivity

for each student and for each prior probabilities setting.

Table 41: Descriptive statistics on Model Sensitivity


Mean 90.5 95.26 80.93


Table 42: Games-Howell test result (Dependent variable: Model Sensitivity)

Games-

Howell Test

Prior

Probabilities

Prior

Probabilities

p-value

(Sig.)

Significant

(*: Yes)

Comparison

Population User-specific .002 *Generic .000 *


Generic .000 *




Figure 52: Model specificity for each student for the different prior probabilities settings

7 Preliminary Analysis on Pre-Post Tests

As shown in the previous Section 4, the original values for the thresholds in the hinting

mechanism in Prime Climb resulted in low hint precision and hint recall in the population and

generic prior settings. On the contrary, it was shown that initializing the student’s model with the

user-specific prior probabilities will result in high hint precision and hint recall. Moreover as

already discussed in Section 3, in measuring the hint precision and hint recall we had to consider

solely the movements involving at least one number (player’s number or partner’s number)

which appears on the pre-test and post-test and the student keeps the same answer to thenumber’s corresponding question in both pre-test and post-test. Following this constraint, a few

number of hints (Mean: 29.3%, Std: 9.96%) out of all hints given to the student could be

consider in calculating the hint precision and hint recall. This fact could negatively affect thevalues of hint precision and hint recall when the user-specific prior is used as the prior

probabilities are only set for the nodes in the BN whose corresponding numbers appear on the

pre-test and post-test and the prior probabilities of the others are set to 0.5 which is equal to theprior probabilities used in the generic prior setting. To investigate the possibility of such

negative effect, we have calculated some preliminary descriptive statistics on the numbers

appearing on the pre-test and post-test. Table 45 represents the numbers with most frequency of appearance in the movements and whether or not they appear on the pre-test and post-test. (Y:

yes, N: No)Table 45: Numbers (15-top) with highest frequency of appearance in the movements

Number 17 25 76 4 27 40 81 89 99 97 96 19 37 31 9

Frequency

713 644 578 554 515 498 463 439 412 407 391 373 366 345 325

Inpretest?

N Y N N Y N Y Y N Y N N N Y Y



It can be resulted that more than 50% (8 out 15) of the numbers with highest frequency of visitdo not appear on the pre-test and post-test. Table 46 and 47 also show the number with most

frequency of visit in correct and wrong movements respectively. It is shown that 60% of the

highest visited numbers involving in the correct movements do not appear on the pre-test and

post-test. The situation is worse for the wrong movements (0.73%).

Table 46: Numbers with highest frequency of visit in the correct movements

Number 17 25 76 89 27 4 97 81 19 37 31 13 99 40 71

Frequency

71

3

58

0

49

2

43

9

42

7

42

0

40

7

39

9

36

7

36

2

34

5

31

0

30

8

30

4

28

3

In

pretest?

N Y N Y Y N Y Y N N Y N N N N

Table 47: Numbers with highest frequency of visit in the wrong movements

Number 40 57 18 96 4 15 99 36 50 21 9 33 27 76 69

Frequency 194 145 143 142 134 108 104 100 95 94 91 89 88 86 74

In

pretest?

N N N N N Y N N N N Y Y Y N N

8 Conclusion and Future work

This manuscript reports on the results on the student’s model parameters refinement, analysis of

the intervention mechanism and the student’s model used in Prime Climb. It was discussed thatthe highest accuracy of predicting the performance of the students in the post-test, conducted

after the students interacting with Prime Climb, is 75.5% when the population prior setting is

used. It was also found that when the population and generic prior settings were used the hint

precision and hint recall were of very low values. On the contrary, these values were high when

the user-specific prior setting was used and there was significant difference on total number of

justified, unjustified and missed hints with between the user-specific prior probability settingsand the other two settings while in all cases (except for the total number of correctly not-given

hints) there was no significant difference between the population prior probabilities and the

generic prior probabilities settings. Furthermore, it was shown that the student’s model

initialized with the user-specific prior probabilities setting resulted in higher model positive

precision, model negative precision, model sensitivity and model specificity.

As for future work, we would like to concentrate on the situations which negatively affect the

model’s specificity and model negative precision and investigate if they follow some specificpatterns. The other focus will be on finding the most appropriate time to intervene as it was

shown that, although the student is interrupted too much during the interaction with the game and

provided with hints, the hint precision and hint recall are very low when the population and

hint analysis

Documents