an educational system for learning search algorithms and

ARTICLE

An Educational System for Learning Search Algorithmsand Automatically Assessing Student Performance

Foteini Grivokostopoulou1& Isidoros Perikos1 &

Ioannis Hatzilygeroudis1

Published online: 22 June 2016# International Artificial Intelligence in Education Society 2016

Abstract In this paper, first we present an educational system that assists students inlearning and tutors in teaching search algorithms, an artificial intelligence topic.Learning is achieved through a wide range of learning activities. Algorithm visualiza-tions demonstrate the operational functionality of algorithms according to the principlesof active learning. So, a visualization process can stop and request from a student tospecify the next step or explain the way that a decision was made by the algorithm.Similarly, interactive exercises assist students in learning to apply algorithms in a step-by-step interactive way. Students can apply an algorithm to an example case, specifyingthe algorithm’s steps interactively, with the system’s guidance and help, when neces-sary. Next, we present assessment approaches integrated in the system that aim to assisttutors in assessing the performance of students, reduce their marking task workload andprovide immediate and meaningful feedback to students. Automatic assessment isachieved in four stages, which constitute a general assessment framework. First, thesystem calculates the similarity between the student’s and the correct answer using theedit distance metric. In the next stage, it identifies the type of the answer, based on anintroduced answer categorization scheme related to completeness and accuracy of ananswer, taking into account student carelessness too. Afterwards, the types of errors areidentified, based on an introduced error categorization scheme. Finally, answer isautomatically marked via an automated marker, based on its type, the edit distanceand the type of errors made. To assess the learning effectiveness of the system anextended evaluation study was conducted in real class conditions. The experimentshowed very encouraging results. Furthermore, to evaluate the performance of theassessment system, we compared the assessment mechanism against expert (human)tutors. A total of 400 students’ answers were assessed by three tutors and the results

Int J Artif Intell Educ (2017) 27:207–240DOI 10.1007/s40593-016-0116-x

* Ioannis [email protected]

1 Department of Computer Engineering & Informatics, School of Engineering, University of Patras,26504 Patras, Hellas (Greece)

http://crossmark.crossref.org/dialog/?doi=10.1007/s40593-016-0116-x&domain=pdf

showed a very good agreement between the automatic assessment system and thetutors.

Keywords Artificial intelligencecurriculum.Searchalgorithms.Automatedassessment. Intelligent tutoring system . Algorithm visualization

Introduction

Over the last decade, web has changed the way that educational content and learningprocesses are delivered to the students. It constitutes a new means for education andtraining, which is growing rapidly worldwide giving new possibilities and offeringbetter, more efficient and intensive learning processes. Intelligent Tutoring Systems(ITSs) constitute a generation of computer-based educational systems that encompassintelligence to increase their instruction effectiveness. The main characteristic of ITSs isthat they can adapt the educational tasks and learning processes to the individualstudents’ needs in order to maximize their learning. This is mainly accomplished byutilizing Artificial Intelligence methods to represent the pedagogical decisions theymake and the knowledge regarding the domain they teach, the learning activities, thestudents’ characteristics and their assessment (Shute and Zapata-Rivera 2010). So, ITSsconstitute a popular type of educational systems and are becoming a main means ofeducation delivery, leading to an impressive improvement in student learning (Alevenet al. 2009; VanLehn 2006; Woolf 2010).

Assessment constitutes a fundamental aspect of ITSs. The aim of the assessment isto provide a measure of students’ comprehension and performance and assist both theeducational system and the learner to get a deeper insight of his/her knowledge leveland gaps (Jeremić et al. 2012; VanLehn 2008). Educational systems are becomingincreasingly effective at assessing the knowledge level of students, utilizing systematicassessment and marking methods (Baker and Rossi 2013; Martin and VanLehn 1995;Pavlik et al. 2009). In such systems, assessment mechanisms are vital and can assisttutors to know how well the students have understood various concepts and to monitorstudents’ performance and class learning progress. More accurate assessments can leadto tutoring that is more adaptive to individual students and thus to more effectivelearning (Siler and VanLehn 2003). Assessments can determine how well students arelearning on a continuous basis and assist in taking necessary corrective actions as soonas possible to improve students learning (Mehta and Schlecht 1998). Therefore,systematic assessment and marking mechanisms should be an integral part of any e-learning system (Kwan et al. 2004). In general, the assessment of students’ performancevia their answers to exercises is considered to be a complex and time consumingactivity that makes tutors cut down valuable time, which they could devote to othereducational tasks. On the other hand, providing systematic assessment manually evenfor a small class cannot assure that tutor feedback will be as instant as in one-to-onetutoring (Ihantola et al. 2010). Manual assessment of students’ performance on exer-cises could delay feedback delivery by tutors to students for days or even for weeks. So,in some cases, tutors may have to reduce even the number of assignments, to be givento their students, due to lack of time. Especially, in large scale courses, accurate andmeaningful assessment is a very demanding task for tutors. Also, accuracy is usually

208 Int J Artif Intell Educ (2017) 27:207–240

difficult to achieve, due to subjective and objective reasons. Automatic assessment cansecure consistency in students’ assessment, since all exercises are evaluated based onexactly the same criteria, and also all assessments and marks awarded can be explainedinstantly and most of all deeply and in detail to the students (Suleman 2008). Therefore,the creation of mechanisms for automatic assessment is quite desirable.

Automatic assessment systems can assist tutors in evaluating students’ work andalso enable more regular and prompt feedback (Barker-Plummer et al. 2008; Charmanand Elmes 1998; Gouli et al. 2006). It is commonly acknowledged by tutors thatstudents’ learning is enhanced by frequent assessment and proper feedback (Shepard2005). While learning in educational systems, timely feedback is essential to studentsand automated marking mechanisms can give the opportunity to provide feedback onstudents’ work in progress (Falkner et al. 2014).

One of the major challenges that teachers face in teaching of computer sciencecourses are the difficulties associated with teaching programming and algorithms, whichare considered difficult domains for the students (Jenkins 2002; Lahtinen et al. 2005;Watson and Li 2014). Artificial Intelligence (AI) course is an important course in thecomputer science discipline. Among the fundamental topics of the curriculum of an AIcourse is Bsearch algorithms^, including blind and heuristic search algorithms. It is vitalfor students to get a strong understanding of the way search algorithms work and also oftheir application to various problems. In general, search algorithms are complicated andmany students have particular difficulties in understanding and applying them.

Usually, in an AI course, the tutor creates and gives a set of assignments asking thestudents to provide their hand-made solutions. Then, the tutor has to mark all students’answers, present the correct ones and discuss common errors. This process is timedemanding for the tutor, particularly when the number of answers is large. On the otherhand, educational systems with graphical and interactive web-based tools are moreappealing for students than the traditional way of doing exercises (Naps et al. 2002;Olson and Wisher 2002; Sitzmann et al. 2006). Therefore, we developed the ArtificialIntelligence Teaching System (AITS), which is an ITS that can assist tutors in teachingand students in learning about, among others, search algorithms (Grivokostopoulou andHatzilygeroudis 2013a). It supports study of their theoretical aspects andprovides visualizations demonstrating the way that different algorithms function andalso assist students in applying the algorithms via various interactive exercises andlearning scenarios. Furthermore, an automatic assessment mechanism has been devel-oped and integrated into AITS, which can assist tutors to reduce the time spent inmarking and use this time efficiently for more creative tasks and personal contact withthe students. Also, during students practice with interactive exercises, the automaticassessment mechanism instantly evaluates students’ actions and provides meaningfulfeedback. With the use of the automatic assessment, all students’ answers are assessedin a consistent manner; students can get their marks and feedback immediately after thesubmission of their answers.

The contributions of this paper are as follows. First, it introduces use of interactivestep-based visualizations (or, in other words, visualized animations) of algorithmicoperations in teaching and learning concepts of search algorithms in the context of anITS. The aim is to achieve more effective learning by actively involving students ininteractive visualizations through interactive exercises. As far as we are aware, there areno other similar efforts. Second, it introduces an automatic assessment mechanism for

Int J Artif Intell Educ (2017) 27:207–240 209

assessing students’ performance on exercises related to search algorithms. The mech-anism takes into account the similarity between a student’s and the correct answer, thetype of the answer in terms of completeness and accuracy as well as possible careless-ness or inattention errors. The aim is to obtain a consistent and reliable assessmentmechanism that assures more effective feedback. As far as we are aware, there are noother efforts that provide such a systematic assessment approach (based on similaritymeasure, systematic error categorization, systematic answer categorization and anautomated marker) that can generalize to other domains. Both contributions are vali-dated via experiments. For the first contribution, a pre-test/post-test and experimental/control group approach has been used. For the second, linear regression and classifi-cation metrics have been used.

The rest of the paper is organized as follows: Section 2 presents related work oneducational systems for teaching algorithms and also works on automated assessmentmethodologies and tools that have been developed. Section 3 presents the ArtificialIntelligence Tutoring System (AITS), by illustrating its architecture and analyzing itsfunctionality. Section 4 presents the automatic assessment mechanism, describes theway it analyzes and evaluates students’ answers and presents the provided feedback.Section 5 presents the experimental studies conducted and discusses the results. Finally,Section 6 concludes the paper and provides directions for future work.

Related Work

Many research efforts and educational systems have been developed to assist teachingand learning in the domain of algorithms. PATHFINDER (Sánchez-Torrubia et al.2009) is a system developed to assist students in actively learning Dijkstra’s algorithm.The highlighting feature provided by that tool is the animated algorithm visualizationpanel. It shows, on the code, the current step the student is executing and also wherethere is a user’s mistake within the algorithm running. TRAKLA2 (Malmi et al. 2004)is a system for automatically assessing visual algorithm simulation exercises.TRAKLA2 provides automatic feedback and grading and allows resubmission, whichmeans that students can correct their errors in real time. In (Kordaki et al. 2008), aneducational system, called Starting with Algorithmic Structures (SAS), designed forteaching concepts of algorithms and basic algorithmic structures to beginners, ispresented. Students come across the implementation of algorithms in real life scenariosand the system offers feedback to correct their answers. In the work presented in (Lauand Yuen 2010), authors attempt to examine whether gender and learning styles can beused to associate mental models in learning a sorting algorithm. Results indicate thatmental models of females are more similar to those of the expert referent structure andthat concrete learners have a higher similarity in their mental models with the expertones than abstract learners. Those findings can be utilized in designing educationalprocesses and learning activities for assisting students in learning algorithms. In aprevious work of ours (Grivokostopoulou et al. 2014a), aspects of an educationalsystem, used in the context of an AI course, are presented and the AI techniques usedfor adapting learning to students are described. None of the above efforts, except that ofTRAKLA2, although offer use of some kind of visualization, provide any mechanismfor automatic assessment.


Also, recently, there have been research studies and systems that support automaticassessment of students in various domains. Marking of students’ answers to exercisesand hence assessment of their performance are necessary to scaffold an effectivelearning process in an educational system. Also, in line with the delivery of appropriateformative feedback, it can enhance a student’s knowledge construction (Clark 2012;Nicol and Macfarlane-Dick 2006; Heffernan and Heffernan 2014). Assessment isnecessary to update a student’s model and characteristics, specify student’s knowledgelevel and trace misconceptions and knowledge gaps for both individual students and theclass. Indeed, studies from cognitive and educational psychology indicate correlationsbetween self-assessment and learning outcomes, pointing out that students, who areaware of their own learning more accurately and timely, tend to have better learningoutcomes (Chi et al. 1989; Long and Aleven 2013; Winne and Hadwin 1998).However, automatic assessment in general is considered to be domain dependent,which means that it necessitates knowledge of the domain’s main principles, conceptsand constraints. So, a challenging research direction is the specification of a generalframework for automated assessment. In this paper, we have made a step towards thisdirection (see section on automatic marking mechanism and Fig. 7).

The field, where automatic assessment is widely used, is computer science andespecially computer programming (Douce et al. 2005; Alemán 2011; Ala-Mutka 2005).There are various systems that include a mechanism for automated assessment, to markstudent programming exercises and provide feedback, such as ASSYST (Jackson andUsher 1997), Boss (Joy et al. 2005), GAME (Blumenstein et al. 2008), CourseMarker(Higgins et al. 2005), AutoLEP (Wang et al. 2011), Autograder (Helmick 2007).ASSYST utilizes a scheme that analyzes students’ programming submissions acrossa number of criteria and specifies whether submissions are correct by comparing theoperation of a program to a set of predefined test data. Also, the analysis aims to specifyits efficiency and whether it has sensible metric scores that correspond to complexityand style. CourseMarker evaluates students’ programming assignments in variousprogramming languages, such as C and C++. It automatically provides feedback tostudents and reports to instructors regarding students’ performance. Its basic markingmethod relays on typographic, feature, and dynamic execution tests of the students’exercises. CourseMarker also supports the formative aspects of assessment, allowingstudents to have their program graded at frequent intervals prior to submission. In orderfor this to be feasible, the profile of the program is constrained by measuring itsattributes and its functionality in order to arrive at a grading. AutoLEP is a learningenvironment for C programming language that integrates a grading mechanism to markstudents’ programs, which combines static analysis with dynamic testing to analyzethose programs. It utilizes the similarity of students’ and teacher’s solutions and alsoprovides feedback regarding compiler errors and failed test cases. To do so, it createsthe graph representation of a student’s program and compares it with a set of correctmodel programs. BOSS supports student exercises assessment through collectingsubmissions, performing automatic tests for correctness and quality, checking forplagiarism, and providing an interface for marking and delivering feedback. It providesautomatic assessment and also assists the lecturer in obtaining a higher degree ofaccuracy and consistency in marking. Also, it provides administrative and archivingfunctionalities. In general, BOSS is conceived as a summative assessment tool and,although it supports feedback for students, its primary function is to assist in the process


of accurate assessment. QuizPACK (Brusilovsky and Sosnovsky 2005) is a goodexample of a system that assesses program evaluation skills. QuizPACK generatesparameterized exercises for the C language and automatically evaluates the correctnessof student answers. For the assessment, QuizPACK utilizes simple code transforma-tions to convert a student’s exercise code into a function that takes the parameter valueand returns the value to be checked. Also, it provides guidance to the users and assiststhem to select the most useful exercises in order to advance their learning goals andlevel of knowledge. QuizJET (Hsiao et al. 2008) is a system for teaching Javaprogramming language supporting automatic assessment of parameterized online quiz-zes and questions.

In (Higgins and Bligh 2006), an approach to conducting formative assessment ofstudent coursework within diagram-based domains using Computer Based Assessment(CBA) technology is presented. Also, in (Thomas et al. 2008), an automatic markingtool for learning and assessing graph-based diagrams, such as Entity-RelationshipDiagrams (ERDs) and Unified Modeling Language (UML) diagrams, is presented. Itspecifies the minimal meaningful units of the diagrams and automatically marksstudent answers based on similarity values with correct answer and also providesdynamically created feedback to guide students.

Graph similarity methods are quite often utilized to analyze and assess exercise andstudents answers. In (Naudé et al. 2010), graph similarity measures are used to assessprogram source code directly, by comparing the structural similarity between a stu-dent’s submissions and the already marked solutions relying on the principle thatsimilar vertices should have similar neighbors. In (Barker-Plummer et al. 2012), editdistance is used as an approach to analyze and characterize the errors in studentformalization exercises. They report that edit distance is quite promising in examiningthe type and the nature of student errors. In (Stajduhar and Mausa 2015), authors markstudent SQL exercises and statements by comparing the similarity of student’s SQLstatements with reference statement pairs utilizing methods such as, Euclidian andLevenshtein word distance. Obtained results show that string metrics are greatlypromising, given that they contribute in the overall predictive accuracy of the assess-ment method. Also, in (Vujošević-Janičić et al. 2013), tools for objective and reliableautomated grading of introductory programming courses are presented, where substan-tial and comprehensible feedback is provided too. Authors present two methods thatcan be used for improving automated evaluation of students’ programs. The first isbased on software verification and the second on control flow graph (CFG) similaritymeasurement. Both methods can be used for providing feedback to students and forimproving automated grading for teachers. Authors report quite interesting resultsregarding the performance of the automatic assessment tools. Although the aboveefforts use the notion of edit distance in assessing the similarity between the studentand correct answer and in characterizing errors, they do not take into account careless-ness errors and do not do it in systematic way, as we do. Also, in WADEIn II(Brusilovsky and Loboda 2006) which is a Web-based visualization tool for C lan-guage, adaptive visualization and textual explanations are utilized in order to portraythe process of expression evaluation.

In the context of AI course in our university, we have developed mechanisms thatautomatically mark exercises related to symbolic logic. AutoMark-NLtoFOL (Perikoset al. 2012) is a web-based system that automatically marks student answers to


exercises related to converting natural language sentences into First Order Logic (FOL)formulas. AutoMark-NLtoFOL provides students with an environment for practicingand assessing their performance on converting natural language sentences into FOLand also for improving their performance by providing personalized feedback. In(Grivokostopoulou et al. 2012), a system that automatically marks students’ answersto exercises on converting FOL to clause form (CF) is presented. Both markingapproaches utilize a domain error categorization to detect errors in student’s answers,mark them and provide proper feedback.

However, as far as we are aware, there are no works in the literature that havedeveloped mechanisms or methodologies to assess students’ answers to interactiveexercises on search algorithms, apart from two works of ours. In (Grivokostopoulouand Hatzilygeroudis 2013b, c) two methods for the automated assessment ofstudent answers to exercises related to search algorithms are presented. Themethods and the tools developed can assist the tutors in their assessment tasksand also provide immediate feedback to students concerning their performanceand the errors made.

Artificial Intelligence Teaching System (AITS)

Artificial Intelligence Teaching System (AITS) is an intelligent tutoring system that wehave developed in our department for helping students in learning and tutors inteaching AI topics, a basic one being ‘search algorithms’. The architecture of thesystem is illustrated in Fig. 1. It consists of six main units: Student Interface, TutorInterface, Automatic Assessment, Test Generator, Learning Analytics and the DomainKnowledge & Learning Objects.

During a student’s interaction with the system, e.g. while dealing with an interactiveexercise or a test, his/her answer(s) is (are) forwarded to Automatic Assessment unit.

Fig. 1 An overview of AITS architecture


Automatic assessment unit consists of three main parts: the Error DetectionMechanism, the Automatic Marking Mechanism and the Feedback Mechanism. Errordetection mechanism is used to analyze the student’s answers, detect the errors madeand characterize the student’s answers in terms of completeness and accuracy. Afterthat, it interacts with automatic marking mechanism, which is used to calculate themark for each student’s answer to an exercise and also specify the overall student’sscore on a test. Feedback mechanism is used to provide immediate and meaningfulfeedback to the student regarding the score achieved and the errors made on individualexercises or a test.

A test in AITS is generated in a user adapted mode by the Test generator unit. Testgenerator unit utilizes a rule-based expert system for making decisions on the difficultylevel of the exercises to be included in the test (Hatzilygeroudis et al., 2006), so that it isadapted to the knowledge level and needs of the student. Created tests consist of anumber of exercises that examine different aspects of (blind and/or heuristic) searchalgorithms.

From the tutor’s perspective, a tutor also can connect and interact with the systemthrough the Tutor Interface. The tutor can manage the educational content and thelearning activities in the system, add new exercises and examples and also edit or evendelete existing ones. For this purpose, an exercise generation tool (Grivokostopoulouand Hatzilygeroudis 2015) has been developed and embedded into AITS, aiming toassist tutors in creating new exercises in a semi-automatic way. Leaning Analytics unit,aims at assisting the tutor in monitoring students’ activities and supervising theirlearning performance and progress. Learning analytics unit provides tutors with generalinformation regarding a student’s learning progress and shows statistics and commonerrors that students make. Finally, the Domain Knowledge and Learning Objects unitrepresents concepts related to search algorithms and their relations in a concise way,through an ontology.

Domain Knowledge and Learning Objects

The domain knowledge structure concerns AI curriculum concepts, related to a numberof AI subjects. The AITS system consists of four main subjects: KnowledgeRepresentation & Reasoning, Search Algorithms, Constraint Satisfaction andPlanning. The domain knowledge is structured in a tree-like way. The root of a treeis a main subject (e.g. Search Algorithms, Constraint Satisfaction etc.). The mainsubject is divided into topics and each topic into concepts. In this way each subjectdeals with a number of topics and each topic deals with a number of concepts. Thenumber of the topics depends of the subject, for example the subject KnowledgeRepresentation and Reasoning consist of three main topics Propositional Logic,Predicate Logic and Reasoning and many concepts. The knowledge tree is displayedat the navigation area, at the left-hand side of the student interface. A student shouldspecify a concept for studying (following a ‘subject-topic-concept’ path). As soon asthis is done, corresponding learning objects, distinguished in theory, examples andexercises, are presented to the students.

The search algorithm section consists of two main topics which are the heuristic andthe blind search topic and each one consists of a number of concepts. In Fig. 2, a part ofthe tree structure for search algorithms subject, concerning the path to heuristic function


concept, is presented. The ‘theory’ objects consist of text, presenting theoreticalknowledge about corresponding concept. The ‘examples and visualizations’ objectsare visual presentations/animations of search algorithms operations, which in line withthe theory aim to improve the students’ comprehension of related concepts. The‘interactive exercises’ are exercises that students try to solve and give answersin a step-based interactive way. There are two main types of interactiveexercises: practice exercises and assessment exercises. The practice exercisesare interactive exercises that are equipped with hints and assistance during thelearning sessions, aiming to provide guidance and help to the students. On theother hand, the assessment exercises are exercises that are used to examine thestudents’ progress and comprehension of the corresponding concepts. Theassessment of students’ answers to exercises can be useful for both the studentsand the system. The system can get a deeper insight of each individualstudent’s knowledge level, skills and performance and adapt the learning activ-ities and the topics for study to the students learning needs. Also, from thestudents’ perspective, self-assessment can help students trace their gaps andspecify concepts needed for further study. Finally, a ‘test’ object consists of aset of exercises that a student is called to solve. The answers will be assessedand marked automatically by the system.

Learning Analytics

The Learning Analytics unit records, analyses and visualizes learning datarelated to students’ activities, aiming to assist the tutor to get a deeper under-standing of the students’ learning progress on search algorithms. More specif-ically, it provides information regarding each individual student’s overall per-formance and performance on specific topics, concepts or exercises. Also, itdisaggregates student performance according to selected characteristics such as,major, year of study etc. Furthermore, it gives feedback to the tutor about theknowledge level and grades (with the help of the automatic marking mecha-nism) of each student, the lowest/highest performance on each exercise, the

...

Search Algorithms (Subject)

Heuristic search (Topic)

Heuristic function (Concept1)

Theory Interactive Exercises Examples & Visualizations Test

Blind search (Topic)

Concept N

Fig. 2 A part of the domain modeling of the course curriculum


most frequent errors made, the exercises that have been tried, the time taken foreach assessment exercise and for the test, the total time spent in the system etc.Moreover, it uses data mining techniques to make predictions regarding stu-dent’s performance (Grivokostopoulou et al. 2014b). Additionally, it providesinformation related to the assessment exercises, like the number of hintsrequested for each exercise, the student performance after the delivery of ahint and more. The statistics can assist the tutor in tracing concepts of partic-ular difficulty for the students and also assist both the tutor and the system ingetting a more complete insight of each student and better adapt the learningactivities to each student’s needs and performance.

Learning Approach and Activities

A student first chooses a concept from the domain hierarchy for studying.Then, for the chosen concept, theory is presented first and afterwardsexamples and visualizations are provided to the student. The student, afterBhaving played^ with them, is called to deal with some interactive practiceexercises, which aim to give him/her (a) a better understanding of what he/she has learnt so far, since practice exercises convey hints and guidance for anydifficulties to be met, and (b) the opportunity to check whether he/she canapply what has learnt. In this spirit, some assessment exercises can be offeredto the student, where he/she experiences a pre-test state and the system obtainsa first evaluation of the student’s knowledge level. Finally, a test is formed,consisting of a number of exercises of various difficulty levels and complexity,based on the history of the student in dealing with practice and pre-testassessment exercises. In this way a student with a worse history performancethan another student will get an easier test than the one of the other student.The results of the test may give a final assessment of the knowledge level ofthe student regarding the corresponding concept(s). The student is not forced tofollow the system’s way of teaching, but can make his/her own choices forstudying about a concept. Any student who has finished studying a particularconcept can take the corresponding concept-level test and continue studying thenext concept(s).

Visualization of Algorithms

In the learning scenarios offered by the system, a student can study the theoreticalaspects of an algorithm alongside appropriate explanations and algorithm visualizationson various example cases. Algorithm visualizations/animations are well pointed toassist students in learning algorithms (Hundhausen et al. 2002). Visualizations, whenproperly used in a learning process, can help a student deeper understand theway that an algorithm operates, by demonstrating how it works and how itmakes proper decisions based on parameters, such as heuristic and cost func-tions (Hansen et al. 2002; Naps et al. 2002). Therefore, there are manyalgorithm visualization tools for educational purpose such as, Jeliot 3(Moreno et al. 2004) and ViLLE (Rajala et al. 2007).


Jeliot is a visualization tool for novice students to learn procedural and objectoriented programming. The provided animations present step by step executions ofJava programs. ViLLE is a language-independent program visualization tool thatprovides a more abstract view of programming. It offers an environment for studentsto study the execution of example programs, thus supporting the learning process ofnovice programmers.

In our system, during the visualization of an algorithm, every decision the algorithmmakes, such as which node(s) to expand/visit, is properly presented and explained tothe student. The system explains how a decision was made by the algorithm and howthe values of parameters, such as the heuristic and the cost functions (if any), werecalculated for each algorithm’s step.

A noticeable aspect of our algorithm visualizations is that they have beendeveloped according to the essence of student active learning. They have beendesigned based on the principle of engaging students as much as possible in thedemonstration process and making them think hard at every step of an algo-rithm’s process animation. The principles of active learning postulate that themore the users directly manipulate and act upon the learning material, thehigher the mental efforts and psychological involvement and therefore the betterthe learning outcome. In this spirit, during an animation demonstrating theimplementation of an algorithm in a case scenario, the system can stop at arandom step and ask the student to specify some aspects regarding the opera-tion of the algorithm. The animation may engage the student and request fromhim/her to specify the next action to be made or ask him/her to justify why anaction was made. In general, such justifications mainly concern either the lastone (or more) action conducted by the algorithm or the specification and properjustification of the next action to be conducted. The interaction with the studentand the questions asked are mainly multiple choice questions, where the studenthas to specify the correct answer. For example, during a visualization thesystem can pause and ask the student to specify the next algorithm’s step byselecting the proper answer of a multiple choice question. In case of a correctstudent’s answer, it can also request from student to justify the reason, byoffering additional multiple choice question(s). In case of an erroneous answer,the correct response and proper explanations are immediately offered to thestudent. After an interaction with the learner, the animation process continues.So, during an algorithm’s visualization in an example exercise scenario, multi-ple interactions with the learner can be made. In Fig. 3, the explanation of thefunctionality of A* on an example exercise via step-by-step animations isillustrated.

Interactive Exercises

The system provides two types of interactive exercises: practice exercises andassessment exercises. The practice interactive exercises provide help and im-mediate feedback after a student’s incorrect action, because the main objectiveis to help the student learn. In practice exercises, students are requested toapply an algorithm according to a specific exercise’s scenario, which canconcern the specification of the whole sequence of nodes (starting from the


root of the tree), a specific sub-part of it (starting from an intermediate nodeconsisting of some steps of an algorithm) or the specification of the nextnode(s)/step(s) of the algorithm. In Fig. 4, a practice exercise related tobreadth-first search (BFS) algorithm, the corresponding feedback error reportand the given mark are presented.

Furthermore, the system provides interactive assessment exercises that areused to examine the student’s progress and comprehension. They can be usedby the students themselves to measure their knowledge and understanding andalso by the system to get a deeper understanding regarding students’ skills,interests and provide personalized instructions, tailored to students’ learningneeds (Aleven et al. 2010; Rosa and Eskenazi 2013). The system does notprovide any feedback during the student’s interaction with assessment exercises;feedback is provided after a student has submitted an answer. In Fig. 5, anassessment exercise on A* star algorithm and the corresponding marks arepresented.

Feedback Provision

Another aspect of the system is that it provides meaningful and immediatefeedback to students. Several research studies have shown that the most impor-tant factor for students learning in an educational system is feedback (Darus2009; Hattie and Timperley 2007; Narciss 2008).

Human tutoring is considered to be extremely effective at providing hints andguiding students learning. A key aspect of tutoring is that the tutor does not revealthe solution directly, but tries to engage students in more active learning andthinking. Furthermore, research studies underpin that computer generated hints

Fig. 3 Visualization of the operation of A* on an example case


can effectively assist students in a similar degree to a human tutor (Muñoz-Merino et al. 2011; VanLehn 2011). In this spirit, the system aims to makestudents think hard so that they finally achieve to give a correct answer withoutunnecessary or unasked hints. The system never gives unsolicited hints tostudents. If a student’s answer is incorrect, proper feedback messages areconstructed and are available via the help button. The student can get thosemessages on demand by clicking on the help button.

A feedback mechanism in AITS is used to provide immediate and meaning-ful feedback to students during learning sessions. With regards to practiceexercises, feedback can be provided before a student submits an answer andalso after the student has submitted an incorrect answer. So, a student can askfor help while working on an exercise and before he/she submits an answer. Ananswer to an interactive exercise mainly depends on the exercise’s type andcharacteristics and in general concerns specification of the steps of an algorithmtrying to solve a problem. This may refer to specification of either all of thesteps or part of them or even a single step. The system’s assistance before thestudent submits an answer can remind student of the corresponding conceptsinvolved in the exercise and also orient student’s attention to specific elementsof the current step of the algorithm that may be tricky, thus leading to higherror rates and failed attempts. In this spirit, the system can provide hints to thestudent regarding the algorithm’s functionality in the current exercise’s condi-tions. This kind of assistance and the corresponding help messages are providedon demand after a student’s request for assistance.

Fig. 4 A practice exercise on breadth-first search algorithm


On the other hand, for the assessment exercises, feedback is provided onlyafter the student submits an answer. The system provides feedback about thecorrectness of the students’ answers as well as information about the types ofthe errors made and their possible causes. Feedback information is providedat two levels. At the first level, after a student submits a test, the systeminforms him/her about the correctness of the answers to each involvedexercise. Then, the system recognizes the errors made and provides properfeedback to the student. Also, it provides the grade (mark) achieved by thestudent, as estimated by the automatic assessment mechanism. InFig. 6, the feedback provided to a student for an exercise of a test ispresented.

Fig. 6 Feedback to a student’s answer

Fig. 5 An assessment exercise on A* algorithm


Error Detection Mechanism

The error detection mechanism is used to recognize the errors made by astudent and interacts with the feedback mechanism unit to provide input forcreating feedback as well as with the automatic marking mechanism to giveinput for calculation of the mark.

Similarity Measure

An answer is represented by the sequence of the nodes visited by an algorithm whiletrying to solve a problem. In terms of data type, an answer is represented by a string.Calculation of the mark of an answer to an exercise is based on the similarity of thestudent answer to the correct answer, i.e. on the similarity of two strings. Error detectionmechanism analyzes student answers and estimates the similarity between the sequenceof a student’s answer (SA) and the sequence of the corresponding correct answer (CA).The similarity between SA and CA is calculated using the edit distance metric(Levenshtein 1966).

The edit distance between SA and CA, denoted by d(SA,CA), is defined as theminimal cost of transforming SA to CA, using three basic operations: insert, delete, andrelabeling. We define the three basic edit operations, node insertion, node deletion andnode relabeling as following:

& node insertion: insert a new node in a sequence of SA.& node deletion: delete a node from a sequence of SA.& node relabeling: change the label of a node in SA.

The node insertion and node relabeling have the following three special cases:

& Second_Node_Insertion (SCI): insert the second node in a sequence.& Second_Node_Relabeling (SNR): change the label of the second node in a

sequence.& Goal_Node_Relabeling (GNR): change the label of the goal node (last node) in a

sequence.

The rationale behind the above distinctions is that we are interested in distinguishingthe cases where an error is made in selecting the second node in a sequence (actuallythe first choice of the student) or the goal node (actually the last choice). This is becausethey are considered as cases of more serious errors than others.

Given a SA and corresponding CA, their similarity is determined by the editdistance, considering that the cost for each basic operation on the nodes (insertion,deletion, relabeling) is 1. Also, we consider the cost of applying one of the special caseoperations (SCI, GNR, etc.) to be 2. The above choices are based on results ofexperimental studies.

As a first example, let’s consider the following: SA = <A B D E M H > andCA = <A C D E Z H>. In this case, in order to make them match, we have to applyrelabeling twice, one to node B, to make it C, and the other to node M, to make it Z.Using the above cost scheme, the cost of relabeling B to C is 2, because it’s a special


case (SNR), while the cost of relabeling M to Z is 1, because it’s a basic case. Thus, theedit distance of the two sequences is d(SA,CA) = 2 + 1 = 3.

As second example, consider the following two sequences: SA = <A(0,5) C(1,4)D(2,3) G(3,4) F(3,4), K(0,7) > and CA = <A(0,5) C(1,4) D(2,3) M(3,2) F(3,4) > relatedto an A* exercise. To have a (exact) match between the two sequences, we have toapply a relabeling of G(3,4) to M(3,2) and a node deletion of K(0,7). Relabeling ofnode G to M has a cost of 1 as well as the cost of deletion of node K. Thus, the editdistance of the two sequences is d(SA,CA) = 1 + 1 = 2.

Categorizing Student Answers

The categorization of students’ answers is made for better modeling and assessment. Tothis end, a student’s answer is characterized in terms of completeness and accuracy. So,an answer is considered complete, if all nodes of the correct answer appear in thestudent’s answer; otherwise it is incomplete or superfluous. An answer is accurate,when all its nodes are correct; otherwise it is inaccurate. Based on them, we havecategorized answers in five categories. The categorization is influenced by the schemeproposed in (Fiedler and Tsovaltzi 2003), however it is enriched in terms of superfluity(Gouli et al. 2006). The categories of student answers are the following:

& Incomplete–Accurate (IncAcc): All present nodes are correct, but they are a subsetof the required and the edit distance is greater than 0.

& Incomplete–Inaccurate (IncIna): Present nodes are a subset of the required and thedistance is greater than 0.

& Complete–Inaccurate (ComIna): Only the required number of nodes are present,but the edit distance is greater than 0.

& Complete-Accurate (ComAcc): Only the required number of nodes are present andthe edit distance is equal to 0. So, the student answer is correct.

& Superfluous (SF): Present nodes are more than the required; also, edit distance isgreater than 0. We distinguish two further cases:

& SF-LG: Last node is the goal node-state& SF-LNG: Last node is not the goal node-state (the student continues after having

reached the goal node or fails to reach it).

In case that a student’s answer is characterized as complete and accurate, the studenthas specified the correct answer or a correct answer. In all other cases, the student’sanswer is considered as incorrect and the student has to work towards a correct one.The student’s answer is further analyzed by the error detection unit, in order torecognize the types of the errors made.

The above used two example student sequences are categorized as, first example:ComIna, second example: SF-LNG.

Importance of Errors

An important factor in answer assessment is estimating the importance of errors madeby a student. Not all errors that a student makes are of the same importance. Also, notall of them are due to lack of knowledge or understanding of the algorithms. Some of


them are due to carelessness or inattention of the student. Carelessness can be definedas giving the wrong answer despite having the needed skills for answering correctly(Hershkovitz et al. 2013). In educational systems, carelessness is not an uncommonbehavior of students, even among high performing students (San et al. 2011), socareless errors are made often by students. Even highly engaged students may para-doxically become overconfident or impulsive, fact that leads to careless errors (SanPedro et al. 2014). In this spirit, an important aspect of modeling student answers is totake into account such carelessness facts too. To this end, we consider that existence ofonly one error in an answer may be due to inattention.

So, we distinguish the following two types of single-error answers:

I. Only one operation either regular or of special cases is required in order to achievematching.

II. Only one node relabeling operation for two consecutive nodes-states (that havebeen switched between each other) is required in order to achieve matching.

Notice that Type I represents answers that have a missing node or an extra node or anode that needs to be changed. Type II represents answers that have two nodes inincorrect positions.

Automatic Marking Mechanism

In this section, we present the automatic marking mechanism which is used tomark students’ answers to exercises. To accomplish this, automatic markingmechanism interacts with the error detection mechanism. We distinguish twotypes of exercises-answers as far as marking is concerned: simple and complex.Exercises whose correct answers include less than or equal to 6 nodes-states,are considered simple exercises-answers. The rest are considered as complex.This distinction is based on empirical data.

Simple Exercise-Answer Marking

Initially, the mechanism calculates the edit distance between the sequence of a simplestudent answer (SAi) and the one of the correct answer (CAi).

Then the mark, called student score (SSi), of the simple answer SAi is calculated bythe following equation:

SSi ¼ maxscore* 1−d SAi;CAið Þ

nc

� �; if d SAi;CAið Þ < nc

0 ; otherwise

8<: ð1Þ

where d(SAi, CAi) is the edit distance, nc represents the number of nodes-statesof (CAi) and maxscore represents the marking scale. In case of a complex


answer, the student score is calculated by the automated marking mechanismpresented below.

Given a test including a number of q interactive exercises, the test score is calculatedas the average score of the answers:

TestScore ¼X q

i¼1SSi

qð2Þ

Automated Marking Algorithm

Marking students’ answers, as mentioned above, is a complex and very de-manding process for tutors, especially in cases of complex exercise-answers.Automated marking in AITS is based on the type of the student answer and thetype of errors made, determined by the error detection mechanism. So, themarking mechanism tries to model and estimate the overall student understand-ing of the functionality of the algorithms. The marking algorithm presentedbelow is based on empirical estimation and evaluation results as well as theprinciple of simulating tutor marking.

1. If simple, SSi is calculated via formula (1)2. If complex2.1 If student answer (SAi) is of type ComIna2.1.1 If SAi is a single-error answer of Type I

SSi=maxscore* (1-(w1*d(SA,CAi))/nc)2.1.2 If SAi is a single-error answer of Type II

SSi= maxscore* (1-(w2*d(SA, CAi))/nc)2.1.3 SSi is calculated via formula (1)2.2 If student answer (SAi) is of type IncAcc

2.2.1 If SAi is a single-error answer of Type I,SSi= maxscore* (1-(w1*d(SA,CA))/nc)

2.2.2 SSi is calculated via formula (1)2.3 If student answer (SAi) is of type IncIna2.3.1 SSi is calculated via formula (1)

2.4 If student answer (SAi) is of type Superflous-Case SF-LG2.4.1 If SAi is a single-error answer of Type I,

SSi=maxscore*(1- ( w1/nc))2.4.2 If SAi is a single-error answer of Type II,

SSi=maxscore * (1 - (w2* d(SA,CA)) / nc)2.4.3 SSi is calculated via formula (1)

2.5 If student answer (SAi) is of type Superflous-Case SF-LNG2.5.1 SSi is calculated via formula (1)

2.6 Student answer (SAi) is of type ComAcc, so SSi = maxscore

where we set the following values to the parameters: w1 = 0.6 andw2 = 0.8. Parameters w1 and w2 represent the importance of the errors in


calculating the mark and are empirically set based on two tutors’ experience.Also, recall that based on empirical data we consider that for a sequence withless than 7 nodes, it does not make sense to take into account the type of theanswer and of the errors made. In the context of our work the maxscore is setto 100 and the automated marking algorithm marks exercises on the 0 to 100points scale.

Furthermore, the mechanism can specify and take into account cases wherethe student has not got adequate understanding of the algorithm and has givenan answer in an in-consistent and in some sense random way. Special attentionis paid to the cases where, particularly in blind search algorithms, the studenthas specified correctly just some nodes, but has not understood the algorithm’sway of function.

As a first example, consider the following case, where the initial state isnode A and the correct answer is CA = < A B K E W Z M > for the Depth-First Algorithm. The student answer is the following: SA = < A B K E L D MH F >. Initially, error detection mechanism estimates the edit distance betweenSA and CA. The required operations are: two node relabelings (W➔ L and D➔ Z) and two node deletions (H and F). The cost for node relabeling andnode deletion is 1. So, the edit distance is d(SA,CA) = 2*1 + 2*1 = 2 + 2 = 4.Also, the Error Detection Mechanism detects the student’s answer as beingsuperfluous, more specifically of SF-LG type. Thus, according to the markingalgorithm, SS = 100–100*(4/7) = 42.

As a second example, consider the following case, where the initial state isnode A and the correct answer is: CA = <A(8) K(5) M(3) L(2) D(2) C(1)N(0) > for a Hill Climbing search interactive exercise. The answer of a studentis the following: SA = <A(8) B(6) F(8) M(3) L(2) D(2) C(1) N(0)>. Initially,the Error Detection Mechanism estimates the edit distance between studentanswer (SA) and correct answer (CA). The required operations are:second_node_relabeling (B(6) ➔ K(5)) and node deletion (F(8)). The cost forsecond_node_relabeling is 2, while the cost for node deletion operation is 1.So, the edit distance is d(SA,CA) = 2 + 1 = 3. Also, Error detection mechanismdetects the student’s answer as SF-LG. Thus, according to the marking algo-rithm, SS = 100–100*(3/7) = 57.15.

As a third example, consider the following case, where the initial state isnode A and the correct answer is CA = <A(8) M(6) L(5) C(4) D(3) G(2)I(0) > for a Best First Search interactive exercise. The student answer is thefollowing: SA = <A(8) M(6) L(5) D(3) C(4) G(2) I(0)>. The required opera-tions to achieve matching are two relabelings (D(3) ➔ C(4) and C(4) ➔ D(3)).The cost for node relabeling is 1. So, the edit distance is d(SA,CA) = 1 +1 = 2. Also, Error Detection Mechanism detects the student’s answer asComIna and single-error answer of Type II. Thus, according to the markingalgorithm, SS = 100–100*((0.8*2)/7) = 77.15.

Finally, consider the following exercise for the Breadth-First search algorithm, asillustrated in Fig. 7. The correct answer (CA) is the following: <A B CD E F GH I J K >.

In Table 1, student answers to the above exercise, the categorization of each answerand the corresponding score calculated by the automated marking algorithm arepresented.


General Assessment Framework

In an effort to specify a general framework for the assessment process, which could beused in other domains too, we ended up with the diagram of Fig. 8. It consists of fourstages:

1. Specify similarity between student answer and correct answer2. Categorize student answer according to the answer categorization scheme3. Check for important errors4. Calculate the mark via the automated marker

Indeed, the resulted framework can be used in different domains by proper instan-tiations of the following elements:

Table 1 Examples of Students’Answers

Student Answer Category of Answer Score

<A B C D E F G H I K J> ComIna 78.1

<A C B D F E G H I J K> ComIna 55

<A B C D> IncAcc 28

A B C D E F G H I J IncAcc 85

<A B C J K I> IncIna 28

<A B C D E G H F L M X K> SF-LG 55

<A B C D E G H F L M X J K> SF-LG 55

<A B C D E F H G J I L M> SF-LNG 55

<A B C D E F G H I J L M> SF-LNG 73

<A X M L > IncIna 0

<A B C D E F G H I J K> ComAcc 100

Fig. 7 An Interactive exercise for Breadth First Search algorithm


& Similarity metric& Answer categorization scheme& Automated marking algorithm

The above elements as implemented in AITS could be easily used for domainswhere the answers are strings or could be transformed into strings.

Evaluation

Experimental studies were conducted to evaluate AITS and the assessmentmechanism during learning. The main objective of the experimental studieswas to obtain an evaluation of the effectiveness of AITS and also of theperformance of the automatic assessment mechanism. For this purpose, twodifferent experiments were designed and implemented in the context of theAI course in our department. In order to explore how the system assistsstudents learn about search algorithms, an extended evaluation study throughthe pre-test/post-test and experimental/control group method was conducted inreal class conditions. Furthermore, to evaluate the performance of the assess-ment system, we compare the assessment mechanism against an expert tutor onthe domain of search algorithms.

Evaluation of the Assessment Mechanism

In this section, we describe the experiments conducted and the resulted findings of theautomatic assessment mechanism. The automatic assessment mechanism has been usedto assess students’ performance, while studying with AITS.

Student Answer Correct Answer

Similarity between

Student Answer & Correct

Answer

Categorization of Student

Answer

Marking of the student answer by

the Automatic Marker Mechanism

Checking for Important

Errors

Fig. 8 General framework of the assessment process


Experiment Design

Initially, for the needs of the study, we randomly selected 80 undergraduate students(both female and male) of our department that attended the artificial intelligence course,had used AITS and had taken exercises and tests related to blind and heuristic searchalgorithms. During the interaction of a student with AITS, all student’s learning actionsand submitted answers were recorded and archived by the system. For the study,students’ answers were collected and a corpus of 400 answers that the studentsprovided for various search algorithm exercises was formed. After that, three tutorshaving many years of experience in teaching search algorithms were asked to mark thecorpus of students’ answers. The tutors jointly evaluated each student answer andmarked it on a 100-point scale. Those tutors’ marks were used as a gold standard.Tutors’marking process was completed before the automatic marking process was usedto assess the student answers. So, the scores of the automatic marker were not known tothe tutors, to eliminate any possible bias towards the automatic marker’s scores.After that, the automatic marker was used to assess each one of the corpusexercise. So, the students’ answers were assessed by both the tutors and theautomated marker and thus two sets were formulated, one with tutors’ marksand the other with the system’s marks.

Results

After the formulation of the two datasets, we calculated the Pearson correlationsbetween the marks (scores) of the tutor and the automated marker. The resultsshowed that there was a very strong, positive correlation between tutor’s marksand automated marking system ones, which was statistically significant(r = .96, n = 400, p < .0005).

After that, a linear regression approach was utilized on the dataset with the students’marks. Linear regression attempts to model the relationship between two variables, oneindependent and one dependent. We consider as independent variable the onerepresenting automated marker scores and as dependent variable the one representingtutor’s marks. The objective of the regression is to offer us a way to specify the qualityof the automated marker; it does not predict the students’ marks. Initially, wechecked that the data is appropriate and meet the assumptions required forlinear regression to give a valid result. For example, looking at the scatter plot(Fig. 9), we can see that the conditions for the two variables (automatedmarker, tutor marker) to be continuous are met: there is a linear relationshipbetween the two variables and there aren’t significant outliers

A simple linear regression application to both corpuses of student answers resultedin the following equation: y = 9.697 + 0.894x with r2 = . 934. The results of the linearregression model indicated there was strong positive (Dancey and Reidy 2007) rela-tionship between the automated marker and the human marker, since r = .967, whilethen r2 = .934 suggests that 93.4 % of the total variation in y can be explained by thelinear relationship between x and y. This means that the straight line of the regressionapproach supports approximately 93.4 % of the variability in the data corpus. Themodel was a good fit for the data (F = 5656, p < .0005).


Then, we created three corpuses out of the corpus of 400 student answers, namelyCorpusA, CorpusB and CorpusC. CorpusA consisted of student answers that theirassessment levels were ‘very low’ and ‘low’, CorpusB of those with levels ‘medium’and ‘good’ and CorpusC of those with level ‘excellent’. We characterize an answerassessment as ‘very low’ if its score specified by the tutor is up to 25, ‘low’ if its scoreis from 26 to 45, ‘medium’ if it is from 46 to 65, ‘good’ if it is from 66 to 85 and‘excellent’ if the score is greater than 85. The purpose of the creation of three groupswas to look at evaluation of answers assessment in terms of low, medium and excellentlevels. Table 2 presents the correlations between the scores of human marker and theautomated marker for the three corpuses. The correlation of corpusB was .896, higherthan those of the other two corpuses.

Moreover, in Fig. 10 the scatter plots, resulted from the three groups, are presented.In fact, Fig. 10b shows a better agreement between tutor and system assessments for the

Fig. 9 Scatter Plot of automated marker vs tutor marker

Table 2 Results of correlationfor three corpuses

Corpus Correlation Significance of correlation

Corpus A .858 1.98 x 10−20

Corpus B .896 1.26 x 10−100

Corpus C .629 7.49 x10−7


(a)

(b)

(c)

Fig. 10 Scatter Plots for (a) CorpusA, (b) CorpusB, (c) CorpusC


medium level answers than the others, since data concentrate more in the vicinity of thecorresponding line.

A second experiment was designed to evaluate the automated assessment system,considering assessment as a classification problem. In this respect, we use appropriatemetrics to evaluate system assessment in comparison with the tutor’s assessment. Weanalyze the corpus of the 400 student answers and discretize it using the abovementioned classification in ‘very low’, ‘low’, ‘medium’, ‘good’ and ‘excellent’ markcategories.

The evaluation is mainly based on three well-known metrics: average accuracy,precision and F-measure, given that we have a multiple class output consisting of fiveclasses. Average accuracy is the mean value of the accuracies of the output classes andF-measure is defined as:

F−measure ¼ 2� precision� recall

precisionþ recall

Also, we evaluate the agreement between the tutor and the automated marker usingCohen’s Kappa statistic (Cohen 1960), which is defined as follows:

k ¼ p0−pe1:0−pe:

where p0 is the proportion of rater exhibiting agreement and pe is the proportionexpected to exhibit agreement only by chance. Thus, Bperfect agreement^ would beindicated by k = 1 and no agreement means that k = 0. Cohen’s Kappa was estimated todetermine whether there is agreement between the grades of the tutor and the automatedmarker of the 400 student’s answers. The results indicate that there is substantialagreement (Viera and Garrett 2005) between the tutor and the automated marker, sinceκ = .707 (95 % confidence interval, p < .0005). Also, the performance of the automated

Table 3 Evaluation results ofautomated assessmentmechanism

Average accuracy 0.83

Precision 0.84

F-measure 0.84

Table 4 Confusion matrix of Automatic Assessment Performance

very low low medium good excellent

very low 18 2 0 0 0

low 13 34 0 0 0

medium 0 26 136 12 0

good 0 0 11 93 0

excellent 0 0 0 4 51


assessment mechanism and the confusion matrix are presented in Tables 3 and 4respectively.

The results indicate that the automated assessment mechanism has an encour-aging performance. From the corpus of 400 student answers that were markedby the automated marking mechanism, 332 of them were assessed in the correctmark level category. This means that in approximately 83 % of the cases theautomated marking mechanism estimated correctly the mark category of thestudent answer. The analysis of the confusion matrix shows that the automatedassessment system and tutor marks have much in common, but do not matchexactly; the tutor in most cases assigned a higher score than the system. Thefact is that automated marking cannot always indicate whether a student hasdeeply understood an algorithm and adapt the way of marking to currentstudent(s). A tutor, however, can realize whether a student has understood thealgorithm, in spite of his/her errors, and include this principle in the marking ofthe answers.

Evaluation of AITS Learning Effectiveness

Method

We conducted an evaluation study in which we compare teaching/learning withAITS versus the traditional way. The purpose of the study was to evaluate theeffectiveness of learning in those two different ways. The participants in thisevaluation study were 300 undergraduate students (both female and male) fromthe Artificial Intelligence (AI) classes at our department. All students were in the4th year of their study and ranged in age from 21 to 24 years (M = 22.5). Apre-test/post-test experiment was used. So, we created two groups; the first oneconsisted of 150 students (70 female and 80 male) of the class of academic year2011–2012, denoted by ClassA (control group), and the second consisted of 150students (73 female and 77 male) of the class of academic year 2012–2013,denoted by ClassB (experimental group). The participants in each group wererandomly chosen from a total of about 250 students in each year.

ClassA (control group) did not use AITS, but instead used the traditionallearning/teaching process about search algorithms. The students attendedlectures and videos on AI search algorithms, solved exercises given by thetutor individually and then discussed them with the tutor. ClassB (experimentalgroup) was given access to AITS to study about AI search algorithms. Thesystem provided different types of exercises and feedback during the interactionwith it.

The experiment consisted of four phases: pre-test, learning phase, post-test andquestionnaire, as illustrated in Fig. 11. The two groups followed the same procedure;both groups were given a pre-test and then ClassA learned about search algorithms withthe traditional way, whereas ClassB through AITS; afterwards, both groups were givena post-test. After the post-test, all participants were given a questionnaire to fill in. Thepre-tests and post-tests were isomorphic and incorporated structurally equivalent exer-cises on search algorithms.


Results

In order to analyze the students’ performance an independent t-test was used on the pre-test. We consider the null hypothesis: there is no difference between the performancesof the students of ClassA and ClassB on the pre-test. The results show that the meanvalue and standard deviation of the pre-test were 39.4 and 10.74 for ClassA (M = 39.4,SD = 10.74), and 38.9 and 11 for ClassB (M = 38.9, SD = 11.0) respectively. Also, thep-value (significance level) was p = .691 (p > .05) and t = .398 and the effect size d(Cohen 1988) was .046. So, it can be inferred that the two classes did not significantlydiffer prior to the experiment. ClassA and ClassB were almost of the same knowledgelevel about search algorithm concepts before starting the learning process. Also,Table 5 presents the means and standard errors of the pre-test and post-test scores forthe two groups (control vs experimental)

The results show that ClassB, while had a mean value of 38.9, it was increased to67.54 in the post-test, whereas ClassA had a pre-test mean of 39.4 that was increased to49.98 in the post-test. So, the results revealed that the mean value of the post-test forClassB class was quite higher than the mean value of the post-test for ClassA.

To determine the effectiveness of learning, we conducted ANOVA with repeatedmeasures to extract the difference between two conditions (control and experimental).An analysis of variance (ANOVA) was conducted with Test (pre-test, post-test) as a

Duration 1 hour

All

Participants

N=300

Duration 2 weeks

Pre-test

ClassA

Traditional Way

of Learning

(Lectures, Video,

etc)

N=150

ClassB

Using AITS

N=150

First Phase

Estimation of

Prior

Knowledge

Second Phase

Learning Tasks

Duration 1

hour

All

Participants

N=300

Duration 20

minutes

All

Participants

N=300

Third Phase

Estimation of

Posterior

Knowledge

Post-test Questionnaire

Fourth

Phase

Opinion of

Students

Fig. 11 Structure of the experiment

Table 5 Results of performance of pre-test/post-test for each group

Group Pre-Test Post-Test

M SD M SD

ClassA (control) 39.4 10.74 49.98 11.18

ClassB (experimental) 38.9 11.00 67.54 14.06


repeated factor and Group (ClassA, ClassB) as a between subjects factor. The resultsrevealed a significant difference in learning performance between the conditionsF(1298) = 235.319, p < .001 and the effect size = .44. Also, Fig. 12 presents howeach group performed on the pre-test and post-test with each line representing a group.

In addition, we calculated the simple learning gains as posttest −pretest.Additionally, ANOVA performed on the simple learning gains showed signif-icant differences among conditions, F(1298) = 235.319, p < .001, MSError = 103.87. Inaddition, we calculated the normalized learning as following:

posttest−pretest1−pretest

ANOVA performed on the normalized gains showed significant differences amongconditions, F(1298) = 91.28, p < .001, MSError = 282.83. Finally, the results showedthat the performance of the students of ClassB, who interacted with AITS, was betterthan that of ClassA.

Survey Questionnaire

All the participants (ClassA and ClassB), after the post-test, were asked to fill in aquestionnaire regarding the students’ experiences and opinions about the system’slearning impact and also the automated assessment mechanism. The questionnaire forClassB consisted of 12 questions, where ten questions required answers based on a fivepoint Likert scale (1- strongly disagree to 5-strongly agree) and two were open endedquestions. The open ended questions were provided at the end of the questionnaire toallow students to write their comments about AITS and automated marking mecha-nism, stating their experiences and opinions. Table 6 presents the mean and the standarddeviations of the students’ responses for ClassB.

Fig. 12 The means of the pre-test and post-test for each group


The results of the questionnaire are very encouraging for learning with AITS and formarking with the automated marker. After analyzing the students’ responses to thequestionnaires, the reliability of the questionnaire was checked using the Cronbach’salpha (Cronbach 1951). The reliabilities of the scales were good with internal consis-tency coefficients α = .78 for the students of ClassB.

Additionally, a questionnaire survey was made to evaluate the efficiency of theautomated marker. We used the same three tutors as above, having many years ofexperience in dealing with search algorithms, who were asked to rate a corpus ofstudents’ answers that were marked by the automated marking mechanism. The rangeof the score was from 0 to 5. The tutors were given 10 students’ answers that had beenmarked by the automated assessment mechanism and were asked to rate the quality ofthe marks, this time independently of each other. The rating results provided by thetutors are presented in Fig. 13.

The average rating scores provided by the three tutors were, M = 4.38 by tutor1,M = 4.47 by tutor2 and M = 4.39 by tutor3. The rating scores indicate that the tutorsfound the automated marking mechanism to be appropriately helpful resulting in anoverall average score of 4.41.

Table 6 Results of the Questionnaire

Question Mean SD

Q1 The AITS make the AI search algorithms more understandable 4.44 .50

Q2 I feel satisfied with the feedback of each exercise that offered by AITS 4.78 .42

Q3 How much the automatic assessment conducive to learning the AI search algorithms 4.12 .65

Q4 I feel satisfied with the grades of the automated marking system? 4.44 .543

Q5 The grading was accurate? 4.01 .74

Q6 I feel that automated marking system is a good platform to facilitate assessment 4.09 .54

Q7 How you rate your overall experience? 4.11 .58

Q8 I feel more confident in dealing with AI search algorithms. 4.18 .59

Q9 Will you suggest the AITS to be integrated into the course and usedby the next year’s students?

4.57 .52

Q10 The visualization examples help me learn more effectively the algorithm way of function. 4.23 .56

Fig. 13 Tutors’ rating for the quality of marking system


Conclusion and Future Work

AITS is an adaptive and intelligent tutoring system used for assisting students inlearning and tutors in teaching artificial intelligence curriculum aspects, one of thembeing Bsearch algorithms^. The system offers theory descriptions, but most importantlyinteractive examples and exercises related to search algorithms. A student can study thetheory and the examples, which use visualized animations to present AI searchalgorithms in a step-by-step way, to make them more understandable and attractive.Also, it provides interactive exercises that aim to assist students to learn to implementalgorithms in a step-by-step interactive approach. Students are called to apply analgorithm to an example case, by specifying the algorithm’s steps interactively andwith the system’s guidance and help. Also, it provides immediate feedback for theinteractive exercises and the tests. According to our knowledge, it is the first time thatsuch technologies and methods are used in teaching/learning about search algorithms.

In the context of AITS, we introduce an automatic assessment mechanism to assessthe students’ answers. Automatic assessment is achieved in a number of stages. First, thesystem calculates the similarities between a student’s answer and the correct answerusing the ‘edit distance’metric. Afterwards, it identifies the type of the answer, based onits completeness and accuracy as well as taking into account carelessness errors. Finally,it automatically marks the answer, based on the answer’s type and the edit distance andthe type of errors, via the automated marking algorithm. So, marking is not based on aclear cut right-wrong distinction, but on partial correctness and takes into accountcarelessness or inattention cases. In this way, accuracy and consistency are achievedin a large degree, by avoiding subjectivity of human marking. Again, it seems that it isthe first effort that specifies a systematic categorization of student answers taken intoaccount, apart from correctness and consistency, also carelessness and inattention.Additionally, it is the first time that an automated assessment process is introduced forexercises on search algorithms. On the other hand, the introduced process constitutes anadequately general assessment framework that could be applied to other domains too.Furthermore, the automated marking algorithm itself could be used as the basis formarking answers to exercises of other domains, given that they are expressed as strings.

We conducted two experiments to evaluate (a) the performance of the automatedassessment mechanism and (b) the effectiveness of using interactive exercises withvisualized step-based animations for search algorithms in AITS on learning. In the firstexperiment, to evaluate the performance of the automated assessment, a data set of 400student answers, marked by the system and jointly by three expert tutors, was used as atest bed. Experimental results, analyzed via linear regression and classification metrics,showed a very good agreement between the automatic assessment mechanism and theexpert tutors. So, the automatic assessment mechanism can be used as a reference (i.e.accurate and consistent) grading system. In the second experiment, we evaluatedlearning effectiveness of AITS through a pre-test/post-test and experimental/controlgroup approach. The results gathered from the evaluation study are very promising.The experimental group made clearly better than the control group. So, it seems thatvisualized animations and interactivity are two crucial factors that contribute in betterlearning, at least for subjects like search algorithms.

At the moment, the implementation of the interactive examples and visualizations isquite time consuming. So, a way for semi-automatic or automatic generation of such


learning objects (actually programs) is a quite interesting direction for further research.On the other hand, investigation of further improvement of the assessment mechanismmay be necessary towards a number of directions. For example, investigation of morecriteria, like graph connectivity, for specification of special error cases is one of them.Also, assessment of errors based on user modeling, to involve domain knowledge, isanother possible direction. Finally, an interesting direction would be to test the assess-ment mechanism on other types of algorithms. Exploring this aspect is a key directionof our future work.

References

Ala-Mutka, K. M. (2005). A survey of automated assessment approaches for programming assignments.Computer Science Education, 15(2), 83–102.

Alemán, J. L. F. (2011). Automated assessment in a programming tools course. Education, IEEE Transactionson, 54(4), 576–581.

Aleven, V., Mclaren, B. M., Sewall, J., & Koedinger, K. R. (2009). A new paradigm for intelligenttutoring systems: example-tracing tutors. International Journal of Artificial Intelligence inEducation, 19(2), 105–154.

Aleven, V., Roll, I. D. O., McLAREN, B. M., & Koedinger, K. R. (2010). Automated, unobtrusive, action-by-action assessment of self-regulation during learning with an intelligent tutoring system. EducationalPsychologist, 45(4), 224–233.

Baker, R. S., & Rossi, L. M. (2013). –Assessing the Disengaged Behaviors of Learners. DesignRecommendations for Intelligent Tutoring Systems, 153.

Barker-Plummer, D., Dale, R., Cox, R., & Etchemendy, J. (2008). Automated assessment in the internetclassroom. Education Informatics, Arlington, VA: In Proc. AAAI Fall Symp.

Barker-Plummer, D., Dale, R., & Cox, R. (2012). Using edit distance to analyse errors in a natural language tologic translation corpus. EDM, 134.

Blumenstein, M., Green, S., Fogelman, S., Nguyen, A., & Muthukkumarasamy, V. (2008).Performance analysis of GAME: a generic automated marking environment. Computers &Education, 50(4), 1203–1216.

Brusilovsky, P., & Loboda, T. D. (2006). WADEIn II: a case for adaptive explanatory visualization. ACMSIGCSE Bulletin, 38(3), 48–52.

Brusilovsky, P., & Sosnovsky, S. (2005). Individualized exercises for self-assessment of programmingknowledge: an evaluation of QuizPACK. Journal on Educational Resources in Computing (JERIC),5(3), 6.

Charman, D., & Elmes, A. (1998). Computer based assessment (volume 1): a guide to good practice. SEED(Science Education, Enhancement and Development), University of Plymouth.

Chi, M. T., Bassok, M., Lewis, M. W., Reimann, P., & Glaser, R. (1989). Self-explanations: how studentsstudy and use examples in learning to solve problems. Cognitive Science, 13(2), 145–182.

Clark, I. (2012). Formative assessment: assessment is for self-regulated learning. Educational PsychologyReview, 24(2), 205–249.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and PsychologicalMeasurement, 20, 37–46.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd edn. Hillsdale, New Jersey: L.Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334.Dancey, C. P., & Reidy, J. (2007). Statistics without maths for psychology. Pearson Education.Darus, S. (2009). Framework for a Computer-based Essay Marking System: Specifically Developed for Esl

Writing. Lambert Academic Pub.Douce, C., Livingstone, D., & Orwell, J. (2005). Automatic test-based assessment of programming: a review.

Journal on Educational Resources in Computing (JERIC), 5(3), 4.Falkner, N., Vivian, R., Piper, D., & Falkner, K. (2014). Increasing the effectiveness of automated assessment

by increasing marking granularity and feedback units. In Proceedings of the 45th ACM technicalsymposium on Computer science education (pp. 9–14). ACM.


Fiedler, A., & Tsovaltzi, D. (2003). Automating hinting in an intelligent tutorial dialog system for mathemat-ics. Knowledge Representation and Automated Reasoning for E-Learning Systems, 23.

Gouli, E., Gogoulou, A., Papanikolaou, K. A., & Grigoriadou, M. (2006). An adaptive feedback framework tosupport reflection, guiding and tutoring. Advances in web-based education: Personalized learningenvironments, 178–202.

Grivokostopoulou, F., & Hatzilygeroudis, I. (2013a). Teaching AI Search Algorithms in a Web-BasedEducational System. In Proceedings of the IADIS International Conference e-Learning (pp. 83–90).

Grivokostopoulou, F., & Hatzilygeroudis, I. (2013b). An automatic marking system for interactiveexercises on blind search algorithms, In Artificial Intelligence in Education (pp. 783–786).Berlin Heidelberg: Springer.

Grivokostopoulou, F., & Hatzilygeroudis, I. (2013c). Automated marking for interactive exercises on heuristicsearch algorithms. In Teaching, Assessment and Learning for Engineering (TALE), 2013 I.E.International Conference on (pp. 598–603). IEEE.

Grivokostopoulou, F., & Hatzilygeroudis, I. (2015). Semi-automatic generation of interactive exercises relatedto search algorithms. In Computer Science & Education (ICCSE), 2015 10th International Conference on(pp. 33–37). IEEE.

Grivokostopoulou, F., Perikos, I., & Hatzilygeroudis, I. (2012). An automatic marking system for FOL to CFconversions. In Teaching, Assessment and Learning for Engineering (TALE), 2012 I.E. InternationalConference on (pp. H1A-7). IEEE.

Grivokostopoulou, F., Perikos, I., & Hatzilygeroudis, I. (2014a). Using Semantic Web Technologies in a WebBased System for Personalized Learning AI Course. In Technology for Education (T4E), 2014 I.E. SixthInternational Conference on (pp. 257–260). IEEE.

Grivokostopoulou, F., Perikos, I., & Hatzilygeroudis, I. (2014b). Utilizing semantic web technologies and datamining techniques to analyze students learning and predict final performance. In Teaching, Assessmentand Learning (TALE), 2014 International Conference on (pp. 488–494). IEEE.

Hansen, S., Narayanan, N. H., & Hegarty, M. (2002). Designing educationally effective algorithm visualiza-tions. Journal of Visual Languages and Computing, 13(3), 291–317.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.Hatzilygeroudis, I., Koutsojannis, C., Papavlasopoulos, C., & Prentzas, J. (2006, July). Knowledge-based

adaptive assessment in a Web-based intelligent educational system. In Advanced Learning Technologies,2006. Sixth International Conference on (pp. 651–655). IEEE.

Heffernan, N. T., & Heffernan, C. L. (2014). The ASSISTments ecosystem: building a platform that bringsscientists and teachers together for minimally invasive research on human learning and teaching.International Journal of Artificial Intelligence in Education, 24(4), 470–497.

Helmick, M. T. (2007). Interface-based programming assignments and automatic grading of java programs. InACM SIGCSE Bulletin (Vol. 39, No. 3, pp. 63–67). ACM.

Hershkovitz, A., de Baker, R. S. J., Gobert, J., Wixon, M., & Sao Pedro, M. (2013). Discovery with models acase study on carelessness in computer-based science inquiry. American Behavioral Scientist, 57(10),1480–1499.

Higgins, C. A., & Bligh, B. (2006). Formative computer based assessment in diagram based domains. ACMSIGCSE Bulletin, 38(3), 98–102.

Higgins, C. A., Gray, G., Symeonidis, P., & Tsintsifas, A. (2005). Automated assessment and experiences ofteaching programming. Journal on Educational Resources in Computing (JERIC), 5(3), 5.

Hsiao, I. H., Brusilovsky, P., & Sosnovsky, S. (2008). Web-based parameterized questions for object-orientedprogramming. In Proceedings of World Conference on E-Learning, E-Learn (pp. 17–21).

Hundhausen, C. D., Douglas, S. A., & Stasko, J. T. (2002). A meta-study of algorithm visualizationeffectiveness. Journal of Visual Languages and Computing, 13(3), 259–290.

Ihantola, P., Ahoniemi, T., Karavirta, V., & Seppälä, O. (2010). Review of recent systems for automaticassessment of programming assignments. In Proceedings of the 10th Koli Calling InternationalConference on Computing Education Research (pp. 86–93). ACM.

Jackson, D., & Usher, M. (1997). Grading student programs using ASSYST. In ACM SIGCSE Bulletin (Vol.29, No. 1, pp. 335–339). ACM.

Jenkins, T. (2002). On the difficulty of learning to program. In Proceedings of the 3rd Annual Conference ofthe LTSN Centre for Information and Computer Sciences (Vol. 4, pp. 53–58).

Jeremić, Z., Jovanović, J., & Gašević, D. (2012). Student modeling and assessment in intelligent tutoring ofsoftware patterns. Expert Systems with Applications, 39(1), 210–222.

Joy, M., Griffiths, N., & Boyatt, R. (2005). The boss online submission and assessment system. Journal onEducational Resources in Computing (JERIC), 5(3), 2.


Kordaki, M., Miatidis, M., & Kapsampelis, G. (2008). A computer environment for beginners’ learning ofsorting algorithms: design and pilot evaluation. Computers & Education, 51(2), 708–723.

Kwan, R., Chan, J., & Lui, A. (2004). Reaching an ITopia in distance learning—a case study. AACE Journal,12(2), 171–187.

Lahtinen, E., Ala-Mutka, K., & Järvinen, H. M. (2005). A study of the difficulties of novice programmers. InACM SIGCSE Bulletin (Vol. 37, No. 3, pp. 14–18). ACM.

Lau, W. W., & Yuen, A. H. (2010). Promoting conceptual change of learning sorting algorithm through thediagnosis of mental models: the effects of gender and learning styles. Computers & Education, 54(1),275–288.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Sovietphysics doklady (Vol. 10, No. 8, pp. 707–710).

Long, Y., & Aleven, V. (2013). Skill diaries: improve student learning in an intelligent tutoring system withperiodic self-assessment, In Artificial intelligence in education (pp. 249–258). Berlin Heidelberg:Springer.

Malmi, L., Karavirta, V., Korhonen, A., Nikander, J., Seppälä, O., & Silvasti, P. (2004). Visualalgorithm simulation exercise system with automatic assessment: TRAKLA2. Informatics inEducation, 3(2), 267–288.

Martin, J., & VanLehn, K. (1995). Student assessment using Bayesian nets. International Journal of Human-Computer Studies, 42(6), 575–591.

Mehta, S. I., & Schlecht, N. W. (1998). Computerized assessment technique for large classes. Journal ofEngineering Education, 87(2), 167.

Moreno, A., Myller, N., Sutinen, E., & Ben-Ari, M. (2004). Visualizing programs with Jeliot 3. In Proceedingsof the working conference on Advanced visual interfaces (pp. 373–376). ACM.

Muñoz-Merino, P. J., Kloos, C. D., & Muñoz-Organero, M. (2011). Enhancement of student learning throughthe use of a hinting computer e-learning system and comparison with human teachers. Education, IEEETransactions on, 54(1), 164–167.

Naps, T. L., Rößling, G., Almstrum, V., Dann, W., Fleischer, R., Hundhausen, C.,. .. & Velázquez-Iturbide, J.Á. (2002). Exploring the role of visualization and engagement in computer science education. In ACMSIGCSE Bulletin (Vol. 35, No. 2, pp. 131–152). ACM.

Narciss, S. (2008). Feedback strategies for interactive learning tasks. Handbook Of Research On EducationalCommunications And Technology, 3, 125–144.

Naudé, K. A., Greyling, J. H., & Vogts, D. (2010). Marking student programs using graph similarity.Computers & Education, 54(2), 545–561.

Nicol, D. J., & Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: a model andseven principles of good feedback practice. Studies in Higher Education, 31(2), 199–218.

Olson, T., & Wisher, R. A. (2002). The effectiveness of web-based instruction: an initial inquiry. TheInternational Review of Research in Open and Distributed Learning, 3(2).

Pavlik Jr, P. I., Cen, H., & Koedinger, K. R. D2009]. Performance factors analysis–a new alternative toknowledge tracing. Online Submission.

Perikos, I., Grivokostopoulou, F., & Hatzilygeroudis, I. (2012). Automatic marking of NL to FOL conver-sions. In Proc. of 15th IASTED International Conference on Computers and Advanced Technology inEducation (CATE), Napoli, Italy (pp. 227–233).

Rajala, T., Laakso, M. J., Kaila, E., & Salakoski, T. (2007). VILLE: a language-independent programvisualization tool, In Proceedings of the Seventh Baltic Sea Conference on Computing EducationResearch-Volume 88 (pp. 151–159). Inc: Australian Computer Society.

Rosa, K. D., & Eskenazi, M. (2013). Self-assessment in the REAP tutor: knowledge, interest, motivation, &learning. International Journal of Artificial Intelligence in Education, 21(4), 237–253.

San Pedro, M. O. Z., Baker, d. R. S., & Rodrigo, M. M. T. (2014). Carelessness and affect in anintelligent tutoring system for mathematics. International Journal of Artificial Intelligence inEducation, 24(2), 189–210.

San Pedro, M. O. C. Z., d Baker, R. S., & Rodrigo, M. M. T. (2011). Detecting carelessness throughcontextual estimation of slip probabilities among students using an intelligent tutor for mathematics, InArtificial Intelligence in Education (pp. 304–311). Berlin Heidelberg: Springer.

Sánchez-Torrubia, M. G., Torres-Blanc, C., & López-Martínez, M. A. (2009). Pathfinder: a visualizationemathteacher for actively learning dijkstra’s algorithm. Electronic Notes in Theoretical Computer Science,224, 151–158.

Shepard, L. A. (2005). Linking formative assessment to scaffolding. Educational Leadership, 63(3), 66–70.Shute, V. J., & Zapata-Rivera, D. (2010). Educational measurement and intelligent systems. of the

International Encyclopedia of Education. Oxford: Elsevier Publishers.


Siler, S. A., & VanLehn, K. (2003). Accuracy of tutors’ assessments of their students by tutoring context. InProceedings of the 25th annual conference of the cognitive science society. Mahwah, NJ: Erlbaum.

Sitzmann, T., Kraiger, K., Stewart, D., & Wisher, R. (2006). The comparative effectiveness of web-based andclassroom instruction: a meta-analysis. Personnel Psychology, 59(3), 623–664.

Stajduhar, I., & Mausa, G. (2015). Using string similarity metrics for automated grading of SQL statements. InInformation and Communication Technology, Electronics and Microelectronics (MIPRO), 2015 38thInternational Convention on (pp. 1250–1255). IEEE.

Suleman, H. (2008). Automatic marking with Sakai. In Proceedings of the 2008 annual research conference ofthe South African Institute of Computer Scientists and Information Technologists on IT research indeveloping countries: riding the wave of technology (pp. 229–236). ACM.

Thomas, P., Smith, N., & Waugh, K. (2008). Automatically assessing graph-based diagrams. Learning, Mediaand Technology, 33(3), 249–267.

VanLehn, K. (2006). The behavior of tutoring systems. International Journal of Artificial Intelligence inEducation, 16(3), 227–265.

VanLehn, K. (2008). Intelligent tutoring systems for continuous, embedded assessment. The future ofassessment: Shaping teaching and learning, 113–138.

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and othertutoring systems. Educational Psychologist, 46(4), 197–221.

Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: the kappa statistic. FamilyMedicine, 37(5), 360–363.

Vujošević-Janičić, M., Nikolić, M., Tošić, D., & Kuncak, V. (2013). Software verification and graphsimilarity for automated evaluation of students’ assignments. Information and SoftwareTechnology, 55(6), 1004–1016.

Wang, T., Su, X., Ma, P., Wang, Y., & Wang, K. (2011). Ability-training-oriented automated assessment inintroductory programming course. Computers & Education, 56(1), 220–226.

Watson, C., & Li, F. W. (2014). Failure rates in introductory programming revisited. In Proceedings of the2014 conference on Innovation & technology in computer science education (pp. 39–44). ACM.

Winne, P. H., & Hadwin, A. F. (1998). Studying as self-regulated learning. Metacognition in EducationalTheory and Practice, 93, 27–30.

Woolf, B. P. (2010). Building intelligent interactive tutors: Student-centered strategies for revolutionizing e-learning. Morgan Kaufmann.


an educational system for learning search algorithms and

Documents