use of recurrent neural network architectures … filearseniy aleksandrovich lebedev 1675...

12
http://www.iaeme.com/IJCIET/index.asp 1674 [email protected] International Journal of Civil Engineering and Technology (IJCIET) Volume 10, Issue 01, January 2019, pp. 1674-1685, Article ID: IJCIET_10_01_153 Available online at http://www.iaeme.com/ijciet/issues.asp?JType=IJCIET&VType=10&IType=01 ISSN Print: 0976-6308 and ISSN Online: 0976-6316 © IAEME Publication Scopus Indexed USE OF RECURRENT NEURAL NETWORK ARCHITECTURES FOR DATA VERIFICATION IN THE SYSTEM OF DISTANCE EDUCATION Arseniy Aleksandrovich Lebedev Laboratory of Innovations, Ltd, Kazan, Tatarstan, Russian Federation ABSTRACT There are very few examples of the use of various architectures for recurrent neural networks to predict student learning outcomes. In fact, the only architecture used to solve this problem is the LSTM architecture. In the works devoted to the use of LSTM to predict educational outcomes, the results of a detailed theoretical substantiation of the preference of this particular architecture of the RNN are not presented. In this regard, it seems advisable to provide such justification in the framework of this study. The main property of input data for prediction of educational outcomes is its temporary nature. Some sequence of user actions unfolds in time and is evaluated (classified) by an external observer as evidence of the presence or absence of an educational result (objective or metaobjective). In this regard, the RNN used to classify user actions should perform a procedure for adjusting the weights of neurons for a certain set of states in the past. At the same time, the length of the sequence of these states is not predetermined: it can be both short (for example, for objective results), and quite long. Keywords: Distance education, Recurrent neural network, Architecture, Structure, Information technology, Monitoring, Educational outcomes prediction, Online courses. Cite this Article: Arseniy Aleksandrovich Lebedev, Use of Recurrent Neural Network Architectures for Data Verification in the System of Distance Education, International Journal of Civil Engineering and Technology, 10(01), 2019, pp. 1674–1685 http://www.iaeme.com/IJCIET/issues.asp?JType=IJCIET&VType=10&IType=01 1. INTRODUCTION Early PNS architectures used the time-error propagation method [1] and real-time recurrent learning [2] to solve this problem. There were two main problems with the propagation of the error signal: 1.1. Two main problems with the propagation of the error signal: 1. the error signal increased sharply with distance to past states;

Upload: others

Post on 30-Aug-2019

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

http://www.iaeme.com/IJCIET/index.asp 1674 [email protected]

International Journal of Civil Engineering and Technology (IJCIET) Volume 10, Issue 01, January 2019, pp. 1674-1685, Article ID: IJCIET_10_01_153

Available online at http://www.iaeme.com/ijciet/issues.asp?JType=IJCIET&VType=10&IType=01

ISSN Print: 0976-6308 and ISSN Online: 0976-6316

© IAEME Publication Scopus Indexed

USE OF RECURRENT NEURAL NETWORK

ARCHITECTURES FOR DATA VERIFICATION

IN THE SYSTEM OF DISTANCE EDUCATION

Arseniy Aleksandrovich Lebedev

Laboratory of Innovations, Ltd, Kazan, Tatarstan, Russian Federation

ABSTRACT

There are very few examples of the use of various architectures for recurrent neural

networks to predict student learning outcomes. In fact, the only architecture used to

solve this problem is the LSTM architecture. In the works devoted to the use of LSTM

to predict educational outcomes, the results of a detailed theoretical substantiation of

the preference of this particular architecture of the RNN are not presented. In this

regard, it seems advisable to provide such justification in the framework of this study.

The main property of input data for prediction of educational outcomes is its

temporary nature. Some sequence of user actions unfolds in time and is evaluated

(classified) by an external observer as evidence of the presence or absence of an

educational result (objective or metaobjective). In this regard, the RNN used to classify

user actions should perform a procedure for adjusting the weights of neurons for a

certain set of states in the past. At the same time, the length of the sequence of these

states is not predetermined: it can be both short (for example, for objective results),

and quite long.

Keywords: Distance education, Recurrent neural network, Architecture, Structure,

Information technology, Monitoring, Educational outcomes prediction, Online courses.

Cite this Article: Arseniy Aleksandrovich Lebedev, Use of Recurrent Neural Network

Architectures for Data Verification in the System of Distance Education, International

Journal of Civil Engineering and Technology, 10(01), 2019, pp. 1674–1685

http://www.iaeme.com/IJCIET/issues.asp?JType=IJCIET&VType=10&IType=01

1. INTRODUCTION

Early PNS architectures used the time-error propagation method [1] and real-time recurrent

learning [2] to solve this problem. There were two main problems with the propagation of the

error signal:

1.1. Two main problems with the propagation of the error signal:

1. the error signal increased sharply with distance to past states;

Page 2: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Arseniy Aleksandrovich Lebedev

http://www.iaeme.com/IJCIET/index.asp 1675 [email protected]

2. the error signal disappeared.

At the same time, an exponential relationship was observed between changes in the error

signal time and the number of weights fitted. Problem 1 led to an undesirable effect of constant

oscillation of the scales, problem 2 led to a situation where, with a sufficiently large time

distance between the states affected by the error signal, network training either took an

unacceptably long time or did not occur at all [3].

2. LITERATURE REVIEW

Researchers have proposed a variety of different RNN (recurrent neural network) architectures

aimed at solving these problems.

2.1. Altinay

The work by Altinay in 2017 [4] provides an overview of a number of approaches that use

various modified gradient descent solutions for solving problems with error signal propagation,

but none of the proposed options solves both problems at once.

2.2. Zagami

In work by Jason Zagami in 2018 [5], the architecture of a with a time delay was proposed for

solving both problems, but only for cases of relatively short sequences of states. In such a

network, the neuron weights are updated with a weighted sum of old weights.

2.3. Fox-Turnbull

The idea of time-delayed recurrent networks formed the basis of the NARX neural network

described in work by Fox-Turnbull in 2016 [6].

2.4. Campbell

To solve problems with the propagation of an error signal in cases of relatively long sequences

of states in work by Campbell in 2015 [7], it was proposed to use a set of time constants

governing the updating of the weights.

2.5. Baines and Chen

An attempt to combine the neural network approach with a time delay and a time regulation

constant for updating the weights was undertaken in the work (Mapotse, 2018). However, as

in work by Baines in 2018 [8], and in work by Chen in 2018 [9], in the case of long sequences

of states, a neat and time-consuming process of selecting time constants was required.

2.6. Hsu

An alternative solution to both problems for short and long sequences of events was described

in work by Hsu in 2016 [10]. The authors proposed to update the weights of the recurrent cell

by summing the old weight and the current normalized input value. At the same time, the

normalized current input value gradually distorted (supplanted) the stored information about

past states, which made it impossible to work with long sequences of states.

2.7. Fletcher-Watson

In work by Fletcher-Watson in 2015 [11], to solve problems with the error signal on long

sequences of events, it was proposed to use special, separate network cells that affect the

weights. Such cells are added only if conflicting error signals occur on the network. In a limited

Page 3: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Use of Recurrent Neural Network Architectures for Data Verification in the System of Distance

Education

http://www.iaeme.com/IJCIET/index.asp 1676 [email protected]

number of cases, this approach can significantly reduce the number of calculations on the

network, however, in unfavorable cases, the number of additional cells can be equal to the

number of states in the sequence, which can lead to problems similar to the infinite oscillation

of weights.

2.8. Hochreiter and Schmidhuber

In the paper by Hochreiter and Schmidhuber [12] the LSTM recurrent neural network

architecture was proposed. It solves both problems with the propagation of an error signal for

almost arbitrary sequences of states. The backpropagation of the error signal in this architecture

is set automatically by a constant, obtained by applying an efficient algorithm based on a

gradient descent. In this case, the signal propagation occurs through the states of the network

cells that have specific 4-layer architecture. As a result, LSTM is capable of maintaining a

temporal relationship of more than 1000 states even in the case of fairly “noisy” input data and

at the same time does not lose this property on short sequences of states.

3. MATERIALS AND METHODS

In contrast to the classical machine learning methods, in which they find a point estimate for

the parameters of the neural network w, in Bayesian neural networks objects, target variables

and parameters are treated as random variables. Accordingly, the neural network models the

dependence p (y | x, w). The prior distribution p (w) sets the initial knowledge and expectations

about the parameters. For example, in thinning models, the prior distribution encourages zero

parameter values. The learning process consists in finding the posterior distribution of the

parameters p (w | D). Then the predictions of the model will be given as

p(y|x) = Ep(w|D)p(y|x,w) (1)

To find the posterior distribution on the parameters according to the Bayes formula

(2)

This fails due to the non-calculated integral in the denominator. Therefore, an approximate

a posteriori distribution qλ (w) in a certain parametric family of distributions is sought, and λ

is the parameters of an approximate a posteriori distribution. Parameters λ are chosen so as to

minimize the KL divergence:

KL(qλ(w)||p(w|D)) → min, w (3)

Which is tantamount to maximizing the variational lower bound on the likelihood

logarithm:

N L(λ) = XEqλ(w) logp(yi|xi,w) − KL(qλ(w)||p(w)) → max λ i=1 (4)

This expression is essentially the sum of the term responsible for the quality of the solution

of the problem, and the regularizer, showing that the a posteriori distribution of the parameters

should be close to the a priori one.

The first addend is usually estimated using the Monte Carlo method with one sample of

weights for each object:

Eqλ(w) logp(yi|xi,w) ≈ logp(yi|xi,w),w ∼ qλ(w). (5)

To avoid gradient bias, a reparameterization trick is applied by Damewood in 2016 [13]:

the weights are given as

. (6)

Page 4: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Arseniy Aleksandrovich Lebedev

http://www.iaeme.com/IJCIET/index.asp 1677 [email protected]

The reparameterization trick is not applicable to all distributions, but it is applicable, for

example, to the normal distribution:

, (7)

elementwise multiplication.

Also, the method of multiple regression, the logistic regression method and the coefficient

of determination were used as a control in the experiment.

4. RESULTS AND DISCUSSIONS

Dropout [14] is a regularization technique for neural networks, which imposes multiplicative

noise on the inputs of each layer. Typically, noise vector elements are generated from the

Bernoulli distribution (binary dropout) or from the normal distribution with a center at 1

(Gaussian dropout), and the parameters of this noise are adjusted using cross-validation. In

work by McLain in 2018 [15], an interpretation of the Gaussian dropout is proposed as a way

to specify the Bayesian neural network. This made it possible to adjust the noise parameters

automatically. In work by Virtanen in 2015 [16], this approach was extended to thin the fully

connected neural networks and was called the thinning variational dropout (TVD).

Consider a fully connected layer h = g (Wx + b) with a weight matrix W. In a TVD, the a

priori distribution of weights is given as a factorized log-uniform distribution:

(8)

This distribution has a large mass at zero and therefore encourages zero weights.

Approximate a posteriori distribution is sought in the family of factorized normal

distributions.

k, n q(W) = Y q(wij), q(wij|θij,αij) = N(θij,αijθij2 )i, j=1 (9)

The use of such a posteriori distribution is equivalent to the imposition of a multiplicative

[17]

wij = θijξij, ξij ∼ N(1,αij), (10)

Or additive [18]:

. (11)

Of normal noise on weight. The parametrization of weights (11) is called additive

reparametrization and makes it possible to reduce the dispersion of gradients L over the average

weights θij. In addition, since the sum of normal distributions is a normal distribution with

calculated parameters, noise can be imposed on Wx pre-activation, rather than separately on

the components of the W matrix. This technique is called local reparametrization [19, 20].

Local reparameterization allows one to reduce the dispersion of gradients even more, and also

saves computations, since sampling noise on weights separately for each object is an expensive

operation.

In the TVD, the variation of the lower estimate (4) is optimized by {θ, logσ} using the trick

of reparametrization, additive reparametrization and local reparametrization to achieve

unbiased low dispersion gradients. Since the a priori and the approximate a posteriori

distributions are factorized by weights, the KL divergence also splits into a sum for individual

Page 5: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Use of Recurrent Neural Network Architectures for Data Verification in the System of Distance

Education

http://www.iaeme.com/IJCIET/index.asp 1678 [email protected]

weights, and each term depends only on the noise dispersion αij due to the special choice of

the prior distribution:

KL(q(wij|θij,αij)||p(wij)) = k(αij),

k(α) ≈ 0.64σ(1.87 + 1.49logα) − 0.5log(1 + α−1) + C. (12)

The last expression is a fairly accurate approximation of the KL divergence and is obtained.

KL divergence (12) encourages large values of αij and small modulo values of θij. If αij →

∞ for the weight wij, then due to the large noise dispersion of the model, it is advantageous to

set θij = 0 and σij = αijθij2 = 0 to avoid large prediction errors. As a result, the distribution q

(wij | θij, αij) approaches the delta function at 0, and this weight is always zero.

The thinning variational dropout model was extended to achieve group thinning of the full-

connected layer. Group dilution refers to the removal of a weight group from a model, for

example, rows or columns of a weight matrix. Group thinning allows you to remove elements

of hidden layers of the neural network, which accelerates the passage forward. As an example,

we will combine the columns of the weights matrix of a fully connected layer into groups and

number them 1 ... k.

The authors propose to introduce group multiplicative weights zi for each weight group and

adjust weights in the following parameterization:

wij = wˆijzi. (13)

In a fully connected layer, this parametrization is equivalent to imposing multiplicative

noise on the input layer:

. (14)

Since the main task is to nullify zi, the authors use for the multiplicative variables the same

pair of a priori-approximate posteriori distribution, as in the TVD:

. (15)

For individual weights, the standard normal a priori distribution is used, and the posterior

distribution, as in the TVD, is approximated in the class of normal distributions:

p(wij) = N(wij|0,1) q(wij) = N(wij|θij,σij2 ). (13)

The prior distribution to individual weights encourages zero averages θij, and this, in turn,

helps bring the group averages θiz to zero, that is, reset the group variables.

The model is trained in the same way as the TVD model by optimizing the variational lower

bound (4). KL-divergence splits into a sum of KL-divergences for group variables and for

weights, with the last term calculated analytically.

For most tasks, recurrent neural networks are defined by dense weights matrices, with most

of the weights being uninformative and not affecting the quality of the solution to the problem.

Despite the existence of heuristic approaches to thinning RNN based on a large number of

hyperparameters, the use of Bayesian thinning techniques has not been previously investigated

for recurrent neural networks. On the other hand, the literature describes various models of

Bayesian regularization of RNN, some features of which are also reflected in the proposed

model.

When applying a TVD to a RNN, the features of the recurrent layer should be taken into

account.:

Page 6: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Arseniy Aleksandrovich Lebedev

http://www.iaeme.com/IJCIET/index.asp 1679 [email protected]

• weights in the recurrent layer are related in time, that is, different elements of the

input sequence are multiplied by the same weights matrices;

• in Bayesian regularization of RNN, the current hidden state ht and the matrix of

recurrent weights Wh are not independent random variables, since the second

involved in the expressions for calculating the first.

First, we consider a model of a thinning variational dropout for a recurrent layer, and then

we note the features of applying a TVD to a fully connected layer and a layer of representations

in the RNN.

Following, we use a log-uniform prior distribution on the weights of the recurrent layer

{Wx, Wh} and approximate the posterior distribution in the class of normal distributions:

, (14)

The training model is to optimize the variation lower bound

(15)

On parameters {θ, logσ} using stochastic gradient optimization methods. In expression

(15), the first term is the likelihood of the model, averaged over the distribution over the

weights q (w | θ, σ). In the process of optimization, this plausibility is estimated by the Monte

Carlo method with one sample of weights. As in the model of the TVD model, the

reparameterization trick and additive reparameterization are used here in order to obtain

unbiased gradients with low dispersion.

(16)

In plausibility, the dependence of the target variable yi is expanded in time to emphasize

that the same weights Wx, Wh are used at all times. So normal noise in Bayesian

regularization of RNN, it must be time bound: the same noise sample is used for one object at

a time.

However, in Bayesian regularization of RNN, local reparameterization cannot be applied

to either Wx or Wh scales. Applying local reparameterization to the Wx weights matrix in the

RNN implies using the same noise sample to preactivate Wxxt ∀t, which is not equivalent to

using the same Wx weights sample at all points in time. For Wh, local reparametrization cannot

be applied for another reason: since ht − 1 and Wh are not independent random variables, the

assertion about the sum of normal distributions is not applicable to the product Whht − 1.

Instead of using local reparameterization in order to avoid resource-intensive sampling of

three-dimensional noise matrices, it is proposed to use one sample of Wx and Wh weights for

all objects of one mini-batch.

A similar scheme can be applied to gate architectures, for example, LSTM. In this case, the

prior and approximate a posteriori distribution will be used for each of the parameter matrices.

Page 7: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Use of Recurrent Neural Network Architectures for Data Verification in the System of Distance

Education

http://www.iaeme.com/IJCIET/index.asp 1680 [email protected]

At the testing stage, we can use the average values of weights Θx, Θh by analogy with. In

addition, weights with large values get nulled.

The final diagram of one-time step forward along the thinning Bayesian RNN, including

zeroing the scales in test mode, is given in algorithm (4).

When applying a thinning dropout to other RNN layers preceding or following the recurrent

layer, at the training stage one should use the same sample of weights at all times for one object.

Thus, when formulating the TVD model for RNN, the following features of recurrent

neural networks are taken into account:

1. sampling the same noise on the weight at all points in time;

2. unlike the direct distribution networks, local reparametrization is not applicable to

RNN, therefore it is proposed to sample one weights matrix for all mini-batch

objects.

In order for thinning to help speed up the forward path through the recurrent neural network

when performing computations on GPUs, you need to remove weights in groups corresponding

to one neuron. To do this, you can apply the approach described. However, this approach can

be improved to obtain different levels of sparseness in gate recurrent architectures. Consider

this approach for the most popular LSTM gateway architecture.

In LSTM, in addition to the latent state vector, the internal memory ct vector is maintained

at each time instant. At each time step, the memory is first updated using the gate mechanism,

then the hidden state is updated:

(17)

By analogy with (14), in order to achieve group thinning, we introduce into the model

multiplicative group variables on weights. In addition to the group variables zx and zh on the

rows of weights matrices responsible for excluding the elements of the input and hidden

vectors, we also introduce the group variables zi, zf, zg and zo on the columns of the weights

matrices responsible for causing the gates i, f, o and information flow g to input data. For

example, for the matrix Wfx we get the following parameterization of the weights:

wf, ijx = wˆf,ijx zix · zjf. (18)

Such parametrization corresponds to the imposition of multiplicative noise on the input

vector xt and the hidden state ht, as well as separate multiplicative noise on the preactivation

of gates and information flow:

(19)

Page 8: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Arseniy Aleksandrovich Lebedev

http://www.iaeme.com/IJCIET/index.asp 1681 [email protected]

When the zx and zh components are zeroed out, the element of the input vector or hidden

state is excluded from the model, respectively. When the components zi, zf, zg, or zo are zeroed

out, the element of the corresponding gate or information flow becomes constant, not

determined by the input data xt and ht. Note that the appearance of constant gates in the model

simplifies, but does not violate the structure of LSTM, and in addition, it saves the computation

of matrix products.

The standard normal a priori distribution for individual weights wˆij was used, but in

practice this limits the thinning of the model. In this paper, it is proposed to use such a prior-

approximate a posteriori distribution, as in the TVD for the RNN, for all groups of weights (for

example, the distributions for the Wfx matrix are given):

(20)

Due to the thinning of all three groups of weights, a hierarchical effect is achieved: thinning

individual weights contributes to the appearance of constant gates and simplifies the structure

of the LSTM, which, in turn, helps to eliminate the xt and ht elements.

The sampling of the group variables zx, zh, zi, zf, zg and zo is carried out using the trick of

reparametrization and additive reparametrization, as in the model of the TVD for RNN. The

training of the model, the forward passage and the testing phase are similar to the TVD model

for RNN with the addition of only additional sampling of group variables in the forward pass

and the components of KL-divergence responsible for group variables in the variational lower

bound (15).

Group thinning can be applied similarly to the layer of representations. To do this, you need

to introduce group multiplicative variables to the elements of the dictionary and dilute both the

elements of the representation matrix and the multiplicative group variables. The a priorI and

approximate a posteriori representation for weights remain the same as in the group TVD

model described above for RNN. As a result of applying such a model, the effect of thinning

the input dictionary is achieved, that is, the selection of features.

Thus, when formulating the model of group TVD for RNN, the following features of the

gating recurrent neural networks are taken into account:

Introduced multiplicative variables on the preactivation of gates and information flow;

For weights wij, a thinning log-uniform prior distribution is used, which enhances the

thinning of group variables.

Traditional RNN maps the input sequence of vectors x1, ..., xT to the output sequence of

vectors y1, ..., yT. This is achieved by calculating the sequence of “hidden” states h1, ..., hT,

considered as sequential registration of information about past states and which is relevant for

predicting future states. These variables are connected by a simple system of equations

presented in formulas (4) and (11):

ht = tanh(Whxxt + Whhht-1 + bh), (21)

yt = σ(Wyhht + by), (22)

Page 9: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Use of Recurrent Neural Network Architectures for Data Verification in the System of Distance

Education

http://www.iaeme.com/IJCIET/index.asp 1682 [email protected]

where tanh and sigmoid function σ () are applied element by element, Whx is the matrix of

weights of input values, Whh is the matrix of recurrent weights, Wyh is the matrix of weights

of output values, bh and by are the offset regulators of hidden and output values, respectively.

Traditional RNN has problems associated with the propagation of the error signal when a

large number of analyzed states. The RNN architecture that solves this problem most efficiently

is LSTM. In LSTM, hidden items retain their values until they are completely cleared by

triggering special “forget gate” gates. Due to the mechanism of the LSTM valves, information

on intermediate states is stored much longer than conventional RNN, which greatly facilitates

the process of their learning. In addition, in LSTM, hidden layers are updated using

multiplicative interaction (rather than additive), which allows this architecture to reflect much

more complex transformations with the same number of neurons in hidden layers.

The LSTM variant of the RNN was used in the two most recent experimental studies

devoted to studying the capabilities of the RNN for predicting educational outcomes.

In the work of scientists from Kyushu University, the accuracy of predicting the future

score for the course “Information Science” was investigated on the basis of the actions of 108

students who are journaled by M2B CSR training support. The input to the RNN was obtained

from journals by giving the student a score from 0 to 6 on the following scales:

1. attendance (0 - no, 3 – late attendance, 5 – in-time attendance);

2. point for testing (with a scale breakdown with a step of 20%);

3. the fact of delivery of the report (0 - not submitted, 3 - delivered late, 5 - delivered

on time);

4. the number of views of course materials, materials in the service of electronic

textbooks, actions in the service of electronic textbooks, words in texts sent to the

service of electronic portfolio (broken down by step 10%, below 50% - 0 points).

A total of 9 variables were proposed. Output data served as a final score for the course on

a 5-point scale. The paper does not indicate the specific type of LSTM used for forecasting,

however, based on a brief description, it can be assumed that this is the traditional LSTM

described in the paper.

As one step (state), there were used scores on 9 scales, given to a student for one week of

classes. The maximum number of weeks used in the experiment was 15. The method of

multiple regression and the coefficient of determination were used as a control in the

experiment. Figure 1 shows the accuracy of the prediction of the final score after each week of

the course.

Figure 1. The accuracy prediction of the final score prediction after each week of the course

Page 10: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Arseniy Aleksandrovich Lebedev

http://www.iaeme.com/IJCIET/index.asp 1683 [email protected]

From Figure 1 it can be seen that when using LSTM, prediction accuracy of >90% is

achieved at the 6th week step, while for multiple regression only at the 10th one, and for the

determination coefficient - only at the 14th one.

Of particular interest in the work is the fact that as input data are used not individual

properties of students, but directly written program code. For vectorization of program code,

the researchers used the method they developed based on building the AST representation of

the student’s program code by analogy with the presentation of texts in natural language.

As a control method, the logistic regression method was used, the input data for which was

a two-dimensional vector. The first element of this vector is calculated as a function inversely

proportional to the number of appearances among the one student sent program codes that are

close to the correct version. The second element is a binary sign of the success / failure of the

assignment.

Figure 2 shows the graphs of the prediction accuracy of the educational outcome of the next

assignment for the control method and LSTM.

Figure 2. The accuracy of the educational result of the next task prediction, depending on the number

of attempts to send the task to students

In Figure 2 it can be seen that the LSTM method has an average of 10 percentage points

higher accuracy compared to the control one. The minimum accuracy of LSTM is more than

80%. The difference in accuracy is explained by the authors of the work by the fact that the

LSTM method allows building a forecast directly based on the meaning of the student’s

response (properties of its program code), while the control method takes into account only the

number of attempts that were close to successful.

5. CONCLUSION

The extremely high prediction accuracy achieved in the experiment (100%) is explained, in our

opinion, by a simple 5-point scale of estimation. It should be noted that in no other work

analyzed in the framework of the analytical review such accuracy has been achieved.

Nevertheless, given the widespread 5-point scale in the Russian Federation, it is extremely

important that the use of the simplest LSTM architecture in combination with a small set of

equally simple input data can provide such prediction accuracy, at least for subject-specific

results.

The paper describes an experiment of using LSTM to predict an educational result based

on the source code of the programs compiled by students during the execution of a single task

of the Hour of Code mass online course on the Code.org platform. The input data set contained

about 1.2 million program codes compiled by 263.5 thousand students. At the disposal of

Page 11: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Use of Recurrent Neural Network Architectures for Data Verification in the System of Distance

Education

http://www.iaeme.com/IJCIET/index.asp 1684 [email protected]

researchers there were data only for two tasks of the course. As a single step (state), one attempt

was made by the student to send a program code (students had the opportunity to send several

program codes in response to one task). At the same time, only those students who made from

2 to 10 attempts to complete the task were selected from the total data set. LSTM should have

predicted the educational outcome in the form of the likelihood of a successful assignment,

next to the one whose performance data was used for LSTM training.

The traditional LSTM described was used for training (the authors indicate this explicitly).

Since the output of the LSTM must produce a probabilistic value, the output values of the last

cell of the network pass through a fully connected layer and the next layer with the Softmax

activation function.

FUNDING STATEMENT

Applied research described in this paper is carried out with financial support of the state

represented by the Russian Federation Ministry for Education and Science under the

Agreement #14.576.21.0091 of 26 September 2017 (unique identifier of applied research -

RFMEFI57617X0091).

REFERENCES

[1] Hughes, E. Sh., Bradford, J. and Likens, C. Facilitating Collaboration, Communication,

and Critical Thinking Skills in Physical Therapy Education through Technology-Enhanced

Instruction: A Case Study. TechTrends, 62(3), 2018, pp. 296–302. DOI: 10.1007/s11528-

018-0259-8.

[2] Abramov, R. A. Management Functions of Integrative Formations of Differentiated Nature.

Biosci Biotech Res Asia, 12(1), 2015, pp. 991–997.

[3] Poth, Ch. The Contributions of Mixed Insights to Advancing Technology-Enhanced

Formative Assessments within Higher Education Learning Environments: An Illustrative

Example. International Journal of Educational Technology in Higher Education, 15(1),

2018, pp 9. DOI: 10.1186/s41239-018-0090-5.

[4] Altinay, F., Dagli, G. and Altinay, Z. Role of Technology and Management in Tolerance

and Reconciliation Education. Quality & Quantity, 51(6), 2017, pp. 2725–36. DOI:

10.1007/s11135-016-0419-x.

[5] Zagami, J. et al. Creating Future Ready Information Technology Policy for National

Education Systems. Technology, Knowledge and Learning, 23(3), 2018, pp. 495–506.

DOI: 10.1007/s10758-018-9387-7.

[6] Fox-Turnbull, W. H. The Nature of Primary Students’ Conversation in Technology

Education. International Journal of Technology and Design Education, 26(1), 2016, pp. 21–

41. DOI: 10.1007/s10798-015-9303-6.

[7] Campbell, T. and Oh, Ph. S. Engaging Students in Modeling as an Epistemic Practice of

Science. Journal of Science Education and Technology, 24(2-3), 2015, pp. 125–31. DOI:

10.1007/s10956-014-9544-2.

[8] Baines, D et al. Conceptualising Production, Productivity and Technology in Pharmacy

Practice: A Novel Framework for Policy, Education and Research. Human Resources for

Health, 16(1), 2018, pp. 51. DOI: 10.1186/s12960-018-0317-5.

[9] Chen, G., Xu, B., Lu, M. and Chen N.-Sh. Exploring Blockchain Technology and Its

Potential Applications for Education. Smart Learning Environments, 5(1), 2018, pp. 1.

DOI: 10.1186/s40561-017-0050-x.

[10] Hsu, L. Diffusion of Innovation and Use of Technology in Hospitality Education: An

Empirical Assessment with Multilevel Analyses of Learning Effectiveness. The Asia-

Page 12: USE OF RECURRENT NEURAL NETWORK ARCHITECTURES … fileArseniy Aleksandrovich Lebedev  1675 editor@iaeme.com 2. the error signal disappeared. At the same time, an exponential

Arseniy Aleksandrovich Lebedev

http://www.iaeme.com/IJCIET/index.asp 1685 [email protected]

Pacific Education Researcher, 25(1), 2016, pp. 135–45. DOI: 10.1007/s40299-015-0244-

3.

[11] Fletcher-Watson, S. Evidence-Based Technology Design and Commercialisation:

Recommendations Derived from Research in Education and Autism. TechTrends, 59(1),

2015, pp. 84–88. DOI: 10.1007/s11528-014-0825-7.

[12] Abramov, R. A., Tronin, S. A., Brovkin, A. V. and Pak, K. C. Regional features of energy

resources extraction in eastern Siberia and the far east. International Journal of Energy

Economics and Policy, 8(4), 2018, pp. 280–287.

[13] Damewood, A. M. Current Trends in Higher Education Technology: Simulation.

TechTrends, 60(3), 2016, pp. 268–71. DOI: 10.1007/s11528-016-0048-1.

[14] Dorner, H. and Kumar, S. Online Collaborative Mentoring for Technology Integration in

Pre-Service Teacher Education. TechTrends, 60(1), 2016, pp. 48–55. DOI:

10.1007/s11528-015-0016-1.

[15] McLain, M. Emerging Perspectives on the Demonstration as a Signature Pedagogy in

Design and Technology Education. International Journal of Technology and Design

Education, 28(4), 2018, pp. 985–1000. DOI: 10.1007/s10798-017-9425-0.

[16] Virtanen, S., Räikkönen, E. and Ikonen, P. Gender-Based Motivational Differences in

Technology Education. International Journal of Technology and Design Education, 25(2),

2015, pp. 197–211. DOI: 10.1007/s10798-014-9278-8.

[17] Muller, J. The Future of Knowledge and Skills in Science and Technology Higher

Education. Higher Education, 70(3), 2015, pp. 409–16. DOI: 10.1007/s10734-014-9842-x.

[18] Tondeur, J., van Braak, J., Ertmer, P. A. and Ottenbreit-Leftwich, A. Erratum to:

Understanding the Relationship between Teachers’ Pedagogical Beliefs and Technology

Use in Education: A Systematic Review of Qualitative Evidence. Educational Technology

Research and Development, 65(3), 2017, pp. 577. DOI: 10.1007/s11423-016-9492-z.

[19] Mapotse, T. A. An Emancipation Framework for Technology Education Teachers: An

Action Research Study. International Journal of Technology and Design Education, 25(2),

2015, pp. 213–25. DOI: 10.1007/s10798-014-9275-y.

[20] Mapotse, T. A. Development of a Technology Education Cascading Theory through

Community Engagement Site-Based Support. International Journal of Technology and

Design Education, 28(3), 2018, pp. 685–99. DOI: 10.1007/s10798-017-9411-6.