functional analysis of real world truck fuel consumption …238366/fulltext01.pdf · technical...

Technical Report, IDE0806, January 2008

Functional Analysis of Real World Truck FuelConsumption Data

Master’s Thesis in Computer Systems Engineering

Georg Vogetseder

School of Information Science, Computer and Electrical EngineeringHalmstad University

Functional Analysis of Real WorldTruck Fuel Consumption Data

School of Information Science, Computer and Electrical EngineeringHalmstad University

Box 823, S-301 18 Halmstad, Sweden

January 2008

Acknowledgement

If it looks like a duck, and quacks like a duck,we have at least to consider the possibility that we have

a small aquatic bird of the family anatidae on our hands.Douglas Adams (1952-2001)

Thanks to my family, especially my mother Eva and friends.

ii

Abstract

This thesis covers the analysis of sparse and irregular fuel consumption data of longdistance haulage articulate trucks. It is shown that this kind of data is hard to analysewith multivariate as well as with functional methods. To be able to analyse the data,Principal Components Analysis through Conditional Expectation (PACE) is used, whichenables the use of observations from many trucks to compensate for the sparsity ofobservations in order to get continuous results.

The principal component scores generated by PACE, can then be used to get roughestimates of the trajectories for single trucks as well as to detect outliers.

The data centric approach of PACE is very useful to enable functional analysis ofsparse and irregular data. Functional analysis is desirable for this data to sidestepfeature extraction and enabling a more natural view on the data.

iii

Contents

Acknowledgement ii

Abstract iii

List of Figures vi

List of Tables viii

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation and Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Methods 7

2.1 General Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Validation Methods . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.4 Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Functional Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Principal Components Analysis through Conditional Expectation . . . 11

3 The Vehicle Application and Data Description 13

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Impurities in the Truck Data . . . . . . . . . . . . . . . . . . . 14

3.1.2 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iv

4 Results 21

4.1 Basic Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Data Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.3 Function Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Application of PACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 Baseline PACE Results . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.2 Number of Principal Components . . . . . . . . . . . . . . . . . 34

4.2.3 Error Assumptions in PACE . . . . . . . . . . . . . . . . . . . . 36

4.2.4 Different Kernel Functions . . . . . . . . . . . . . . . . . . . . . 38

4.2.5 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.5.1 Model Variance . . . . . . . . . . . . . . . . . . . . . . 40

4.2.5.2 Data Variance . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Discussion 50

6 Conclusion 51

Bibliography 53

List of Abbreviations 55

v

List of Figures

3.1 Fuel Consumption between Observations . . . . . . . . . . . . . . . . . 14

3.2 Fuel consumption plot generated from the raw data . . . . . . . . . . . 15

3.3 Histograms of the original and the cleaned data . . . . . . . . . . . . . 17

3.4 Fuel consumption plot generated from the clean data . . . . . . . . . . 17

3.5 Scatter plot and histograms . . . . . . . . . . . . . . . . . . . . . . . . 18

3.6 Histogram of the distance between observations . . . . . . . . . . . . . 19

4.1 Distribution and mean/variance of binned data . . . . . . . . . . . . . 22

4.2 Boxplots of binned data . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Outlier detection based on feature extraction . . . . . . . . . . . . . . . 25

4.4 Straight line fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Plot of mean function and principal components . . . . . . . . . . . . . 29

4.6 Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.7 Smoothed covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 31

4.8 Reconstructed curves versus mean function and raw observations of se-lected trucks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.9 Reconstructed curves and raw measurements for all trucks. . . . . . . . 33

4.10 Reconstructed traces of misfitted trucks . . . . . . . . . . . . . . . . . . 33

4.11 Comparison of reconstructed trajectories with differing number of PCs 35

4.12 Reconstructed trajectories without measurement error assumed . . . . 37

4.13 A comparison of µ with different smoothing kernels . . . . . . . . . . . 38

4.14 A comparison of 3 PCs with different smoothing kernels . . . . . . . . . 39

4.15 Distribution of all mean curves . . . . . . . . . . . . . . . . . . . . . . 41

4.16 Graph of all mean curves . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.17 Trucks with a high influence on the results of PACE . . . . . . . . . . . 42

4.18 Data variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vi

4.19 Normal Distribution Plots of the PC scores . . . . . . . . . . . . . . . . 45

4.20 Histograms of the probability of trucks . . . . . . . . . . . . . . . . . . 46

4.21 Samples of truck probability . . . . . . . . . . . . . . . . . . . . . . . . 46

4.22 PACE Results of Speed Data . . . . . . . . . . . . . . . . . . . . . . . 47

4.23 PACE Results on Seasonal Fuel Consumption . . . . . . . . . . . . . . 48

4.24 Selected trucks from the Seasonal Fuel Consumption Data . . . . . . . 49

vii

List of Tables

4.1 MSE of PACE with 8 principal components . . . . . . . . . . . . . . . 34

4.2 MSE of PACE with 3 PCs . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 MSE of PACE with 4 PCs . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 MSE of PACE with 29 PCs . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 MSE of PACE with 8 PCs and error cut-off . . . . . . . . . . . . . . . 37

viii

1

Introduction

1.1 Background

The original idea for analyzing this data came from Volvo Parts AB, one of the main

business units of Volvo Group AB. The role of Volvo Parts is to provide solutions and

tools to the after-market, which includes vehicle electronics diagnostic tools. When a

truck is in the workshop, the vehicle electronics data is read out from the truck using

diagnostics tools from Volvo Parts and transmitted to a central database.

This data, which is collected from the vehicles electronics systems is called logged

vehicle data (LVD) and is collected from sensors within the truck. Several electronic

subsystems supply information for LVD, which can include data from the electronic

suspension, the transmission, and most importantly from the Engine Electric Control

Unit. The current main use of LVD is seemingly just basic analysis, e.g. remote

diagnostics of faulty components and simple statistics.

One of the problems with analysing LVD is the relative lack of observations. The

source of this lack of information is the data retrieval process. The procedure is a time

consuming process, making it a cost factor for the workshops. The time consumption

affects the adoption rate of this procedure in the field negatively, which leads to the

data composition detailed in Section 3.1.

The basic idea behind the problems detailed in this thesis is to expand the usefulness

of the data for Volvo Parts, retrieving additional new information from it and provide

means to access this information. This is done by using recent advanced statistical

1

1. Introduction 2

techniques. As a starting point to the application of these techniques, the analysis of

the fuel consumption data contained in LVD was suggested.

Fuel consumption data is very interesting from a statistical point of view. This interest

stems from being a major cost factor, as well as being influenced by a high number of

other factors, such as:

• Usage patterns of the operator, i.e. the driving style and habits

• Maintenance of the truck

• Gross Combination Weight usage, i.e. the cargo of the truck

• Environment, i.e. hilliness, road condition, etc.

The influence of these and more factors make this data a good indicator. But the

mass of influences also makes exact determination of the underlying cause impossible.

Additionally, some of these influences might cancel each other out, thus removing

information. If it is possible to extract information from fuel consumption data, then

it should work for the rest of the data too.

1.2 Motivation and Novelty

From LVD, it should be possible to extract information on hidden trends, i.e. the

principal components (see Section 2.1.1) that are common to all similar trucks. Based

on these components, it should be possible to determine if a truck is unrelated to other

trucks, i.e. a outlier and to predict future developments in fuel consumption, when

the trucks behavior is similar to that of other vehicles.

It is very easy to take the last observation of each truck in a group of similar trucks to

determine abnormal fuel consumption, but it is hardly possible to calculate underlying

trends or other information from these facts.

To discover information like trends or outliers from LVD, the data of a truck has to

include not only the last observation available, but also past ones. These requirements,

multiple observations of a truck and a set of similar trucks lead to the irregular and

sparse structure of the data used in this thesis. The data is described in more detail

in Section 3.1.

1. Introduction 3

The analysis of this data can be done in at least two ways. The most obvious choice

in methodology would be the use of multivariate statistics, but for several reasons de-

tailed below, the central methodology for this thesis is functional statistics. Functional

statistics focuses on analysing the data as functions, rather than a set of discrete values

1.

Multivariate statistics are a set of methods which work on more than one variable at a

time. Some examples for these methods are regression analysis, principal components

analysis and artificial neural networks. Principally, functional statistics are also part

of this set, as both have multiple variables as input. However, the focus on handling

the input variables as continuous functions rather than arbitrary variables separates

those two fields.

As the observation of trucks in the workshop is not happening regularly, i.e. the

observations can not be fitted to a grid, it is difficult to incorporate all information

from the input into variables for use in multivariate statistics. Therefore, features

like mean, variance, duration of all observations, date of first observation, odometer

count at the last observation, etc. have to be extracted from the data to be able to do

analysis. Inevitably, the extraction of this knowledge leads to information loss, which

is problematic on this already sparse data. The process of discovery and selection of

important features for multivariate analysis is very difficult and time consuming. It is

crucial to extract and select the best and most important features from the data to

minimize the data loss and maximize the information content of the features for the

success of all further steps in analysis. Feature extraction creates an additional layer

of data processing and introduces a large number of tunable knobs.

Functional Data Analysis (FDA) on the other hand, preserves the information in the

data present and does not need feature extraction at all. Furthermore, it facilitates a

more natural handling of the data, describing not only more or less abstract features

of the data, but a function which resembles the data. The choice of using functional

over multivariate data analysis is also motivated by the ability to analyze the func-

tional properties of the data, e.g. derivatives of the data. Additionally, FDA does not

introduce a high number of additional parameters, unlike multivariate analysis.

1A more detailed description on this collection of methods can be found in Section 2.2.

1. Introduction 4

However, multivariate analysis has an advantage over FDA when a high number of

different functions have to be analysed at the same time. FDA has problems in visu-

alizing this higher dimensional data, as well as the necessity of having a high amount

of data for each dimension (curse of dimensionality).

The most important step in FDA is the transformation of the discrete data to a func-

tional basis. Again, the irregular and sparse nature of the data makes this transforma-

tion difficult. For being able to perform FDA on this data, a method called Principal

Components Analysis through Conditional Expectation (PACE) is applied. The foun-

dation of PACE is the assumption that a smooth function is underlying the sparse

data. Under this assumption, it is possible to use even irregular data for the discovery

of principal components.

The main novel aspect of this thesis is the application of FDA and PACE to automotive

data. Previously it has successfully been applied to biological data, economic processes,

bidding in online auction houses, but not automotive data. PACE itself is highly

interesting to be applied to the data at hand, because it is able to work on it without

the need for feature extraction or regular observations.

The methods used in this work can be used to describe the actual fuel consumption of

the observed trucks in customer hands. This means the methods applied to LVD are

driven by data and not by a model.

1.3 Related Work

General sources of information on data analysis – related to this work – are The El-

ements of Statistical Learning [1], Functional Data Analysis [2] and Nonparametric

Functional Data Analysis [3].

The single most important paper related to this work is Functional Data Analysis for

Sparse Longitudinal Data [4], which proposed the method PACE and applied it to

yeast cell cycle gene expression data and to longitudinal CD4 cell percentages. The

percentage is used as a maker for the progress of AIDS in adults.

1. Introduction 5

Functional Data Analysis for Sparse Auction Data [5] combines the PACE approach

with linear regression to predict closing prices of online auctions.

The most related of the few public papers on fuel consumption in heavy trucks is Heavy

Truck Modeling for Fuel Consumption Simulations and Measurements [6]. This work

deals with building a simulation model of fuel consumption. Another paper, which dis-

cusses methods to reduce idle fuel consumption in North American long distance trucks

and highlights typical driver behavior is Analysis of Technology Options to Reduce the

Fuel Consumption of Idling Trucks [7]

Additional information on doing PCA on sparse and irregular data can be found in

Principal component models for sparse functional data[8] and Sparse Principal Compo-

nent Analysis [9]. More related to PACE is Properties of principal component methods

for functional and longitudinal data analysis [10]. Another paper which is related to

the estimation of Functional Principal Component Scores is [11]. Knowledge relating

to linear regression analysis for longitudinal data can be found in [12].

1.4 Limitations

The scope of this thesis is to research the possibilities for the application of FDA meth-

ods to the sparse and irregular automotive data from LVD. It is outside of the scope

of this thesis to establish a conclusive theory about a true long term fuel consumption

model of all truck engines.

The conclusive, globally valid model is impossible because of a relatively low number of

individuals in the data, as well as a limited observation duration and possible differences

in usage patterns of the trucks, i.e. vehicles with a high mileage in a limited time span

do not necessarily exhibit a similar fuel consumption to low mileage trucks in the same

time span.

1. Introduction 6

1.5 Outline

The next chapter ”Methods” describes crucial used methods. This includes underlying

basic methods as well as the foundations of FDA and PACE. The chapter 3 ”Applica-

tion” provides a description of the data used in this thesis and includes information on

the interplay of the proposed methods and the data. Chapter 4 provides comprehen-

sive information on the results. The last two chapters, ”Conclusion” and ”Discussion”,

wrap up the results from this thesis and provide an outlook on possible continuations

of the research.

2

Methods

This chapter is divided into three parts. General Statistical Methods describes non-

functional methods which are fundamental to this work. Functional Data Analysis

provides an introduction into this field. The final part, Principal Components Analysis

through Conditional Expectation gives an overview of this crucial method.

2.1 General Statistical Methods

This section introduces general statistical concepts used in this thesis and a number of

tools to visualize data and test results.

2.1.1 Principal Component Analysis

One of the constitutional methods for analysing LVD is the Karhunen-Loeve transfor-

mation, universally known as Principal Component Analysis (PCA). PCA is also the

foundation to Functional Principal Component Analysis (FPCA)[1, 13].

Basically, PCA is a method to explore data by finding the most important ways the

variables in the data differ from another. It can compress the data by discovering a

low number of linear combinations of input variables which contribute most to the

variability of the input. These linear combinations are found by constructing a linear

basis for the data where the retained variability is maximal.

7

2. Methods 8

Mathematically speaking, the goal is to reduce or compress high dimensional data X

to lower dimensional data Y .

To do this reduction, a number of algorithms are available, here, a method involving

the calculation of the covariance is described.

The first step is to calculate the mean vector µ for each variable:

µi =1

Ki

Ki∑j=1

xij, i = 1 . . . N

where N denotes the number of variables and Ki the number of observations in one

variable.

Subsequently, µ is removed from every observation in X, which is subsequently denoted

as X − X.

In the next step the covariance matrix cov(X − X) has to be calculated. Covariance

is a measure how two variables vary together. If those two variable vary in the same

way (i.e. same prefix), the covariance will be positive. If, on the other hand, the two

variables have different prefixes, the covariance will be negative. A covariance matrix

is the result of calculating the covariance for all members of two vectors. The resulting

matrix gives the grade of correlation between the input vectors.

To find a mapping M that is able to transform the high dimensional data into low

dimensional data, M that maximizes MT cov(X − X)M has to be found. It can be

shown that the best (variance maximizing) mapping is formed by the eigenvectors of the

covariance matrix. Hence, PCA has to solve the eigenproblem to get the transformation

matrix.

cov(X − X)M = λM

The eigenproblem has to be solved d times with different principal eigenvalues λ to get

the principal eigenvectors (or principal components). The low dimensional representa-

tion Y can then be computed by simple multiplication:

Y = (X − X)M

2. Methods 9

2.1.2 Hierarchical Clustering

Hierarchical clustering is a relatively simple method [1] to segment data into related

groups. Clustering is used within this thesis for testing if differing clusters of trucks can

be found from extracted features. Hierarchical clustering needs a dissimilarity measure

between the elements. The standard for measuring the dissimilarity is the euclidean

distance, which is also used in this thesis.

When the distance between all possible pairs of elements is calculated, the clusters can

be built. For building these clusters, there are two different approaches: The agglom-

erative approach, which starts with as many clusters as there are individuals. The

divisive method starts with one big cluster which is then split into smaller clusters.

Agglomerative methods are guaranteed to have a monotonic increasing level of dissim-

ilarity between merged clusters, growing with the level of merging. This property is

not guaranteed to divisive approaches.

The second choice for building the clusters is to decide on the measurement for the

distance between two clusters.

• Single Linkage – The link between the clusters is defined by the smallest dis-

tance between elements in the two clusters.

• Complete Linkage – The link is defined by the largest distance between ele-

ments in the two clusters, the opposite of the first method.

• Average Linkage – Uses the average distance between all pairs of elements in

both clusters.

2.1.3 Validation Methods

A number of methods to validate the results and to estimate variation were used in

the scope of this thesis. These include brief usage of bootstrap, jackknife and various

cross validation methods, such as k-fold and leave-one-out [1].

Bootstrapping is the process of randomly picking a samples from given observations

where a single observation can be chosen multiple times. The goal of a bootstrap is to

approximate the distribution from these samples.

2. Methods 10

Jackknifing can be used to estimate the bias and standard error. Jackknife is very

similar to k-fold and leave-one-out cross validation, as it systematically removes one

or more observations from a sample and then recalculates the results as often as there

are possible readouts.

2.1.4 Diagrams

A number of special diagrams were used to illustrate some results of this thesis. Those

diagrams are dendrograms, boxplots and scree plots [1, 2].

• Dendrograms are tree diagrams which are used to illustrate the result of a clus-

tering algorithm. An example for such a diagram is Figure 4.3. On the vertical

axis the distance between clusters is plotted. A horizontal line denotes a split

between classes at this specific distance measure. This implies that a split at

a higher distance value has a higher dissimilarity between the split classes, as

opposed to a lower distance value split.

• Boxplots describe groups of data – such as binned data – through five statistical

properties. A boxplot example can be seen in Figure 4.2. The box represents

the lower and the upper quartile, showing where half of the data is contained.

The line in this box illustrates the median of data in this group. The whiskers

attached to this box extend to the furthest data point, up to a maximum of 1.5

the distance between the quartiles. Data points outside of this boundary are

usually marked with a cross, indicating a possible outlier.

• Scree plots give an indication of the relevance of a principal component (eigen-

function) by indicating the accumulated eigenvalue up to the n-th principal com-

ponent. This plot can be used to select a suitable number of eigenfunctions. An

example for a scree plot is Figure 4.6.

2.2 Functional Data Analysis

Functional data analysis (FDA) [2, 3] is a collection of methods which enable the

investigation of data in a functional form. Functional data is the idea of looking at a

2. Methods 11

set of observations not as a vector in discrete time, but as a continuous function. The

analysis of functions rather than discrete samples inherits advantages over multivariate

analysis.

An advantage of this property is that the rate of change or derivatives of these functions

can easily be calculated and analysed. FDA also includes variants of multivariate

methods like PCA. Functional PCA, like normal PCA, not only provides a method for

dimensionality reduction, but also characterizes the main modes of variation from a

mean function.

To perform FDA on discretely sampled data, the data has to be converted to a contin-

uous, functional format. This means a function has to be fitted to the sampled data

points. It is not feasible to convert every dataset to a functional form. Especially in

the case of sparse and irregular observations, this task is very difficult, but central to

the success of functional data analysis.

Usually, the methods used to convert data into a functional format are interpolation

and smoothing, or more generally function fitting. A very simple method to do this

conversion would be a least squares fit of a first order polynomial (a straight line).

Usually, a more flexible method is used for this step, namely spline interpolation.

Depending on the underlying data, other fits like Fourier functions are possible.

FDA is easily applicable if the measurements were done with a regular spacing, and

the data is complete over the observation duration. In the opposite case, it is very

difficult to estimate the complete trajectory, when only a single subject is taken into

calculation.

2.3 Principal Components Analysis through Con-

ditional Expectation

Principal Components Analysis through Conditional Expectation (PACE) is a deriva-

tive of functional principal components analysis for sparse longitudinal data, proposed

in the paper Functional Data Analysis for Sparse Longitudinal Data by Yao, Muller

and Wang [4].

2. Methods 12

PACE is an algorithm for extracting the principal components from irregular and

sparse data. It also provides an estimation of individual smooth trajectories of the data.

PACE assumes that the data is randomly located with a random number of observations

per subject. Furthermore it assumes that data is determined by a underlying smooth

trajectory.

The first step in PACE is the estimation of the smooth mean function µ, by using a

local linear line smoother on all measurements combined into one pool of data. The

choice of the smoothing parameter, or bandwidth is done automatically[14] or by hand

in this step.

The covariance surface can then be calculated like a regular covariance matrix. This

raw covariance surface is stripped of the variance (the first diagonal). This raw matrix

is then smoothed utilizing a local linear surface smoother. The bandwidth is chosen

by leave-one-curve-out cross-validation. The smoothing step is necessary to fill in for

missing observations. The estimation of these two model components share the same

smoothing kernel. The choice of a smoothing kernel is discussed in Chapter 4.

From these model components, it is possible to calculate the estimates of the eigenvalues

and eigenfunctions, i.e. the functional principal components of sparse and irregular

data.

The last step is the calculation of the functional principal component scores. Those

scores describe how much of a principal component is retained in a single subject.

However, the conventional method of using numerical integration to recover the Prin-

cipal Component (PC) scores leads to biased results; because of sparse and irregular

data. In this step, the conditional expectation comes into play. It provides the best

prediction of the PC scores if the measurement error is Gaussian, or the best linear

prediction otherwise. PACE is discussed in detail by Yao, Muller and Wang [4].

3

The Vehicle Application and Data

Description

The purpose of this chapter is to outline the connection between the methods proposed

in Chapter 2 and the application of those methods on the Volvo data.

3.1 Volvo Truck Data

The original data received from Volvo Parts AB consists of 2027 observations of 267

trucks. It was collected between June 2004 and May 2007 in North America.

All trucks have the same engine and are configured as articulate truck for long distance

transports on smooth roads. The gross combination weight (GCW), which includes the

weight of the towed trailer and the truck itself is 36 tons, the US federal GCW limit.

Data is retrieved when a truck is in a workshop that is equipped to read out the onboard

electronics and performs this procedure. It is then sent to the Volvo Headquarter in

Gothenburg for storage and analysis.

The data from each observation contains only informations from one of the trucks

onboard electronic systems, the Engine Control Unit (ECU). From these data, two

variables are mainly relevant for this thesis:

• Total distance driven

• Total amount of fuel consumed

13

3. The Vehicle Application and Data Description 14

0 1 2 3 4 5 6 7

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Distance Driven [km]

Incr

emen

tal F

uel M

ileag

e [k

m/l]

Figure 3.1: This figure shows the distribution of the fuel consumption, when the fuelmileage is calculated only between two observations. The outliers visible in this figurecan be explained by a high amount of idling between two close observations. Whenthe fuel mileage is calculated accumulative, those outliers do not occur.

These variables are not reset when the ECU was read out in the workshop and therefore

behave accumulative. Using these variables as a basis to calculate the fuel consumption

per distance or time has an averaging effect on itself as it includes all former mileage

data. This is necessary because of the unevenly distributed data. If a truck was

read out twice within a very short span of time, the fuel consumption in this interval

is possibly vastly different from the normal fuel consumption behavior of the truck,

possibly because the truck was not moved very far withhin this time span, but idling for

some time. The outliers caused by this effect can be seen in Figure 3.1. These outliers

are the reason for not using the difference in fuel amounts between two observations

as a calculation basis in this thesis. The accumulative approach allows those outliers

to remain in the dataset.

3.1.1 Impurities in the Truck Data

The raw data retrieved from the trucks contains irregular observations or changes in

the truck data which result – in some cases – in a removal of specific observations or

the whole truck from the data set. See Figure 3.2 for a plot of the raw fuel consumption


0 1 2 3 4 5 6 7

x 105

1

1.5

2

2.5

3

3.5

4


Fue

l Mile

age

[km

/l]

Figure 3.2: Fuel consumption plot generated from the raw data. The lines are linearinterpolations between the observations.

data.

• Incomplete Observations – A truck is missing one of more variables that would

be required for analysis. The observations from this individual can not be used

for the calculations.

• Physically impossible changes in accumulative variables – Between two

observations of a single truck, accumulative variables changed to a smaller value.

This means that a later observation in time has a smaller number of total driving

distance than an earlier measurement for example. This is physically impossible,

but observable if the ECU has been replaced or the contents of the ECU were

erased during a software update. This criteria applies to 44 trucks. Although it

is possible to use a subset of the observations from each of these trucks. This was

not done, because the quality of the measurement might have been compromised

and the manual effort of cleaning the data is a time consuming task for very few

usable measurements.

• Empty and Duplicated Observations – Some observations do not contain any

new information, but only seem to be resubmits of earlier or empty observations

with a different time stamp. These particular observations are removed from the


final data, but the remaining observations of this truck are used. Phenomena

like these might occur, when the data aquisition process in the workshop was

interrupted, or a transmission error occurred.

• Early Observations – These observations are too early in the life of the truck

to give a meaningful information. The removal of these observations is moti-

vated by the unusual fuel consumption of a truck in this state. The unusual fuel

consumption is caused by a high number of short trips the truck has to travel

before it can be put into regular service. Examples are drives to paint shops or

truck customizers as well as transfers to the customer. The number of observa-

tions purged when this criteria is set to remove all measurements below 10000

km is 150, when all measurements before 1000km are deleted, the number of

observations drops by 100. See Figure 3.3.

From the 269 initial individual trucks, 56 trucks are removed. In terms of observations,

from originally 20271 observations 1320 remained in the data set, when the lower

border for observations is set to 1000km. See Figure 3.4 for a plot of the cleaned

fuel consumption data. The most visible change to Figure 3.2 is the lower number of

outliers at roughly 0 kilometers, which is mostly an effect of the removal of very early

observations.

3.1.2 Data structure

Some properties of the data make the task of analyzing inherently difficult. Most of

these properties stem from the sparsity of the data. Sparseness in this case means that

every truck has been observed on average just 7.405 times with a standard deviation

of 2.4083 observations. The sparseness of the data is visualized in Figure 3.5.

• The data is not fully observed. The observations of a single truck often are not

scattered over a very long distance in time or driven distance, but measured only

within a short span. The average duration between the first observation of a

truck and the last one, where measurements are taken, is 317841 kilometers with

a standard deviation of 114208 kilometers. The mean focus of the observations

1Excluding incomplete observations, as they are not usable at all.


0 1 2 3 4 5 6 7 8 9

x 105

0

50

100

150

200

250


Num

ber

of O

bser

vatio

ns

Raw DataCleaned Data

Figure 3.3: This comparison shows the number of observations on the raw data versusthe cleaned data. The overall reduction in the number of observations as well as thelower amount of observations at the beginning is noticeable.

0 1 2 3 4 5 6 7

x 105

1

1.5

2

2.5

3

3.5

4

Driven Distance [km]

Fue

l Mile

age

[km

/l]

Figure 3.4: Fuel consumption plot generated from the clean data. Note the lack ofoutliers at the beginning of the data.


0 1 2 3 4 5 6 7 8 9

x 105

2

2.2

2.4

2.6

2.8

3


Fue

l Mile

age

[km

/l]

Figure 3.5: The scatter plot in this figure highlights the sparse and irregular distributionof the data. The histograms describe the distribution of the observations along the axes.

is at 303232 kilometers deviating by 133609 kilometers, which means that most

of the trucks are not observed from the beginning, but observed later on in their

life-cycle.

• The density of measurements varies. This implies that the placement of measure-

ments is irregular throughout the duration of their observation. As the trucks are

independent of each other, the times when observations happen are not correlated

with each other. For a visual representation of the irregular duration between the

measurements, see Figure 3.6. This figure indicates a non-normal distribution.

The average distance between observations is 52020 kilometers with a standard

deviation of 61858 kilometers.

• Unsupported curvature. The irregular placement and the sparsity of variables

causes this property to occur. If a part of a curve has a high curvature, which

can be approximated by ‖ d2ydx2‖ or ( d2y

dx2 )2. When this is the case, the relative

resolution of the data at the point of the high curvature should also be high to

enable a good estimation of the underlying function [2].


0 0.5 1 1.5 2 2.5 3 3.5 4

x 105

0

50

100

150

200

250

300

350

Distance Driven between Observations [km]

Num

ber

of O

bser

vatio

ns

Figure 3.6: This figure shows the distribution of distances between two observations ofthe same truck.

3.2 Approach

The first part in analyzing truck data, which is described in section 4.1, is to establish

results with basic multivariate analysis as a basis where the results of functional analysis

can be compared to. This part shows pitfalls and difficulties when applying standard

multivariate methods to the data.

The first possible way for multivariate analysis is feature extraction. It is a difficult

task to find relevant features to extract. A simple statistical feature will be extracted

from the data to be able to give an idea how feature extraction works. The second

possibility for multivariate analysis is to put the observations into bins. This is done

in order to be able to align the data onto a vertical grid.

The second way is necessary, because it is very hard to visualize and convert to the

original data format from the extracted features. However, binning cannot easily be

used for outlier detection. Usually, some of the bins are likely to have only a low

number of observations which makes outlier determination in this bin very difficult. If

the bins are made larger, multiple – or even all – observations of a single truck might

be put into a single bin. This leads to increased difficulty in differentiating between

normal and outlying observations.


These steps should lead to two results : A simple outlier detection, based on a clustering

of the extracted features and a variance and mean estimation for the data, based on

the binned data.

The task of estimating fuel consumption behavior for a single truck, outside of its

observation duration using the extracted features is very hard. This is because the

mapping between the values of the features and a function is not available. Addition-

ally, information from other, similar trucks is not taken into consideration.

The last step in Basic Analysis (Sect. 4.1) is a demonstration of the main problem

of applying FDA on the data at hand: The difficulty of fitting a function to a single

truck.

The main task of this thesis is to apply the PACE algorithm to the data (Sect. 4.2), and

to try out the various options within the PACE algorithm. In this section, the results of

PACE in general will be assessed, the difference between PACE with different options

in regard of the PACE generated functions as well as general statistical properties, such

as the mean function.

The first advantage in using the PACE algorithm in comparison to the basic methods

is the lack of need to pre-process data, i.e. to extract features or otherwise process the

data. This non-parametric input of the data is complimented by a number of options

to tune the algorithm itself for various needs (amount of information retained, if the

input data has measurement errors, etc.).

The next step is to try out a number of methods which can be applied to the results of

PACE. For example to calculate the possibility of the fuel consumption of a particular

truck, given all the other trucks.

PACE enables the user to analyse the sparse and irregular data at hand, enabling the

use of additional techniques from FDA, whereas using only multivariate data analysis

or normal FDA on the same data is very difficult to do and does not incorporate the

information gathered from the other trucks.

PACE makes outlier detection, estimation of the function outside the observation du-

ration and the gathering of common statistical properties, like mean and variance in

functional form, from sparse and irregular data a lot easier or even possible.

4

Results

4.1 Basic Data Analysis

The aim of this section is to provide an overview of basic multivariate analysis possi-

bilities with the available data. Functional methods are applied from Section 4.2.

4.1.1 Data Binning

One approach, as described in the previous chapter, is the creation of a vertical grid

for the data domain followed by binning the data into a limited number of “buckets”

along the time or distance axis, similar to creating a histogram. If there is more than

one observation of a truck in one of these bins, an average of these measurements is

put into the bin. This has to be done to avoid biasing in case of dense observations of

a truck within a short timespan.

The size and the quantity of the bins is crucial for binning. With the data at hand, 25

bins were used, which results in a size of 36087 kilometers per bin.

In Figure 4.1 the number of observations per bin, as well as an estimation of the mean

function and the variance of the data can be seen.

In Figure 4.2 a boxplot of the binned data and one of the results of bootstrapping [1]

the mean value per bin (10000 bootstrap samples) are illustrated.

21

4. Results 22

0 5 10 15 20 250

20

40

60

80

100

120

140

Bins

Obs

erva

tions

Histogram

0 2 4 6 8

x 105

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3Mean and Deviation


Fue

l Mile

age

[km

/l]

Figure 4.1: The histogram depicts the number of observations per bin. Especially thefirst and the last few bins have a very small number of observations, which leads tothe abnormal results in these bins in the mean and standard deviation figure on theright. This figure shows the mean as well as the standard deviation estimated from thebinned data.

4. Results 23

5 10 15 20 25

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1

Bin

Val

ues

Binned Data Boxplot

5 10 15 20 25

2.35

2.4

2.45

2.5

2.55

2.6

2.65

2.7

2.75

2.8

Bin

Val

ues

Bootstrapped Mean Boxplot

Figure 4.2: The figures show boxplots for the binned data (left) and bootstrapped meanvalues (right). The left boxplot is a simple plot of the raw binned data, providing aneasy visualization. The right boxplot is generated by bootstrapping the mean of eachbin 10000 times. Bootstrapping should give an idea of how much the mean can vary,if new data has the same distribution as the data at hand.

4. Results 24

4.1.2 Feature Extraction

The features which are retrieved from all observations of a single truck, are used to

construct a simple outlier detector with hierarchical clustering.

The goal of this simple outlier detector is to find trucks, whose mean is deviating

significantly from the mean of the entire data. A single extracted feature was used in

this case:

∆Truck = (µTruck − µAll)2

The data was then clustered with a hierarchical algorithm, using average distance

linking. The outlying classes were subjectively selected by looking at the resulting

dendrogram. For the results, see Figure 4.3.

4. Results 25

1 5 2 4 3 7 6

0.05

0.1

0.15

0.2

0.25

Dendrogram

Class

Cla

ss D

ista

nce

0 2 4 6 8

x 105

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1


Fue

l Mile

age

[km

/l]

Plot

Figure 4.3: Results of outlier detection based on feature extraction. The left figureshows the dendrogram of the clustering algorithm. This figure shows that the class 6 isan extreme outlier, whereas the classes 3 and 7 are also quite different from the mainpart of the data. The basis for this classes being outliers is a vastly different mean fromthe rest of the data. In the other figure, the outlying clusters are highlighted. Theextreme outlier is marked red, the normal outliers are marked green and the normaldata is colored blue. The class 3 has 5 members, whereas the other outlier classes havejust 1 member. The classes 1 and 5 have 114 respective 52 members. Class 2 has 27members, whereas 4 has 13 members.

4. Results 26

4.1.3 Function Fitting

Finding a plausible function that is fitting the data of the trucks well is difficult, because

of the open-ended nature of the measurements. If a set of observations have a defined

start and an end of their measurements, i.e. the data is fully observed, it is easy

to interpolate the data in between, even if the data within this span is sparse. This

property of the data at hand is also discussed in Section 3.1.

If the set of data is not fully observed, it is almost impossible to get a reliable fit outside

the observation span of a single entity. This reliable fit outside of this span is necessary

for performing FDA on this data, as FDA needs the same set of basis functions, or in

the case of spline interpolation, the same knots for all functions to work.

It was not possible to get a good fit on this data with splines, where all of the knots are

distributed the same for all truck entities. Also, polynomial fits, i.e. the approximation

of the data with low (< 5) order did not result in a stable fit for the available data.

The most reliable fit under these conditions were generated by fitting a linear function

to the fuel consumption observations. These results in fitting the sparse and irregular

data motivate the idea of combining the observations by the means of PACE, to be

able to get better fits from the reconstructed trajectories.

The results of fitting a straight line to the data can be seen in Figure 4.4.

4. Results 27

2 4 6 8

x 105

1.5

2

2.5

3

3.5


Fue

l Mile

age

[km

/l]

2 4 6 8

x 105

1.5

2

2.5

3

3.5


Fue

l Mile

age

[km

/l]

Figure 4.4: On the left, all fitted straight lines are shown. The right figure shows themean straight line along with the standard deviation of the slope and the offset (blue)and the standard deviation of just the offset (dashed). The main problem with thisstraight line fit are a number of fits with high gradients, which are not valid outsidetheir observation span. However, the mean line shows a slight increase in fuel economy,just like the mean curve from PACE (Figure 4.5).

4. Results 28

4.2 Application of PACE

The goal of this section is to elaborate on the application of the PACE method on the

truck data, focusing only on fuel consumption per kilometer over the distance axis.

Along with the results of this first application, some options available for a fine-tuning

of the method will be presented and a general estimate of variability will be given.

4.2.1 Baseline PACE Results

The data in use for this initial run of the PACE method is the cleaned set, with all the

trucks removed which have less than 2 observations. Additionally every observation,

that happened before a threshold of 10000 km has been removed. The PACE method

has some interchangeable sub-methods. For the baseline results, mostly the same

parts as in the original method described in [4] were used. Thus, the kernel used for

smoothing the mean function is the Epanechnikov kernel [4] and the input data is

assumed to contain measurement errors.

A small discrepancy to the original method is the choice of using Fraction of Variance

Explained1 (FVE), instead of the Akaike Information Criterion [1] (AIC) to select the

number of PCs. The FVE threshold is set at 95 % of variance explained.

Regarding Figure 4.5, the smoothed mean curve should be taken with a grain of salt,

especially the variance plots and the measurement density plots in Figure 3.3 should

be considered. The number of PCs selected by FVE is 8, which accounts for 96.57 %

of the total variation. The scree plot (Section 2.1.4) of the principal components from

this analysis can be seen in Figure 4.6. The first, strong principal component is almost

a straight line, which is basically shifting the mean from its starting point closer to

the position of the measurements. The second and the fourth principal component

seem to serve partially as corrective for trucks with a higher initial fuel economy than

the average truck. The smoothed covariance matrix generated and used by PACE is

visualized in Figure 4.7 by a color-matrix.

1The sum of the eigenvalues of a certain number of eigenfunctions divided by the sum of alleigenvalues has to exceed a certain threshold. The first number of PCs which exceeds this thresholdis subsequently used.

4. Results 29

0 2 4 6 8

x 105

2.45

2.5

2.55

2.6


Fue

l Mile

age

[km

/l]

Smooth mean curve

0 2 4 6 8

x 105

−3

−2

−1

0

1

2

3

4x 10

−3


Principal Components

55.72 % 11.88 % 8.65 %4.03 %

Figure 4.5: The smooth mean function generated by PACE (left) is the basis for allother results. The four most significant PCs (right) are the strongest ways in whichthe individual trucks vary. The legend quantifies the strength of the PCs.

4. Results 30

0 5 10 15 20 25 30 35 40 45 5080

10

20

30

40

50

60

70

80

90

10096.57

Number of principal components

Fra

ctio

n of

var

ianc

e ex

plai

ned

(%)

Scree Plot

Figure 4.6: The scree plot, which highlights the trade-off between the number of PCsused versus the variance retained. The use of more than 10 PCs makes little sense, asthe Fraction of Variance Explained (FVE) is not improving much.

4. Results 31

0 2 4 6 8 10

x 105

0

1

2

3

4

5

6

7

8

9x 10

5


Smoothed Covariance Matrix

Dis

tanc

e D

riven

[km

]

−0.04

−0.02

0

0.02

0.04

0.06

0.08

Figure 4.7: The smoothed covariance matrix generated by PACE. (The diagonal, whichis the variance, has been removed prior to smoothing.) The main part of the matrixshows a small positive covariance (green).

4. Results 32

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 14

Fue

l Con

sum

ptio

n [k

m/l]

2 4 6 8

x 105

Vehicle # 106

2 4 6 8

x 105

Vehicle # 92

2 4 6 8

x 105

Vehicle # 72

2 4 6 8

x 105

Vehicle # 4

Figure 4.8: These plots exhibit the mean curve(red), the corresponding original obser-vations(green) and the reconstructed curve(blue). Vehicle 14 and 106 have high valueson all major PC scores, under opposite prefixes. Number 92 has the lowest PC scoresoverall; Trucks 72 and 4 have average PC scores. High PC scores lead to extremevalues, especially on the strong first PC.

From the estimated PCA scores, the mean function µ and the principal component

functions, the individual traces of the trucks can be reconstructed, which should give a

rough estimate on the behavior of the truck. A number of selected reconstructions can

be viewed in Figure 4.8 and a collection of all traces and the original measurements

can be seen in Figure 4.9.

As a next step, for an analysis of the results, the goodness-of-fit of the original mea-

surements versus the reconstructed traces is assessed. To estimate the goodness-of-fit,

the mean squared error [1] between the discrete observation and the estimated re-

construction is considered. However, the irregular measurement intervals are making

assessment of the results difficult.

In Figure 4.10 some examples of bad fits are explained. Just taking the mean of the

mean square error (MSE) of all observations of one truck is prone to skewing, as well

as just summing up the MSE for each single truck. A more sensible approach to

4. Results 33

0 1 2 3 4 5 6 7 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2


Fue

l Mile

age

[km

/l]

Figure 4.9: This graph shows all reconstructed traces (gray) and original measurements(blue). Note how the traces tend to follow the observations, especially when the relativeoccurrence of observations is low.

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 73

Fue

l Con

sum

ptio

n [k

m/l]

2 4 6 8

x 105

Vehicle # 102

2 4 6 8

x 105

Vehicle # 106

2 4 6 8

x 105

Vehicle # 202

Figure 4.10: As described in the text, these figures depict misfitted trucks. Vehicle#73 and #106 show trucks which provide bad fits, whereas #102 is a truck which isonly identifiable as misfit when median mean square error (MSE) is applied. Truck#202 is a counter-example, where the misfit is more noticeable when the mean MSEis used.

4. Results 34

Method Max. MSE Mean MSE Median MSE Std. MSE

Mean MSE per Truck 0.189% 0.0343% 0.0209% 0.0383%Median MSE per Truck 0.238% 0.0215% 0.0096% 0.0331%

All Observations Pooled 0.679% 0.0310% 0.0089% 0.0629%

Table 4.1: MSE of the reconstructed traces by PACE and the original observationswith 8 PCs. In the last column, the standard deviation of the MSE is given.

get reliable error measurements is to use the median of the individual MSE as error

measure. A good example of a bad fit is truck #102 (Figure 4.10), which is, when the

median MSE is used, the third worst fitting truck, in contrast to mean MSE, where

the truck is ranked 63rd.

A counter-example is provided by vehicle #202 which is ranked 3rd using the median

and 19th with mean MSE. In this example, one of the observations is a strong outlier,

which is influencing the median MSE, because of the low number of observations on

this truck.

Because both measurement methods have their respective merits, both are used for

judging the fit of the individual trucks. In addition to these two methods, which view

the trucks as separate entities, all truck observations will be pooled and the overall

MSE is given. The results can be seen in Table 4.1

After the establishment of these baseline results, various parts of the PACE method

can be changed to see their influence on the results.

4.2.2 Number of Principal Components

As a first variation, the number of PCs will be varied and the resulting MSE table will

be compared. Also, a visual comparison will be offered. In addition to the baseline

threshold of FVE – 0.95, thresholds of 0.75, 0.85 and 0.9999 will be subject to this

experiment. The difference between the baseline and the variants is just the number of

PCs. With a lower number of PCs, the MSE in the data should be higher. The problem

of using a higher number of PCs, which result in a lower MSE, probably cause a worse

performance in generalisation, i.e. the principal components might be over-fitted to

the existing data.

4. Results 35

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 14

Fue

l Con

sum

ptio

n [k

m/l]

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 105

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 92

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 72

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 4

8 PC3 PC4 PC29 PC

Figure 4.11: These plots show how much reconstructed traces vary with different num-bers of PCs involved.




Table 4.2: MSE of the reconstructed traces with 3 PCs (76.69% variance retained).

Thus, the only difference to the baseline result is the number of PCs, graphs of the

mean function and the PCs themselves will be omitted. Only the MSE table and the

reconstructed trajectories of selected trucks will be shown. For the baseline table, see

Table 4.1 and for a comparative visualization of reconstructed trajectories see Figure

4.11

As expected, the MSE results from the variations (Tables 4.2, 4.3, 4.4) perform anal-

ogous to the scree plot visible in Figure 4.6. When using a lower number of PCs the




Table 4.3: MSE of the reconstructed traces with 4 PCs (85.34% variance retained).

4. Results 36




Table 4.4: MSE of the reconstructed traces with 29 PC (99.99% variance retained).

error increases, whereas a high number of used principal components do not necessar-

ily boost the error performance much. This means, the scree plot and the fraction of

variance retained is proportional to the size of the MSE.

4.2.3 Error Assumptions in PACE

There are two possibilities to tune the behavior of PACE regarding “measurement

errors”:

• The assumption that the observations are containing no ”measurement errors”.

• In addition to the presence of ”measurement errors”, the estimated errors are cut

off at the quartiles for the estimation of the error variance σ.

The notion of ”measurement errors” in this context is a bit misleading, as PACE

assumes a underlying smooth function. The accumulative fuel consumption data is

precise enough, but the variation of the observations around this smooth function can

be considered as noise. The assumptions on measurement error mostly influence the

calculation of the PC scores.

The previous results are containing the results for PACE with assumed measurement

errors without cut off. Thus, his section just covers the other two possible modes of

operation are covered in this section.

4. Results 37




Table 4.5: MSE of the reconstructed traces with 8 PC. For the estimation of errorvariance, all data outside the quartiles were cut off.

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 14

Fue

l Con

sum

ptio

n [k

m/l]

2 4 6 8

x 105

Vehicle # 105

2 4 6 8

x 105

Vehicle # 92

2 4 6 8

x 105

Vehicle # 72

2 4 6 8

x 105

Vehicle # 4

Figure 4.12: Reconstructed traces of selected trucks with no measurement error as-sumed. The influence of this assumption can be seen clearly in Vehicle #105, wherePC scores are maximized to fit at the observation points.

Baseline PACE with error cut off:

Table 4.5 shows that the MSE for with error cut off is almost as small as the MSE with

29 PC, which is a clear improvement over the baseline. Basically the additional cut off

leads to reduced outliers, which seems to improve the performance in comparison with

baseline results.

Using PACE under the assumption of zero measurement error:

As there is no assumed measurement error, the MSE reconstruction error is very small.

But the problem with this tight fit is, that reconstruction is only accurate at the original

measurement points. This is shown in Figure 4.12.

4. Results 38

0 1 2 3 4 5 6 7 8 9

x 105

2.45

2.5

2.55

2.6

2.65

2.7


Fue

l Con

sum

ptio

n [k

m/l]

Epanechnikov kernel Rectangular kernelGaussian Kernel

Figure 4.13: This Figure shows the effects of using different kernels for smoothing themean curve µ. The Gaussian kernel produces a very smooth mean curve, whereas therectangular kernel picks up noise from the measurements. Epanechnikov produces acompromise to these kernel variants.

4.2.4 Different Kernel Functions

Usually, the Epanechnikov kernel [4] is the standard choice for the smoothing steps in

the PACE method. This kernel function has a compact basis and definitive endings.

Alternative choices of kernels are rectangular and Gaussian kernels[4]. Whereas the

rectangular kernel has definitive endings, the Gaussian kernel extends to infinity. For

the smooth mean curve and the principal components the rectangular kernel has the

effect of adding some noise to the curves, whereas the Gaussian kernel has stronger

smoothing properties. In Figure 4.13 all three mean curves and in Figure 4.14 the three

most significant PC curves are visible.

In comparison, the overall MSE of the pooled data is slightly higher for PACE with

a rectangular kernel than it is with a Epanechnikov kernel (0.0351% mean, 0.0113%

median in the rectangular case versus 0.031% mean, 0.0089% median with the Epanech-

nikov kernel). In the Gaussian case, the fit is worse than with the other kernels

(0.0468% mean, 0.0162% median)2.

2These results were archived with 8 principal components.

4. Results 39

1 2 3 4 5 6 7 8

x 105

−3

−2

−1

0

1

2

3

4x 10

−3

Distance [km]


Rectangular PC1Rectangular PC2Rectangular PC3Epanechnikov PC1Epanechnikov PC2Epanechnikov PC3 Gaussian PC1Gaussian PC2Gaussian PC3

Figure 4.14: This Figure shows the effect of using different kernels for smoothing onthe PCs. The order of the PCs is visualized by the thickness of the lines, i.e. thethickest line depicts the first principal component. Generally, the same observationsas in Figure 4.13 apply.

4.2.5 Variances

There are two different variances in the results, model and data variance.As the different

name indicates, the variances result from different sources, and therefore must be

handled differently.

The model variance is based on the question of how sure we are of a model. A gen-

eralisation of this variance would be to do leave-one-curve-out cross validation on the

smooth mean curve. This enables a visualization of how much influence a single curve

has on the overall result of the mean curve or the principal components.

The data variance represents the density of measurements in a certain part of the curve.

For calculating the variance and the confidence interval of a certain part of the curve,

the number of trucks that influence a part of the curve has to be given. There are two

different approaches to this:

One approach is to bin the data and calculate the variances of the bins as shown in

Section 4.1, and use them to get approximate results for the variance.

4. Results 40

Another approach is the use of reconstructed curves as a basis for calculating the

variances. There are two different implementations to this approach. Either the recon-

structed curves are taken into account only within the interval of their real observations,

i.e. incorporate only observations which are relevant for this particular interval. The

other approach is to use the complete reconstructed curves, which ignores the number

of real observations in a part of the curve.

4.2.5.1 Model Variance

The data used for this experiment is generated by PACE which uses 8 principal compo-

nents and every observation which happened before the truck had run 1000 kilometers

was removed. The method which is used to generate the necessary data to analyze the

model variance is leave-one-curve-out cross validation. This validation method gener-

ates a result from PACE for excluding one truck at a time in the data. Thus, there are

as many PACE results generated as there are trucks.

Model variance gives us two results. The first result is a ranking of the most influential

trucks to the mean curve or a PC. This result can be found when a PACE result which

excludes a particular truck and the overall PACE result is compared.

Additionally, model variance gives the distribution and variation for a particular result

from PACE. In Figure 4.15 the very peaky distribution of all the mean curves at

different points can be seen. Figure 4.16 shows all leave-one-out mean curves, the

overall mean curve µ and the standard deviation σ curves. The average deviation of

σ from µ is 0.0016 km/l, the maximal deviation 0.0039 km/l. Figure 4.17 is a fuel

consumption plot of interesting vehicles in regard of their influence on the mean curve

or on the PCs.

4.2.5.2 Data Variance

Analysing data variance gives an idea on the variation of the data around the mean

function. As mentioned before, there are two methods to accomplish the inclusion: The

first, simpler method is to use the data from binning. The other method, outlined in

this part of the thesis is, to consider only the segments of the reconstructed data where

4. Results 41

2.5 2.55 2.60

20

40

60

80

100

120

140

160

180

Fuel [km/l]

Num

ber

of C

urve

s

50000 km

2.5 2.55 2.60

20

40

60

80

100

120

140

160

180

Fuel [km/l]

225000 km

2.5 2.55 2.60

20

40

60

80

100

120

140

160

180

Fuel [km/l]

400000 km

2.5 2.55 2.60

20

40

60

80

100

120

140

160

180

Fuel [km/l]

575000 km

2.5 2.55 2.60

20

40

60

80

100

120

140

160

180

Fuel [km/l]

750000 km

Figure 4.15: The distribution of all mean curves generated with the leave-one-curve-outmethod at various points. Two properties of the mean curves are visible, namely thepeakiness of the distribution and the higher deviation from the mean at 50000 km andat 750000 km.

1 2 3 4 5 6 7 8

x 105

2.5

2.52

2.54

2.56

2.58

2.6

2.62

2.64

Distance [km]

Fue

l Con

sum

ptio

n [k

m/l]

Figure 4.16: All mean curves generated by leave-one-out-cross-validation and the orig-inal µ (blue) and σ (red) curves. The small distance between the σ curves and the µcurve highlight the density of curves around the mean.

4. Results 42

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 43

Fue

l Con

sum

ptio

n [k

m/l]

2 4 6 8

x 105

Vehicle # 15

2 4 6 8

x 105

Vehicle # 88

Distance [km]2 4 6 8

x 105

Vehicle # 23

2 4 6 8

x 105

Vehicle # 186

Figure 4.17: Plot of trucks with a high influence on the results of PACE. Trucks 43and 15 have a strong influence on the µ curve since they provide data at he end ofall observations where data is very sparse. Vehicle 88 is the truck with the smallestinfluence on µ. It has both a small observation duration and average measurements.Truck #23 has the highest influence on the first PC and truck #185 has the smallestinfluence on the first PC.

real data support is existent. Both results can be seen in Figure 4.18. Both methods

deliver a similar result. The main difference is the the resolution of the result based on

PACE, which is much higher. However, unlike the binning results, the estimated data

between the observations is also incorporated into the variance results, which means

that regions with low data support are also represented in the variance.

4.3 Prediction of Fuel Consumption with PACE

Prediction in this case essentially is the usage of the reconstructed trajectories from

the PC scores to guess the fuel consumption of a truck at a certain point3.

The baseline to measure the effectiveness of the prediction, the value of the last mea-

surement available will be used as the predicted value. This straight assumption is

good on the accumulative data because the fuel consumption is usually developing in

an almost straight line.

3If the data — unlike the available truck data — is not open ended, an alternative to the directuse of the trajectories would be the use of regression methods [5].

4. Results 43

0 1 2 3 4 5 6 7 8

x 105

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9


Fue

l Mile

age

[km

/l]

0 1 2 3 4 5 6 7 8

x 105

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9


Fue

l Mile

age

[km

/l]Figure 4.18: The standard deviation extracted from the binned data is visible on theleft. The right graph shows the standard deviation of of the data, reconstructed fromthe observation duration and the trajectories regenerated from the PC scores of PACE.

For testing the prediction of new observations, the last observation of the truck to

predict is removed from the data, and the PACE results are calculated without it. The

prediction at the time of the removed observation is taken from these results. This

procedure is done for each available truck.

In this general prediction test, straight line prediction produces a maximum error of

5.04%, a mean error of 0.58% with a standard deviation of 0.81%. Whereas using

the reconstructed trajectories to predict produces a maximum error of 5.49%, a mean

error of 1.25% with a standard deviation of 1.06%.

These results emphasize the straight nature of data. In general, these results show

that it is better to assume steady continuing fuel consumption behavior for forward

prediction.

If the trajectory is used as the true reference instead of the real observations, straight

line prediction has a maximum error of 2.81%, a mean error of 0.48% with a standard

deviation of 0.56%. Prediction using PACE produces a maximum error of 2.31%,

a mean error of 0.38% with a standard deviation of 0.40%. These results are not

necessarily an indicator for prediction quality, but for result stability in the case of a

single missing observation.

4. Results 44

Using the reconstructed trajectories directly for prediction is affected by the assumption

of the presence of a measurement error – i.e. a basic underlying deviation even at the

points with known observations, and the bad fit which usually occurs when dealing with

outliers. However, given the relatively constant measurements of individual trucks

and the preexisting error between the actual observations and the trajectories, the

prediction works and is quite stable regarding the removal of observations.

4.4 Detection of Outliers with PACE

The main idea behind outlier detection with PACE, in particular with the PC scores,

is to be able to quantify how normal and likely the fuel consumption behavior of a

single truck is.

The first step in quantifying this probability, the distribution of the PC scores has to

be found. In this case, the scores are normally distributed, which can be seen in Figure

4.19. This makes the calculation of probabilities for a single PC score easily possible.

By using the probabilities from just the first principal component, the same outliers

as with simple feature extraction (Section 4.1.2) can be found. Basically, the same

outliers can be found by just using the raw PC scores.

However, if the probabilities of several PCs are calculated, it is possible to calculate the

“normality” of a truck. The distribution of these probabilities, with a varying count of

PCs used can be seen in Figure 4.20. In Figure 4.21 a few example fuel consumption

plots of trucks along with their normality can be seen. For these samples, the weighted

four first PCs were used.

In comparison to clustering the data with extracted features, as in Section 4.1, this

approach delivers a probability value and not a grouping of data. The probability

value provides finer increments in comparison to the clusters. The normality method is

coupled tightly to the PCs calculated with the help of PACE, whereas clustering works

on arbitrary extracted features. However, for finding outliers in the data, calculating

the normality is much more non-parametric.

4. Results 45

−200 0 2000.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99

0.997

Data

Pro

babi

lity

PC 1

−40 −20 0 20 400.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99

0.997

PC 2

Data

Pro

babi

lity

−50 0 500.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99

0.997

Data

Pro

babi

lity

PC 3

−20 0 200.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99

0.997

Data

Pro

babi

lity

PC4

Figure 4.19: These are the normal probability plots for the first four PCs. The dashedred line represents the ideal normal distribution, whereas the blue crosses are the actualobservations.

4. Results 46

0 0.5 10

20

40

60

80

100

120

Probability

Num

ber

of O

bser

vatio

ns

1 PC

0 0.5 10

20

40

60

80

100

120

Probability

3 PCs

0 0.5 10

20

40

60

80

100

1204 PCs

Probability0 0.5 1

0

20

40

60

80

100

1204 PCs (weighted)

Probability

Figure 4.20: These histograms show the likelihoods for the occurrence of a single truckwith different counts of PCs used for the calculation. When multiple PCs are used,the result is given by the product from the probabilities of all principal components.In the rightmost histogram the likelihoods are weighted by the eigenvalues of the PCs.

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2

Vehicle # 88Prob. 78,84%

Dist. [km]

Mile

age

[km

/l]

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2


Dist. [km]2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2


Dist. [km]2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2


Dist. [km]

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2


Dist. [km]

Mile

age

[km

/l]

2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2


Dist. [km]2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2


Dist. [km]2 4 6 8

x 105

2

2.2

2.4

2.6

2.8

3

3.2


Dist. [km]

Figure 4.21: These graphs depict trajectories of several trucks, along with their nor-mality. The normality describes how average the fuel consumption of one truck isregarding to all other trucks.

4. Results 47

2 4 6 8

x 105

40

45

50

55

60

65

70

Distance [km]

Ave

rage

Spe

ed [k

m/h

]

Mean Function

2 4 6 8

x 105

−4

−3

−2

−1

0

1

2x 10

−3

Distance [km]

PCs

0 2 4 6 8

x 105

0

10

20

30

40

50

60

70

80

90

100Observations

Distance [km]

PC1 43% PC2 16%PC3 13%PC4 7%

Figure 4.22: This figure shows the mean curve of average vehicle speed, the PCs and ascatter-plot of all available observations. The mean curve is an indicator, that truckswith an high odometer count have a higher average speed.

4.5 Expansion of our Application

As an example of the application of PACE on other data the average vehicle speed is

used. Furthermore, in this section the PACE method is used on cyclic fuel consumption

data, even if PACE was developed for longitudinal data.

The results of PACE on the average vehicle speed can be seen in figure 4.22. In

comparison to the results from the fuel consumption data, the speed data has also a

similar distribution of the observations.

The most interesting outcome from the analysis of seasonal fuel consumption was not

the mean curve, but the second and the third principal component, which seem to

resemble high fuel efficiency in spring and autumn. Based on those two PCs, it is

possible to calculate the strength of those two seasonal effects in respect to the other

trucks.

4. Results 48

0 100 200 3002.5

2.52

2.54

2.56

2.58

2.6

Day of the Year

Fue

l Con

sum

ptio

n [k

m/l]

Mean Function

0 100 200 300−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Day of the Year


Fue

l Con

sum

ptio

n [k

m/l]

PC1 PC2PC3PC4

Figure 4.23: The left figure shows the mean fuel consumption when the fuel consump-tion is observed over the year. The peak at the very beginning and the very end of theyear is probably caused by the lack of observations at this time of the year. On theright, the first four PCs are shown. The first PC is a linear offset, whereas the secondand the third components probably show seasonal effects.

In Figure 4.23 the mean curve as well as the strongest PCs can be seen. Traces of

interesting trucks in relation to the strength of their seasonal effects can be seen in

Figure 4.24. Note that the data was collected in up to just three years, so these results

might be affected by a mild winter or a rainy summer.

4. Results 49

0 2002

2.2

2.4

2.6

2.8

3

3.2Vehicle # 121

Day of the Year

Fue

l Con

sum

ptio

n [k

m/l]

0 2002

2.2

2.4

2.6

2.8

3

3.2Vehicle # 192

Day of the Year0 200

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 190

Day of the Year0 200

2

2.2

2.4

2.6

2.8

3

3.2Vehicle # 6

Day of the Year

Figure 4.24: Vehicle #121 is exhibiting no seasonal dependencies, which suggests aconstant climate. Vehicle #192 has a high PC2 score, i.e. a high fuel efficiency in thespring. Vehicle #190 has a low fuel efficiency in the winter and has high values inboth PC2 and PC3 scores. Vehicle #6 has a low fuel efficiency only in spring. Thiscan be because a change in utilisation of this truck or those measurements are from adifferent year.

5

Discussion

A natural perspective for a possible continuation of this work would be the expansion

of the method applications to different datasets, especially to more specialized ones.

For example, data in a similar quantity as the one used, but from a corporate fleet of

quasi identical trucks which are in service within the same climate zone with similar

loads, etc. This data would likely be better suited for research on detecting trends,

as well as detection of trucks with unexpected behavior, i.e. outliers. An example

for such a data set would be data from Scandinavian long distance trucks, where the

differences in fuel consumption between summer and winter should be clearly visible.

To further research on seasonal variation of fuel consumption, an expansion of PACE

onto cyclic data might be useful.

Furthermore the analysis of data containing more observations might be interesting, as

a lot of small underlying influences in the fuel consumption data would be uncovered.

With these datasets, research on the asymptotic properties, as well as the distribution

of the data would be more useful as with the small amount of mixed data at hand.

With denser data, it might also be viable to switch to calculate fuel consumption based

on the fuel amount used between two observations instead of using the fuel amount

consumed since the truck was manufactured.

50

6

Conclusion

Sparse and irregular data is hard to analyse with multivariate and functional statistics.

The steps which make analysis with those two approaches hard are feature extraction

and data interpolation. Feature extraction needs a careful extraction of relevant fea-

tures. The manual work in this case is not very desirable. The main problem in doing

interpolation on this data is the open-ended nature. Outside the given observations

for a single truck it is very difficult to estimate a function without knowledge on the

underlying model.

However, it is possible to analyse such data in a functional way if the Principal Com-

ponents Analysis through Conditional Expectation (PACE) method is used. If the

Gaussian assumptions made by PACE are acceptable, the methods provide a complete

data centric approach to extract a mean curve and principal components from the data

as well as complete trajectories regenerated from the principal component scores of the

individuals.

These results can be used as basis for further analysis, such as classification and re-

gression. While these tasks can also be approached with feature extraction, PACE uses

all data available and is much more non-parametric. Also, the functional approach of

PACE keeps the data in an more natural format than the format of abstract extracted

features of the individuals.

Most of the variation in the data used in this work (long-distance articulate truck

fuel consumption data) can be captured with a small number of principal components.

51

6. Conclusion 52

However, the data does not contain highly significant general trends and easily separa-

ble clusters. Some outlying individuals are contained in this data, but because of the

meta-data nature of the fuel consumption, it is not possible to distinguish between a

possible truck fault and environmental influences.

Fuel consumption is difficult to predict, as it can change very rapidly when the en-

vironment changes. The available truck data has no definitive end or start. Samples

were taken at arbitrary times and are connected by just the truck configuration. Thus

prediction for this data can just give an educated guess about the fuel consumption

based on the data from the other trucks, not for this individual truck.

However, it is possible using the principal component scores, to calculate if a truck is

normal – relative to its peers – or if it is is relatively unlikely to happen, i.e. an outlier.

The data-centric approach of these methods is bound by the quality and quantity data

used, which is true for all statistical methods. To overcome difficulties with sparseness

and irregularities in the observations the methods described above are very reasonable

and enable functional analysis of this data.

Bibliography

[1] Hastie, T., R. Tibshirani, and J. Friedman: The Elements of Statistical Learning.

Springer Verlag, 2001.

[2] Ramsay, J. O. and B. W. Silverman: Functional Data Analysis. Second Edition.

Springer Verlag, 2006.

[3] Ferraty, F. and P. Vieu: Nonparametric Functional Data Analysis: Theory and

Practice. Springer Verlag, 2006.

[4] Yao, F., H.G. Muller, and J.L. Wang: Functional Data Analysis for Sparse Longi-

tudinal Data. Journal of the American Statistical Association, 100(470):577–591,

2005.

[5] Liu, B. and H.G. Muller: Functional Data Analysis for Sparse Auction Data.

2007. http://www.smith.umd.edu/ceme/statistics/Liu_Muller_FDA%20for_

Sparse_Auction_Data.pdf Preprint published online. Retrieved 2007-11-25.

[6] Sandberg, T.: Heavy Truck Modeling for Fuel Consumption Simulations and Mea-

surements. Master’s thesis, Division of Vehicular Systems, Department of Electri-

cal Engineering, Linkoping University, Linkoping, 2001.

[7] Stodolsky, F., L. Gaines, and A. Vyas: Analysis of Technology Options to Reduce

the Fuel Consumption of Idling Trucks. Technical report, ANL/ESD-43, Argonne

National Lab., IL (US), 2000.

[8] James, GM, TJ Hastie, and CA Sugar: Principal component models for sparse

functional data. Biometrika, 87(3):587–602, 2000.

53

http://www.smith.umd.edu/ceme/statistics/Liu_Muller_FDA%20for_Sparse_Auction_Data.pdf

http://www.smith.umd.edu/ceme/statistics/Liu_Muller_FDA%20for_Sparse_Auction_Data.pdf

Bibliography 54

[9] Zou, H., T. Hastie, and R. Tibshirani: Sparse principal component analysis. Jour-

nal of Computational and Graphical Statistics, 15(2):265–286, 2006.

[10] Hall, P., H.G. Muller, and J.L. Wang: Properties of principal component methods

for functional and longitudinal data analysis. Ann. Statist, 34(3):1493–1517, 2006.

[11] Yao, F., H.G. Muller, A.J. Clifford, S.R. Dueker, J. Follett, Y. Lin, B.A. Buchholz,

and J.S. Vogel: Shrinkage Estimation for Functional Principal Component Scores

with Application to the Population Kinetics of Plasma Folate. Biometrics, 59:676–

685, 2003.

[12] Liang, K.Y. and S.L. Zeger: Longitudinal data analysis using generalized linear

models. Biometrika, 73(1):13–22, 1986.

[13] Maaten, L. J. P. van der, E. O. Postma, and H. J. van den Herik: Dimension-

ality reduction: A comparative review. 2007. http://www.cs.unimaas.nl/l.

vandermaaten/dr/DR_draft.pdf Preprint published online. Retrieved 2007-12-

01.

[14] Fan, J. and I. Gijbels: Variable Bandwidth and Local Linear Regression Smoothers.

The Annals of Statistics, 20(4):2008–2036, 1992.

http://www.cs.unimaas.nl/l.vandermaaten/dr/DR_draft.pdf

http://www.cs.unimaas.nl/l.vandermaaten/dr/DR_draft.pdf

List of Abbreviations

LVD . . . . . . . . . . . . . . . . . . . Logged Vehicle Data

EECU . . . . . . . . . . . . . . . . . Engine Electric Control Unit

FDA . . . . . . . . . . . . . . . . . . . Functional Data Analysis

FD . . . . . . . . . . . . . . . . . . . . Functional Data

PCA . . . . . . . . . . . . . . . . . . . Principal Component Analysis

PC . . . . . . . . . . . . . . . . . . . . Principal Component

PCs . . . . . . . . . . . . . . . . . . . Principal Components

PACE . . . . . . . . . . . . . . . . . Principal Component Analysis through Conditional Ex-

pectation

MSE . . . . . . . . . . . . . . . . . . . Mean Squared Error

AIC . . . . . . . . . . . . . . . . . . . Akaike Information Criterion

FVE . . . . . . . . . . . . . . . . . . . Fraction of Variance Explained

55

functional analysis of real world truck fuel consumption …238366/fulltext01.pdf · technical...

Documents