a new automated spectral feature extraction...

MNRAS 465, 4311–4324 (2017) doi:10.1093/mnras/stw2894

A new automated spectral feature extraction method and its applicationin spectral classification and defective spectra recovery

Ke Wang,1,3‹ Ping Guo1,2‹ and A-Li Luo3‹1School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China2Image Processing and Pattern Recognition Laboratory, Beijing Normal University, Beijing 100875, China3Key Laboratory of Optical Astronomy, National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100012, China

Accepted 2016 November 7. Received 2016 October 17; in original form 2016 July 17

ABSTRACTSpectral feature extraction is a crucial procedure in automated spectral analysis. This procedurestarts from the spectral data and produces informative and non-redundant features, facilitatingthe subsequent automated processing and analysis with machine-learning and data-miningtechniques. In this paper, we present a new automated feature extraction method for astro-nomical spectra, with application in spectral classification and defective spectra recovery. Thebasic idea of our approach is to train a deep neural network to extract features of spectrawith different levels of abstraction in different layers. The deep neural network is trained witha fast layer-wise learning algorithm in an analytical way without any iterative optimizationprocedure. We evaluate the performance of the proposed scheme on real-world spectral data.The results demonstrate that our method is superior regarding its comprehensive performance,and the computational cost is significantly lower than that for other methods. The proposedmethod can be regarded as a new valid alternative general-purpose feature extraction methodfor various tasks in spectral data analysis.

Key words: methods: data analysis – methods: numerical – methods: statistical – techniques:spectroscopic.

1 IN T RO D U C T I O N

With the technology advancement of observational instruments inastronomy, huge volumes of spectral data has been and will begenerated in modern spectroscopic surveys, for example the SloanDigital Sky Survey (SDSS, York et al. 2000), the Global Astromet-ric Interferometer for Astrophysics (GAIA, Perryman et al. 2001),and the Large Sky Area Multi-Object Fibre Spectroscopic Tele-scope (LAMOST or Guo Shoujing Telescope, Cui et al. 2012). Asis the case for many other data-intensive scientific disciplines, theimmense volume of data and the high rate of data acquisition inmodern astronomy necessitate a focus on automated, efficient andintelligential techniques and methodologies that can ‘understand’certain tasks in astronomical research and automatically mine largeastronomical data bases for scientific discoveries. One way of per-forming such tasks is through the use of machine-learning tech-niques. Various reviews of the use of machine learning in astronomyare available, for example Ball & Brunner (2010) and Way et al.(2012).

One significant procedure in the deployment of machine-learningalgorithms is the design of data transformation pipelines that result

� E-mail: [email protected] (KW); [email protected] (PG); [email protected] (A-LL)

in a representation of the original data. This is usually referred to asfeature extraction (or feature learning or representation learning).The features (or representation) of the original data can be viewedas the input to a machine-learning algorithm. Feature extraction isa crucial procedure for machine learning because the performanceof machine-learning algorithms is heavily dependent on the qualityof data representation (Bengio, Courville & Vincent 2013).

In astronomy, principal component analysis (PCA) is widely usedas a general tool for achieving unsupervised feature extraction ordimensionality reduction of spectral data. In PCA, the original dataare represented by their projections onto the principal components,and the projections are dimension reduced features. The derivedfeatures are then used as the inputs in the subsequent processing.PCA has been used in several areas of astronomy, including stel-lar spectral classification (Bailer-Jones, Irwin & Von Hippel 1998),galaxy spectral classification (Yip et al. 2004), spectral clusteringanalysis (Wang, Guo & Luo 2015) and the estimation of stellar fun-damental parameters (McGurk, Kimball & Ivezic 2010). In featureextraction using PCA, spectra are represented by linear combina-tions of a few eigenvectors with the highest eigenvalues. It is clearthat linear combination is an oversimplified method, because it can-not reveal the inherently non-linear relationships within the spectraldata. Thus, non-linear methods have been introduced in spectral datafeature extraction. Among these methods, locally linear embedding(LLE) is a manifold learning method that seeks to find a set of the

C© 2016 The AuthorsPublished by Oxford University Press on behalf of the Royal Astronomical Society

at National A

stronomical O

bservatory on January 8, 2017http://m

nras.oxfordjournals.org/D

ownloaded from

mailto:[email protected]




http://mnras.oxfordjournals.org/

4312 K. Wang, P. Guo and A.-L. Luo

nearest neighbours of a data point that best describes the point andadopts an eigenvector-based optimization technique to find the low-dimensional representation. It is a representative technique appliedin spectral data feature learning (Richards et al. 2009; VanderPlas& Connolly 2009; Daniel et al. 2011).

Among the various feature-learning techniques, the work ofHinton and Salakhutdinov (Hinton & Salakhutdinov 2006) canbe considered as a breakthrough. In their work, multiple-layer(deep) architectures are used to learn features with different lev-els of abstraction. To overcome the difficulties of training suchdeep models, a greedy algorithm with layer-wise pre-trainingscheme is proposed. The pre-training can be considered as a fea-ture extraction procedure that learns a new representation fromthe learned representation in previous layers. Inspired by Hintonand Salakhutdinov’s work, other researchers have applied deeplearning with success in many fields. These include but are notlimited to, image recognition (Hinton & Salakhutdinov 2006;Yoshua Bengio & Larochelle 2006; Krizhevsky, Sutskever &Hinton 2012), speech recognition (Hinton et al. 2012; Dahlet al. 2012) and information retrieval (Krizhevsky & Hinton 2011;Salakhutdinov & Hinton 2009). More and more research indicatesthat deep models are able to achieve a much better performancethan traditional shallow models (Bengio 2009).

The explosion of data in many fields means that more samplesare available, and hence the training of deep models has becomemore feasible. Deep learning also provides a powerful tool for theanalysis of large data sets. The success of deep learning in manyapplications provides precedents for other fields. There are also in-creasing opportunities to capitalize on the tremendous volumes ofdata in astronomy using deep-learning techniques. Recently, vari-ous deep-learning techniques have been employed in astronomicaldata processing. Bu et al. (2015) applied the restricted Boltzmannmachine (RBM) to spectral processing. Although RBM is alwaysused as a building unit to train deep models, these authors used asingle RBM rather than stacking RBMs into a deep model. Graffet al. (2014) presented a training tool for deep neural networks inastronomy. They applied their deep neural network into three as-trophysics examples, namely map of dark matter, identification ofgamma-ray bursters, and image compression for galaxies. How-ever, they focused mainly on image data rather than on spectra.Dieleman, Willett & Dambre (2015) used a convolutional neuralnetwork (CNN) for the classification of galaxy morphology in theGalaxy Zoo project. Subsequently, Huertas-Company et al. (2015)extended this new methodology to high redshifts by classifying im-ages of galaxies with median redshift 〈z〉 ∼ 1.25. Yang & Li (2015)proposed a scheme for spectral feature learning in atmospheric pa-rameter estimation. They used the auto-encoder to extract localfeatures from stellar spectra. Hoyle (2016) applied deep-learningtechniques to evaluate the photometric redshift. He used a deepneural network to implement an end-to-end pipeline that takes theentire multi-band galaxy images as the input.

The immense volume of astronomical data, however, presentsa challenge to the widespread use of deep-learning schemes inastronomy. One of the challenges relates to learning efficiency.Most deep-learning algorithms are based on variations of gradient-descent-based algorithms, such as the back-propagation (BP) al-gorithm. These algorithms suffer from a slow training speed whenthe data volume is large, because the training procedure requirescomputationally expensive iterative optimization. Another issue formost gradient-descent-based algorithms lies in the fact that the userneeds to specify a set of control parameters. These parameters, in-cluding maximum epoch number, learning rate and momentum, are

crucial to the performance of the algorithm. However, the parameteradjustment is usually task-specified and relies mainly on empiricaltricks.

In order to utilize fully the capacity of deep learning in astronom-ical spectral processing, we propose an efficient learning schemefor deep neural networks and extend it to an incremental learningversion. This scheme builds on previous pseudo-inverse strategies,which were designed for the training of forward networks. In or-der to demonstrate the practicality of our method for spectral data,we apply it to a stellar spectral classification task and to a defec-tive spectra recovery task. In addition, we seek good qualitativeinterpretations of what the neural network learns from the spectraldata. Although neural networks have long been used in astronomi-cal spectral processing, they are employed mainly as a classifierthat uses a supervised back-propagation algorithm (Von Hippelet al. 1994; Weaver & Torres-Dodgen 1997; Navarro, Corradi &Mampaso 2012). The most notable difference of our work is thatwe employ neural networks to extract features of spectra with a newefficient algorithm, and the extracted features can be used not onlyfor spectral classification but also for other tasks.

The remainder of the paper is organized as follows. In the nextsection, we give a brief introduction to the spectral data used inthis paper. In Section 3, we provide a detailed description of ourfeature-learning scheme. In Section 4, we discuss and analyse theutility of the proposed method as a feature-learning scheme as wellas its performance relative to classical spectral classification anddefective spectra recovery. Finally, our conclusions are presented inSection 5.

2 SPECTRAL DATA AND PRE-PROCESSING

The spectral data in this work comprise 50 000 stellar spectra ran-domly selected from LAMOST Data Release One (Luo et al. 2015).The LAMOST telescope, also known as the Guo ShouJing tele-scope, is a 4-m reflecting Schmidt telescope with 4000 fibres con-figured on a 5◦ field of view (Cui et al. 2012). The data reductionmethod for LAMOST is described in Luo et al.’s previous work(Luo et al. 2015). The stellar spectra are classified as F, G and Ktypes by the LAMOST 1D pipeline. The wavelength range of thesespectra is 370–900 nm. The theoretical resolution of the LAMOSTspectrographs is R = 1000, while the practical resolution reachesR > 1500. Because there is still no corresponding photometric sur-vey, LAMOST cannot provide spectra with absolutely calibratedfluxes. However, a relative flux calibration is employed in the LAM-OST 2D pipeline. In the calibration, first some stars are selected ineach spectrograph field as standard stars (or as pseudo-standardstars, which can be regarded as extending the more restricted groupof standard stars ). The spectral response function for each spectro-graph is then obtained by finding the observed pseudo-continuumof the standard (or pseudo-standard) stars and the best physicalpseudo-continuum generated using Kurucz models. Then the re-sponse function is used to calibrate the raw spectra provided byother fibres of the same spectrograph (Song et al. 2012).

To start, we need to perform a series of pre-processing tasks forthe original spectral data. For a given set of spectra denoted by D ={xi , oi}N

i=1, where the vector xi = (xi1, xi

2,...,xi

d) ∈ Rd represents a

spectrum, in which xin

is the flux at a given wavelength, we re-bineach spectrum to form a lower-dimensional vector. To be specific,we empirically average every five pixels to form one synthesizedpixel and use this synthesized pixel to represent the original five.By doing this, we can obtain for each spectrum a 721-dimensionalvector, which was initially 3601-dimensional. This pre-processing

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from


Automated spectral feature extraction 4313

Figure 1. (Top) a random sample from the original spectra; (bottom) thepre-processed spectrum. The wavelengths are given in angstroms.

can eliminate the disturbance of the stochastic noise with little tono effect on the final performance. Moreover, the re-binning canreduce the computational complexity and thus improve efficiency.It is worth noting that the fluxes of different wavelengths in the rawspectrum are generally on very different scales. Hence we carryout a linear re-scaling of the raw data, such that the fluxes in eachindividual spectrum have zero mean and unit variance. This is acommonly used technique in machine learning when the input dataare distributed on different scales. It is carried out as follows:

xi = xi − μ

σ, (1)

where μ is the mean of the sample and σ is the standard deviation.An alternative method is also frequently used, namely

xin

= (max − min)(xin − min{xi})

(max{xi} − min{xi}) + min, (2)

where min and max indicate the expected minimum and maximumvalues, respectively. Except for the aforementioned pre-processing,there is no longer any other operation. The comparison between theoriginal spectrum and the pre-processed one is shown in Fig. 1.

3 D E E P N E U R A L N E T WO R K SC H E M E

3.1 Deep neural networks

A deep neural network (DNN), also called a multilayer neural net-work or multilayer perceptron, is a type of artificial neural networkwith multiple hidden layers between the input layer and the outputlayer. The basic structure of a DNN is illustrated in Fig. 2. For asupervised learning problem, we obtain a training set with N ar-bitrary distinct samples that is denoted as D = {xi , oi}N

i=1, wherexi = (xi

1, xi

2,...,xi

d) ∈ Rd is the ith d-dimensional input vector, and

oi = (oi1, o

i2, ..., o

im) ∈ Rm is the corresponding label vector. The

supervised learning task involves seeking the weight matrix thatcan minimize the following sum of the square error:

E = 1

2N

N∑i=1

m∑j=1

∥∥Gj (xi , �) − oij

∥∥2, (3)

where � is the network parameter set including the connectionweight W and a bias. Gj (xi , �) is a function that maps the inputsto the outputs of the jth neuron in the output layer. In practice,Gj (xi , �) is calculated layer-by-layer from the input layer to theoutput layer, a process usually called ‘feed-forward’. To be specific,the top layer computes an output vector by taking the output of the

Figure 2. The structure of a deep neural network with n hidden layers. Theneurons in each layer are fully connected with the neurons in the adjacentlayers.

nth hidden layer as input after the output of the nth hidden layeris calculated by using the output of the (n − 1)th hidden layerrecursively down to the input layer. To take the calculation in thenth hidden layer as an example, this process could be formulatedas

gnj (xi , �) = σ

(pn−1∑k=1

wk,j gn−1k (xi , �) + θ

), (4)

where gnj (xi , �) is the function used to calculate the output of the

jth neuron in the nth hidden layer. gn−1k (xi , �) is the output of

the kth neuron in the previous layer. θ is a bias parameter. σ (·) isthe so-called activation function, which is a non-linear piecewisecontinuous function, for example the hyperbolic function

σ (x) = ex − e−x

ex + e−x, (5)

or the sigmoidal function

σ (x) = 1

1 + e−x, (6)

or the rectifier function

σ (x) = max(0, x). (7)

3.2 Auto-Encoders

As the network becomes deeper, it is able to achieve a muchbetter learning capacity than a shallow network. However, train-ing the DNN becomes more difficult than training shallow ar-chitectures (Erhan et al. 2009a; Bengio 2009). In order to solvethe challenging training problem of deep architectures, Hinton &Salakhutdinov (2006) adopted a layer-wise unsupervised learningstrategy to initialize the network, rather than random initialization.In their new learning algorithm, each layer of a deep architectureis associated with an ‘auto-encoder’ that is trained in an unsuper-vised fashion. Several auto-encoders are finally stacked to form apre-trained DNN. An auto-encoder (also called an autoassociator)(Hinton & Salakhutdinov 2006; Bengio 2009; Vincent et al. 2008)is simply a particular type of single-layer feed-forward network

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Figure 3. The structure of an auto-encoder.

(SLFN). It tries to learn an approximation to an identity functionso as to reconstruct the input. In other words, the auto-encoder triesto learn features that make the reconstructed outputs as similar aspossible to the inputs. Auto-encoders adopt the framework shownin Fig. 3. The framework can be considered to consist of two parts,namely the ‘encoder’ and the ‘decoder’. The ‘encoder’ is trainedto encode an input x into a feature y. Its typical form is an affinemapping followed by a non-linearity, namely

y = f (x, �) = σ (WTx + θ ). (8)

Its parameter set is denoted as � = {W, θ}. A vector z in inputspace is then reconstructed from the feature y by the ‘decoder’formulated as

z = φ( y, �′) = W y + θ ′, (9)

with �′ = {W, θ} as its parameter set. Training the auto-encoderinvolves reconstructing an input vector optimally by minimizing thefollowing cost function:

E = − log p(x|z). (10)

If x|z is continuous, the cost function can be defined as the followingsquared reconstruction error:

E =N∑

i=1

∥∥(W(σ (WTxi + θ )) + θ ′) − xi∥∥2

. (11)

Here, we use the L2 norm to penalize the difference between inputsand reconstructed ones. If the input x is binary or considered tobe binomial probabilities, the following cross entropy cost can beused:

E = −N∑

i=1

(xi log zi + (1 − xi) log(1 − zi)). (12)

Auto-encoder is a well-known deep-learning model used to learnfeatures for a set of data. Auto-encoders have been used as build-ing blocks to train DNNs (stacked auto-encoders). In this ap-proach, each hidden layer is trained separately in an auto-encoder(Bengio 2009; Larochelle et al. 2007; Vincent et al. 2010). In thistype of DNN, different hidden layers extract features of the inputdata with different levels of abstraction. This means that the (k +1)th layer is trained after the kth has been trained, and the learnedfeatures in the kth layer are used as input for the next (k + 1)thlayer. This greedy layer-wise scheme has been shown to yield sig-nificantly better local minima than random initialization, achievingbetter generalization on a number of tasks (Larochelle et al. 2009).Auto-encoders could be used to learn a compressed, sparse or equalrepresentation (features) for a set of data. When the hidden layer isnarrower than the input layer, the auto-encoder attempts to representthe data in a lower-dimensional feature space, and the original input

could be reconstructed approximately from the lower-dimensionalfeature. In contrast, if the hidden layer is wider than the input layer,the auto-encoder represents the data in a higher-dimensional featurespace.

3.3 The training technique

3.3.1 Pseudo-inverse learning (PIL) algorithm

The pseudo-inverse learning (PIL) algorithm was originally pro-posed by Guo & Lyu (2004). These authors used this algorithm totrain feed-forward neural networks in supervised machine-learningproblems. In the PIL algorithm, it is worth noting that the weights ofthe neural network are calculated in an analytical way, rather thaniteratively as in the conventional gradient-descent-based learningalgorithms. Hence the greatest advantage of the PIL algorithm isthat it is more efficient than iterative learning algorithms, for exam-ple the error back propagation (BP) algorithm. Furthermore, withthe PIL algorithm there is no need to explicitly set any control pa-rameters, such as the learning epochs, step length and momentum,which are usually specified empirically by the user without a theo-retical basis. A detailed description of the PIL algorithm is given inAppendix A.

3.3.2 Pseudo-inverse learning auto-encoder (PILAE)

In essence, auto-encoder is a SLFN, and thus the PIL algorithm canalso be used to train auto-encoders. In our proposed feature-learningscheme, we extended the PIL algorithm to train auto-encoders andstacked these trained auto-encoders into a DNN. Unlike traditionalalgorithms for auto-encoders or restricted Boltzmann machines,the PIL algorithm can calculate the network weights analyticallywithout repeated control parameter tuning. Hence, it is an efficientand easy-to-use learning algorithm, and competitive with gradient-descent algorithms in practical applications. A detailed descriptionof the PILAE is given in Appendix B.

3.3.3 Local connectivity

In traditional DNNs, neurons in adjacent layers are fully connected.This kind of connection scheme treats different dimensionalitiesin the same way and hence does not take the spatial structure ofthe data into account. Furthermore, the full connectivity betweenneurons not only lowers the learning speed but also increases thenetwork complexity and further increases the over-fitting risk.

In order to solve the above problems, we adopt a locally connectednetwork structure substitute for the fully connected structure. Ourproposed locally connected structure is shown in Fig. 4. In this struc-ture, a neuron is forced to be connected to neighbouring neuronsin the adjacent layers. We refer to these neighbouring neurons as a‘segment’. This architecture thus ensures that the learned featuresproduce the strongest response to a local input pattern. This locallyconnected structure enables the network to first learn local featuresfrom the input data and then to assemble these local features into aglobal feature. Furthermore, this scheme also dramatically reducesthe complexity of the network. If a SLFN has 100 input neurons,50 hidden neurons and 100 output neurons, it needs a total of(100 × 50) + (50 × 100) = 10 000 weights to fully connect theneurons. If we employ the local connectivity with five segments ofequal size for the same input data, the network would have only(20 × 10 × 5) + (20 × 10 × 5) = 2000 connection weights, whichis much lower number than in the fully connected network. The

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Figure 4. The local connection network structure.

model-slimming mitigates the risk of over-fitting while improvingthe learning efficiency.

3.3.4 Incremental learning scheme

Typically, we assume that the entire training data set is available andits size is fixed when we train auto-encoders. However, real-worldspectral data are usually continually collected and become availablein a sequential order, which means that the learning algorithm haslimited access to the data at a given point in time. In conventionalbatch-learning, whenever a new sample is received, the newly arriv-ing sample should be combined with the existing data to form a newtraining set, and the connection weights should be updated accord-ing to the new larger data set. It is obvious that batch-learning is notfeasible to train sophisticated models with ever-increasing volumesof data.

In order to address the above-mentioned issues, we developed anincremental learning auto-encoder (IPILAE) by extending the basicPILAE presented in the previous subsection. A detailed descriptionof the incremental learning algorithm is given in Appendix C.

4 R ESULTS: A PPLICATION O F THEP RO P O S E D M E T H O D

In order to evaluate the proposed algorithm regarding spectral fea-ture extraction, we conducted two experiments to investigate itsperformance. The first is the stellar spectral classification task, andthe second is the recovery of spectra with defects caused by anincorrect wavelength-band connection.

4.1 Stellar spectral classification

Stellar spectral classification is an essential procedure in spectro-scopic surveys. It is useful for users to be able to conduct specificresearch with classified spectral data. For example, astronomersmay want to select all OB-type stars from archived spectra in orderto study young massive stars. In large spectroscopic surveys, forexample LAMOST, the MK-type classification of stars cannot beachieved by comparing the spectra with a standard star with theunaided human eye owing to the huge volume of data. Spectralclassification is a non-trivial task for LAMOST, because there isno photometric survey with the spectroscopic one, which makes itdifficult to classify spectra with photometric colour indices.

In this subsection, we conduct an experiment to assess the perfor-mance of our feature-extraction method in the spectral classification

task. In this experiment, we randomly chose 50 000 LAMOST spec-tra along with information on their spectral type. This spectral dataset includes 12 994 F-type stars, 16 448 G-type stars and 13 058 K-type stars. A common practice in classification tasks is to dividethe data set into two separate subsets, known as the training set andthe validation set. Therefore, we randomly selected 85 per cent ofthe data to form the training set, which is used to train the mod-els, and reserved the remaining data as the validation set, which isused for performance evaluation. We divided the training procedureinto two phases. The first is the feature-extraction phase, in which amultilayer neural network is trained with the PIL algorithm layer bylayer. To be specific, each layer of the network is associated with aPILAE, which is trained to extract the features from its input. Afterthat, the trained network is applied to extract features from the spec-tral training data layer by layer. The next phase is the classification.In this phase, we employ a softmax regression model as the clas-sifier. Softmax regression models generalize logistic regression tomulticlass problems. In the softmax regression setting, we estimatethe probability of the class label taking on each of the K possiblevalues. The prediction of the class label is made by selecting theclass label with the highest probability. For a given sample xi , ourhypothesis will output the estimated probability as

p(yi = k|xi ; θ

) = exp(θT

k xi)

∑Kj=1 exp

(θT

j xi) , (13)

where y denotes the class label and k ∈ {1, 2, ...K}; θ is the parameterset of the model. Given a training set, θ is trained to minimize thefollowing cost function:

J (θ ) = − 1

N

N∑i=1

K∑j=1

1{yi = j} log p(yi = j |xi ; θ ), (14)

where 1{·} takes on a value of 1 if its argument is true, and 0 other-wise. This optimization problem could be solved with an iterativeoptimization algorithm such as gradient descent. In the training ofour softmax classifier, the inputs are the features extracted from theoriginal training spectra in the last phase. In order to facilitate thetraining, the spectral types are coded in numerical form to supervisethe training of the classifier, namely 1 for F-type stars, 2 for G-typestars and 3 for K-type stars.

4.2 Defective spectra recovery

In LAMOST, there are 16 spectrographs, and two CCD camerasmounted on each spectrograph to record both blue and red imagesof multifibre spectra. On the original CCD images, 4000 pixelsare used for recording blue- or red-wavelength regions, rangingfrom 370 to 590 nm or from 570 to 900 nm, respectively. One ofthe basic functions of the LAMOST 2D pipeline is wavelength-band connection, in which the blue and red channels are connectedto each other. However, this procedure produces some spectra withwavelength-connection defects. Fig. 5 shows an example of this typeof defective spectrum. This defect may cause unpredictable errorsin the subsequent 1D pipeline; for example, the automated spectralclassification algorithms in the 1D pipeline may misclassify the de-fective spectrum. In fact, a large proportion of the spectra classifiedas ‘unknown’ are done so because of the wavelength-connectionerror (Wang et al. 2015). Although human experts can achieve ac-curate classification for these defective spectra, the volumes of datainvolved make manual inspection infeasible. Consequently, it is nec-essary to develop an automated method to extract features that could

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Figure 5. An example of spectra with a wavelength-connection defect. Aconnection defect can be observed at 5 700 Å , which is the joint point ofthe blue and red bands. The wavelengths are given in angstroms.

Table 1. Comparison between the PIL-based algorithm and others in stellarspectral classification. The elapsed training times are given in seconds.

Algorithm Training time F1-scoreF-type G-type K-type

PCA 1.9056 0.7727 0.6707 0.8115LLE 2,539.9331 0.8082 0.7361 0.8214RBM 1,481.0040 0.7927 0.7225 0.8019PILDNN* 1,103.4251 0.8106 0.7415 0.8228PILDNN 226.9495 0.8468 0.7747 0.8427

be used to reconstruct these spectra with a wavelength-connectiondefect.

4.3 Experimental settings

In the stellar spectral classification experiment, we compare ourproposed method with PCA and LLE, both of which are widelyused as unsupervised feature-extraction techniques in astronomicalspectral processing. In addition, we compare our method with thenew approach using RBM. It should be emphasized that the aimof this comparison is not to seek high classification accuracy, butto compare the performances of the feature-learning capacity ofdifferent methods under the same conditions. We trained our neuralnetworks with and without local connectivity (denoted as PILDNNand PILDNN* respectively). In PILDNN, any spectrum input intothe network is split into four arbitrary segments. In order to ensurefairness, all methods employ the softmax regression model with thesame parameter settings as the classifier. Furthermore, the networkshave the same architecture, represented as 721-400-800-1200-2000,which means that a network has 721 input neurons, correspondingto the dimensionality of input spectrum, and four hidden layers with400, 800, 1200 and 2000 neurons, respectively.

In order to evaluate the classification performance quantitatively,we use the F1-score and the elapsed time to assess the classifiers. TheF1-score is a commonly used metric to evaluate the performance ofalgorithms or models in classification or information retrieval tasks;

it is the harmonic mean of the precision and the recall. It shouldbe noted that the F1-score is generally used in binary classificationtasks, and hence we should calculate the F1-score for each classseparately. Specifically, we can regard one type of spectrum as apositive sample, while other types of spectra are regarded as negativesamples, for example F-type versus G- and K-type spectra, and thencalculate the F1-score for the positive class as

F1 = 2P R

P + R, (15)

where P is the precision, which is defined as

P = T P

T P + FP, (16)

and R is the recall, which is calculated as

R = T P

T P + FN. (17)

TP denotes the true positive, which is the number of samples thatare classified into the positive class correctly. FP denotes the falsepositive, which is the number of samples that are classified intothe positive class incorrectly. FN is the false negative, which isthe number of samples that are classified into the negative classincorrectly.

In order to illustrate the performance of our proposed method indefective spectra recovery, we conduct an experiment according tothe following steps. (1) For each spectrum in the spectral data set,we generate a random number between 0 and 1 as the offset. (2) Theoffset is added to or subtracted from the fluxes in the red (or blue)band with a probability of 0.5 for each spectrum. By doing this,we obtain a new pseudo-defective spectral data set. (3) We use thispseudo-defective data set as input to the network, and the originaldata set as the expected output. In other words, we mean to train anetwork that can reconstruct the original spectra from the defectiveones. The spectral data set in this experiment is the same as the onein the aforementioned classification task, and is also divided intotwo subsets: a training set with 85 per cent of the samples and atest set with 15 per cent of the samples. We employ a basic PILAEin this experiment. The architecture of the PILAE is 721-360-721,which means a network with 721 input neurons, 360 hidden neuronsand 721 output neurons, corresponding to the dimensionality of theinput spectrum. We conducted both experiments on the same serverwith 12 Xeon X5690 3.47-GHz processors.

4.4 Results

Table 1 gives a summary of the performance evaluation for eachmethod in the stellar spectral classification task. The confusionmatrices for the classification experiment can be found in Fig. 6.Compared with conventional methods, our method has superior

Figure 6. Confusion matrices of the stellar spectral classification according to the different methods.

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Figure 7. The 2D visualization. The right panel shows 10 000 randomlyselected samples produced by taking the first two principal componentsof all training samples. The left panel shows the 2D features of the samesamples found by a network with only two output neurons.

performance. The reason for this lies in the fact that deep modelsare able to represent more complicated functions, which shallowerarchitectures fail to represent. Fig. 7 illustrates a way of visualizingthe 2D principal components produced by PCA and by our networkwith only two output neurons. It is clear that our model produces abetter separable visualization of the spectral data, which indicatesits superior feature-extraction capacity. Although the deep networkshave a stronger feature-learning ability, if we train a deep networkwith the conventional algorithm, for example the BP algorithm, thetraining will need more than 20 h, while our algorithm needs only afew minutes. There are two reasons for the superior efficiency. Forconventional deep-learning models, for example deep belief net-works (DBNs) (Hinton & Salakhutdinov 2006) and stacked auto-encoders (SAEs) (Vincent et al. 2010), a time-consuming globalfine-tuning with stochastic gradient descent is required after thelayer-wise pre-training. For the sake of efficiency, we do not adoptfine-tuning in our algorithm, and simply employ the softmax regres-sion model as the classifier, which takes the output of the last hiddenlayer as the input. Furthermore, our pseudo-inverse-based learningalgorithm pre-trains the deep model analytically, while conventionalalgorithms require time-consuming iterative optimization. From thecomparison between the PILDNN and PILDNN*, it can be seen thatthe local connectivity scheme also improves the learning efficiency.

There is another type of classification method that, unlike ourmethod, extracts spectral features based on empirical knowledge,such as index-based stellar classifications. For example, Liu et al.

(2015) used the 27 pre-defined line indices as features to classifyLAMOST stellar spectra. From the results reported in their work, wecalculated the F1-score for F-, G- and K-type stellar spectra, obtain-ing 0.8264, 0.8558 and 0.6464, respectively. Note that these authorsonly used spectra with a signal-to-noise ratio larger than 20; in con-trast, we do not have a special requirement for the signal-to-noiseratio. This demonstrates that features extracted by our automatedmethod can achieve comparable performance with expert-selectedfeatures in the classification task.

The results of random selected defective spectra recovery areshown in Fig. 8. Our proposed method can detect and repair theconnection defect while retaining as much information as possibleon the continuum and spectral lines in both the red and blue bands. Itis worth noting that our recovery method does not need any manualintervention throughout the entire procedure. In addition, we do notneed to know anything about the details of the connection defect,for example the position of the join point or the offset of the blueand red bands. Note that we find a few peculiar cases, as shown inthe top right panel of Fig. 8. In this type of case, spectra not only aredefective in the wavelength-band connection but also have missingflux pixels near the join point of the red and blue bands. This kindof defect is very different from the common cases and confuses ouralgorithm. Our method finally reconstructs the missing flux pixelsto a broad absorption line.

4.5 Parameter selection

Because our algorithm dose not need to set learning control param-eters, the only parameter that needs to be specified is the numberof neurons. In this subsection, we discuss the parameter selectionby analysing how the number of hidden neurons affects the perfor-mance of the network. For the sake of simplicity, we only analysethe size of the last hidden layer, that is, the dimensionality of thefinal obtained features. Fig. 9 illustrates the correlation between theaccuracy and the number of neurons. It can be seen that the accu-racy increases rapidly with the number of neurons when there arefewer than 500 neurons in a layer. After that, the accuracy increasesslightly with the number of neurons and remains nearly constantafter the number falls within the range of 1500 to 2000. In fact,the number of neurons affects not only the accuracy but also thetraining time. As shown in the middle panel of Fig. 9, the elapsed

Figure 8. Defective spectra with their corresponding recoveries. The spectra marked in red are the synthetic defective ones. The black spectra are the onesrepaired with our method. The wavelengths are given in angstroms.

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Figure 9. From left to right: classification accuracy with increasing size of the last hidden layer; computational time with increasing size of the last hiddenlayer; recognition accuracy for the different ensemble models.

training time increases with the number of hidden neurons. Fol-lowing a comprehensive investigation, we set the number of hiddenneurons to 2000 in this spectral classification task. The size of otherhidden layers can be analysed and specified in a similar way.

4.6 Model averaging

Because our method has a superior learning speed, we can also traina set of networks and average the predictions of each network toobtain better performance in the classification task. These networksvary slightly in the number of layers, activation functions and num-ber of neurons in the individual hidden layers, and make slightlydifferent predictions as a result. We trained a series of differentnetworks on different data sets. Every data set consisted of F-, G-and K-type spectra selected from every 100 000 stellar spectra ofLAMOST DR1. We then averaged the predictions of different net-works to obtain the final results on the same validation set. Theprediction accuracy of the ensemble models is shown in the rightpanel of Fig. 9. When the number of training samples increases to608 483 and the number of networks in the ensemble model reaches12, the F1-score is 0.8549, 0.7891 and 0.8499 for F-, G- and K-typespectra, respectively. The ensemble model thus improves the per-formance further.

4.7 Visualization and analysis

The proposed DNNs with the pseudo-inverse-based learning algo-rithm have demonstrated an impressive performance on the astro-nomical spectral data. However, this deep model remains a ‘blackbox’ because the activations in intermediate layers are very diffi-cult to understand, and also there is no clear understanding of howthe model operates, or what exactly it learns from the spectra. Asshown in Fig. 10, a simple visualization may not be conducive tounderstanding the learned features. It is therefore necessary to vi-sualize and understand the obtained features in order to help us tounderstand the model and the learning procedure. A visualizationand understanding of the deep networks is difficult, however, be-cause deep models always have millions, or even more, parametersand work as highly complex non-linear functions. Without a clearunderstanding of the features in the hidden layers and of what ex-actly the networks learn from the input data, we can only developor improve a model by trial and error.

In this subsection, we aim to address two problems. The first oneis to find good qualitative interpretations of what a neuron learns

Figure 10. (Top) a random sample of spectra; (bottom) the features ex-tracted by the deep neural network. The wavelengths are given in angstroms.

from the spectra. The second one is to explore ways of visualizingthe salient wavelength range to which the network is attentive.

For the first problem, we try to seek the input patterns that max-imize the output of a given hidden neuron while avoiding trivialsolutions. The reasoning behind this strategy is that the input pat-terns maximizing the activation of a given hidden neuron illustratewhat the neuron is looking for. Inspired by the work of D. Erhanet al. (Erhan et al. 2009b), we can search for the maximum activa-tion of a given neuron by solving an optimization problem. To bespecific, we use � to denote the network parameter set, includingconnection weights and bias. Let hij represent the ith neuron in thejth layer. The function hij (�, x) maps an input spectrum x to theactivation of the given neuron. After training, � is fixed, and theoptimization problem can be formulated as

x∗ = arg max hij (�, x), (18)

where x∗ is the input pattern maximizing the activation of the givenneuron. It can be viewed as what this neuron has learned from thetraining data. Initially, we input the spectrum into the trained net-work, and it is propagated throughout the first j layers. Then we canselect the top N most active neurons in the jth layer and map theseactivities back to the input space by solving the aforementionedoptimization problem. Because the most frequently used activationfunctions, for example the hyperbolic and sigmoidal functions, arecontinuous and have continuous first-order derivatives, the opti-mization problem can be solved with a gradient-based method. For

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Figure 11. Visualization of learned patterns in the trained network given an F-type spectrum as input. The top row shows the four input spectrum segments.Rows 2 to 6 show the patterns that maximize the activation of the top five active neurons for each segment separately. The wavelengths are given in angstroms.

Figure 12. Visualization of learned patterns in the trained network given a G-type spectrum as input. The top row shows the four input spectrum segments.Rows 2 to 6 show the patterns that maximize the activation of the top five active neurons for each segment separately. The wavelengths are given in angstroms.

a given input spectrum divided into four segments, we select the topfive most active neurons in an arbitrary hidden layer correspondingto each segment and map each activation back to the input space.Figs 11, 12 and 13 show what the top five most active neuronslearn from the spectral data once the training is finished. It can beobserved that the learned patterns are similar to the original inputspectrum segments, which proves that the model is able to extractthe features of both the continuum and spectral lines.

For the second problem, we were inspired by the work of Zeilerand Fergus (Zeiler & Fergus 2014) and evaluate the ‘importance’ ofdifferent wavebands in the input spectrum by visualizing the outputof the network. In other words, a ‘contribution value’ is assigned toeach portion of the input spectrum. To do this, we occluded differentportions (wavebands) of the input spectra using a sliding windowand set the step-length to 1Å. We then visualized the changes in

the output neuron. If the occlusion of a certain wavelength rangecauses a significant change in the final output, we can concludethat the neural network is more attentive to the particular patternslying in the occluded wavelength range. In contrast, if the occlusionof a certain wavelength range has no effect on the final output, thisproves that this waveband is inconsequential. Because a portion maybe occluded several times by the sliding window, we calculate theoverall ‘contribution value’ by averaging variations of the output.The visualization is presented in Fig. 14, in which the colour reflectsthe importance. A light-coloured region means that these portionsof the input spectrum are important evidence for the final predictedclass. In contrast, a dark-coloured region means that these portionsare unrelated to the prediction. For example, we can see that almostthe entire spectra are depicted in a dark colour in the first columnof Fig. 14. This implies that the trained network does not ‘think’

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Figure 13. Visualization of learned patterns in the trained network given a K-type spectrum as input. The top row shows the four input spectrum segments.Rows 2 to 6 show the patterns that maximize the activation of the top five active neurons for each segment separately. The wavelengths are given in angstroms.

that any portion provides important evidence for the final prediction.Actually, we set the window size to only 1 in this case, and any singlepixel provides little information for the final prediction. As the sizeof the sliding window size increases, it is interesting to observe howthe network assesses the saliency of different wavelength ranges.As shown in the 4th column of Fig. 14, we can observe that thetrained network is more attentive to the portions near to the peakof blackbody radiation intensity. This peak reflects the differencesin the distribution of the stellar effective temperature, which is themain criterion adopted to define different spectral classes.

Once the aforementioned operations are complete, we can obtaina function that maps the wavelengths to the corresponding saliencyvalues. The saliency value of a wavelength reflects the relative im-portance of this wavelength for the final prediction result, and can bequantified as the network output variation. We randomly selectedthree spectra from F-, G- and K-type stars and plot the saliencyvalue as a function of wavelength in Fig. 15. In addition, the threesaliency-value curves are averaged to form an overall curve. In or-der to evaluate and understand the feature extraction capacity of ourmodel, we compared it with the line index-based method, whichcan be understood as a method that manually selects features basedon domain knowledge. Liu et al. (2015) selected five line indices,namely Hγ , Mg, Fe, G band and TiO2, for the index-based stellarclassification. Hγ was selected as the representative Balmer linebecause all Balmer lines separate the classes well and Hγ has thelargest amplitude of variation. Mg1, Mg2, Mgb and nine iron lineswere averaged to represent the composite line indices of Mg and Fe,respectively. The G band (CH) and TiO2 were selected to representthe molecular bands. Among these line indices, Hγ and Mg aresuitable for distinguishing F-, G- and K-type spectra and they alsolie in the wavelength range with relatively high average saliencyvalues, as shown in Fig. 15. The difference of line indices amongF-, G- and K-type spectra is small at Fe 5709, Fe 5782 and TiO2.These three indices locate at the wavelength with lower saliency-value. Through the above analysis, we find that, although ourmethod requires no expert knowledge about the spectral classifi-

cation task, it can obtain knowledge directly from the data, andwhat it learns is consistent with the expert knowledge to someextent.

5 C O N C L U S I O N S

In this paper, we have presented a new efficient and automatedfeature-extraction method for large-scale astronomical spectral datasets and applied the proposed method to stellar spectral classifica-tion and defective spectra recovery. Compared with other classicalfeature-extraction methods widely used in astronomy, our algorithmhas a better feature-learning performance. Unlike expert-designedfeature-extraction methods, the proposed method does not makeany prior assumptions about the underlying structure of the spectraldata and does not assume any priori knowledge nor require anymanual intervention. Hence it is not task-specific but is rather ageneral-purpose technique for various spectral processing tasks. Itis worth noting that, although we only used F-, G-, K-type stellarspectra in the classification task, other types of spectra can also beprocessed by our method. There are also some peculiar cases thatneed further research. For example, the neural network producesfake features in some cases (see Figs 11, 12 and 13). In the defec-tive spectra recovery task, we also found a few complicated casesthat confuse our algorithm. In future work, we plan to apply ourmethod to other astronomical spectral analysis tasks in addition tothe ones discussed in this paper.

AC K N OW L E D G E M E N T S

This work was fully supported by grants from the National Nat-ural Science Foundation of China (61375045), the Beijing Nat-ural Science Foundation (4142030) and the Joint Research Fundin Astronomy (U1531242) under cooperative agreement betweenthe National Natural Science Foundation of China (NSFC) and theChinese Academy of Sciences (CAS).

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Fig

ure

14.

Vis

ualiz

atio

nof

the

impo

rtan

ceof

inpu

tpat

tern

sfo

rth

epr

edic

ted

resu

ltsin

thre

era

ndom

lyse

lect

edsp

ectr

aw

ithdi

ffer

entt

ypes

.We

syst

emat

ical

lyco

ver

updi

ffer

entp

ortio

nsof

the

inpu

tspe

ctra

with

asl

idin

gw

indo

wto

see

how

the

final

clas

sifie

rou

tput

chan

ges.

Row

s1

to3

show

the

thre

esa

mpl

edsp

ectr

ase

para

tely

.The

size

ofth

esl

idin

gw

indo

wis

spec

ified

as1,

64,1

28an

d25

6,re

spec

tivel

y,in

colu

mns

1to

4.T

hew

avel

engt

hsar

egi

ven

inan

gstr

oms.

(Not

eth

atth

efig

ure

iscl

eare

rin

the

colo

urve

rsio

non

line.

)

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Figure 15. The saliency curve of randomly selected spectra and their av-erage. Vertical lines with different colours and symbols correspond to theselected prominent spectral lines. The wavelengths are given in angstroms.(Note that the figure is clearer in the colour version online.)

R E F E R E N C E S

Bailer-Jones C. A., Irwin M., Von Hippel T., 1998, MNRAS, 298, 361Ball N. M., Brunner R. J., 2010, Int. J. Mod. Phys. D, 19, 1049Bengio Y., 2009, Foundations and trends R© in Machine Learning, 2, 1Bengio Y., Courville A., Vincent P., 2013, TPAMI: IEEE Trans. Pattern

Anal. Mach. Intell., 35, 1798Boullion T. L., Odell P. L., 1971, Generalized Inverse Matrices. John Wiley

& Sons, New YorkBu Y., Zhao G., Luo A.-l., Pan J., Chen Y., 2015, A&A, 576, A96Cui X.-Q. et al., 2012, RA&A, 12, 1197Dahl G. E., Yu D., Deng L., Acero A., 2012, IEEE Trans. Audio, Speech

Language Proc., 20, 30Daniel S. F., Connolly A., Schneider J., VanderPlas J., Xiong L., 2011, AJ,

142, 203Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441Erhan D., Manzagol P.-A., Bengio Y., Bengio S., Vincent P., 2009a, in

AISTATS, The Difficulty of Training Deep Architectures and the Effectof Unsupervised Pre-Training, p. 153

Erhan D., Bengio Y., Courville A., Vincent P., 2009b, Visualizing Higher-layer Features of a Deep Network. Dept. IRO., Univ. Montreal

Graff P., Feroz F., Hobson M. P., Lasenby A., 2014, MNRAS, 441, 1741Guo P., Lyu M. R., 2004, Neurocomputing, 56, 101Hager W. W., 1989, SIAM Rev., 31, 221Hinton G. E., Salakhutdinov R. R., 2006, Science, 313, 504Hinton G. et al., 2012, IEEE Signal Processing Magazine, 29, 82Hoerl A. E., Kennard R. W., 1970, Technometrics, 12, 55Hoyle B., 2016, Astron. Comput., 16, 34Huertas-Company M. et al., 2015, ApJS, 221, 8Krizhevsky A., Hinton G. E., 2011, in ESANN, Using Very Deep Autoen-

coders for Content-based Image RetrievalKrizhevsky A., Sutskever I., Hinton G. E., 2012, in Adv. Neural Inf. Pro-

cess. Syst., ImageNet Classification with Deep Convolutional NeuralNetworks, p. 1097

Larochelle H., Erhan D., Courville A., Bergstra J., Bengio Y., 2007, in ICML,An empirical Evaluation of Deep Architectures on Problems with manyFactors of Variation, p. 473

Larochelle H., Bengio Y., Louradour J., Lamblin P., 2009, J. Mach. Learn.Res., 10, 1

Liu C. et al., 2015, RA&A, 15, 1137Luo A.-L. et al., 2015, RA&A, 15, 1095McGurk R. C., Kimball A. E., Ivezic Z., 2010, AJ, 139, 1261Navarro S., Corradi R., Mampaso A., 2012, A&A, 538, 76Perryman M. et al., 2001, A&A, 369, 339Richards J. W., Freeman P. E., Lee A. B., Schafer C. M., 2009, ApJ, 691, 32

Salakhutdinov R., Hinton G., 2009, Int. J. Approx. Reasoning, 50, 969Song Y.-H. et al., 2012, Res. Astron. Astrophys., 12, 453VanderPlas J., Connolly A., 2009, AJ, 138, 1365Vincent P., Larochelle H., Bengio Y., Manzagol P.-A., 2008, in Adv. Neural

Inf. Process. Syst., Extracting and Composing Robust Features withDenoising Autoencoders, p. 1096

Vincent P., Larochelle H., Lajoie I., Bengio Y., Manzagol P.-A., 2010, J.Mach. Learn. Res., 11, 3371

Von Hippel T., Storrie-Lombardi L., Storrie-Lombardi M., Irwin M., 1994,MNRAS, 269, 97

Wang K., Guo P., Luo A.-L. 2015 in IEEE Big Data, Angular Quantizationbased Affinity Propagation Clustering and its Application to Astronom-ical Big Spectra Data, p. 601

Way M. J., Scargle J. D., Ali K. M., Srivastava A. N., 2012, Advances inMachine Learning and Data Mining for Astronomy. CRC Press, BocaRaton, FL

Weaver W. B., Torres-Dodgen A. V., 1997, ApJ, 487, 847Yang T., Li X., 2015, MNRAS, 452, 158Yip C.-W. et al., 2004, AJ, 128, 585York D. G. et al., 2000, AJ, 120, 1579Yoshua Bengio Pascal Lamblin D. P., Larochelle H., 2006, in Adv. Neural

Inf. Process. Syst., Greedy Layer-Wise Training of Deep Networks,p. 153

Zeiler M. D., Fergus R., 2014, in, ECCV. Springer, Cham, Heidelberg,New York, Dordrecht, London, p. 818

A P P E N D I X A : IN T RO D U C T I O N TO T H E P I LA L G O R I T H M

For simplification, we define the propagation of a SLFN in matrixform:

H = σ (XW0 + θ ),X ∈ RN×d ,W0 ∈ Rd×p, (A1)

where the outputs of the hidden layer are summarized into a matrixH. X is the input matrix consisting of N d-dimensional input vectors.W0 = [w1

0, w20, ..., w

p0 ] is the input connection weight matrix, in

which wi0 = [wi,1

0 , wi,20 , ..., wi,d

0 ]T is the connection weight betweenall input neurons and the ith hidden neuron. The output of the SLFNis

Y = HW1,H ∈ RN×p,W1 ∈ Rp×m, (A2)

where W1 = [w11, w

21, ..., w

m1 ] is the output connection weight ma-

trix and the ith column of W1, wi1 = [wi,1

1 , wi,21 , ..., w

i,p1 ]T, is the

weight connecting all hidden neurons and the ith output neuron.Therefore, the supervised learning is to solve the following opti-mization problem:

minimizeW1

: ‖Y − O‖2, (A3)

where O ∈ RN × m is the expected label matrix, which consists ofN m-dimensional label vectors. Guo and Lyu proposed a pseudo-inverse-based solution to solve the optimization problem defined inequation (A3) as

W1 = H+O, (A4)

where H+ is the pseudo-inverse of the hidden-layer matrix H. Fromthe point of view of linear algebra, the above-mentioned solution isthe best approximation for Y =O (see Guo & Lyu 2004; Boullion &Odell 1971 for more details). The pseudo-code of the PIL algorithmis as follows:

(1) H0 = X0; /* Initialization */(2) (H0)+ = Pseudo-inverse(H0); /* Compute the pseudo-inverse

*/

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



(3) If ‖Hl(Hl)+ − I‖2 < e go to (6); Else go to (4); /* e is a givenerror */

(4) Wl = (Hl)+; Hl + 1 = σ (HlWl); /* Feed forward the result tothe next layer */

(5) (Hl + 1)+ = Pseudo-inverse(Hl + 1); l = l + 1; go to (3);(6) WL = (HL)+O; /* Calculate the output weight */(7) Y = σ (...σ (σ (H0W0)W1)...)WL; /* Stop training and calcu-

late the output */.

A P P E N D I X B : IN T RO D U C T I O N TO T H E P I L A E

Here, we present a detailed description of the PILAE. We re-gard an auto-encoder as a particular type of SLFN that learns anapproximation to an identity function. This means that we coulduse the PIL algorithm to train auto-encoders with a constraint, O =X. Unlike the original PIL algorithm, auto-encoders should be ableto map the input to a lower- or higher-dimensional feature space.Therefore, we randomly project the input data into a space withdifferent dimensionality, rather than arbitrarily set the number ofhidden neurons equal to the input. In addition, the input weight ma-trix W0 is constrained by W0 = W1

T. This constraint is called tiedweights in auto-encoders (Vincent et al. 2008). With this constraint,we can simply use W to represent both the input and output weightmatrixes without distinction. Analogous to the original PIL algo-rithm, we adopt the following pseudo-inverse approximate solutionto solve the optimization problem defined in equation (A3) as:

W = H+X. (B1)

The orthogonal projection method (Boullion & Odell 1971) can beused to calculate the pseudo-inverse, and thus equation (B1) can berewritten as

W = (HTH)−1HTX. (B2)

In order to avoid over-fitting problems, we add a regularizationterm in order to obtain good generalization performance. Accordingto the research of Hoerl and Kennard (Hoerl & Kennard 1970), formultiple linear regression, Y = Xβ + ε, where X ∈ Rn × p andis of rank p, β ∈ Rp × 1 is unknown. If XTX is not nearly a unitmatrix, the least-squares estimates are sensitive to the ‘errors’ inX, and the estimation of β does not make sense when put intoreal applications. The following method is proposed to increase thegeneralization performance as

β = (XTX + kI)−1XTY, k > 0. (B3)

Therefore, we rewrite equation (B2) as

W = (HTH + kI)−1HTX, (B4)

where k > 0 is a user-specified regularization parameter.

A P P E N D I X C : IN T RO D U C T I O N TO T H EIPILAE

Here, we present a detailed description of the IPILAE. For thesake of convenience, we take a single hidden layer auto-encoder asan example rather than stacked auto-encoders, although the latteris more commonly used in real applications. In fact, the learningalgorithm for the single hidden layer auto-encoder can be easilyapplied to the stacked auto-encoders when the output of the previouslayer is used as the input for the current layer.

In the following, we describe the incremental learning algorithmfor the PILAE in detail. Let us use Ht0 , Wt0 and Xt0 to denote the

activations of hidden neurons, the output weights matrix and theexisting training samples at an arbitrary time t0. Whenever a newbatch of training instances, �D = {xi , oi}�N

i=1, arrives at time t1,during the training procedure, we should update the weights matrixaccording to equation (A3). Thus, the optimization object becomes

minimizeWt1

:∥∥Ht1Wt1 − Xt1

∥∥2. (C1)

A direct solution for the above-mentioned optimization problemis given in equation (B4). On these grounds, we can calculate theweights matrix at time t1 as

Wt1 = (HT

t1Ht1 + kI

)−1HT

t1Xt1 . (C2)

Here, we simply combine the past samples Xt0 with the newlyarriving samples �X to construct a new larger training set Xt1 =[XT

t0, �XT]T. This direct solution is simply in a conventional batch

learning fashion. For the sake of simplicity, we use Si to denoteHT

i Hi + kI, and then equation (C2) can be simplified to

Wt1 = S−1t1 HT

t1Xt1 , (C3)

and Wt0 can be written as

Wt0 = S−1t0

HTt0Xt0 . (C4)

We can also rewrite the solution in equation (C2) in another form:

Wt1 =([

Ht0

�H

]T [Ht0

�H

]+ kI

)−1[Ht0

�H

]T [Xt0

�X

]

= (HT

t0Ht0 + �HT�H + kI

)−1 (HT

t0Xt0 + �HT�X

), (C5)

where �H denotes the values of hidden neurons corresponding tothe newly arriving training samples �X. Then equation (C3) can bewritten as

Wt1 = S−1t1

(HT

t0Xt0 + �HT�X

)= (

St0 + �HT�H)−1 (

HTt0Xt0 + �HT�X

). (C6)

According to the Woodbury matrix identity (Hager 1989), we cancalculate the inverse of the matrix St1 as

S−1t1

= (St0 + �HT�H

)−1

= S−1t0

− S−1t0

�HTC−1t0

�HS−1t0

, (C7)

where Ct0 is the so-called capacitance matrix (Hager 1989) and canbe calculated as

Ct0 = I + �HS−1t0

�HT. (C8)

Subsequently, by substituting equation (C7) into equation (C6),the calculation of Wt1 can be transformed into its equivalent formas

Wt1 = (S−1

t0− S−1

t0�HTC−1

t0�HS−1

t0

) (HT

t0Xt0 + �HT�X

).

(C9)

Through formula derivation, equation (C9) can be written as

Wt1 = αt0Wt0 + βt0 , (C10)

where

αt0 = (I − S−1

t0�HTC−1

t0�H

)Wt0 ,

βt0 = S−1t0

(I − �HTC−1

t0�HS−1

t0

)�HT�X. (C11)

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from



Without loss of generality, we can calculate the output weights ina recursive way whenever new training instances are received, andthus equation (C10) can be generalized as

Wtn = αtn−1Wtn−1 + βtn−1 , (C12)

where

αtn−1 =(I − S−1

tn−1�HTC−1

tn−1�H

),

βtn−1 = S−1tn−1

(I − �HTC−1

tn−1�HS−1

tn−1

)�HT�X, (C13)

in which the output weights matrix at any time tn can be calculatedfrom the weights at the previous time point tn − 1. Analogously, S−1

tn−1

and C−1tn−1

are also calculated in a recursive way by generalizing thespecific cases in equations (C7) and (C8) to more general cases as

S−1tn

= S−1tn−1

− S−1tn−1

�HTC−1tn−1

�HS−1tn−1

, n ≥ 1, (C14)

where

C−1tn−1

=(I + �HS−1

tn−1�HT

)−1. (C15)

Supposing that the training data are continually collected in asequential order, after receiving a new batch of training instances,

�X, at time tn, we start a new round of weight updating. To do this,we calculate �H using the current weights matrix Wtn−1 , namely�H = �XWT

tn−1. Then, we update the output weights matrix with

�H and S−1tn−1

according to equations (C12) and (C13). It should be

noted that S−1tn−1

were calculated in the previous round, and hencewe can access them directly. Finally, we calculate the inverse of thecapacitance matrix Ctn−1 and S−1

tnaccording to equations (C14) and

(C15) and store them for the future use in the next round of weightupdating.

At the beginning of the whole learning procedure, the first batch oftraining samples should be trained in a batch-learning way, whichmeans we can directly calculate the weights matrix according to

equation (C4). We also need to store the S−1t0

= (HT

t0Ht0 + kI

)−1

for the next round.

This paper has been typeset from a TEX/LATEX file prepared by the author.

MNRAS 465, 4311–4324 (2017)

at National A

stronomical O



ownloaded from


a new automated spectral feature extraction...

Documents