distribution inference from early-stage stationary data

19
Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=uiie21 IISE Transactions ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/uiie21 Distribution inference from early-stage stationary data streams by transfer learning Kai Wang, Jian Li & Fugee Tsung To cite this article: Kai Wang, Jian Li & Fugee Tsung (2021): Distribution inference from early-stage stationary data streams by transfer learning, IISE Transactions, DOI: 10.1080/24725854.2021.1875520 To link to this article: https://doi.org/10.1080/24725854.2021.1875520 View supplementary material Published online: 01 Mar 2021. Submit your article to this journal Article views: 98 View related articles View Crossmark data

Upload: others

Post on 26-Mar-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distribution inference from early-stage stationary data

Full Terms & Conditions of access and use can be found athttps://www.tandfonline.com/action/journalInformation?journalCode=uiie21

IISE Transactions

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/uiie21

Distribution inference from early-stage stationarydata streams by transfer learning

Kai Wang, Jian Li & Fugee Tsung

To cite this article: Kai Wang, Jian Li & Fugee Tsung (2021): Distribution inferencefrom early-stage stationary data streams by transfer learning, IISE Transactions, DOI:10.1080/24725854.2021.1875520

To link to this article: https://doi.org/10.1080/24725854.2021.1875520

View supplementary material

Published online: 01 Mar 2021.

Submit your article to this journal

Article views: 98

View related articles

View Crossmark data

Page 2: Distribution inference from early-stage stationary data

Distribution inference from early-stage stationary data streams bytransfer learning

Kai Wanga, Jian Lia, and Fugee Tsungb

aSchool of Management and State Key Laboratory for Manufacturing Systems Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China;bDepartment of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Kowloon, Hong Kong

ABSTRACTData streams are prevalent in current manufacturing and service systems where real-time data arriveprogressively. A quick distribution inference from such data streams at their early stages is extremelyuseful for prompt decision making in many industrial applications. For example, a quality monitoringscheme can be quickly started if the process data distribution is available and the optimal inventorylevel can be determined early once the customer demand distribution is estimated. To this end, thisarticle proposes a novel online recursive distribution inference method for stationary data streamsthat can respond as soon as the streaming data are generated and update as regularly as the dataaccumulate. A major challenge is that the data size might be too small to produce an accurate esti-mation at the early stage of data streams. To solve this, we resort to an instance-based transferlearning approach which integrates a sufficient amount of auxiliary data from similar processes orproducts to aid the distribution inference in our target task. Particularly, the auxiliary data arereweighted automatically by a density ratio fitting model with a prior-belief-guided regularizationterm to alleviate data scarcity. Our proposed distribution inference method also possesses an effi-cient online algorithm with recursive formulas to update upon every incoming data point. Extensivenumerical simulations and real case studies verify the advantages of the proposed method.

ARTICLE HISTORYReceived 20 May 2020Accepted 6 January 2021

KEYWORDSData sharing; density ratioestimation; instancetransfer; prior information;small data

1. Introduction

Rapid advances in sensing and information technologieshave facilitated the collection of massive amounts of data inan automatic and real-time fashion (Tien, 2013; Yang et al.,2019). Such data arrive progressively and continuously, e.g.,in-situ signals from machine sensors, transaction records inbanks, hit logs of e-commerce websites, etc., and have beenreferred to as data streams (Domingos and Hulten, 2003;Heinz and Seeger, 2008). Unlike batch data analysis, wheresufficient data have to be prepared for a long period inadvance and learning from the accumulated dataset is aonce-and-for-all offline process, decision making fromstreaming data requires an efficient online learning approach(Lin and Wang, 2011, 2012; Hoi et al., 2018). The solutionsdesigned for data streams should be able to (i) respond assoon as the data stream starts, and (ii) update as regularly asa new data point from the data stream is received.

This article aims to infer distributions from early-stagestationary data streams. It is well known that distributionscan provide complete information of data processes andplay a fundamental role in many advanced data mining andmachine learning algorithms (Bishop, 2006). Particularly, wefocus on the estimation of a stationary CumulativeDistribution Function (CDF) at the very early stage of adata stream. Due to current high-variety low-volume

manufacturing and service environments (Huertas Quinteroet al., 2010; Srinivasan and Viswanathan, 2010), many datastreams are quite short. Even for long or open-ended datastreams, the process dynamics would make the duration ofdata streams in one steady state to be rather limited. In suchcases, developing an accurate and reliable distribution esti-mation instantaneously when data streams begin or enter anew state is imperative for timely decision making, andspans a wide range of applications in industrial engineering.For instance:

1. In quality control of short-run or low-volume manufac-turing processes, the probability that a part’s character-istic exceeds control limits has to be monitored as soonas possible (Zantek et al., 2006; Li et al., 2014).

2. For newsvendor problems, the demand distribution of aproduct in a new period or market should be earlyinferred to quickly determine the optimal inventorylevel (Huber et al., 2019; Oroojlooy et al., 2020).

However, when conducting distribution inference in theabove early-stage context, a daunting challenge inevitablyarises that is associated with the fact that the available dataare still very scarce. In this article, we propose to tackle thispractical dilemma from a transfer learning prospective,

CONTACT Jian Li [email protected] data for this article is available online at https://doi.org/10.1080/24725854.2021.1875520

Copyright � 2021 “IISE”

IISE TRANSACTIONShttps://doi.org/10.1080/24725854.2021.1875520

Page 3: Distribution inference from early-stage stationary data

which is inspired by current industrial practice that: (i)products or processes within an enterprise are usually differ-entiated for diverse demands, but they also share commonfeatures for cost saving; and (ii) their data can be cheaplyrecorded, safely stored and easily retrieved using today’s bigdata infrastructure. Therefore, data sharing is an intuitivelyappealing idea for solving data scarcity, and it is exactlywhat transfer learning promotes. In this article we realizethis novel idea by developing a theoretically sound and com-putationally fast distribution inference method which caneffectively transfer auxiliary data into target data streams toenhance estimation performance.

1.1. Related works

The conventional distribution inference methods are theEmpirical Distribution Function (EDF) and the KernelDensity Estimation (KDE), which are both distribution freeor require no parametric forms. When applying them todata streams, the main concern of existing works is to keepthe computation and storage cost at each time point con-stant, so that the model can be easily updated. For example,in the online KDE, only a fixed number of kernels are usedby merging adjacent data points (Zhou et al., 2003; Heinzand Seeger, 2008) or resampling historical data (Zhenget al., 2013), and in the online EDF, a constant number ofpredetermined bins, each with a center point and an accu-mulative count statistic, are taken to calculate the rank of anew data point (Ross et al., 2011). Note that these worksconcentrate on developing approximate estimates when thedata size goes to infinity, whereas our focus is on making aquick and accurate distribution inference at the early stageof data streams when the data size is still small.

Learning from small datasets is drawing increasingresearch attention (Kuo and Kusiak, 2019), and the relevantmethods in the literature can be classified into the followingthree categories. One solution is to add artificial data. Toname a few, Li and Lin (2006) proposed a KDE with varyingbandwidths in different data intervals to generate virtualsamples. A mega or generalized trend diffusion techniquewas used in Li et al. (2007) and Lin and Li (2010) to drawvirtual data whose probabilities are calculated by the trianglemembership function in fuzzy theory, but this membershipfunction is deficient in closely mimicking complex distribu-tions (e.g., multimodal or heavily-tailed ones). In addition,the virtual samples are not guaranteed to be generated fromthe true distribution and could induce potential bias (seeFigure 13 in Li and Lin, (2006)).

Another way to treat small data is to incorporate priorinformation via the Bayesian framework, and the estimationcan be updated upon every new data point by posteriorsampling (Bishop, 2006). To infer distributions of anypotential forms, the Bayesian nonparametric density estima-tion has been widely studied (M€uller and Quintana, 2005;Jara et al., 2011; Polansky, 2014; Li et al. 2017), where theunderlying distribution is often modeled by an infinite mix-ture of normal distributions whose parameters follow aDirichlet Process (DP) prior. The DP prior and its

hyperparameters impact the posteriors greatly when the dataare limited, and thus need a demanding tuning. The elicit-ation of an informative prior for a target dataset from anauxiliary dataset actually neglects their inherent difference,and could impair the target distribution inference.

The final class refers to transfer learning, which improves alearner’s performance in a new target domain or task by transfer-ring information from related source domains (e.g., similar prod-ucts or processes) so as to eliminate the expensive efforts in datacollection (Pan and Yang, 2010; Huang et al., 2012; Tseng et al.,2016; Tsung et al., 2018). One important transfer strategy is toadjust the weights of source instances or data considering the tar-get task. This can be achieved by supervised learning to assignmore weights on source instances that lead to lower target predic-tion errors (Dai et al., 2007; Garcke and Vanck, 2014; Tirinzoniet al., 2018), or by importance or density ratio estimation toreduce divergence between the target and reweighted source cova-riate data (Huang et al., 2006; Jiang and Zhai, 2007; Kanamoriet al., 2009; Sugiyama et al., 2012; Garcke and Vanck, 2014; Xia,et al. 2018). The above instance weighting transfer learning typic-ally focuses on the offline analysis of batch data, and has not beeninvestigated as in our context for an online distribution inferencetask where the target data arrive continuously.

1.2. Proposed solution and contributions

In this article, we propose a distribution inference methodfor stationary data streams at their early stages with smalldata size via an instance-based transfer learning approach.To be specific, the similar data from a source domain, afterbeing properly reweighted, will join our target CDF estima-tion in a theoretically unbiased way. Then our goal becomesa weight assignment problem that is equivalent to a densityratio estimation task, and we adopt a least-squares fittingmodel as in Kanamori et al. (2009), Sugiyama et al. (2012)and Liu et al. (2013) to conquer this task. More importantly,to address the challenge that the initial target data size isfairly small, a prior belief that the target and source distribu-tions are similar is explicitly utilized. Mathematically, suchinformation a priori is formulated as a differentiable quad-ratic-form regularization term, which is further imposed onour model. The tuning parameters involved are determinedcreatively such that as the stream accumulates more data,the effect of the prior information vanishes. Our proposedtransfer learning-based CDF inference method, embeddedwith a regularized density ratio estimation model, enjoys aclosed-form solution, and has efficient recursive formulasfor online updating upon every new data point. The exten-sions of our method to situations with multiple sourcedomains and multivariate data streams are also provided.

To sum up, we put forward a novel online distributioninference method that can respond quickly, perform accur-ately and update efficiently for data streams. The main con-tributions of this article are highlighted as below:

1. An instance-based transfer learning approach which caneffectively realize the data sharing idea is proposed forour online distribution inference task based on

2 K. WANG ET AL.

Page 4: Distribution inference from early-stage stationary data

streaming data. It successfully addresses two specificcharacteristics of data streams with respect to the datascarcity when a data stream starts and the model updat-ing when a data stream grows.

2. A regularized density ratio estimation model that expli-citly utilizes the similarity between the source and targettasks is developed. The differentiable quadratic-formregularization term produces analytical solutions at eachtime point and enhances estimation performance insmall data scenarios. The tuning parameter decays pro-gressively to diminish the prior information when moretarget data are accumulated.

3. A computationally efficient algorithm is derived toenable a fast model updating for every new data pointin data streams. It is equipped with closed-form recur-sive formulas for key model parameters to circumventbig matrix inversions and to maintain modest storagecosts, which makes the execution time of our onlinealgorithm significantly less than the batch learning.

The reminder of this article is organized as follows.Section 2 describes the technical details of our proposedCDF inference method. Numerical simulations are per-formed in Section 3 to show the advantages of our method.Section 4 applies the method to two industrial examples.Conclusions are finally given in Section 5. Some simulationresults are offered in an online file containing supplemen-tary material.

2. Methodology

This section first describes a transfer learning approach tocombine source data for inferring the target distribution.The involved density ratio is then estimated with the aid ofa tailored regularization term. Next, an online updatingalgorithm is derived and some guidelines on the tuning par-ameter selection are offered. Finally, several extensions ofour method are discussed.

2.1. Distribution inference via instance transfer

Suppose our target data stream is characterized by a con-tinuous univariate random variable X with CDF F(x) andProbability Density Function (PDF) p(x). At the nth timepoint, we have observed a sequence of target data DðtÞ

n ¼fx1, :::, xng which are independently and identically drawnfrom p(x). As mentioned before, we can also access a similarproduct or process in a source domain and obtain a sourcedataset DðsÞ ¼ fz1, :::, zmg with PDF q(z) and a large size m.The choice of DðsÞ is discussed later in Section 2.4. Note thatfor exposition convenience, here we only consider onesource domain and univariate data stream, but our devel-oped method can scale up to multiple source domains andmultivariate data streams as will be shown in Section 2.5.

A typical nonparametric method to estimate F(x) is theEDF based on DðtÞ

n , i.e., FnðxÞ ¼ 1n

Pni¼1 Iðxi � xÞ, where

Ið�Þ is the indicator function. At the early stage of the targetstream, the size n is small, which causes FnðxÞ to be highly

unreliable. To circumvent this pitfall, we now reformulatethe EDF from an instance-based transfer learning prospect-ive. That is, we borrow the data or instances from thesource domain to learn the target task.

First, note that by using the importance sampling tech-nique (Liu, 2008), we have

FðxÞ ¼ PðX � xÞ ¼ðx�1

pðuÞdu ¼ðþ1

�1Iðu � xÞpðuÞdu

¼ðþ1

�1Iðu � xÞ pðuÞ

gðuÞ gðuÞdu,(1)

where g(u) is an auxiliary PDF and a density ratio functionis defined as cðuÞ ¼ pðuÞ=gðuÞ: A naive choice of g(u) isq(u), and then by evaluating the expectation in Equation (1)with the empirical average, we obtain

FðxÞ ¼ðþ1

�1Iðu � xÞ pðuÞ

qðuÞ qðuÞdu � 1m

Xmj¼1

Iðzj � xÞ pðzjÞqðzjÞ ,

where F(x) is estimated from DðsÞ, a much larger dataset,and the density ratio function cðuÞ, which is now equal topðuÞ=qðuÞ, is used to adjust the weight of each source datapoint. However, due to the nature of a ratio, here cðuÞ couldbe highly unstable and even diverge to infinity. For example,let pðuÞ ¼ Nð1, 12Þ, qðuÞ ¼ Nð0, 12Þ, and then cðuÞ ¼exp ðu� 1=2Þ, which would get extremely large as u !þ1: This fact can lead to a poor estimation of a cðuÞ asdemonstrated by our pilot numerical studies (see Section S.1in our supplementary material). The downside of the highfluctuations in density ratio function has also been pointedout in recent works (Yamada et al., 2011; Sugiyama et al.,2012; Liu et al., 2013; Anees et al., 2016; Xia et al. 2018).

In view of this, we let gðuÞ ¼ gpðuÞ þ ð1� gÞqðuÞ of amixture form of q(u) and p(u), where the proportion param-eter g 2 ð0, 1Þ, and then the density ratio function becomes

cðuÞ ¼ pðuÞgðuÞ ¼

pðuÞgpðuÞ þ ð1� gÞqðuÞ , (2)

which is more smooth and well bounded in ð0, 1=gÞ, and isalso termed as the relative density ratio (Yamada et al.,2011; Liu et al., 2013). Plugging this g(u) into Equation (1)yields

FðxÞ¼ðþ1

�1Iðu�xÞcðuÞðgpðuÞþð1�gÞqðuÞÞdu

¼gðþ1

�1Iðu�xÞcðuÞpðuÞduþð1�gÞ

ðþ1

�1Iðu�xÞcðuÞqðuÞdu

�g1n

Xni¼1

Iðxi�xÞcðxiÞþð1�gÞ 1m

Xmj¼1

Iðzj�xÞcðzjÞ:

(3)

Here F(x) is inferred from DðtÞn [ DðsÞ and in an unbiased

way as a result of the importance sampling and the approxi-mation of expectations using averages. As Fðþ1Þ ¼ 1, wefurther have a normalization constraint below, which alsoguatantees

Ðþ1�1 pðuÞdu ¼ Ðþ1

�1 cðuÞgðuÞdu ¼ 1:

IISE TRANSACTIONS 3

Page 5: Distribution inference from early-stage stationary data

g1n

Xni¼1

cðxiÞ þ ð1� gÞ 1m

Xmj¼1

cðzjÞ ¼ 1: (4)

Now it is clear from Equation (3) that the key problemin our CDF inference becomes the estimation of the densityratio cðuÞ: In addition, we do not go through the estimationof the source density q(u) even if it has abundant datapoints, since this intermediate step is not necessary to derivecðuÞ as will be shown in the following.

2.2. Regularized density ratio estimation

Let cðuÞ denote an estimate of the true cðuÞ in Equation (2).We want a cðuÞ to minimize the following squared errorthat is widely used in density ratio estimation studies(Kanamori et al., 2009; Yamada et al., 2011; Sugiyama et al.,2012; Liu et al., 2013; Anees et al., 2016):

J ¼ðþ1

�1ðcðuÞ � cðuÞÞ2gðuÞdu ¼

ðþ1

�1cðuÞ2gðuÞdu

�2ðþ1

�1cðuÞcðuÞgðuÞduþ

ðþ1

�1cðuÞ2gðuÞdu

¼ gðþ1

�1cðuÞ2pðuÞduþ ð1� gÞ

ðþ1

�1cðuÞ2qðuÞdu

�2ðþ1

�1cðuÞpðuÞduþ Constant, (5)

where the last equation holds as gðuÞ ¼ gpðuÞ þ ð1� gÞqðuÞand cðuÞ ¼ pðuÞ=gðuÞ: Approximating the expectations inEquation (5) by averages and discarding the constant term,our objective is

Jn ¼ g1n

Xni¼1

cðxiÞ2 þ ð1� gÞ 1m

Xmj¼1

cðzjÞ2 � 21n

Xni¼1

cðxiÞ:

We now represent cðuÞ by a vector of non-negative basisfunctions /ðuÞ ¼ ð/1ðuÞ, :::,/bðuÞÞT :

cðuÞ ¼ aT/ðuÞ ¼Xbj¼1

/jðuÞaj, (6)

where /jðuÞ � 0, j ¼ 1, :::, b: Particularly, we use the sourcedata DðsÞ as center points and let /jðuÞ ¼exp ð�ðu� zjÞ2=ð2r2ÞÞ with a bandwidth parameter r. Herewe do not further include the target data DðtÞ

n as centerpoints, as the source data are rich enough to create sufficientand widely populated centers to represent the estimateddensity ratio function well (see an example in Section S.2 inthe supplementary material). The fixed number and loca-tions of center points can also facilitate closed-form recur-sive formulas in online computing as is shown later inSection 2.3. Substituting cðuÞ into Jn, we obtain

Jn ¼ gaTHnaþ ð1� gÞaTVa� 2hTna, (7)

where Hn ¼ 1n

Pni¼1 /ðxiÞ/ðxiÞT ,V ¼ 1

m

Pmj¼1 /ðzjÞ/ðzjÞT and

hn ¼ 1n

Pni¼1 /ðxiÞ:

Our preceding derivation follows a least-squares density ratiofitting model as in Kanamori et al. (2009), Sugiyama et al.(2012) and Liu et al. (2013), but their models only perform

well when both n and m are large. When n is too small to offersufficient information for a purely data-driven estimation inEquation (7), we supplement a prior belief in our transferlearning context that p(u) and q(u) are fairly similar, i.e.,cðuÞ � 1: To impose such a prior belief on our density ratioestimation, a regularization term is formulated as:

R ¼ðþ1

�1ðcðuÞ � 1Þ2gðuÞdu ¼

ðþ1

�1cðuÞ2gðuÞdu

�2ðþ1

�1cðuÞgðuÞduþ 1

¼ gðþ1

�1cðuÞ2pðuÞduþ ð1� gÞ

ðþ1

�1cðuÞ2qðuÞdu

�2gðþ1

�1cðuÞpðuÞdu� 2ð1� gÞ

ðþ1

�1cðuÞqðuÞduþ 1:

Evaluating R empirically using DðtÞn and DðsÞ and taking

the form of cðuÞ in Equation (6), the regularization term(without a constant) is

Rn ¼ gaTHnaþ ð1� gÞaTVa� 2ghTna� 2ð1� gÞvTa (8)

where v ¼Pmj¼1 /ðzjÞ=m: In addition, as abundant basis

functions (b ¼ m in Equation (6)) are used to representcðuÞ, to avoid over-fitting, we also consider an L2-normregularization term, i.e.,

Q ¼ aTa: (9)

By combining Equations (7)-(9), our regularized densityratio estimation model with two non-negative tuning param-eters k1, n and k2, n is developed as

mina

Jn þ k1, nRn þ k2, nQ

¼ mina

aT�gHn þ ð1� gÞVþ k2, n

1þ k1, nI

�a

�21

1þ k1, nhn þ k1, n

1þ k1, nðghn þ ð1� gÞvÞ

� �T

a,

where I is an identity matrix, and the analytical solution tothe above problem is

~an ¼ gHn þ ð1� gÞVþ k2, n1þ k1, n

I

� ��1 11þ k1, n

hn

�þ k1, n1þ k1, n

ðghn þ ð1� gÞvÞ�: (10)

Note that when k1, n ¼ 0, our model degenerates into theoriginal density ratio estimation model with an L2-normregularization, and when k1, n ¼ þ1, our model drives thedensity ratio estimate cðuÞ to be constant one. The deter-mination of k1, n is discussed later in Section 2.4.

Finally, since cðuÞ � 0, we further truncate each negativeelement in ~an as zero, and also slightly scale ~an to satisfythe normalization constraint in Equation (4). Denote thefinal non-negative and normalized model parameter as an,and our estimate for the density ratio is

cnðuÞ ¼ aTn/ðuÞ: (11)

2.3. Online computation

We have outlined the proposed CDF inference method atthe nth time point. When a new target data point arrives,

4 K. WANG ET AL.

Page 6: Distribution inference from early-stage stationary data

our results have to be updated. This section develops an effi-cient online algorithm with recursive formulas to performupdating. In brief, we first update ~an in Equation (10), andthen cnðuÞ in Equation (11) can be evaluated and taken toestimate F(x) via Equation (3).

Suppose at the nth time point, we have obtained Hn andhn, and let

Xn ¼ gHn þ ð1� gÞVþ k2, n1þ k1, n

I:

At the (nþ 1)th time point, after observing xnþ1, we con-struct /ðxnþ1Þ, and then

Hnþ1 ¼ 1nþ 1

Xnþ1

i¼1

/ðxiÞ/ðxiÞT

¼ nnþ 1

Hn þ 1nþ 1

/ðxnþ1Þ/ðxnþ1ÞT , (12)

hnþ1 ¼ 1nþ 1

Xnþ1

i¼1

/ðxiÞ ¼ nnþ 1

hn þ 1nþ 1

/ðxnþ1Þ, (13)

Xnþ1 ¼ gHnþ1 þ ð1� gÞVþ k2, nþ1

1þ k1, nþ1I: (14)

In the equations above, Hn and hn can be easily updatedto Hnþ1 and hnþ1: To obtain ~anþ1 in Equation (10), how-ever, we need X�1

nþ1, an inversion of a b � b large matrixwhere b ¼ m. To circumvent this expensive computation,we also develop a recursive form for X�1

nþ1:Note that after substituting Equation (12) in Equation

(14), we get

In light of the fact that G�1nþ1 can be easily derived when

X�1n is available using the Sherman–Morrison identity

(Petersen and Pedersen, 2008), we try to approximate X�1nþ1 as

X�1nþ1 � G�1

nþ1 � dnþ1G�1nþ1Fnþ1G

�1nþ1:

In recursion, suppose that at the nth time point, we haveobtianed X

�1n , a proxy of X�1

n , and then based on theSherman–Morrison identity (Petersen and Pedersen, 2008), wehave

G�1nþ1 ¼ ðXn þ g

1nþ 1

/ðxnþ1Þ/ðxnþ1ÞTÞ�1

¼ X�1n � gX

�1n /ðxnþ1Þ/ðxnþ1ÞTX�1

n

nþ 1þ g/ðxnþ1ÞTX�1n /ðxnþ1Þ

: (16)

Then X�1nþ1 can be approximated as

X�1nþ1 ¼ G

�1nþ1 � dnþ1G

�1nþ1Fnþ1G

�1nþ1, (17)

where dnþ1 is determined by minimizing the approximationerror of X

�1nþ1 with respect to X�1

nþ1 :

dnþ1 ¼argmind

jjX�1nþ1Xnþ1 � Ijj2F ¼ argmin

djjðG�1

nþ1

�dG�1nþ1Fnþ1G

�1nþ1ÞXnþ1 � Ijj2F

¼ TrððG�1nþ1Fnþ1G

�1nþ1Xnþ1ÞðXnþ1G

�1nþ1 � IÞÞ

jjG�1nþ1Fnþ1G

�1nþ1Xnþ1jj2F

: (18)

By combining Equations (16)-(18), X�1nþ1 can be obtained

from X�1n without any matrix inversion operation, so can

~anþ1, as below:

~anþ1 ¼ X�1nþ1

�1

1þ k1, nþ1hnþ1

þ k1, nþ1

1þ k1, nþ1ðghnþ1 þ ð1� gÞvÞ

�: (19)

After truncating and normalizing ~anþ1 to anþ1, the CDFat the nþ 1 time point is

Fnþ1ðxÞ¼g1

nþ1

Xnþ1

i¼1

Iðxi�xÞcnþ1ðxiÞþð1�gÞ 1m

Xmj¼1

Iðzj�xÞcnþ1ðzjÞ

¼ g1

nþ1

Xnþ1

i¼1

Iðxi�xÞ/ðxiÞTþð1�gÞ 1m

Xmj¼1

Iðzj�xÞ/ðzjÞT0@

1Aanþ1:

(20)

When the storage space for the target data stream is lim-ited, we can further follow the binning-and-counting trickin Ross et al. (2011) to make Fnþ1ðxÞ get rid of x1, :::, xn:Specifically, the range of target data is split into L fixed binsor intervals each with a center xcl and a count ncl : If xnþ1

falls into the lth bins, ncl will increase by one. Then

Fnþ1 xð Þ � g1

nþ 1

XLl¼1

nclI xcl � xð Þ/ xclð ÞT

þ 1� gð Þ 1m

Xmj¼1

I zj � xð Þ/ zjð ÞT!anþ1: (21)

The above computation details are summarized inAlgorithm 1. We can see that the time complexity of our

Xnþ1 ¼gn

nþ 1Hn þ 1

nþ 1/ xnþ1ð Þ/ xnþ1ð ÞT

� �þ 1� gð ÞVþ k2, nþ1

1þ k1, nþ1I

¼ gHn þ 1� gð ÞVþ k2, n1þ k1, n

I

� �þ g

1nþ 1

/ xnþ1ð Þ/ xnþ1ð ÞT þ k2, nþ1

1þ k1, nþ1� k2, n1þ k1, n

� �I� g

1nþ 1

Hn

� �

¼Xn þ g1

nþ 1/ xnþ1ð Þ/ xnþ1ð ÞT|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}Gnþ1

þ k2, nþ1

1þ k1, nþ1� k2, n1þ k1, n

� �I� g

1nþ 1

Hn|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}Fnþ1

: (15)

IISE TRANSACTIONS 5

Page 7: Distribution inference from early-stage stationary data

algorithm is Oðb3Þ, and the main cost resides in thematrix multiplication in Equation (18). Actually, dnþ1

does not need to be decided at every time point, as itsvalue typically becomes stable after a few rounds ofupdating (see Section S.3 in the supplementary material).Then the dominant computation is only the two matrixmultiplications in Equation (17) to obtain X

�1nþ1: Our

online algorithm for data streams has the following threefavorable properties:

1. First, to obtain anþ1, only a few key terms (solid boxesin Algorithm 1) need to be stored and updated at eachtime point. It is thus not necessary to store the past tar-get data DðtÞ

n :2. Then, to obtain Fnþ1ðxÞ, as shown in Equation (21), we

can just store the bin centers and counts ðxcl , nclÞ, l ¼1, :::, L: Every new target data updates the ncl ’s valuesand is then discarded.

3. Finally, the real execution of our algorithm is highlyefficient compared with its batch counterpart as dem-onstrated in numerical studies (see Section 3.2).

To sum up, at each time point, our online algorithmenjoys a constant storage cost (maintaining a fixed numberof terms), a constant computation cost (evaluating a fixednumber of recursive formulas) and a high real executionefficiency (skipping the big matrix inversion).

2.4 Implementation guidelines

2.4.1. On choosing the source data DðsÞ

In the current big data era, it is not uncommon to observemany similar datasets. Numerous examples include newsand personal articles (Jiang and Zhai, 2007), movie andbook reviews (Xia et al., 2018), product quality data fromthe same 3D printer (Cheng et al., 2021) and flight delaydata in two sequential years (Garcke and Vanck, 2014). Inaddition to simply using the above contextual information,we can choose the source domain more carefully by: (i)identifying a set of key attributes (e.g., raw materials, processequipment, operators, environmental conditions, etc.) basedon industrial knowledge to represent the nature of thesource and target products or processes as in Lu et al.(2009); (ii) comparing these attribute values and checking

Algorithm 1. CDF Inference from a Data Stream viaTransfer Learning.

Input: Source dataset DðsÞ ¼ fz1, :::, zmg, target data streamDðtÞ ¼ fx1, x2, :::g, tuning parameters g, r, and k1, n, k2, n ateach time point.Output: An estimate of CDF at each time point.Initialization:

V ¼ 1m

Pmj¼1 /ðzjÞ /ðzjÞT , v ¼ 1

m

Pmj¼1 /ðzjÞ,H0 ¼ 0, h0 ¼

0, X�10 ¼ ðð1� gÞVþ k2, 0

1þk1, 0IÞ�1:

Online updating. For n ¼ 0, 1, 2, :::

whether they are of identical value, of scale similarity (e.g.,pilot-scale or full-scale manufacturing phase) or of familysimilarity (e.g., using the same material category) which areall believed to lead to a high similarity between the twostudied domains (Lu et al., 2009). As the target data streambegins and accumulates more data, a quantitative compari-son between the source and target distributions, such as thetwo sample goodness-of-fit hypothesis test (Zhang, 2006),can be conducted at every time point, where the use ofsource data can be continued unless an extremely small p-value occurs.

2.4.2. On selecting the proportion parameter g and thebandwidth parameter r

The proportion parameter g makes the density ratio func-tion cðuÞ in Equation (2) well upper-bounded and reliablyestimated. As g approaches zero, g(u) will be more like q(u),which makes the bounding constraint lose efficacy and thedensity ratio estimation highly unstable. On the other hand,when g gets close to one, g(u) degenerates to p(u) and uti-lizes little source data. A reasonable choice of g to balancethe above trade-off is within an interval around 0.5, e.g.,½0:3, 0:7�, which can yield a satisfactory CDF inference per-formance. Therefore, we set g ¼ 0:5 (i.e., cðuÞ 2 ð0, 2Þ) inthis article. For the bandwidth parameter r, since we adoptGaussian kernels for a nonparametric fitting, a simple choicecan be made based on an empirical bandwidth formula as inZou et al. (2008). That is, r ¼ ðvarðDðsÞÞÞ1=2m�1=5: Here r isdetermined based on the source data DðsÞ as they are thecenter points as discussed in Section 2.2.

2.4.3. On determining the tuning parameters k1, nand k2, n

First, k2, n suppresses extreme entries in a as indicated byEquation (9), and also stabilizes the matrix inversioninvolved in Equation (10). Therefore, k2, n is not so criticalin our model, and for simplicity, we let k2, n=ð1þ k1, nÞ ¼0:2 in Equation (10). To clarify the role of k1, n, we rewrite~an in Equation (10) as

~an ¼ 11þ k1, n

~að1Þn þ k1, n1þ k1, n

~að2Þn , (22)

where

6 K. WANG ET AL.

Page 8: Distribution inference from early-stage stationary data

~að1Þn ¼ gHn þ ð1� gÞVþ k2, n1þ k1, n

I

� ��1

hn, :

~að2Þn ¼ gHn þ ð1� gÞVþ k2, n1þ k1, n

I

� ��1

ðghn þ ð1� gÞvÞ

Hence, ~an is an integration of a data-driven part ~að1Þn and aprior-belief-guided part ~að2Þn , and k1, n is the trade-off weight.As data accumulate from the target data stream, the effect ofthe prior belief is desired to be progressively reduced.Therefore, we let k1, n ¼ m=ðn� wÞ, which makes

11þ k1, n

¼ n� wn� wþm

,k1, n

1þ k1, n¼ m

n� wþm:

The effect of ~að2Þn thus decays as n gets larger. Here w> 1is a factor to imply that the target data have stronger rele-vance to our distribution inference than the source data andto control the decay rate of the prior belief. An empiricalsetting w 2 ½2, 10� works well in our studies, and a smaller wcan be used when a higher degree of similarity between thetarget and source data is expected.

2.5 Extensions

In practice, sometimes we can obtain multiple products orprocesses similar to our target one. Suppose we have Ksource domains and thus K source datasets DðsÞ

1 , :::,DðsÞK

where DðsÞk ¼ fzk1, :::, zkmkg with PDF qkðzÞ and size mk. To

utilize our method, we build an auxiliary PDF as

gðuÞ ¼ gpðuÞ þ ð1� gÞXKk¼1

pkqkðuÞ,

where pk � 0 is the weight parameter andPK

k¼1 pk ¼ 1:This actually regards the K source datasets as one largercomposite source dataset of a mixture form,i.e., qðzÞ ¼Pk

k¼1 pkqkðzÞ:Then our target CDF can be approximated as

FðxÞ � g1n

Xni¼1

Iðxi � xÞcðxiÞ

þ ð1� gÞXKk¼1

pk1mk

Xmk

j¼1

Iðzkj � xÞcðzkjÞ:

Following the procedures in Section 2.2, cðuÞ can bederived by minimizing Jn þ k1, nRn þ k1, nQ except that

V ¼XKk¼1

pk1mk

Xmk

j¼1

/ðzkjÞ/ðzkjÞT ,

v ¼XKk¼1

pk1mk

Xmk

j¼1

/ðzkjÞ

and /ðuÞ is built on b ¼PKk¼1 mk center points. Our online

algorithm is still applicable with slight modification in theinitialization, but since b gets larger, our use of the proxy ofX�1

n is more advantageous. When the K source domainsare all highly similar to our target one, we can simply letpk ¼ 1=K: Otherwise, the set of key attributes suggested in

Section 2.4 can be used to measure the heterogeneous simi-larity between the target domain and the K source domains,and a higher similarity deserves a larger pk.

On the other hand, multiple data streams can be of inter-est in many applications, i.e., at each time point, we have amultivariate observation with correlated elements ratherthan a univariate one. When extending our method tomultivariate data streams, we use the definition of multivari-ate CDF, define a density ratio cðuÞ in a d-dimensionalspace, and then Equation (3) becomes

FðxÞ � g1n

Xni¼1

Iðxi1 � x1, :::, xid � xdÞcðxiÞ

þ ð1� gÞ 1m

Xmj¼1

Iðzj1 � x1, :::, zjd � xdÞcðzjÞ:

We again represent cðuÞ by b basis functions, each of whichis now defined in the d-dimensional space as /jðuÞ ¼exp ð�ðu� zjÞTS�1ðu� zjÞÞ, where S is the bandwidth matrixand can be determined using Silverman’s rule of thumb. Thenthe estimation of cðuÞ is equivalent to the minimization over aas in Section 2.2. When the distribution of the multivariaterandom variable x can be factorized as a product of severalmarginal distributions and conditional distributions accordingto some dependence structures (e.g., Bayesian networks), ourCDF inference can be simplified into several independent tasksconducted in some spaces of dimension lower than d.

3. Numerical simulations

This section tests our proposed method by simulations. Wefirst show the behavior of our regularized model for densityratio estimation, and then the superiority of our CDF infer-ence method is verified for various data streams. Finally, theeffect of the tuning parameters and the extension perform-ance for multiple source domains and multivairate datastreams are investigated.

3.1. Density ratio estimation

Recall that in Section 2.2, we incorporate a regularizationterm Rn in the least-squares density ratio fitting model toreflect the prior belief in our transfer learning context. Anumerical example is thus given here to illustrate this regula-rization’s benefit. The simulation setup is as follows. Thesource and target data DðsÞ and DðtÞ

n are produced from thenormal distributions Nð0, 1:52Þ and Nð0, 12Þ, respectively.Following the guidelines in Section 2.4, we let g ¼ 0:5, c¼ 1,k2, n=ð1þ k1, nÞ ¼ 0:2, w¼ 5. The source data size is fixed asm¼ 500 whereas the target one n increases from 10 to 500.

The performance of the proposed Regularized DensityRatio (“RDR”) model and the original Density Ratio (“DR”)model is displayed in Figure 1. It seems that the RDR modelresults in density ratios much closer to the true one thanthe DR model, especially when n is small (less than 150).From Figures 1(b)-(e), it is clear that when n is too small togenerate reliable estimates, the prior belief that the sourceand target distributions are similar would force the density

IISE TRANSACTIONS 7

Page 9: Distribution inference from early-stage stationary data

ratio estimated by the RDR model towards a constantone. For example, the estimated density ratio is loweredwhen it gets extremely large (e.g., when u ¼ �0:5) and islifted when it is too small (e.g., when u¼ 2). When ngrows large in Figures 1(f)-(i), the estimated densityratios from the RDR and DR models both approximatethe true one, and their mutual difference is also weak-ened. This is because k1, n is tuned as a decreasing func-tion of n such that the effect of prior belief wouldgradually disappear. For example, when n¼ 500, theprior-belief-guided part ~að2Þn only accounts for one-sixthof ~an in Equation (22). To sum up, our proposed RDRmodel enjoys an enhanced capability when the dataset issmall and will automatically revert to its original versionwhen the dataset gets large.

3.2. CDF inference

Now our proposed CDF inference method is compared withmany of its counterparts to show its advantages. The EDFand KDE are two common methods. We also generate

virtual samples as in Lin and Li (2010) and then apply theEDF to the augmented dataset. Note that these three meth-ods only depend on the target data stream. To combine thesource data, a naive way is to directly mix the target andsource data as a whole dataset. Another way is the Bayesiannonparametric density estimation with a DP mixture of nor-mals as the prior distribution. Specifically, we take the pos-terior from the source data as the prior for the target data,and the Bayesian computing is conducted by the DP pack-age in R (Jara et al., 2011). Two variants of our proposedmethod are also considered. One takes the original densityratio estimation model, the other performs the batch learn-ing with big matrix inversion operations and uses no recur-sive formulas in online updating. The notations of allcompeting methods are given in Table 1. We consider thefollowing six cases to simulate the source and target datafrom a wide array of distributions:

Case 1. p(x): Nð0:0, 1:02Þ; q(z): Nð0:0, 1:52Þ: Case 2. p(x): Nð0:0, 1:02Þ; q(z): Nð0:2, 1:02Þ: Case 3. p(x): 0.40Nð0:0, 1:02Þþ 0.60Nð0:5, 1:02Þ; q(z):

Nð0:0, 1:02Þ:

Figure 1. Estimated density ratio functions for different values of target data size n.

8 K. WANG ET AL.

Page 10: Distribution inference from early-stage stationary data

Case 4. p(x): tð0:0, 3:0Þ; q(z): tð0:2, 3:0Þ: tðl, mÞ is aStudent’s t distribution with location l and degree offreedom �.

Case 5. p(x): Gammað5:0, 1:0Þ; q(z): Gammað5:5, 1:0Þ:Gammaðk, hÞ is a Gamma distribution with shape k andscale h.

Case 6. p(x): Logisticð0:0, 1:0Þ; q(z): Logisticð�0:5, 1:0Þ:Logisticðl, sÞ is a logistic distribution with location l andscale s.

In each case, DðtÞn and DðsÞ are simulated from p(x) and

q(z), respectively. We set g ¼ 0:5, c¼ 1, k2, n=ð1þ k1, nÞ ¼0:2, w¼ 5, m¼ 500, and let n increase from 1 to 300. Werepeat this process for T¼ 500 times, generating differentrealizations of DðsÞ and DðtÞ

n in each case. The performancecriterion of the CDF inference at the nth time point is themean squared error:

MSEn ¼ 1T

XTrep¼1

ððFðrepÞ

n ðxÞ � FðxÞÞ2pðxÞdx,

where FðrepÞn ðxÞ is the estimated CDF based on one realiza-

tion of DðsÞ and DðtÞn and the integration is empirically calcu-

lated by averaging. We also calculate the average of themean squared errors from the beginning to the nth timepoint as

AMSEn ¼ 1n

Xni¼1

MSEi:

The MSEn values, after being logarithmic-transformed,are visualized in Figure 2. Generally speaking, our proposedTL-RDR method delivers a better performance with smallererrors than the competing ones, and the advantage is par-ticularly pronounced when n is small (less than 100). Whenn gets larger, the difference between our TL-RDR and thecompeting EDF and KDE methods is reduced, and some-times the EDF and KDE methods generate slightly betterresults based on sufficient target data (e.g., in Cases 1,2 and4 when n> 150). The VSG method does not perform wellhere since the virtual samples, generated from the veryscarce target data, could induce heavy bias as discussed inLin and Li (2010). Although using the source data, the EDF-MIX and KDE-MIX methods are worse than our TL-RDRmethod, and their blind mix could result in large errors

when n is large. The BN method is also inferior to our TL-RDR, as the prior distribution affects the CDF inferencequite a lot when the target data are limited and the priorelicitation from the source data actually ignores the inherentdifference between the target and source distributions. Bycontrast, our TL-RDR utilizes the source data in a moresophisticated manner by using importance weighting via thedensity ratio estimation. Furthermore, the advantage of TL-RDR over TL-DR substantiates the positive effects of ourtailored regularization term. The unobservable differencebetween the TL-RDR and TL-RDR-B methods (they overlayeach other in Figure 2) verifies the effectiveness of ourrecursive formulas in online updating, and the former enjoysmuch higher computational efficiency, as it can be per-formed almost eight times faster than the latter (more com-parison is given in Section S.3 in the supplementarymaterial). On average, our TL-RDR method based onAlgorithm 1 is able to complete the updating for each newtarget data point in about 0.03 seconds using our personalcomputers. The AMSEn values are tabulated in Table 2 toshow the overall performance of each method, where thebest two ones in each row are in boldface. We see againthat our proposed TL-RDR method is always among thebest candidates.

3.3. Parameters’ effects and discussion

Here we first explore the effect of tuning parameters. Fromour discussion in Section 2.4, a key parameter is k1, n whichcontrols the dynamic effect of the prior belief in continu-ously inferring the CDF from a data stream. Take Case 6 forinstance where p(x) is Logisticð0:0, 1:0Þ but q(z) isLogisticð�D, 1:0Þ: We let D ¼ 0:2, 0:4, :::, 1:2 and considerw ¼ 2, 5, 10, 20: The MSEn values are plotted in Figure 3,and the AMSEn’s when n¼ 300 are given in Table 3 (thebest result for each D is highlighted). It can be seen when Dis small or the target and source domains are highly similar,a smaller w is preferred so as to utilize the prior belief for alonger time. When D gets larger, the best w with the small-est AMSE also increases to make the prior belief decay morequickly. On the other hand, for each w, the MSEn curvemoves upwards as D increases, and when D is too large,which implies that the foundation of transfer learning hascollapsed in fact, large errors could occur. As such, we

Table 1. Notations of considered methods.

EDF Empirical distribution function.EDF-MIX EDF for a mixed sample of the source and target dataKDE Kernel density estimation (the CDF is estimated by numerical integration)KDE-MIX KDE for a mixed sample of the source and target dataVSG Virtual sample generation method to increase the target data size (Lin and Li, 2010)BN Bayesian nonparametric density estimation (Jara et al., 2011) using source data to

elicit priorTL-DR Our proposed transfer learning-based CDF inference method with the original density

ratio estimation model (Kanamori et al., 2009; Sugiyama et al., 2012; Liu et al., 2013)TL-RDR Our proposed transfer learning-based CDF inference method with the regularized

density ratio estimation modelTL-RDR-B Our TL-RDR method for batch learning without recursive formulas and online

computation

IISE TRANSACTIONS 9

Page 11: Distribution inference from early-stage stationary data

recommend our CDF inference method only when thesource and target domains are expected to be highly ormoderately similar, and in these conditions, w 2 ½2, 10� is agood choice.

We also study the influence of the proportion parameterg in the above simulations with D ¼ 0:4 and w¼ 2. TheMSEn values under different values of g are shown in Figure4(a). It seems that our CDF inference gets worse when g iseither too small (e.g., g ¼ 0:1) since the upper boundbecomes greatly loose and the estimated ratio functions getshighly unstable, or too large (e.g., g ¼ 0:9) as only minor

information from the source data is utilized. We get thesame conclusion when calculating the AMSEn, the resultsare deferred to Table S.3 in the supplementary material. Amoderate g around 0.5 within an interval ½0:3, 0:7� producesbetter results, which still outperforms the competing meth-ods in Table 1 (see Tables S.3 and S.4 in our supplemen-tary material).

Note that our above investigations all rely on the sourcedomain samples DðsÞ and do not go through the estimationof the true source distribution q(z). In the ideal scenariowith q(z) given, estimating the density ratio cðuÞ, which is

Figure 2. Logarithm of MSEn in Cases 1-6.

10 K. WANG ET AL.

Page 12: Distribution inference from early-stage stationary data

necessary to incorporate the source domain as shown inEquation (1), is expected to behave better since the integralsinvolving q(u) in Equations (3) and (5) can now be moreprecisely evaluated than using the finite-sample approxima-tion. To quantify the difference, we follow the above studyin Case 6 and fix g ¼ 0:5: The AMSEn s based on the sourcedistribution and source dataset (m¼ 500) are listed in Table4. A two-step scenario where q(z) is first estimated fromDðsÞ and then used as true is also compared. It is not sur-prising in Table 4 that the availability of q(z) can generate

smaller errors than the use of DðsÞ, as the former is equiva-lent to utilizing infinite source data points. Their differencealso decreases when the target data size n gets larger, as thesource domain becomes less important. The two-step pro-cedure performs slightly worse than our method, whichindicates that estimating q(z) as an intermediate step cancreate more estimation errors than directly estimating thedensity ratio cðuÞ:

As a final note, we analyze the sensitivity of our TL-RDRmethod when noises, e.g., measurement errors, are added to

Figure 3. Logarithm of MSEn in Case 6 with different w’ values.

Table 2. Logarithm of AMSEn in Cases 1-6.

Case n EDF EDF-MIX KDE KDE-MIX VSG BN TL-DR TL-RDR TL-RDR-B

1 50 –4.8973 –5.3997 –5.2826 –5.2588 –4.2659 –5.5285 –4.3570 –5.4182 –5.4194100 –5.3407 –5.4841 –5.6951 –5.3351 –4.5658 –5.6727 –4.9237 –5.7304 –5.7271200 –5.8337 –5.6315 –6.1597 –5.4700 –4.8743 –5.9004 –5.4954 –6.0963 –6.0858300 –6.1341 –5.7604 –6.4433 –5.5889 –5.0522 –6.0853 –5.8293 –6.3363 –6.3209

2 50 –4.8973 –5.1563 –5.2826 –5.1876 –4.2659 –5.3707 –4.3963 –5.4519 –5.4539100 –5.3407 –5.2424 –5.6951 –5.2725 –4.5658 –5.5670 –4.9516 –5.7898 –5.7891200 –5.8337 –5.3903 –6.1597 –5.4185 –4.8743 –5.8623 –5.5147 –6.1635 –6.1578300 –6.1341 –5.5195 –6.4433 –5.5463 –5.0522 –6.0857 –5.8414 –6.3968 –6.3870

3 50 –4.9271 –5.3033 –5.3053 –5.3776 �4.2910 –5.5379 –4.0285 –5.7525 –5.7557100 –5.3715 –5.3828 –5.7166 –5.4571 –4.5994 –5.7537 –4.6345 –6.1020 –6.1058200 –5.8524 –5.5301 –6.1661 –5.6045 –4.9167 –6.0660 –5.2357 –6.4930 –6.4979300 –6.1492 –5.6586 –6.4429 –5.7329 –5.0983 –6.2897 –5.5846 –6.7409 –6.7462

4 50 –4.9560 –5.0555 –5.1543 –5.1638 �4.3692 –5.2798 –4.1177 –5.5419 –5.5443100 –5.4022 –5.1402 –5.5366 –5.2456 –4.7039 –5.5011 –4.7060 –5.8723 –5.8717200 –5.8904 –5.2903 –5.9504 –5.3893 –5.0543 –5.8237 –5.3053 –6.2559 –6.2500300 –6.1884 –5.4207 –6.2045 –5.5138 –5.2568 –6.0596 –5.6553 –6.4957 –6.4860

5 50 –4.9607 –4.9714 –5.3305 –4.9728 –4.2383 –5.3357 –4.1806 –5.5721 –5.5756100 –5.4105 –5.0550 –5.7444 –5.0537 –4.3401 –5.5905 –4.7717 –5.9283 –5.9294200 –5.8988 –5.2049 –6.1961 –5.1991 –4.3638 –5.9560 –5.3693 –6.3316 –6.3286300 –6.1936 –5.3348 –6.4681 –5.3251 –4.3440 –6.2162 –5.7156 –6.5817 –6.5758

6 50 –4.9792 –5.2704 –5.3184 –5.2852 �4.3473 –5.5239 –4.1167 –5.8740 –5.8788100 –5.4138 –5.3523 –5.7138 –5.3653 –4.6406 –5.7448 –4.7183 –6.2157 –6.2223200 –5.9132 –5.5004 –6.1766 –5.5101 –4.9676 –6.0660 –5.3240 –6.6219 –6.6291300 –6.2184 –5.6295 –6.4579 –5.6361 –5.1607 –6.2876 –5.6719 –6.8682 –6.8752

IISE TRANSACTIONS 11

Page 13: Distribution inference from early-stage stationary data

the target data. We also set g ¼ 0:5 and introduce normallydistributed white noises, whose scale parameter is a fractionof the scale parameter of the target data. Figure 4(b) visual-izes the MSEn s when the fraction e varies. The curve ofAMSE300 over e is also provided in Figure S.5 in the supple-mentary material. To summarize, there is almost no signifi-cant change in the performance of our method until e getsparticularly large (> 0.2), which indicates that our methodhas sufficient robustness to commonly encountered smallnoises in practice.

3.4. Extension performance

In this section, we first investigate the marginal benefitswhen additional source domains are included in our TL-RDR method. Following Case 2 where only a source datasetfrom Nð0:2, 1:02Þ is used (a baseline), we then add anothersource dataset from Nð�0:2, 1:02Þ and Nð0:0, 0:92Þ, respect-ively. We also consider a situation with the above threesource datasets together. The above source data sizes are all200. As high similarities between the target and all sourcedatasets are expected, we let pk ¼ 1=K: The MSEn’s are plot-ted in Figure 5(a). It can be clearly seen that the incorpor-ation of more highly similar source domains could lead tobetter estimations. We then consider a scenario where twosource domains have different similarities with the targetdomain in Figure 5(b), where the source dataset furtheraway from the target one is assigned with less weight. Wecan see that with appropriate weight parameters, the twoheterogeneous source domains still perform better than thebaseline since more source data are actually leveraged, butthe advantages vanish as one of source dataset becomes

more distant from the target one. Negative effects mighteven occur if one dissimilar source domain is assigned anoverly large weight (see the long-dashed cyan line inFigure 5(b)).

From Figures 5(a) and 5(b), we conclude that multiplesource transfer can bring more assistance, but their incorp-oration needs more caution in the heterogeneous cases. Thedetermination of weights pks via a data-driven mannerseems unreliable in our early-stage context with limited tar-get data. For example, we consider n¼ 20 target data pointsin Case 2 and two source datasets from Nð�0:2, 1:02Þ andNð0:2, 1:02Þ: The leave-one-out cross validation, where every19 out of 20 target data points are taken to train our modeland the estimated density function (i.e.,

pðxÞ � Fðxþ eÞ � Fðx� eÞ2e

where e is a small value) is evaluated at the left one’s testpoint, is performed to find the optimal p1 that leads to themaximum value of the log-density for all the left test points(i.e.,

Pni¼1 ln pðxiÞ). We repeat the above procedure 500

times for different realizations of the target dataset and plotthe frequency of the selected p1 in Figure 5(c). These p1 val-ues are quite unstable as they can range from zero to one.The most frequent choice is p1 ¼ 0 that indicates only onesource should be utilized, but we have already shown inFigure 5(a) that using two sources p1 ¼ 0:5 can yield a bet-ter distribution inference. For multiple source domains,prior industrial knowledge should be relied on to select thehighly similar source domains, and when these sourcedomains all have high similarities with the target one, wecan let pk � 1=K for simplicity.

Finally, our TL-RDR method is applied to infer multi-variate distributions. The target data are generated fromMNð0,HÞ where H is the covariance matrix with diagonal 1and off diagonal 0.5, and the source data from MNðD,HÞwith D ¼ ð0:1, 0:1ÞT , ð0:05, :::, 0:05ÞT , ð0:02, :::, 0:02ÞT ford¼ 2, 4, 8, respectively. Figure 6 shows the MSEn values ofour method and its counterparts. It seems that the

Table 4. Logarithm of AMSEn in Case 6 under different scenarios.

Scenarion

10 20 50 100 150 200 300

Source dataset –5.2729 –5.6062 –6.0290 –6.4005 –6.6385 –6.8045 –7.0286True source distribution –5.4670 –5.7504 –6.0972 –6.4275 –6.6490 –6.8119 –7.0304Estimated source distribution –5.2690 –5.6019 –6.0236 –6.3983 –6.6373 –6.8033 –7.0270

Figure 4. Logarithm of MSEn in Case 6 with (a) different g values and (b) different e values.

Table 3. Logarithm of AMSE300 in Case 6.

D¼ 0.2 D¼ 0.4 D¼ 0.6 D¼ 0.8 D¼ 1.0 D¼ 1.2

w¼ 2 –7.0434 –7.0286 –6.8453 –6.5074 –6.1442 –5.8182w¼ 5 –6.8768 –6.8788 –6.8574 –6.6299 –6.3952 –6.1629w¼ 10 –6.7400 –6.7439 –6.7207 –6.6254 –6.4848 –6.3370w¼ 20 –6.6075 –6.6067 –6.5953 –6.5374 –6.4503 –6.3614

12 K. WANG ET AL.

Page 14: Distribution inference from early-stage stationary data

superiority of our method still retains in the low dimensionin Figure 6(a). However, as d increases to 4, 8 (Figures 6(b)-(c)), our method deteriorates with slight or little advantagesover the competing ones. The KDE-based methods are nolonger shown when d � 4, due to the integrations involvedfor CDF calculations being unavailable in high dimensionswith a small n in our numerical experiments. The reason forthe compromise of our method is that the density ratio esti-mation, as pointed out in Sugiyama et al. (2010), Sugiyamaet al. (2012) and Bu et al. (2018), could be more unstable inhigher dimensions due to the nature of the ratio. As such,our proposed method is more suitable for the online infer-ence of multivariate distributions of low and moder-ate dimensions.

4. Industrial applications

4.1. Quality monitoring

Our first case considers quality monitoring of smartphonesin an Original Equipment Manufacturer (OEM). For mod-ern phones, the display is a dominant feature that stronglyaffects customer experience. One critical characteristic of adisplay is its Color Uniformity (CU), which is measured asthe maximum value of color differences at a grid of check-points when the display shows white. To meet diversedemands, a type of phone usually has a family of variants tobe assembled in the OEM. These variants share productdesigns and assembly lines, but have components of

different grades. We focus on a particular privilege variantwith a small production volume. To efficiently monitor itsdisplay quality, the upper control limit of the CU test datahas to be built as soon as the assembly process begins.When the CU test data distribution is available, a sensiblecontrol index is the CDF F(x) and the control limit is 1� swhere s is the false alarm rate (Qiu, 2018).

To this end, we apply our TL-RDR method to quicklyinfer the CDF from the CU data stream. We take a basicvariant of the phone as the source product that is highlysimilar to the privilege one. Particularly, in one productionbatch, the volumes of the source and target products arescheduled to be 2240 and 280, respectively. After the sourceproducts are finished, we export DðsÞ, upon which the PDFq(z) is derived by the KDE method and is plotted in Figure7(a) (the raw data have been transformed for confidentialityissues). The target PDF p(x) is also shown there based onDðtÞ

n with n¼ 280. This PDF p(x) is taken as the true targetdistribution only for the following performance comparison,and is certainly unknown during the assembly process. Inaddition, the non-normality exhibited in Figure 7(a) justifiesour use of a nonparametric charting statistic F(x) ratherthan the conventional statistics in the existing self-start-ing charts.

The guidelines in Section 2.4 are used for parameterselection and we set w¼ 5 in this real case. First, the esti-mated density ratios cn values as n increases are plotted inFigures 7(b)-(f). Consistent with our observations in

Figure 5. Logarithm of MSEn with multiple sources in (a) and (b) and weight selection in (c).

Figure 6. Logarithm of MSEn for multivariate data streams.

IISE TRANSACTIONS 13

Page 15: Distribution inference from early-stage stationary data

simulations (see Figure 1), for this real data stream, the pro-posed RDR with a prior-belief-guided regularization termalso produces a much better estimation. The control indexesFnðxÞ values, obtained from a wide array of competingmethods, are monitored as time goes by in Figure 8, wherethe control limit (red dotted line) is 0.990. In the controlchart constructed from the true target distribution, an alarmis triggered at the 120th time point, indicating a phone witha quality-deficient display. We count the number of theFalse Alarm (FA) and the True Alarm (TA) of each method.The control chart based on our TL-RDR method in Figure8(f) performs best as it catches the existing one TA at theexpense of fewest FAs.

4.2. Inventory control

Our second real example solves the inventory control of abike sharing system by using the newsvendor model. In thisnew bike rental system, customers can pick up a bike at onelocation and drop it off at another position. The bike shar-ing dataset in the UCI Machine Learning Repository is usedhere, where the daily rental counts of Capital Bikeshare, ametro bike sharing company in Washington DC, arerecorded from January 1, 2011 to December 31, 2012(Fanaee-T and Gama, 2014). In daily operation, the com-pany has to decide how many bikes to launch in the city, ormaintain a best trade-off between overstocking costs andunsatisfied opportunities. In the newsvendor model, let codenote the cost for each unit of overage, i.e., the mainten-ance and fund occupation cost of an unused bike per day,and cu the cost of underage, i.e., the average profit that a

bike can make if regularly rented out per day. Then theoptimal inventory level is known to be F�1ðcu=ðcu þ coÞÞ,where F(x) is the CDF of the daily bike demand andcu=ðcu þ coÞ is the optimal service level.

Huber et al. (2019) demonstrated the benefits of large his-torical datasets for demand distribution estimation in thenewsvendor problem, whereas we aim to deliver the optimalinventory level when entering a new season with the demanddata received daily and rather scarce. Here a basic assumptionis that the daily demand follows a stationary distribution ineach season, and thus the optimal inventory level within a sea-son will be a constant number. Particularly, we target todecide the optimal number of daily launched bikes in Season3, 2012, but we cannot simply refer to the historical demanddata in Season 3, 2011 because the daily demand in 2012 hasdramatically increased due to a remarkable annual marketgrowth (see Figure 9(a); only records on working days arekept). Instead, we make use of the data in Season 2, 2012 asthe source data. The weather conditions, which can highlyaffect bike rentals (Fanaee-T and Gama, 2014), are very simi-lar in these two seasons (see the weather history in Table 5).

We let cu=ðcu þ coÞ ¼ 0:75, w¼ 5 and apply our TL-RDRmethod to derive the optimal daily inventory, the results ofwhich are demonstrated in Figure 10. The optimal inventorybased on the post-hoc demand distribution p(x) in figure9(b) is also plotted in Figure 10 as a benchmark (red solidhorizontal line). It can be seen that our TL-RDR methodgenerates the closest results to the benchmark at the firsthalf period of Season 3, 2012, and then it behaves very simi-larly to the EDF, KDE and TL-DR methods. The inferiorperformance of EDF-MIX, KDE-MIX and BN indicates that

Figure 7. Estimated density ratios as the target data stream generates more data.

14 K. WANG ET AL.

Page 16: Distribution inference from early-stage stationary data

Figure 8. Control charts with the numbers of FA and TA based on different CDF inference methods (the interval [0.9, 1.0] on the vertical axis is enlarged for vis-ual clarity).

Figure 9. Bike sharing datasets.

IISE TRANSACTIONS 15

Page 17: Distribution inference from early-stage stationary data

regrading the demand data in Seasons 2 and 3, 2012 ashomogeneous is inappropriate. The newsvendor model hereis just a simple inventory control strategy to show the effect-iveness of our CDF inference method, and more advancedmodels can be built upon our method whenever the demanddistribution has to be estimated from data streams.

5. Conclusion

Data streams have become a common data structure inmodern manufacturing and service applications, the analysisof which requires solutions to respond instantaneously andupdate efficiently. This article proposes a quick distributioninference method for stationary streaming data, a theoretic-ally sound and computationally fast building block for anyadvanced analytics that requires distribution information. Toobtain an accurate and reliable CDF estimation as soon as adata stream starts, we adopt an instance-based transferlearning approach to borrow auxiliary data from sourcedomains. The density ratio function for weight assignmenton the source data is then estimated by a regularized least-squares fitting model with a well-designed regularizationterm to utilize the similarity between the source and targetdistributions. Hinging on the transfer of external sourcedata, our TL-RDR method is effective at the early stage ofthe target data stream. It is also equipped with an onlineupdating algorithm with recursive formulas to ensure com-putational efficiency, and is extensible to multiple sourcedomains and multivariate data streams.

As illustrated in our simulations, distribution inferencefrom multivariate data streams is quite challenging. Totackle this difficulty, we could try the dimension reductiontechniques in Sugiyama et al. (2010) or employ a more

stable density-difference-based estimation model in Bu et al.(2018), the online version of which will be investigated inour future work. Furthermore, the studied distributions inthis article are assumed to be stationary, but in reality thedistributions might be time-varying. The state-space model-ing as in Zhang et al. (2017) and Qian et al. (2019) can beexplored in this case and our instance weighted transferlearning is also a promising technique when many transitionparameters have to be estimated from the limited targetdata. Lastly, our proposed method can be embedded intomany generative machine learning models that relies on dis-tribution information, such as the Bayesian classifier, andwill transform these models into an online learning scheme,which deserves our future research efforts as well.

Acknowledgments

The authors greatly acknowledge the valuable comments provided bythe department editor, associate editor and two anonymous refereesthat have resulted in great improvements of this article.

Funding

This work was supported by the National Natural Science Foundationof China under Grants 71931006, 71772147; the National Key R&DProgram of China under Grant 2019YFB1704100; the Fellowship ofChina Postdoctoral Science Foundation under Grant 2020M673430; theHong Kong RGC General Research Funds under Grants 16216119,16201718; the Science & Technology Innovation Team Plan of ShaanxiProvince under Project S2020-ZC-TD-0083; the Youth InnovationTeam of Shaanxi Universities “Big data and Business IntelligentInnovation Team”; and the Fundamental Research Funds for theCentral Universities.

Table 5. Weather information of Seasons 2 and 3 in 2011 (standard deviations are in parentheses).

Item Season 2 Season 3

Weather condition Clear 36 days 46 daysMistþ Cloudy 27 days 17 days

Light Snow/Rain 1 day 3 daysTemperature 0.5372(0.1409) 0.6999(0.0722)Feeling Temperature 0.5123(0.1267) 0.6529(0.0706)Humidity 0.6465(0.1483) 0.6428(0.1339)Wind Speed 0.2199(0.0708) 0.1721(0.0542)

Figure 10. Optimal daily inventory levels in Season 3, 2012 by different methods.

16 K. WANG ET AL.

Page 18: Distribution inference from early-stage stationary data

Notes on contributors

Kai Wang is currently an assistant professor in the Department ofOperations Management & Industrial Engineering, School ofManagement, at the Xi’an Jiaotong University, Xi’an, China. Hereceived his PhD degree in the industrial engineering and logisticsmanagement in 2018 from the HKUST, Hong Kong, and his bachelor’sdegree in industrial engineering in 2014 from Xi’an JiaotongUniversity, Shaanxi, China. His research focuses on industrial big dataanalytics, machine learning and transfer learning, statistical processcontrol and monitoring.

Jian Li is an associate professor in the School of Management, Xi’anJiaotong University, China. He received his BS degree in automationfrom Tsinghua University, Beijing, China, and his PhD degree inindustrial engineering and decision analytics from the Hong KongUniversity of Science and Technology, Hong Kong. His currentresearch interests include quality management and quality engineering,Six Sigma implementation, and statistical process control.

Fugee Tsung is a Chair Professor in the Department of IndustrialEngineering and Decision Analytics (IEDA), Director of the Qualityand Data Analytics Lab (QLab), at the Hong Kong University ofScience and Technology (HKUST), Hong Kong, China. He is a Fellowof the American Society for Quality, Fellow of the American StatisticalAssociation, Academician of the International Academy for Quality,and Fellow of the Hong Kong Institution of Engineers. He receivedboth his MSc and PhD from the University of Michigan, Ann Arbor,and his BSc from the National Taiwan University. His research inter-ests include quality analytics in advanced manufacturing and serviceprocesses, industrial big data and statistical process control, monitor-ing, and diagnosis.

References

Anees, A., Aryal, J., O’Reilly, M.M. and Gale, T.J. (2016) A relativedensity ratio-based framework for detection of land cover changesin MODIS NDVI time series. IEEE Journal of Selected Topics inApplied Earth Observations and Remote Sensing, 9(8), 3359–3371.

Bishop, C.M. (2006) Pattern Recognition and Machine Learning,Springer Science & Business Media, New York, NY.

Bu, L., Alippi, C. and Zhao, D. (2018) A pdf-free change detection testbased on density difference estimation. IEEE Transactions on NeuralNetworks, 29(2), 324–334.

Cheng, L., Wang, K. and Tsung, F. (2021) A hybrid transfer learningframework for in-plane freeform shape accuracy control in additivemanufacturing. IISE Transactions, 53(3), 298–312.

Dai, W., Yang, Q., Xue, G.-R. and Yu, Y. (2007) Boosting for transferlearning, in Proceedings of the 24th International Conference onMachine Learning, 227, 193–200.

Domingos, P. and Hulten, G. (2003) A general framework for miningmassive data streams. Journal of Computational and GraphicalStatistics, 12(4), 945–949.

Fanaee-T, H. and Gama, J. (2014) Event labeling combining ensembledetectors and background knowledge. Progress in ArtificialIntelligence, 2(2-3), 113–127.

Garcke, J. and Vanck, T. (2014, September). Importance weightedinductive transfer learning for regression, in Joint EuropeanConference on Machine Learning and Knowledge Discovery inDatabases, Springer, Berlin, Heidelberg, pp. 466–481.

Heinz, C. and Seeger, B. (2008) Cluster kernels: resource-aware kerneldensity estimators over streaming data. IEEE Transactions onKnowledge and Data Engineering, 20(7), 880–893.

Hoi, S.C., Sahoo, D., Lu, J. and Zhao, P. (2018) Online learning: acomprehensive survey. arXiv preprint arXiv:1802.02871.

Huang, J., Gretton, A., Borgwardt, K.M., Sch€olkopf, B. and Smola, A.J.(2006) Correcting sample selection bias by unlabeled data, inAdvances in Neural Information Processing Systems, 19, pp. 601–608.

Huang, S., Li, J., Chen, K., Wu, T., Ye, J., Wu, X. and Yao, L. (2012) Atransfer learning approach for network modeling. IIE Transactions,44(11), 915–931.

Huber, J., M€uller, S., Fleischmann, M. and Stuckenschmidt, H. (2019)A data-driven newsvendor problem: From data to decision.European Journal of Operational Research, 278(3), 904–915.

Huertas Quintero, L.A., West, A.A., Velandia, D.M.S., Conway, P.P.and Wilson, A. (2010) Integrated simulation tool for quality supportin the low-volume high-complexity electronics manufacturingdomain. International Journal of Production Research, 48(1), 45–68.

Jara, A., Hanson, T.E., Quintana, F.A., M€uller, P. and Rosner, G.L.(2011) DP package: Bayesian semi- and nonparametric modeling inR. Journal of Statistical Software, 40(5), 1–30.

Jiang, J. and Zhai, C. (2007) Instance weighting for domain adaptationin NLP, in Proceedings of the 45th Annual Meeting of theAssociation of Computational Linguistics, Association forComputational Linguistics, Stroudsburg, Pennsylvania, pp. 264–271.

Kanamori, T., Hido, S. and Sugiyama, M. (2009) A least-squaresapproach to direct importance estimation. Journal of MachineLearning Research, 10, 1391–1445.

Kuo, Y.H. and Kusiak, A. (2019) From data to big data in productionresearch: The past and future trends. International Journal ofProduction Research, 57(15-16), 4828–4853.

Li, D.C. and Lin, Y.S. (2006) Using virtual sample generation to buildup management knowledge in the early manufacturing stages.European Journal of Operational Research, 175(1), 413–434.

Li, D.C., Wu, C.S., Tsai, T.I. and Lina, Y.S. (2007) Using mega-trend-diffusion and artificial samples in small data set learning for earlyflexible manufacturing system scheduling knowledge. Computers &Operations Research, 34(4), 966–982.

Li, M., Han, J. and Liu, J. (2017) Bayesian nonparametric modeling ofheterogeneous time-to-event data with an unknown number of sub-populations. IISE Transactions, 49(5), 481–492.

Li, Y., Liu, Y., Zou, C. and Jiang, W. (2014) A self-starting controlchart for high-dimensional short-run processes. InternationalJournal of Production Research, 52(2), 445–461.

Lin, J. and Wang, K. (2011) Online parameter estimation and run-to-run process adjustment using categorical observations. InternationalJournal of Production Research, 49(13), 4103–4117.

Lin, J. and Wang, K. (2012) A Bayesian framework for online param-eter estimation and process adjustment using categorical observa-tions. IIE Transactions, 44(4), 291–300.

Lin, Y.S. and Li, D.C. (2010) The generalized-trend-diffusion modelingalgorithm for small data sets in the early stages of manufacturingsystems. European Journal of Operational Research, 207(1), 121–130.

Liu, J.S. (2008) Monte Carlo Strategies in Scientific Computing. SpringerScience & Business Media, New York, NY.

Liu, S., Yamada, M., Collier, N. and Sugiyama, M. (2013) Change-pointdetection in time-series data by relative density-ratio estimation.Neural Networks, 43, 72–83.

Lu, J., Yao, K. and Gao, F. (2009) Process similarity and developingnew process models through migration. AIChE Journal, 55(9),2318–2328.

M€uller, P. and Quintana, F.A. (2005) Nonparametric Bayesian dataanalysis. Quality Engineering, 50(3), 325–326.

Oroojlooy, A., Snyder, L. and Tak�a�c, M. (2020) Applying deep learningto the newsvendor problem. IISE Transactions, 52(4), 444–463.

Pan, S.J. and Yang, Q. (2010) A survey on transfer learning. IEEETransactions on Knowledge and Data Engineering, 22(10),1345–1359.

Petersen, K.B. and Pedersen, M.S. (2008) The matrix cookbook,Technical University of Denmark, Kongens Lyngby, Denmark.

Polansky, A.M. (2014) Assessing the capability of a manufacturing pro-cess using nonparametric Bayesian density estimation. Journal ofQuality Technology, 46(2), 150–170.

Qian, Y., Huang, J.Z., Park, C. and Ding, Y. (2019) Fast dynamic non-parametric distribution tracking in electron microscopic data. TheAnnals of Applied Statistics, 13(3), 1537–1563.

Qiu, P. (2018) Some perspectives on nonparametric statistical processcontrol. Journal of Quality Technology, 50(1), 49–65.

IISE TRANSACTIONS 17

Page 19: Distribution inference from early-stage stationary data

Ross, G.J., Tasoulis, D.K. and Adams, N.M. (2011) Nonparametricmonitoring of data streams for changes in location and scale.Technometrics, 53(4), 379–389.

Srinivasan, M.M. and Viswanathan, S. (2010) Optimal work-in-processinventory levels for high-variety, low-volume manufacturing sys-tems. IIE Transactions, 42(6), 379–391.

Sugiyama, M., Kawanabe, M. and Chui, P.L. (2010) Dimensionalityreduction for density ratio estimation in high-dimensional spaces.Neural Networks, 23(1), 44–59.

Sugiyama, M., Suzuki, T. and Kanamori, T. (2012) Density RatioEstimation in Machine Learning, Cambridge University Press, NewYork, NY.

Tien, J.M. (2013) Big data: Unleashing information. Journal of SystemsScience and Systems Engineering, 22(2), 127–151.

Tirinzoni, A., Sessa, A., Pirotta, M. and Restelli, M. (2018) Importanceweighted transfer of samples in reinforcement learning, in ICML 2018:Thirty-fifth International Conference on Machine Learning, 80, 4936–4945.

Tseng, S.T., Hsu, N.J. and Lin, Y.C. (2016) Joint modeling of laboratoryand field data with application to warranty prediction for highlyreliable products. IIE Transactions, 48(8), 710–719.

Tsung, F., Zhang, K., Cheng, L. and Song, Z. (2018) Statistical transferlearning: A review and some extensions to statistical process control.Quality Engineering, 30(1), 115–128.

Xia, R., Pan, Z. and Xu, F. (2018, July) Instance weighting for domainadaptation via trading off sample selection bias and variance, inProceedings of the 27th International Joint Conference on ArtificialIntelligence, International Joint Conferences on ArtificialIntelligence, Stockholm, Sweden, pp. 13–19.

Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H. and Sugiyama, M.(2011) Relative density-ratio estimation for robust distributioncomparison. Neural Information Processing Systems, 25(5),594–602.

Yang, H., Kumara, S., Bukkapatnam, S.T. and Tsung, F. (2019) Theinternet of things for smart manufacturing: A review. IISETransactions, 51(11), 1190–1216.

Zantek, P.F., Wright, G.P. and Plante, R.D. (2006) A self-starting pro-cedure for monitoring process quality in multistage manufacturingsystems. IIE Transactions, 38(4), 293–308.

Zhang, C., Chen, N. and Li, Z. (2017) State space modeling of autocor-related multivariate Poisson counts. IISE Transactions, 49(5),518–531.

Zhang, J. (2006) Powerful two-sample tests based on the likelihoodratio. Technometrics, 48(1), 95–103.

Zheng, Y., Jestes, J., Phillips, J.M. and Li, F. (2013, June) Quality andefficiency for kernel density estimates in large data, in Proceedings ofthe 2013 ACM SIGMOD International Conference on Management ofData, Association for Computing Machinery, New York, NY, pp.433–444.

Zhou, A., Cai, Z., Wei, L. and Qian, W. (2003, March) M-kernel merg-ing: Towards density estimation over data streams, in EighthInternational Conference on Database Systems for AdvancedApplications, 2003, IEEE Press, Piscataway, N, pp. 285–292.

Zou, C., Tsung, F. and Wang, Z. (2008) Monitoring profiles basedon nonparametric regression methods. Technometrics, 50(4),512–526.

18 K. WANG ET AL.